Creating an International Console

By: Marjan Venema

Abstract: This article describes how to create an international character capable console by patching some of Delphi 2009's private system functions instead of relying on UTF8String casts or the SystemDefaultCodePage

Introduction

Patrick and I are currently involved in porting the Every Angle software from Ansi to full Unicode. Our code base is large – approximately 500K lines of proprietary Delphi code. Part of that code base handles extraction of large volumes of data from SAP servers. Because of these large volumes, we use console applications to ensure that each part of the process gets the maximum out of the 3GB of virtual memory we have available on Win32 platforms.

Many of our console applications serve a dual role: stand-alone utilities that can be run “manually” by a user; and integral parts of the data extraction process that are spawned from a central service. In the latter role their console output is caught using redirected console handles and subsequently added to reports and/or log files.

If we want to release our own software claiming full support for international characters, our console applications need to be able to output all those characters to the console window. And we need to be able to do that regardless of whether the standard console input and output handles are redirected or not.

Console window display

All of our console applications use the standard Write and WriteLn functions to output information to the console. Finding out that these functions do not support Unicode was therefore quite a shock to us. It seems that the decision not to have these functions support Unicode was based on the view that “the console is Ansi or OEM anyway.”

An unfortunate view as this is simply not the case. A Windows console window will happily show international characters. No need to install East-Asian Language support or to set a code page other than the default. Just ensure that the console’s font is set to Lucida instead of the default raster fonts.

If the font used by your console window is set to “Raster Fonts” (the default) then it would indeed seem that the console cannot display international characters, as show in the picture below.

Hide image
Click to see full-sized image

If you simply change the console’s display font to “Lucida Console” then this is what you get:

Hide image
DirCommandLucidaFont

As you can see the output from the first “dir” command is still showing a question mark, but the output from the second “dir” command reveals that it is a Cyrillic character, the small DE letter to be precise. No other magic needed. Not even any changes to the console’s current code page. In the above picture’s it is unchanged from the default which is the 437 (US) code page.

Less obvious in the two picture’s above is the change to the character before that Cyrillic character. Before the change to the Lucida font the “dir” command shows a simple capital A. After the switch the “dir” command reveals that it is actually a capital A with an accent grave.

Just to show you that interpreting what you see on the console can be a bit tricky, here are some more examples.

This is the output you get if you set the console to code page 1252 (Latin 1) but leave the console’s font set to the default “Raster Fonts”:

Hide image
DirCommandAnsi

It shows an old line graphic character for the capital A with accent grave even though the Latin 1 code page does have that character... The raster fonts simply have a different image for the ordinal value of that character.

Setting the code page to UTF-8 (65001) - an often recommended strategy for displaying international characters, you get even prettier results:

Hide image
DirCommandUTF8

It does show very clearly that both the capital A with accent grave and the Cyrillic small DE character take two bytes and are apparently send to the console as UTF-8 encoded information by the dir command.

Anyway, the long and short of it is: if you want to see the correct international characters in your console windows, use the code page you would normally use and set the console window’s font “Lucida Console” (or any other installed console font with good Unicode support).

Setting a console window’s font to Lucida: Right click on the console window’s caption, select Properties, select the Font tab, select “Lucida Console” as the font and click OK. You’ll get a dialog asking whether you want to do this for the current window only or for all future windows with the same title as well. Your choice…

How to emulate the dir command?

We needed a way to do exactly what the dir command seemed to be doing: sending UTF-8 encoded strings to the console.

There are two very simple ways of doing this:

  • Cast the string to a UTF8String before feeding it to Write/WriteLn
  • Setting your DefaultSystemCodePage to UTF-8

Both seem simple enough workarounds to achieve what we wanted to achieve: correct output of international characters by our console applications. However, both workarounds have drawbacks that we were not willing to put up with.

Casting through UTF8String

Putting each and every string passed to Write/WriteLn through an UTF8String()-cast would be very tedious indeed, but you could of course put the cast inside your own custom procedure, such as:

procedure MyWrite(const aString: string);
begin
  Write(UTF8String(aString));
end;

procedure MyWriteLn(const aString: string);
begin
  WriteLn(UTF8String(aString));
end;

Yes, you can do that. However, the compiler magic surrounding Write/WriteLn means that these procedures can be called with any type of parameter you care to throw at them. Integers, Variants, Booleans, Int64’s… You would need to write an overload for both MyWrite/MyWriteLn for all the parameter types used in calls to Write/WriteLn. Writing the overloads is simple enough to do and then “all” you would have to do is change all Write/WriteLn calls in your code to MyWrite/MyWriteLn calls.

For some that might be a good enough solution. For us, it wasn’t acceptable. There were simply too many lines of code that would be affected. The huge number of changes would make our lives very difficult when we get to that point in time where we want to synchronize/merge our Unicode project with other code branches.

Setting DefaultSystemCodePage to UTF-8

Now this is really a very simple and very handy solution for many people. Setting the DefaultSystemCodePage to UTF-8 somewhere in an initialization section, ensures that all Write/WriteLn calls will result in sending UTF-8 to the console. This sounds like a solution!

So, why didn’t we adopt this approach? Because setting the DefaultSystemCodePage affects all (implicit) conversions from Unicode to Ansi and vice versa. And that would spell trouble for us, because our Unicode version still needs to support Ansi data for customers who are really not interested in support for international characters, but do care about disk space and memory requirements. And that means that we need to use the correct Ansi code page when loading their data. Preferably with as little changes to our code as possible, and that means doing what we have – rightly or wrongly – done up till now: relying on the system’s code page settings. So mucking about with the DefaultSystemCodePage simply wasn’t an option for us.

Now what?

We still want international characters from our console applications, but can’t use the two simple workarounds available to us. A deeper investigation was called for.

When you Ctrl-Click on a WriteLn symbol in your code, you are taken to the system unit. At the beginning of that unit, you’ll find some comments explaining that predefined constants, types, procedures and functions do not have actual declarations, but are built into the compiler. WriteLn is even mentioned as an example.

So code hooking or code patching – something with which Patrick isn’t exactly unfamiliar and which usually is a fair line of defense when dealing with unaccommodating third party code – seemed like a no go this time.

Someone suggested writing a Pascal text driver. Never heard of such a beast. Digging around on the Internet did turn up several examples. Most deal with getting the WriteLn and Write functions to work with streams. Not so surprising when you consider that WriteLn and Write are not just used to display text on the console. Although these examples were not what I was looking for, they did help me get started. Spent quite a few hours on this, only to find out that no matter what I did, by the time my Pascal text driver got its hands on the string passed to WriteLn, it had already been “ansified”.

So I had to dig further into the system unit to find out what was happening to the strings before they arrived in my Pascal text driver.

When you search for WriteLn in the system unit, you’ll find a _WriteLn declaration in the interface section. Try using Ctrl-Shft-Up/Down on it. It doesn’t budge. But when you do a “find again” you arrive at the implementation. Finding it didn’t help much though. When I put a break point on it and ran my Pascal text driver through its paces, the string was already “ansified” when we got to _WriteLn.

Just above the _WriteLn declaration in the system unit, you’ll find a bunch of functions whose names all start with _Write. Most interesting of them were:

  • _WriteWString and _Write0WString, taking a WideString parameter
  • _WriteWCString and _Write0WCString, taking a PWideChar parameter
  • _WriteWChar and _Write0WChar, taking a WideChar parameter
  • _Write0UString en _WriteUString, taking a UnicodeString parameter

The _Write0WString, _Write0WCString and _Write0WChar functions all call their _WriteW... namesakes. _Write0UString calls _WriteWString instead of _WriteUString. All _WriteW... versions end up calling _WriteLString.

This makes _WriteLString a good candidate for hooking, were it not for the fact that all _WriteW... versions send an “ansified” string to _WriteLString by doing a AnsiString(s) cast of the string passed into them. So if there was any hooking to be done, it would have to be on the three _WriteW... functions to prevent them from “ansifying” their input.

Have you ever tried hooking one of the functions or procedures in the system unit? It works fine. Normally. Not with these three _WriteW... functions though. When you try to place a hook on them, the compiler complains about an undeclared identifier, even though all three functions are declared in the interface section of the system unit.

So hooking any of these functions using their names was impossible. Hooking these functions seemed futile anyway. If we can’t hook them because of undeclared identifier complaints, the same complaint will show up when we would call _WriteLString in our own functions.

Now what? Revisited

Bother. Now what? Patrick?!? Help!

My pathetic plea for help resulted in an afternoon of hacking graduate school. Patrick doing most of the thinking and “ah”-ing, me looking on, able to understand conceptually what was going on, but completely lost when it came to any implementation. I don’t know how he does it, but Patrick is obviously able to juggle knowledge about calling conventions, CPU registers and behavior, ASM opcodes, relative and absolute pointers and where they point to, all in his head and without losing his way...

What did we need? We needed a way to find the address of the _WriteW... functions without being able to call them directly. Left me stumped, but not Patrick. “Simply a matter of creating a procedure” he says, “that causes a call to the function we want and then examining the memory to find the CALL opcode and thus the address the CALL jumps to. When we have that we can analyse the executable code to see what is going on and what we might do about it.”

Yes, well, ok, I understand that bit, now how do we go about it?

_WriteWString

Let’s start with the _WriteWString. We need a procedure that is as empty as possible and causes a call to _WriteWString without calling it directly. Probably by passing a WideString to WriteLn. Something like:

procedure CauseACallTo_WriteWString(const aWideString: WideString);
begin
  WriteLn(Output, aWideString);
end;

Next step is to create a simple program that executes this procedure passing in an actual WideString. We need to make sure the compiler doesn’t optimize this code, so we can’t rely on the project options and need to turn off optimization in the unit itself.

{$OPTIMIZATION OFF}

Of course we also need to keep the linker from excluding this function from the executable by making sure it is either called directly or the linker thinks it may be called indirectly (by its address).

After that it’s time to build and run the application with a breakpoint somewhere so that we can examine what the compiler has done in the CPU view. If you didn’t put a breakpoint in the function itself, you can simply scroll the CPU view until you find the instructions that correspond with the code in our CauseACallTo_WriteWString function.

Hide image
DN_CauseACallTo_WriteWString_Step1

As it turns out the first call is the call to _Write0WString even though it now lacks the initial underscore. Probably the compiler magic at work. This is rather unfortunate as we want to hook or patch the _WriteWString function. But we know that _WriteWString is called by the _Write0WString function. So we can still get there, we will “just” have to take it in steps.

First, we need to find the address of _Write0WString function. The “E8189BFFFF” is the actual CALL instruction executed by the CPU. The first byte “E8” is the CALL opcode and the remaining 4 bytes are the destination of the call. This address is not an absolute address within the entire memory, but an address relative to the current address (“0040B0EF”).

We can now find the address of the _Write0WString function by calculating $0040B0EF+$FFFF9B18+5

  • The $FFFF9B18 is the size of the relative jump in bytes. As it starts with $FF, it actually results in a subtraction.
    (Please note that addresses in CPU instructions are in little endian, or reverse order.)
  • The +5 is a correction we need because the although the $FFFF9B18 is relative to the address of the CALL instruction ($0040B0EF), the current instruction pointer has already moved beyond the instruction. So, to move to the address of _Write0WString, we need to move back an extra 5 bytes – the length of the CALL instruction.

The IDE can do the calculation for us and take us to the desired address. Simply press Ctrl-G (for Go to address) and enter the formula given above.

Hide image
DN_CauseACallTo_WriteWString_Step2

Lo and behold, we have found the _Write0WString function. And true to the code in the system unit, the first call is to WriteWString, which we know and love as _WriteWString. Now we are getting somewhere.

We are still on the hunt for the address of the _WriteWString function. As it is also the destination of a CALL instruction with a relative address ($E8), we can use the same method as above and calculate its address as

$00404C0E+$00000001+5

  • The $00000001 is the size of the relative jump in bytes.
    (Please note that the addresses in the CPU instructions are given in reverse order.)
  • The +5 is again the correction we need because the current instruction pointer has already moved beyond the CALL instruction.

As you might already have noticed, the size of the relative jump is 1 byte. Obviously, that means the _WriteWString function starts right after the CALL instruction and we don’t really have to use the Ctrl-G (Go to address) method to get there. Just look one line down.

The _WriteWString implementation is a bit longer than that of the _Write0WString function, but when you look down through the instructions, you will soon find the call to _WriteLString.

Hide image
DN_CauseACallTo_WriteWString_Step3

Just before the call to _WriteLString, you can see a call to LStrFromWStr (found in the system unit as _LStrFromWStr). That is one of the internal string conversion functions. The instruction just before the call shows that $00000000 is moved into the ECX register, which corresponds to the third parameter for the function. And that – surprise, surprise – is the code page to be used for the conversion.

By using code page zero, Delphi is telling the conversion function to use the System Default Code Page. Which is why setting the Default System Code Page to UTF-8 is one of the workarounds available to us.

Finding this also gives us the solution to our problem: “all” we need to do now is to patch this bit of code to use the value for the UTF-8 code page... But before we do, let’s check the other functions that are part of the compiler magic that surround sending strings to the Write/WriteLn functions.

_Write0WCString / _WriteWCString and _Write0WChar / _WriteWChar

To find out what happens when you pass in a PWideChar or a WideChar to WriteLn, we did the same as we did for passing a WideString to WriteLn. We found that their behavior is exactly the same. Passing in a PWideChar results in a call to _Write0WCString, which in turn calls _WriteWCString. Passing in a WideChar results in a call to _Write0WChar, which in turn calls _WriteWChar.

Both _WriteWCString and _WriteWChar contain a call to _LStrFromWStr passing $00000000 as the code page to be used just before they call _WriteLString to do the actual work.

_Write0UString / _WriteUString

The pair of functions that deal with UnicodeString’s are _Write0UString and _WriteUString. As _Write0UString calls _WriteWString we don’t have to worry about that one. When we find a solution for _WriteWString, we’ll automatically have tackled _Write0UString.

That leaves us with _WriteUString. Using the same method again, we found that when you pass a UnicodeString to the WriteLn function, the compiler generates code that calls _Write0UString. So apparently we don’t need to worry about _WriteUString yet.

Solution

We now have a way to enable our console applications to display international characters:

  • At initialization, set the current code page for console output to UTF-8.
  • Patch the offending $00000000 parameter in the appropriate functions with the value for the UTF-8 code page.
  • At finalization, set the current code page for console output back to what it was when we started.
    Trust me, in this case it is very necessary to clean up after ourselves. When you set the console output page to UTF-8 you’ll find that it is quite hard to execute any cmd files from the command line...

Why do we need to set the console’s code page to UTF-8 when there was no need to do that manually to get the dir command to produce beautiful Cyrillic characters? Well, you do indeed need to set the console to the UTF-8 code page if you want to be able to show nice Cyrillic characters. The dir command probably does it internally so all we have to do is take care of the font. And that is exactly what we want to achieve as well.

Running under Windows Vista or Windows Server 2008, we could even take care of ensuring that the console’s font is set to Lucida. Simply use the Get/SetCurrentConsoleFontEx API functions. Unfortunately earlier versions of Windows do not have these API functions. They must make do with a GetCurrentConsoleFont API function. And there is no corresponding setter function.

Be careful restoring the font at finalization to what it was before you changed it. At least when I did it manually, it mucked up my console window pretty badly... But I haven't really looked into it in any more detail. It’s low priority for us as users who want our Unicode version, will probably already have ensured that any console window opens with the appropriate font settings (or they wouldn’t see their own stuff either).

Back to the solution. It gives us two important benefits:

  • Only the Write/WriteLn functions are affected. This means that we can still use the System Default Code Page for all other Unicode to Ansi (and vice versa) conversions.
  • It can be coded in a separate unit. This means that apart from including this unit in the dpr files of our console applications, we do not have to change a single line of code that currently writes to a console window.

There is a drawback too. All code writing files using the AssignFile / Rewrite / Append methods will now put out UTF-8 as well. But, unless you do something about it, any file written in this manner won’t have the UTF-8 preamble. There are two solutions to that problem. One is to switch to streams for file access. The other is to ensure that the UTF-8 preamble is written to files created or updated using Write/WriteLn. Once the patches are in place, you will have to do that byte-by-byte.

  // Write UTF-8 Preamble
  // NOTE: AnsiChar cast is necessary because otherwise Write will write out the
  // decimal representation of the byte values
  // That would leave you with 239187191 (Hex 32 33 39 31 38 37 31 39 31) at the
  // start of your file instead of the Hex ef bb bf we need.
  Bytes := TEncoding.UTF8.GetPreamble;
  for i := Low(Bytes) to High(Bytes) do
    Write(F, AnsiChar(Bytes[i]));

Code page for console output

Getting and setting the code page to be used for console output is pretty simple. We have two Windows API calls to do just that:

  • GetConsoleOutputCP
  • SetConsoleOutputCP

We use the initialization section to retrieve the code page currently used for console output, store it in variable for use at finalization and then change the code page to UTF-8. At finalization all we have to do to clean up after ourselves is change the code page back to the one we retrieved at initialization.

var
  CurrentConsoleOutputCodePage: Integer;
Initialization
  CurrentConsoleOutputCodePage := GetConsoleOutputCP;
  SetConsoleOutputCP(CP_UTF8);
finalization
  SetConsoleOutputCP(CurrentConsoleOutputCodePage);

The CP_UTF8 constant is declared in the Windows unit.

Patching the code page parameters

Patching the code page parameter is a bit more involved. We have to come up with a way to do in code exactly what we just did “live” in the CPU window: find the address of the _WriteWString function. With that address we can find the address of the instruction that puts $00000000 into the ECX register, work out the address of the $00000000 parameter, and replace that parameter with the value of the UTF-8 code page. And of course we need to repeat all that for the _WriteWCString and _WriteWChar functions.

To do all this, the first thing we need are three procedures that cause calls to the respective _WriteW... functions.

procedure CauseACallTo_WriteWString(const aWideString: WideString);
begin
  // Ensure a call to _Write0WString and from there to _WriteWString
  WriteLn(Output, aWideString);
end;

procedure CauseACallTo_WriteWCString(const aPWideChar: PWideChar);
begin
  // Ensure a call to _Write0WCString and from there to _WriteWCString
  WriteLn(Output, aPWideChar);
end;

procedure CauseACallTo_WriteWChar(const aWideChar: WideChar);
begin
  // Ensure a call to _Write0WChar and from there to _WriteWChar
  WriteLn(Output, aWideChar);
end;

Of course we want the patches of the code page parameters in place as soon as possible, so we need to put the patching code in an initialization section. I like to keep these sections as short as possible, so all we do here is add a few calls after setting the console’s code page:

var
  CurrentConsoleOutputCodePage: Integer;
Initialization
  CurrentConsoleOutputCodePage := GetConsoleOutputCP;
  SetConsoleOutputCP(CP_UTF8);

  Patch_WriteXXXFunctionUsing(@CauseACallTo_WriteWString);
  Patch_WriteXXXFunctionUsing(@CauseACallTo_WriteWCString);
  Patch_WriteXXXFunctionUsing(@CauseACallTo_WriteWChar);

What we do here is use a separate procedure to do the patching for us. We call it three times passing it the address of each of our “trigger” procedures. The code of the patch procedure is still pretty simple:

procedure Patch_WriteXXXFunctionUsing(const aAddressOfTriggerProc: Pointer);
var
  AddressOf_WriteFunction: Pointer;
begin
  AddressOf_WriteFunction := FindAddressOf_WriteFunctionFor(
    aAddressOfTriggerProc);
  Assert(Assigned(AddressOf_WriteFunction));

  PatchCodePageIn(AddressOf_WriteFunction);
end;

Using the function FindAddressOf_WriteFunctionFor, the patch procedure finds the address of the _WriteW... function triggered by the “trigger” procedure whose address it receives in the aAddressOfTriggerProc parameter. It asserts that an address was found and passes that address to the PatchCodePageIn procedure.

Finding the address of a _WriteW... function in code

To find the address of a _WriteW... function we need to do exactly what we did “live” in the CPU window, but now without the benefit of our eyes. As you may recall finding the address of the _WriteW... function was actually a two step process. The first step is to use the address of the “trigger” procedure to find the address of the _Write0... function. The second step is to use the address of the _Write0... function to find the address we are really after: that of the _WriteW... function.

To find the address of the _Write0... and _WriteW... functions we used the fact that both the “trigger” procedure and the _Write0... function use a CALL instruction to access the _Write0... and _WriteW... functions respectively. Also very important here is the fact that in both cases the CALL instruction we need is the first CALL instruction we encounter after the starting address of each function.

That makes the implementation of the FindAddressOf_WriteFunctionFor function nicely straightforward:

function FindAddressOf_WriteFunctionFor(const aAddressOfTriggerProc: Pointer): Pointer;
var
  Addr_Write0,
  Addr_Write: Pointer;
begin
  Addr_Write0 := GetDestinationAddressOfFirstCallAfter(aAddressOfTriggerProc);
  Addr_Write := GetDestinationAddressOfFirstCallAfter(Addr_Write0);

  Result := Addr_Write;
end;

The FindAddressOf_WriteFunctionFor first passes the address of the “trigger” procedure to the GetDestinationAddressOfFirstCallAfter function. The result it receives is the address of the _Write0... function. It then calls GetDestinationAddressOfFirstCallAfter again, now passing in the address of the _Write0... function. The result is the address of the _WriteW... function which it returns as its own result.

Now we get to the fun bit: finding the first CALL instruction from the passed in starting address and calculating the absolute address of the CALL instruction’s destination:

type
  MathPtr = Integer;

const
  OPCODE_RET = $C3;
  OPCODE_CALL = $E8;
  OPCODE_JMP_REL = $E9;
  OPCODE_CMP = $66;
  OPCODE_MOV_ECX = $B9;

function GetDestinationAddressOfFirstCallAfter(const aCode: PByte): Pointer;
var
  i: Integer;
  AddrPtr: PPointer;
begin
  Result := nil;
  for i := 0 to 100 do
    case aCode[i] of
      OPCODE_RET:
        Exit;
      OPCODE_CALL:
        begin
          // Get the address of CALL's relative destination.
          AddrPtr := PPointer(aCode + i + 1);
          // Read the value at that address. This is a relative pointer.
          Result := AddrPtr^;
          // Translate relative pointer to absolute pointer.
          MathPtr(Result) := MathPtr(AddrPtr) + MathPtr(Result) + SizeOf(Pointer);
          Exit;
        end;
    end;
end;

Safety first, so we start by setting the result to nil. This ensures that the assert in the Patch_WriteXXXFunctionUsing will fail if we do not find a CALL instruction.

To find the CALL instruction, we loop over the bytes starting from the address passed in to GetDestinationAddressOfFirstCallAfter. We stop after looking at a maximum of a 100 bytes. That 100 is an arbitrary value. It could just as well have been 50 or 20, just as long as it is enough to find that first CALL instruction. We also stop if we encounter a RET instruction. RET is the opcode for the return out of a function. If we encounter that before finding a CALL we might as well stop as continuing would mean going beyond the function we are interested in.

WARNING:
Scanning
executable code and drawing conclusions from it, is very tricky to say the least. You have to realize that looking for the CALL opcode in this way is only “safe” because we checked the functions in the CPU-view first and we know that there wasn’t any data containing the opcode we were looking for between the function start and the CALL instruction itself. By “data” we mean all non-opcode bytes, whether they be jump target addresses or any other kind of data you might find in an executable.

When the byte we are looking at matches the value of the CALL instruction we get to the real work. The first thing we do is work out the address of the first byte of the CALL instruction’s destination Pointer.

As you may remember the CALL instruction consists of

  • A single byte of value $E8
  • A destination pointer (4 bytes as we are on a 32 bit platform)

To figure out the absolute destination address of the CALL instruction, we first need to retrieve the relative destination address of the CALL instruction. The destination pointer starts one beyond the byte we are currently looking at, so the address of the destination pointer is given by:

          AddrPtr := PPointer(aCode + i + 1);

Where (aCode + i) is the address of the CALL instruction and the + 1 correction is to skip the CALL opcode itself.

To get the value of the destination pointer, we need to dereference the address we just calculated:

          Result := AddrPtr^;

As this address is relative to the address of the CALL instruction, we need to translate it to an absolute address:

          MathPtr(Result) := MathPtr(AddrPtr) + MathPtr(Result) + SizeOf(Pointer);

To do so we:

  • take the address of destination pointer – AddrPtr
  • add the relative destination – Result
  • and adjust for the fact that the current instruction pointer has moved beyond the CALL instruction by adding SizeOf(Pointer)

If you are still awake and with it, you will have noticed that SizeOf(Pointer) equals 4 (at least on a 32 bit platform), whereas the correction for the instruction pointer having moved when we did it “live” was 5:

$0040B0EF+$FFFF9B18+5

That is because when we did it “live” we started from the address of the CALL instruction, instead of the address of the destination pointer. To do the same in the “live” situation we should have done:

($0040B0EF+1)+$FFFF9B18+4

where ($0040B0EF+1) is the address of the destination pointer.

The MathPtr type is declared and used to ensure that we can do math with “normal” Pointers and Pointers to Pointers.

Finding and replacing the code page parameter

To find and replace the code page parameter we scan the code from the address we are given and look for the address of the instruction that puts a value of hex zero in the ECX register. If an address is found where such an instruction begins, we use the WriteProcessMemory function from the Jedi Win32 API library (http://jedi-apilib.sourceforge.net/) to change the value of instruction’s parameter from $00000000 to the value of the UTF-8 code page.

Here is how the PatchCodePageIn procedure does it:

procedure PatchCodePageIn(const aAddressOf_WriteFunc: Pointer);
var
  Pattern: TBytes;
  PatchAddress: PByte;
  NumberOfBytesWritten: Integer;
begin
  SetLength(Pattern, 5);
  Pattern[0] := OPCODE_MOV_ECX;
  Pattern[1] := $00;
  Pattern[2] := $00;
  Pattern[3] := $00;
  Pattern[4] := $00;
  PatchAddress := FindCodeBytes(aAddressOf_WriteFunc, Pattern);
  Assert(Assigned(PatchAddress));

  PInteger(@(Pattern[1]))^ := CP_UTF8;
  if not JwaWinBase.WriteProcessMemory(
    Windows.GetCurrentProcess(),
    PatchAddress,
    @(Pattern[0]),
    Length(Pattern),
    @NumberOfBytesWritten) then
      RaiseLastOSError;
end;

The pattern it looks for is a set of 5 bytes representing the instruction to put the $00000000 value into the ECX register.

The

  PInteger(@(Pattern[1]))^ := CP_UTF8;

statement looks tricky, but all it does is take the address of the second byte of the Pattern variable:

@(Pattern[1])

and then tells the compiler to treat it as a Pointer to an Integer using the PInteger cast and dereferences that pointer to ensure that the value the pointer points to is overwritten, and not the pointer itself.

Why do we use the WriteProcessMemory function from the Jedi Win32 API library? Well, if we don’t the operating system will complain about us writing to memory that is designated as “execution” only. The Jedi library function ensures that the operating system is told we know what we are doing and really want to do exactly what we are telling it to do.

Scanning for a pattern of bytes

The FindCodeBytes function is used by the PatchCodePageIn procedure to scan for a specific pattern of bytes. Here is the implementation of that function:

function FindCodeBytes(const aCode: PByte; const aBytes: TBytes): PByte;
const
  MAX_SCAN = 100;
var
  i, j: Integer;
begin
  for i := 0 to MAX_SCAN do
  begin
    Result := aCode + i;
    for j := 0 to Length(aBytes) - 1 do
    begin
      if Result[j] <> aBytes[j]  then
      begin
        Result := nil;
        Break;
      end;
    end;

    if Assigned(Result) then
      Exit;
  end;

  Result := nil;
end;

It starts at the address it is given (aCode) and scans for a maximum of a 100 bytes. This again is an arbitrary value, set large enough to ensure that it will find the pattern within the _WriteW... functions.

For each byte within that 100, it sets the result to the current byte, then scans ahead, checking each byte against the corresponding byte of the pattern it was given. At the first mismatch it breaks out of the for-loop after resetting the result to nil. If the scan did not result in an early break, the result is still set after the inner for-loop completes and it can now break out of the outer for-loop as well. This time using an Exit statement to ensure that “Result := nil;” is skipped which is there to ensure a nil result when the pattern is not found.

Summary

To sum up this long story, we have shown you:

  • that the console window is capable of much more than OEM or Ansi.
  • that to see international characters on a console window, you need to set the console window’s font to Lucida.
  • that when you run a dir command from a console window, the user does not have to mess with the console’s code page setting and that your own console applications should behave similarly.
  • that to act like the dir command does and show international characters correctly you need to set the console’s code page to UTF-8 in code and return it to what it was when you started when your application ends.
  • that there are two simple ways to get international characters to the console. One is casting all strings passed to Write/WriteLn to a UTF8String. The other is to set the SystemDefaultCodePage to UTF-8.
  • that if for some reason these options are not open to you, there is no way to hook into the compiler magic surrounding the Write/WriteLn functions.
  • that there is a way to get our grubby little hands on the address of these “private” system functions and take it from there.
  • how to patch these private system functions so that Write/WriteLn will send UTF-8 to the console window without affecting any other (implicit) Unicode to Ansi (and vice versa) conversions.
  • that you can put these patches in a separate unit so all you have to do to add support for sending international characters to a console window is to include that unit in your console application’s project files.
  • that the side effect of this solution is that if you use the Write/WriteLn methods to write to files they will also get UTF-8 encoded information and you would either have to start using streams for file access or ensure that the Unicode preamble is included in the file.

Another advantage of this solution is that it works regardless of whether the standard console’s output handle is redirected or not. It works when you redirect the console’s output handle to a pipe or to a file through CreateProcess. It also works when you redirect the console’s handles to a file using the “>” symbol from a command prompt. Just be aware that the the output you receive in the pipe is UTF-8 encoded and that any files created through redirecting the console will be UTF-8 encoded but will lack the UTF-8 preamble.


Server Response from: ETNASC04