Delphi in a Unicode World Part II: New RTL Features and Classes to Support Unicode

By: Nick Hodges

Abstract: This article will cover the new features of the Tiburon Runtime Library that will help handle Unicode strings.

In This Article

Introduction

In Part I, we saw how Unicode support is a huge benefit for Delphi developers by enabling communication with all characters set in the Unicode universe. We saw the basics of the UnicodeString type and how it will be used in Delphi

In Part II, we’ll look at some of the new features of the Delphi Runtime Library that support Unicode and general string handling.

TCharacter Class

The Tiburon RTL includes a new class called TCharacter, which is found in the Character unit. It is a sealed class that consists entirely of static class functions. Developers should not create instances of TCharacter, but rather merely call its static class methods directly. Those class functions do a number of things, including:

  • Convert characters to upper or lower case
  • Determine whether a given character is of a certain type, i.e. is the character a letter, a number, a punctuation mark, etc.

TCharacter uses the standards set forth by the Unicode consortium.

Developers can use the TCharacter class to do many things previously done with sets of chars. For instance, this code:

uses
Character;

begin
if MyChar in [‘a’...’z’, ‘A’...’Z’] then
begin
  ...
end;
end;

can be easily replaced with

uses
  Character;

begin
if TCharacter.IsLetter(MyChar) then
begin
    ...
end;
end;

The Character unit also contains a number of standalone functions that wrap up the functionality of each class function from TCharacter, so if you prefer a simple function call, the above can be written as:

uses
  Character;

begin
if IsLetter(MyChar) then
begin
    ...
end;
end;

Thus the TCharacter class can be used to do most any manipulation or checking of characters that you might care to do.

In addition, TCharacter contains class methods to determine if a given character is a high or low surrogate of a surrogate pair.

TEncoding Class

The Tiburon RTL also includes a new class called TEncoding. Its purpose is to define a specific type of character encoding so that you can tell the VCL what type of encoding you want used in specific situations.

For instance, you may have a TStringList instance that contains text that you want to write out to a file. Previously, you would have written:

begin
  ...
  MyStringList.SaveToFile(‘SomeFilename.txt’);  
  ...
end; 

and the file would have been written out using the default ANSI encoding. That code will still work fine – it will write out the file using ANSI string encoding as it always has, but now that Delphi supports Unicode string data, developers may want to write out string data using a specific encoding. Thus, SaveToFile (as well as LoadFromFile) now take an optional second parameter that defines the encoding to be used:

begin
  ...
  MyStringList.SaveToFile(‘SomeFilename.txt’, TEncoding.Unicode);  
  ...
end; 

Execute the above code and the file will be written out as a Unicode (UTF-16) encoded text file.

TEncoding will also convert a given set of bytes from one encoding to another, retrieve information about the bytes and/or characters in a given string or array of characters, convert any string into an array of byte (TBytes), and other functionality that you may need with regard to the specific encoding of a given string or array of chars.

The TEncoding class includes the following class properties that give you singleton access to a TEncoding instance of the given encoding:

    class property ASCII: TEncoding read GetASCII;
    class property BigEndianUnicode: TEncoding read GetBigEndianUnicode;
    class property Default: TEncoding read GetDefault;
    class property Unicode: TEncoding read GetUnicode;
    class property UTF7: TEncoding read GetUTF7;
    class property UTF8: TEncoding read GetUTF8;

The Default property refers to the ANSI active codepage. The Unicode property refers to UTF-16.

TEncoding also includes the

class function TEncoding.GetEncoding(CodePage: Integer): TEncoding;

that will return an instance of TEncoding that has the affinity for the code page passed in the parameter.

In addition, it includes following function:

function GetPreamble: TBytes;

which will return the correct BOM for the given encoding.

TEncoding is also interface compatible with the .Net class called Encoding.

TStringBuilder

The RTL now includes a class called TStringBuilder. Its purpose is revealed in its name – it is a class designed to “build up” strings. TStringBuilder contains any number of overloaded functions for adding, replacing, and inserting content into a given string. The string builder class makes it easy to create single strings out of a variety of different data types. All of the Append, Insert, and Replace functions return an instance of TStringBuilder, so they can easily be chained together to create a single string.

For example, you might choose to use a TStringBuilder in place of a complicated Format statement. For instance, you might write the following code:

procedure TForm86.Button2Click(Sender: TObject);
var
  MyStringBuilder: TStringBuilder;
  Price: double;
begin
  MyStringBuilder := TStringBuilder.Create('');
  try
    Price := 1.49;
    Label1.Caption := MyStringBuilder.Append('The apples are $').Append(Price). 
             ÄAppend(' a pound.').ToString;
  finally
    MyStringBuilder.Free;
  end;
end;

TStringBuilder is also interface compatible with the .Net class called StringBuilder.

Declaring New String Types

Tiburon’s compiler enables you to declare your own string type with an affinity for a given codepage. There is any number of code pages available. (MSDN has a nice rundown of available codepages.) For instance, if you require a string type with an affinity for ANSI-Cyrillic, you can declare:

type
  // The code page for ANSI-Cyrillic is 1251
  CyrillicString = type Ansistring(1251);

And the new String type will be a string with an affinity for the Cyrillic code page.

Additional RTL Support for Unicode

The RTL adds a number of routines that support the use of Unicode strings.

StringElementSize

StringElementSize returns the typical size for an element (code point) in a given string. Consider the following code:

procedure TForm88.Button3Click(Sender: TObject);
var
  A: AnsiString;
  U: UnicodeString;
begin
  A := 'This is an AnsiString';
  Memo1.Lines.Add('The ElementSize for an AnsiString is: ' + IntToStr(StringElementSize(A)));
  U := 'This is a UnicodeString';
  Memo1.Lines.Add('The ElementSize for an UnicodeString is: ' + IntToStr(StringElementSize(U)));
end;

The result of the code above will be:

The ElementSize for an AnsiString is: 1
The ElementSize for an UnicodeString is: 2

StringCodePage

StringCodePage will return the Word value that corresponds to the codepage for a given string.

Consider the following code:

procedure TForm88.Button2Click(Sender: TObject);
type
  // The code page for ANSI-Cyrillic is 1251
  CyrillicString = type AnsiString(1251);
var
  A: AnsiString;
  U: UnicodeString;
  U8: UTF8String;
  C: CyrillicString;
begin
  A := 'This is an AnsiString';
  Memo1.Lines.Add('AnsiString Codepage: ' + IntToStr(StringCodePage(A)));
  U := 'This is a UnicodeString';
  Memo1.Lines.Add('UnicodeString Codepage: ' + IntToStr(StringCodePage(U)));
  U8 := 'This is a UTF8string';
  Memo1.Lines.Add('UTF8string Codepage: ' + IntToStr(StringCodePage(U8)));
  C := 'This is a CyrillicString';
  Memo1.Lines.Add('CyrillicString Codepage: ' + IntToStr(StringCodePage(C)));
end;

The above code will result in the following output:

The Codepage for an AnsiString is: 1252
The Codepage for an UnicodeString is: 1200
The Codepage for an UTF8string is: 65001
The Codepage for an CyrillicString is: 1251

Other RTL Features for Unicode

There are a number of other routines for converting strings of one codepage to another. Including:

UnicodeStringToUCS4String
UCS4StringToUnicodeString
UnicodeToUtf8
Utf8ToUnicode

In addition the RTL also declares a type called RawByteString which is a string type with no encoding affiliated with it:

  RawByteString = type AnsiString($FFFF);

The purpose of the RawByteString type is to enable the passing of string data of any code page without doing any codepage conversions. This is most useful for routines that do not care about specific encoding, such as byte-oriented string searches.Normally, this would mean that parameters of routines that process strings without regard for the strings code page should be of type RawByteString. Declaring variables of type RawByteString should rarely, if ever, be done as this can lead to undefined behavior and potential data loss.

In general, string types are assignment compatible with each other.

For instance:

MyUnicodeString := MyAnsiString;

will perform as expected – it will take the contents of the AnsiString and place them into a UnicodeString. You should in general be able to assign one string type to another, and the compiler will do the work needed to make the conversions, if possible.

Some conversions, however, can result in data loss, and one must watch out this when moving from one string type that includes Unicode data to another that does not. For instance, you can assign UnicodeString to an AnsiString, but if the UnicodeString contains characters that have no mapping in the active ANSI code page at runtime, those characters will be lost in the conversion. Consider the following code:

procedure TForm88.Button4Click(Sender: TObject);
var
  U: UnicodeString;
  A: AnsiString;
begin
  U := 'This is a UnicodeString';
  A := U;
  Memo1.Lines.Add(A);
  U := 'Добро пожаловать в мир Юникода с использованием Дельфи 2009!!';
  A := U;
  Memo1.Lines.Add(A);
end;

The output of the above when the current OS code page is 1252is:

This is a UnicodeString
????? ?????????? ? ??? ??????? ? ?????????????? ?????? 2009!!

As you can see, because Cyrillic characters have no mapping in Windows-1252, information was lost when assigning this UnicodeString to an AnsiString. The result was gibberish because the UnicodeString contained characters not representable in the code page of the AnsiString, those characters were lost and replaced by the question mark when assigning the UnicodeString to the AnsiString.

SetCodePage

SetCodePage, declared in the System.pas unit as

procedure SetCodePage(var S: AnsiString; CodePage: Word; Convert: Boolean);

is a new RTL function that sets a new code page for a given AnsiString. The optional Convert parameter determines if the payload itself of the string should be converted to the given code page. If the Convert parameter is False, then the code page for the string is merely altered. If the Convert parameter is True, then the payload of the passed string will be converted to the given code page.

SetCodePage should be used sparingly and with great care. Note that if the codepage doesn’t actually match the existing payload (i.e. Convert is set to False), then unpredictable results can occur. Also if the existing data in the string is converted and the new codepage doesn’t have a representation for a given original character, data loss can occur.

Getting TBytes from Strings

The RTL also includes a set of overloaded routines for extracting an array of bytes from a string. As we’ll see in Part III, it is recommended that instead of using string as a data buffer, you use TBytes instead. The RTL makes it easy by providing overloaded versions of BytesOf() that takes as a parameter the different string types.

Conclusion

Tiburon’s Runtime Library is now completely capable of supporting the new UnicodeString. It includes new classes and routines for handling, processing, and converting Unicode strings, for managing codepages, and for ensuring an easy migration from earlier versions.

In Part III, we’ll cover the specific code constructs that you’ll need to look out for in ensuring that your code is Unicode ready.


Published on: 8/21/2008 1:14:28 PM

Server Response from: ETNASC02

Copyright© 1994 - 2013 Embarcadero Technologies, Inc. All rights reserved.