Delphi in a Unicode World Part I: What is Unicode, Why do you need it, and How do you work with it in Delphi?
By: Nick Hodges
Abstract: This article discusses Unicode, how Delphi developers can benefit from using Unicode, and how Unicode will be implemented in Delphi 2009.
In This Article
The Internet has broken down geographical barriers that enable world-wide software distribution. As a result, applications can no longer live in a purely ANSI-based environment. The world has embraced Unicode as the standard means of transferring text and data. Since it provides support for virtually any writing system in the world, Unicode text is now the norm throughout the global technological ecosystem.
What is Unicode?
Unicode is a character encoding scheme that allows virtually all alphabets to be encoded into a single character set. Unicode allows computers to manage and represent text most of the world’s writing systems. Unicode is managed by The Unicode Consortium and codified in a standard. More simply put, Unicode is a system for enabling everyone to use each other’s alphabets. Heck, there is even a Unicode version of Klingon.
This series of articles isn’t meant to give you a full rundown of exactly what Unicode is and how it works; instead it is meant to get you going on using Unicode within Delphi 2009. If you want a good overview of Unicode, Joel Spolsky has a great article entitled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” which is highly recommended reading. As Joel clearly points out “IT’S NOT THAT HARD”. This article, Part I of III, will discuss why Unicode is important, and how Delphi will implement the new UnicodeString type.
Among the many new features found in Delphi 2009 is the imbuing of Unicode throughout the product. The default string in Delphi is now a Unicode-based string. Since Delphi is largely built with Delphi, the IDE, the compiler, the RTL, and the VCL all are fully Unicode-enabled.
The move to Unicode in Delphi is a natural one. Windows itself is fully Unicode-aware, so it is only natural that applications built for it, use a Unicode string as the default string. And for Delphi developers, the benefits don’t stop merely at being able to use the same string type as Windows.
The addition of Unicode support provides Delphi developers with a great opportunity. Delphi developers now can read, write, accept, produce, display, and deal with Unicode data – and it’s all built right into the product. With only few, or in some cases to zero code changes, your applications can be ready for any kind of data you, your customers or your users can throw at it. Applications that previously restricted to ANSI encoded data can be easily modified to handle almost any character set in the world.
Delphi developers will now be able to serve a global market with their applications -- even if they don’t do anything special to localize or internationalize their applications. Windows itself supports many different localized versions, and Delphi applications need to be able to adapt and work on machines running any of the large number of locales that Windows supports, including the Japanese, Chinese, Greek, or Russian versions of Windows. Users of your software may be entering non-ANSI text into your application or using non-ANSI based path names. ANSI-based applications won’t always work as desired in those scenarios. Windows applications built with a fully Unicode-enabled Delphi will be able to handle and work in those situations. Even if you don’t translate your application into any other spoken languages, your application still needs to be able to work properly -- no matter what the end user’s locale is.
For existing ANSI-based Delphi applications, then opportunity to localize applications and expand the reach of those applications into Unicode-based markets is potentially huge. And if you do want to localize your applications, Delphi makes that very easy, especially now at design-time. The Integrated Translation Environment (ITE) enables you to translate, compile, and deploy an application right in the IDE. If you require external translation services, the IDE can export your project in a form that translators can use in conjunction with the deployable External Translation Manager. These tools work together with the Delphi IDE for both Delphi and C++Builder to make localizing your applications a smooth and easy to manage process.
The world is Unicode-based, and now Delphi developers can be a part of that in a native, organic way. So if you want to be able to handle Unicode data, or if you want to sell your applications to emerging and global markets, you can do it with Delphi 2009.
A Word about Terminology
Unicode encourages the use of some new terms. For instance the idea of “character” is a bit less precise in the world of Unicode than you might be used to. In Unicode, the more precise term is “code point”. In Delphi 2009, the SizeOf(Char) is 2, but even that doesn’t tell the whole story. Depending on the encoding, it is possible for a given character to take up more than two bytes. These sequences are called “Surrogate Pairs”. So a code point is a unique code assigned an element defined by the Unicode.org. Most commonly that is a “character”, but not always.
Another term you will see in relation to Unicode is “BOM”, or Byte Order Mark, and that is a very short prefix used at the beginning of a text file to indicate the type of encoding used for that text file. MSDN has a nice article on what a BOM is. The new TEncoding Class (to be discussed in Part II) has a class method called GetPreamble which returns the BOM for a given encoding.
Now that all that has been explained, we’ll look at how Delphi 2009 implements a Unicode-based string.
The New UnicodeString Type
The default string in Delphi 2009 is the new UnicodeString type. By default, the UnicodeString type will have an affinity for UTF-16, the same encoding used by Windows. This is a change from previous versions which had AnsiString as the default type. The Delphi RTL has in the past included the WideString type to handle Unicode data, but this type is not reference-counted as the AnsiString type is, and thus isn’t as full-featured as Delphi developers expect the default string to be.
For Delphi 2009, a new UnicodeString type has been designed, that incorporates the capabilities of both the AnsiString and WideString types. A UnicodeString can contain either a Unicode-sized character, or an ANSI byte-sized character. (Note that both the AnsiString and WideString types will remain in place.) The Char and PChar types will map to WideChar and PWideChar, respectively. Note, as well, that no string types have disappeared. All the types that developers are used to still exist and work as before.
However, for Delphi 2009, the default string type will be equivalent to UnicodeString. In addition, the default Char type is WideChar, and the default PChar type is PWideChar.
That is, the following code is declared by the compiler:
string = UnicodeString;
Char = WideChar;
PChar = PWideChar;
UnicodeString is assignment compatible with all other string types; however, assignments between AnsiStrings and UnicodeStrings will do type conversions as appropriate. Thus, an assignment of a UnicodeString type to an AnsiString type could result in data-loss. That is, if a UnicodeString contains high-order byte data, a conversion of that string to AnsiString will result in a loss of that high-order byte data.
The important thing to note here is that this new UnicodeString behaves pretty much like strings always have (with the notable exception of their ability to hold Unicode data, of course). You can still add any string data to them, you can index them, you can concatenate them with the ‘+’ sign, etc.
For example, instances of a UnicodeString will still be able to index characters. Consider the following code:
MyString := ‘This
MyChar := MyString;
The variable MyChar will still hold the character found at the first index position, i.e. ‘T’. This functionality of this code hasn’t changed at all. Similarly, if we are handling Unicode data:
MyString := ‘世界您好‘;
MyChar := MyString;
The variable MyChar will still hold the character found at the first index position, i.e. ‘世’.
The RTL provides helper functions that enable users to do explicit conversions between codepages and element size conversions. If the user is using the Move function on the character array, they cannot make assumptions about the element size.
As you can imagine, this new string type has ramifications for existing code. With Unicode, it is no longer true that one Char represents one Byte. In fact, it isn’t even always true that one Char is equal to two bytes! As a result, you may have to make some adjustments to your code. However, we’ve worked very hard to make the transition a smooth one, and we are confident that you’ll be able to be up and running quite quickly. Parts II and III of this series will discuss further the new UnicodeString type, talk about some of the new features of the RTL that support Unicode enablement, and then discuss specific coding idioms that you’ll want to look for in your code. This series should help make your transition to Unicode a smooth and painless endeavor.
With the addition of Unicode as the default string, Delphi can accept, process, and display virtually any alphabet or code page in the world. Applications you build with Delphi 2009 will be able to accept, display, and handle Unicode text with ease, and they will work much better in almost any Windows locale. Delphi developers can now easily localize and translate their applications to enter markets that they have previously been more difficult to enter. It’s a Unicode world out there, and now your Delphi apps can live in it.
In Part II, we’ll discuss the changes and updates to the Delphi Runtime Library that will enable you to work easily with Unicode strings.
Published on: 8/8/2008 5:08:11 PM
Server Response from: ETNASC04