Prepare for Kylix: The Compiler and RTL

By: Danny Thorpe

Abstract: Danny Thorpe provides some technical details of the ongoing work for Delphi and C++ Builder for Linux

Prepare for Kylix: The Compiler and RTL

by Danny Thorpe, Delphi R&D

What is Kylix?

Recent articles and Borland/Inprise press announcements have stirred up a lot of questions lately. Perhaps the question on many Delphi and C++Builder developers' lips is "Just what exactly is Kylix?" We diligently refer them to the Borland/Inprise Kylix press announcement and the Kylix Q&A article, but invariably the response is "Yes, yes, that's all very nice, but that's marketing stuff. Where's the real info? What will the nuts and bolts look like? What will port, what won't?"

This is the first in a series of Borland community articles intended to brief you, the Delphi and C++Builder developer, on Kylix technical bits that you need to be aware of to prepare yourself and your code for possible porting or migration to the Linux universe. This is a briefing of directions, issues, and solutions currently under evaluation, not a technical specification cast in stone.

Disclaimer: This article describes features of software products that are in development and subject to change without notice. Description of such features here is speculative and does not constitute a binding contract or commitment of service.

C++ coders will probably notice that these first few articles will talk almost exclusively about Delphi things. C++Builder and Delphi are siamese twins - where one goes, the other usually follows. Sometimes Delphi leads with new technology features that appear in the next C++Builder release, sometimes C++Builder leads with features that appear in the next Delphi release. The Kylix project encompasses both Delphi and C++Builder tools for the Linux platform. Right now, the plan is that Delphi for the Linux platform will be the first product produced by the Kylix project. C++Builder will follow after we get Delphi out the door.

This article is required reading for Kylix Kickstart attendees. (There will be a test!)

The Nuts and Bolts

Ok, enough prep verbage. Let's get down to business. This article will outline what's new, what's different, and what's out of the Object Pascal language, compiler/linker, and Run Time Library (RTL) in Kylix compared to the current Delphi 5 product for Windows. I'm not going to talk about VCL, or the IDE, or anything else. You'll just have to keep an eye on the Borland community site headlines to catch articles on those topics later on.

Command Line Tools

What's New?
  • DCC as a native Linux executable. Whee!
  • All-new built-in assembler, written in portable code. The old built-in assembler was written in TASM. (See next section)
  • DCC produces native Linux x86 executables. Ooh-ah. (Ok, so this item should be painfully obvious but it had to be said or some joker would claim we weren't doing native executables.)
What's Out?
  • tasm. We will not port tasm, Borland's external x86 assembler, to the Linux platform. If you need an external assembler, use Linux's GNU as assembler or the nasm assembler. Note, however, that neither of these support the TASM Intel assembler syntax so none of your existing .ASM files will compile with nasm.

    Your porting options for existing .ASM code are:

    1. Rewrite in Object Pascal or C. Long-term, this is your most portable option (see also PIC later in this article).
    2. Migrate the external .ASM code into Object Pascal inline assembler code blocks. You lose all macros and other TASM shortcuts, but at least it's not a rewrite.
    3. Tweak the external .ASM code into Microsoft masm syntax. nasm supports the basic masm syntax (parameter ordering, opcode mnemonics, etc), but not all the macro bells and whistles. tasm supports masm syntax too, so you might find a common ground between the platforms.
    4. Rewrite in nasm. When it absolutely, positively has to be out of sight. (obfuscation humor)

  • make. We will not port Borland's make utility to the Linux platform. Instead, you can use Linux's own GNU make utility. The syntaxes supported by these make utilities are almost compatible. <snicker>
  • brcc. We don't currently plan to port Borland's resource script compiler to Linux. We will support the notion of resources in Linux executables and binary resource files to link into projects, but if we provide some sort of text script to binary res file compiler it will be extremely simplistic at best. The leading proposal right now is to use some simple, well-known text format (perhaps INI file format, with resourcename=filename pairs) to just glom together a set of binary files and tack a resource header on the front. We don't need a C++ preprocessor (like BRCC / Microsoft's RC) just to concatenate binary chunks with a resource header - a shell script might suffice.
What's Different?
  • Command line switches. The '/' character is a path separator in Linux, so '/' is no longer a valid command line switch delimiter. Use '-' instead.
  • Semicolon is not a path separator in Linux - colon is. You can't list multiple search paths in a single -u switch using semicolons to separate the paths, for example. Use multiple -u switches, one search path each. The DCC32 compiler on your machine already concatenates multiple -u paths into one search path internally. We might update the compiler to support ':' as a path separator, but don't count on it.
  • Resource restrictions. The details have not been finalized yet, but it is likely that Delphi's embedding of resource data in Linux ELF executables will carry restrictions not found in Windows PE executables. For example, "rip and replace" of resources may be limited to changing only the data of the resource items, but not the total number of resource items nor the names of the resource items. So, using "rip and replace" without relinking your executable, you can replace a bitmap image with a smaller version or a larger version, but you cannot change its resource name or delete the bitmap from the executable entirely or add a new bitmap to the executable file. If you need to change names, add or delete resource items, you will have to relink the application from scratch.
  • Resource introspection, or the enumeration and discovery of resources at runtime, will not be supported. 99% of all Delphi and C++Builder applications know the names of all their resources at compile time. All your DFM forms are resource chunks, and the names of the resource chunks match the class names of the form objects.
  • Resource editing tool opportunities will be limited or more difficult than for Windows executables, for all the reasons listed above.
  • Position Independent Code (PIC). Linux shared object libraries (DLL equivalents) require that all code be relocateable in memory without modification. This has no tangible effect on Pascal source code, but you will have to tweak any inline assembler code that refers to global variables or other absolute addresses.

    PIC rules for inline assembler code:

    1. PIC requires all memory references be made relative to the EBX register, which contains the current module's base address pointer (Linux term: Global Offset Table, or GOT). So, instead of MOV EAX,GlobalVar you would use MOV EAX,[EBX].GlobalVar
    2. PIC requires that you preserve the EBX register not only across calls into your asm code (same as Win32), but also restore the EBX register before making calls to external functions (different from Win32).
    3. While PIC code will work in base executables, it won't be performant. You don't have any choice in shared objects, but in exes you probably still want to get as much performance as you can. You'll probably have to do what we're doing in the RTL, which is {$IFDEF PIC} your asm code for PIC and non-PIC codegen. (or rewrite the routine in Pascal and forget about it)

Language Syntax

There won't be a lot of changes to the Object Pascal language syntax. Things that are commonly mistaken as Windows-isms, such as Delphi's interface and GUID types, exist just fine in Kylix. A few things that do rely heavily on Windows implementation and have no equivalent in the Linux OS, such as Variants and resources, will be reimplemented in Kylix.
What's New?
  • Expression evaluation in conditional defines, including access to declared constants:
    {$IF Defined(SomeSymbol) and (SomeConstant < 11.0)}
    
    ...
    
    {$ELSE}
    
    ...
    
    {$ENDIF}
    Yes, Virginia, this can be used to check the compiler version with a single $IF expression. We even defined a new conditional symbol, CONDITIONALEXPRESSIONS, so you can hide the new $IF from the old compilers in source code that needs to compile everywhere. Note to self: when vacationing in Australia, leave the laptop in California...
  • Pascal Library modules and packages compile to Linux Shared Object (.so) libraries. .so is the Linux equivalent of the Windows .DLL.
  • The conditional symbol LINUX is now defined, indicating the source code is being compiled for the Linux platform.
What's Out?
  • Variables on absolute addresses. The syntax var X: Integer absolute $1234; cannot be supported in Position Independent Code and will most likely be thrown out entirely. Using absolute to overlay one variable on top of another variable should not be affected, but it will still earn you some well-deserved ugly looks from your fellow coders.
  • The conditional symbol WIN32 is not defined in Kylix.
What's Different?

  • Stdcall calling convention will be mapped to cdecl. This should have no tangible effect on Pascal code, but will affect inline assembler code. Win32 STDCALL has the callee clean up the stack, but in CDECL the caller cleans up. If you have any stdcall routines implemented in inline assembler that don't exit through the normal procedure endpoint, or you have inline assembler code that calls stdcall routines, you'll have some tweaking to do.
  • Safecall calling convention will be mapped to cdecl. Safecall will lose all its special runtime semantics: no function result checking, no raising exceptions, and when implementing a safecall routine, no trapping of exceptions. Since this drastically changes the runtime behavior, we'll probably emit a compiler warning whenever your Kylix code calls or implements a safecall routine. It would be simpler to say Safecall doesn't exist in Kylix, but that would break too much existing code. Mapping Safecall to cdecl will allow most existing code to still run correctly, it just won't deal with exceptions the way the Win32 code does.

Run Time Library

What's New?

  • Portable Variant implementation. We've implemented Variant data transport and coercion in platform independent Object Pascal code. Only the variant data types listed as Ole Automation compatible on the Windows side have been implemented on the Linux side. Win32's 12 byte VT_DECIMAL will not be supported.
  • WideStrings are now reference counted. In Windows, the Delphi WideString is implemented as an Ole BSTR to maximize data compatibility with OLE and ActiveX APIs. Ole BSTRs / WideStrings are not reference counted like Delphi AnsiStrings, so WideStrings tend to be a bit promiscuous in copying themselves all over the place.

    In Linux, there is no WideString compatibility requirement or issue, so we've reimplemented WideStrings to use the same copy-on-write reference count semantics as AnsiStrings. In fact, Kylix WideStrings use many of the same internal RTL support functions as AnsiStrings! How's that for code reuse!

What's Out?
  • Units such as ComObj, ComServ, Activex, Windows, etc;
  • Safecall exceptions
  • RaiseLastWin32Error, OleCheck, Win32Check
  • ExpandUNCFilename. Linux doesn't support UNC (serverdirectory).
What's Different?
  • Filename case sensitivity. Applications that assume the file system is case insensitive (that is, the application alters the case of user input filenames or doesn't preserve the case of filenames discovered by FindFirst/FindNext) won't work. Period.

  • WideChar is (still) 2 byte Unicode. The Linux widechar type, wchar_t, is actually 4 bytes per character. 4 bytes!!! Ouch! The complete UCS specification (here's a summary) calls for 4 bytes per character to ensure that there is enough room in the character set to adequately represent all known languages and texts, living and dead, and room for future expansion, such as planetary invasion by Vogons. It would be a shame if Earth's character set couldn't represent Vogon poetry in its true native iconographs.

    Anyway, nothing in the Linux kernel actually uses 4 byte widechars - the kernel expects strings (filenames and so forth) to be encoded in UTF-8. Delphi WideChar and WideString will remain 2 bytes per character Unicode, which just so happens to be a proper subset of the UCS-4 specification. How do you translate Unicode 2 byte chars to UCS 4 byte chars? Add two bytes of zeros in front.

  • AnsiStrings encoded as UTF-8. In Windows, AnsiStrings can carry multibyte character sequences, dependent upon the user's locale settings. The multibyte encodings for Japanese, Chinese, Hebrew, Arabic, and other locales are all different and usually incompatible. Linux appears to be standardizing on UTF-8, a multibyte encoding of the 4 bytes per char UCS character standard, as the dominant string data carrier.

    UTF-8 has the advantage that it can encode the entire UCS character standard across all known living languages and text systems, and UTF-8 is very easy to parse (unlike some Windows mbcs encodings). Linux does also have locales and code page character sets, so we have some reading to do yet to figure out how they mesh with UTF-8. At this time we're hopeful that we can use UTF-8 for all AnsiString data everywhere and make locale charsets and codepages a non-issue.

    One side effect of UTF-8, though, is that multibyte character sequences can be more than 2 bytes long. Most code in Windows (including parts of the Delphi RTL) assumes that mbcs character sequences are at most 2 bytes in length - a lead byte and a trail byte. I don't believe this two byte assumption would be a problem for any Western character sets, but some of the Eastern languages and perhaps mathematical symbol sets could spike up into the 3 byte UTF-8 range. In the interest of correctness, existing code that looks like

    if p^ in LeadBytes then Inc(p);
    should be modified to handle the possibility of one or more trail bytes following a lead byte. Techniques have yet to be determined.
  • Resource string efficiencies. In Windows, resource strings are stored in the executable file in Unicode format (2 bytes per char). Resource string data is copied into heap allocated memory as a WideString (Ole BSTR) each time the resource string is referenced at runtime.

    In Linux, Delphi resource strings will be encoded in UTF-8 (1 byte per char, usually) in the executable file. References to resource strings will resolve to point directly into the read-only resource section of the executable file mapped into memory by the program loader. No heap allocations, no data copying. It's just there.

  • File times in Unix format. The file time 32 bit integer in Delphi's FindFirst/FindNext TSearchRec and returned by functions such as FileAge and FileGetDate is a DOS packed time on Windows. On Linux, these will return a 32 bit integer in Unix time format. Comparing two such file time integers on the same platform to determine which file was modified more recently will still work fine. Code that unpacks the DOS time fields (say, to extract the year) will not work with the Unix file time integer.
  • DiskFree, DiskSize. Linux doesn't have drive letters. These functions will probably be altered or overloaded to accept a path string instead of a drive letter char. Or, these functions may disappear entirely. To be determined.
  • ExtractFileDrive. See above. We'll probably modify this function to always return an empty string in Kylix.
  • Path separator. Linux uses slash '/' to separate directory names in a path, not backslash ''. If your code uses the SysUtils utility routines like IsPathDelimiter, IncludeTrailingBackslash and ExcludeTrailingBackslash and the ExtractFilePath family of functions that already exist in Delphi 5, you'll be insulated from the / versus platform differences. We'll also introduce a new constant, PathSeparator, which will contain the appropriate character for the platform.

Feedback

Obviously, publishing this information is not a one way street. We need feedback from the Delphi and C++Builder developer communities, as well as from the Linux community at large.

Just one small request: Don't send me email! There are a lot more of you than there are of me. You can attach comments to this article or post comments to the Borland public newsgroups. Responding to comments in a public forum is a much more effective use of Borland's resources than sending essentially the same response to several email queries. Email responses only educate one person at a time. Newsgroups and web posts educate thousands in one fell swoop.

I hope you find this Kylix Compiler and RTL briefing informative and helpful. Now, if you don't mind, I really need to get back to implementing this stuff!

--Danny Thorpe
Senior Engineer, Delphi R&D
Inprise Corporation


Server Response from: ETNASC04