2021-10-02
md
On Translating Free Pascal/Lazarus Programs
<-Translating Console Applications in Free Pascal

Not very long ago, I uploaded a set of command line utilities to a GitHub repository called poutils. These utilities, which manipulate the .po translation files automatically created by the Lazarus IDE when internationalization is enabled, are written in Free Pascal. However they were initially written in Delphi in a haphazard fashion over a number of years when I used dxgettext to internationalize some applications. In retrospect, it would have been better to be more patient before uploading these programs, until I had studied the GNU gettext translation system more closely. This text presents some of the information obtained from the GNU project and some experimentation done to identify the way Free Pascal and Lazarus implement gettext.

There is a unit called gettext in the Free Pascal run time library. It provides procedures to translate resource strings from MO files which are compiled PO files. This is only a small part of what is called the Lazarus implementation of [GNU] gettext below. Hopefully, this will not cause confusion.

Table of Contents

  1. PO Files
  2. Simple Lazarus Example
    1. Behind the Scene
    2. First Translation
    3. Second Translation
  3. With a Little Help
  4. Extended Translations
  5. Edge Cases
  6. Message Context and MO Files
  7. Caveat
  8. References and Conclusion

PO Files toc

Have I earned credits for not naming this section "PO Files In Courage"? It was very tempting.

A PO file, also called a gettext catalogue, is a text file which contains a list of strings used in a program along with their translation into another natural language. Each language into which a program has been translated has its own .po file. The original gettext system (and dxgettext) used a complicated hierarchy of directories to identify the natural language found in each file. As is often the case, the Lazarus team simplified the approach and the convention is now to store all translation files in a subdirectory named languages or locale. The natural language of each translation is specified with a language code suffix (usually a two-letter code, but there are exceptions) before the file extension which is always .po. So app.fr.po and app.sp.po are, respectively, the French and Spanish translations of an application named app.

A translation file is composed of entries for each string to be translated. Each entry must contain a pair strings: the untranslated string tagged with the msgid keyword followed by its translation marked with the msgstr keyword. When a program is launched, each its untranslated strings is replaced with its translation except if the translated string happens to be blank. This is a one-time operation and should not impact execution time afterwards. The syntax of the basic entry is as follows.

msgid "untranslated-string" msgstr "translated-string"

The opening and closing quotes are not part of the string but are useful as they allow for strings with leading or trailing spaces. If no translation is available, then the entry would be

msgid "untranslated-string" msgstr ""

While the msgstr field can be an empty string, the msgid field can never be blank, which makes sense. There is a single exception which is called the header entry to be discussed later.

For convenience, long strings can be divided into manageable sized chunks.

msgid "" "first-chunk of the untranslated-string" "second-chunk of the untranslated-string" msgstr "" "first-chunk of the translated-string" "second-chunk of the translated-string"

I believe the following layout is just as valid but not usually used.

msgid "first-chunk of the untranslated-string" "second-chunk of the untranslated-string" msgstr "first-chunk of the translated-string" "second-chunk of the translated-string"

The breaks between chunks have no meaning. The untranslated string will be a concatenation of all the chunks, so the untranslated string in the last example is

first-chunk of the untranslated-stringsecond-chunk of the untranslated-string

By the same token, there is no requirement that the translated string be broken up into the same number of chunks, or that the chunks correspond and so on. Of course it is possible to have multiline strings in a program such as when a label has a three-line caption. This can be done by identifying the ends of lines with the usual \n escape sequence as shown in the following example.

msgid "first line\nsecond line\nof the untranslated-string" msgstr "première ligne\nsecondeligne\nde la chaîne traduite"

Again this could be written in chunks, with each chunk ending with the new line escape sequence to make the layout of the string more obvious

msgid "" "first line\n" "second line\n" "of the untranslated-string" msgstr "" "première ligne\n" "secondeligne\n" "de la chaîne traduite"

As a matter of fact, "untranslated-string and translated-string [respect] the C syntax for a character string, including the surrounding quotes and embedded backslashed escape sequences" according to the GNU documentation. The basic escape sequences \n for a new line, \t for a tabulation, \" for a double quote and \\ for the escape character \ will work in Free Pascal. I have not investigated more esoteric escape sequences. However, the ASCII character 174 (0x7C), the vertical bar | does cause problems. I have found that it was not possible to translate a file dialog filter such as "JSON files|*.json" with "Fichiers JSON|*.json". I tried escaping the character with "\|" without success. Perhaps replacing the character with its Unicode escape sequence would work, but after thinking about it, it struck me that it made no sense to enable a translator to change a file extension. Instead I changed the resource string to "JSON files" and built the filter string with it adding the constant "|.json". Are there other "special" ASCII characters that cause problems during translation?

The GNU specification includes other elements in an entry. Here is a definition of a standard entry.

    white-space
    #  translator-comments
    #. extracted-comments
    #: reference…
    #, flag…
    #| msgctxt previous-context
    #| msgid previous-untranslated-string
    msgctxt context
    msgid untranslated-string
    msgstr translated-string

While gettext tools, including the Lazarus implementation, generate a single blank line between entries, I believe this is optional. In GNU gettext all lines that begin with the comment character "#" are optional. In the Lazarus implementation, the reference comment that begin with "#: " is mandatory otherwise the entry will be treated as a header or ignored if a header has already been defined. The other types of comments are optional in the Lazarus implementation. The context element (msgctxt) is optional in GNU gettext and Lazarus. As far as I know, the Lazarus IDE does not generate extracted-comments which "xgettext program extracts [] from the program’s source code." Similarly, I have not seen previous context elements in a .po file generated by the Lazarus IDE but that is not evidence that they are never present.

References are "references to the program's code." In the Lazarus implementation, a reference for an entry is the fully qualified name of the entity that owns the untranslated string in lower case. For example, the application has a form of type TForm2 with a label named Label5 with the caption set to "Value". Then the generated .po file would contain these two entries at a minimum.

#: tform2.caption msgid "Form2" msgstr "" #: tform2.label5.caption msgid "Value" msgstr ""

Of course, these references are guaranteed to be unique if two conditions are satisfied: the program compiles without error and no human has touched the generated .po file. Unfortunately, that last condition does not obtain in the author's home.

A message context (msgctxt) is used to resolve ambiguity whenever an untranslated string appears more than once. A unique context should then be appended to each entry to differentiate them. The Lazarus IDE automatically generates any required context, but it is just the entry reference in quotes. More on this latter.

From the GNU gettext documentation, I assume that a flag line would look like this:

#, fuzzy, format-string [, format-string]

where the format-strings (which could be c-format, no-c-format, python-format, no-python-format etc.) specify the type of format (things like %s, %d, %2.f and so on) used in the untranslated string. As far as I know, only the fuzzy keyword is processed by the Lazarus internationalization system, but anything else in the comment will be untouched.

The header entry, which must be the first entry in the .po file, is the only entry that has an empty msgid. Its msgstr contains meta data. Here is the basic entry created by the IDE.

msgid "" msgstr "Content-Type: text/plain; charset=UTF-8"

It is not rare to encounter headers with much more information

msgid "" msgstr "" "Content-Type: text/plain; charset=UTF-8\n" "Project-Id-Version: \n" "MIME-Version: 1.0\n" "CContent-Transfer-Encoding: 8bit\n" "Language: fr_FR\n"

In addition there is usually some information about the translator's identity and about the date of translation and so on. But what's included is beyond the scope of this post. Also outside of what will be examined in this post is the third type of entry which deals with plural forms. This is just laziness on my part, but perhaps the subject will be reexamined in the future.

Simple Lazarus Example toc

Let's experiment with the Lazarus internationalization capabilities with a simple program that does nothing except display three captions and close itself when the button is pressed. Here is its main form in the IDE designer.

It contains a label and a button with text that will be translated. The caption of the button is set to "Close" at design time, while the caption for the label is set at run time in the FormShow event. Its value is set equal to the content of the SHello resource string defined in the source code.

unit main; {$mode objfpc}{$H+} interface uses Classes, SysUtils, Forms, Controls, Graphics, Dialogs, StdCtrls; type TForm1 = class(TForm) Button1: TButton; Label1: TLabel; procedure Button1Click(Sender: TObject); procedure FormShow(Sender: TObject); private public end; var Form1: TForm1; implementation {$R *.lfm} Resourcestring SHello = 'Hello'; procedure TForm1.FormShow(Sender: TObject); begin label1.caption := SHello; end; procedure TForm1.Button1Click(Sender: TObject); begin close end;

When executed, this is the appearance of the form.

Enable internationalization in the IDE. To do this, bring up the Options for Project: dialog by either pressing the CtrlShiftF11 key combination or navigating through the menu system: Project/Project Options. Select i18n option in the left panel and then activate the Enable i18n checkbox. The PO Output Directory should also be specified.

As displayed, I chose languages, but that is just for ease while examining the Lazarus gettext implementation. Normally I use langs for the PO Output Directory and avoid languages or locale because these are the traditional input directories that contain the distributed or manually created translation file. It would not do to let the Lazarus IDE overwrite modified translation files.

Build the project and a file named test.po will be created in the languages directory when the program is compiled. Here it is.

msgid "" msgstr "Content-Type: text/plain; charset=UTF-8" #: main.shello msgid "Hello" msgstr "" #: tform1.button1.caption msgid "Close" msgstr "" #: tform1.caption msgid "Test" msgstr "" #: tform1.label1.caption msgid "Label1" msgstr ""

This is a correctly formed translation file with a header as the first entry and additional valid entries for the visual components and resource strings. Each of those begins with a reference identified by the #: comment marker. The reference is the fully qualified name of the text property in lowercase. The msgid field contains the untranslated string in quotes. The msgstr field which may eventually contain the translated string is empty. It's up to the translator to supply the missing translations.

Since this file has no translations, it is said to be the translation template. As such it, it would probably be named test.pot in standard gettext systems, but Lazarus chose not to use the .pot extension. Perhaps using the extension would be a problem in systems with Microsoft Office that uses the extension for other purposes.

Behind the Scene toc

The IDE also created a test.lrj file and saved it alongside the main.pas source code.

{"version":1,"strings":[ {"hash":371876,"name":"tform1.caption","sourcebytes":[84,101,115,116],"value":"Test"}, {"hash":86477809,"name":"tform1.label1.caption","sourcebytes":[76,97,98,101,108,49],"value":"Label1"}, {"hash":4863637,"name":"tform1.button1.caption","sourcebytes":[67,108,111,115,101],"value":"Close"} ]}

As can be seen this a JSON formatted file. Clearly, the value associated with the "name" key is the entry reference as stored in the .po catalogue. Similarly, the value associated with the "value" key is the untranslated string. By changing the caption of label1 to "été" it was possible to determine that the array of bytes stored associated with the key "sourcebytes" is the UTF-8 encoding of the untranslated string while the "value" is stored as string with Unicode escape values for all non-ASCII characters.

{"hash":13504729,"name":"tform1.label1.caption","sourcebytes":[195,169,116,195,169],"value":"\u00E9t\u00E9"},

By changing the label's name to label2 and noticing no change in the "hash" value, it can be surmised that it is the untranslated string which is hashed.

{"hash":13504729,"name":"tform1.label2.caption","sourcebytes":[195,169,116,195,169],"value":"\u00E9t\u00E9"},

So that is the content of the file described, but what is its purpose? It is probably used by the IDE to track changes when it rebuilds the files in the PO Output Directory. A lot more happens behind the scenes when translations are added, but these .lrj files merit special mention because it "... is very important that you include [.lrj files] with your source code in the version system you're using, don't add that file to ignored (say .gitignore), else your translations will be broken." (Source: Translations / i18n / localizations for programs)

First Translation toc

As a first experiment copy the template file to test.fr.po within the same directory and provide translations for the resource string and button in the latter:

msgid "" msgstr "Content-Type: text/plain; charset=UTF-8" #: main.shello msgid "Hello" msgstr "Allô" #: tform1.button1.caption msgid "Close" msgstr "Fermer" #: tform1.caption msgid "Test" msgstr "" #: tform1.label1.caption msgid "Label1" msgstr ""

I used a simple text editor to add the two words "Allô", and "Fermer", but poedit or a similar translation utility could be used. In that case, expect to have a much bigger header, but that has no real consequence. Here is the command to start the program with the French translation no matter what the language settings for the system are.

michel@hp:~/Documents/Lazarus_projects/i18n$ ./test --lang fr
or in Windows
michel@hp:~\Documents\Lazarus_projects\i18n> test --lang fr

This will prove disappointing as neither translations is displayed. The missing ingrediant is the DefaultTranslator unit which needs to be added to the uses clause of the main unit.

unit main; {$mode objfpc}{$H+} interface uses Classes, SysUtils, Forms, Controls, Graphics, Dialogs, StdCtrls, DefaultTranslator; ...

Now the needed code to search for the .po or .mo file and to use it to translate the application will be included. Compile the modified source and launch the test program again.

This time the label and button captions are translated. That shows just how simple it is to internationalize Lazarus programs. The hard work is translating the template into other languages.

Second Translation toc

Copy the template file to test.es.po within the same directory and provide translations for the resource string and button in the latter:

msgid "" msgstr "Content-Type: text/plain; charset=UTF-8" #: main.shello msgid "Hello" msgstr "Hola" #: tform1.button1.caption msgid "Close" msgstr "Cerca" #: tform1.caption msgid "Test" msgstr "" #: tform1.label1.caption msgid "Label1" msgstr ""

Because of my woefully inadequate knowledge of most languages, that complicated translation was done with the help of the Web with all its inherent potential for errors. My apologies to Hispanophones if the translations make no sense. To take advantage of this new translation, there's no need to recompile the application, just launch it with the appropriate language flag: test --lang es.

This reveals the power of this system. Anyone can translate string resources of an i18n enabled program if supplied with an accurate template file.

The system is actually better than what has been shown. A user will normally not need to specify the language to be used. The DefaultTranslation unit will use the system locale to load the appropriate language file if it exists. The locale of my system is fr_CA which stands for French in Canada. Failing to find a regionalized French version named test.fr_CA.po, the program will load the test.fr.po translation when test is launched without an overriding --lang command line option (see the caveat below).

michel@hp:~$ locale LANG=fr_CA.UTF-8 LANGUAGE=fr_CA:fr ... michel@hp~$ Documents/Lazarus_Projects/i18n/test

If a French or Spanish user wanted to see the original language, then the --lang parameter can be used to override the automatic selection of the language file.

michel@hp~$ Documents/Lazarus_Projects/i18n/test --lang en or michel@hp~$ Documents/Lazarus_Projects/i18n/test --lang zebra

Use en, zebra or any other "language" for which a translation file is not provided. In that case DefaultTranslation will not load a translation file and the default language strings will be displayed. Another way to achieve the same goal is to rename the languages directory to something which is not searched.

With a Little Help toc

The Lazarus gettext implementation provides some welcomed help for translators by suggesting translations when possible. Change the Form1 caption to "Hello" in the IDE with the form designer. Build the program and execute it specifying that the French translation is to be used. The result is probably what one would expect.

The form caption is "Hello" because that is what was specified in the form designer. However look at the French translation file which has been updated by the IDE.

msgid "" msgstr "Content-Type: text/plain; charset=UTF-8" #: main.shello msgctxt "main.shello" msgid "Hello" msgstr "Allô" #: tform1.button1.caption msgid "Close" msgstr "Fermer" #: tform1.caption #, fuzzy msgctxt "tform1.caption" msgid "Hello" msgstr "Allô" #: tform1.label1.caption msgid "Label1" msgstr ""

There is now a suggested translation for the form caption. But that is only a proposal because the fuzzy flag was added to the tform1.caption entry. This flag is used by the translation mechanism coded into the application to skip the suggested translation and to use the untranslated string which in this case was "Hello". The flag is also a signal for the translator that the entry needs to be reviewed. Remove the #, fuzzy line from test.fr.po and launch the application again. No need to recompile it. The form caption will now be translated.

Note the addition of the context line msgctxt in the two entries of the PO file that have the same "Hello" msgid. This will be discussed further on.

The IDE did the same thing, mutatis mutandis, with the Spanish translation file. It would have done the same will all other .po files in the PO Output Directory.

Again, all this is very useful. Presumably, the programmer could pass on the .po files to the translators and ask them to look at the "fuzzies". They could decide that the suggested translation is correct and just remove the #, fuzzy line in the entry. Instead they could decide that the suggested translation is not correct in the given circumstance and write in the correct translation in the msgstr field and then remove the #, fuzzy flag.

Extended Translations toc

Instead of removing the fuzzy flag, remove the whole form caption entry in the translation catalogue and run the application. The form caption will still be translated.

msgid "" msgstr "Content-Type: text/plain; charset=UTF-8" #: main.shello msgid "Hello" msgstr "Bonjour" #: tform1.button1.caption msgid "Close" msgstr "Fermer" #: tform1.label1.caption msgid "Label1" msgstr ""

Clearly when faced with a string which does not have a specific translation, the translation system will search for a PO entry with the same untranslated string and if one is found it uses that entry's translated string. This does raise a number of questions.

As a programmer I often thought it was clever to simplify the translation file by removing entries to benefit from the automatic translation feature described above. I try not to do that anymore because sometimes "Bonjour" is an appropriate translation of "Hello", but at other times "Allô" or "Salut" might be better. Leaving the entries the gettext system places in the template file and letting translators decide how to best translate duplicated untranslated strings is a much better strategy.

Edge Cases toc

There is no need to follow me down the following rabbit hole because it's an investigation into what happens if the PO file is not correct. The following table shows the displayed form caption depending on the content of the PO file. The program was compiled only once with the resource string for the form caption, as specified in the .lfm file, being "Hello". The tests, divided into 6 groups of four, consisted of changes to the PO file only. In the first two groups of tests, the PO file contained only 2 entries. There was entry 0, which is not shown in the table and which is the header. For the first 4 tests, entry 1 is the translation of the form caption. In the first two tests, the msgid for the caption is the same as its resource string and, not surprisingly, the displayed caption is translated as expected. In the third and fourth tests of the first group, the msgid is incorrect. Nevertheless, the displayed caption is not the original value of the resource string, which shows the importance of the reference field in PO files within the Lazarus implementation.

In the second group of tests (5 to 8), there is no entry for the form caption in the PO file. Instead there is an entry with reference to a non-existing entity. Nevertheless, as test 6 shows, this translation will be used to translate the caption. As tests 7 and 8 show, entry 1 has no impact when it contains a msgid which does not correspond to the caption resource string.

The third and fourth group of tests (9 to 16) shows that an entry with a translation of the form caption resource string has no impact on the translation of the caption if there is an entry with the correct reference to the form caption. And that remains true no matter if the entry for the caption has a translation or not. The order in which the entries appear is of no consequence.

Entry 1Entry 2Displayed Caption
referencemsgidmsgstrreferencemsgidmsgstr
1tform1.caption"Hello"""Hello
2tform1.caption"Hello""Bonjour"Bonjour
3tform1.caption"bye"""bye
4tform1.caption"bye""Bonjour"Bonjour
5ref.nowhere"Hello"""Hello
6ref.nowhere"Hello""Allô"Allô
7ref.nowhere"bye"""Hello
8ref.nowhere"bye""Allô"Hello
9ref.nowhere"Hello""Allô"tform1.caption"Hello"""Hello
10ref.nowhere"Hello""Allô"tform1.caption"Hello""Bonjour"Bonjour
11ref.nowhere"Hello""Allô"tform1.caption"bye"""bye
12ref.nowhere"Hello""Allô"tform1.caption"bye""Bonjour"Bonjour
13tform1.caption"Hello"""ref.nowhere"Hello""Allô"Hello
14tform1.caption"Hello""Bonjour"ref.nowhere"Hello""Allô"Bonjour
15tform1.caption"bye"""ref.nowhere"Hello""Allô"bye
16tform1.caption"bye""Bonjour"ref.nowhere"Hello""Allô"Bonjour
17ref.nowhere"bye""Allô"tform1.caption"Hello"""Hello
18ref.nowhere"bye""Allô"tform1.caption"Hello""Bonjour"Bonjour
19ref.nowhere"bye""Allô"tform1.caption"bye"""Hello
20ref.nowhere"bye""Allô"tform1.caption"bye""Bonjour"Bonjour
21tform1.caption"Hello"""ref.nowhere"bye""Allô"Hello
22tform1.caption"Hello""Bonjour"ref.nowhere"bye""Allô"Bonjour
23tform1.caption"bye"""ref.nowhere"bye""Allô"bye
24tform1.caption"bye""Bonjour"ref.nowhere"bye""Allô"Bonjour

The last two groups of tests leave me nonplussed. Compare tests 15, 19 and 23. In all three cases, there is a valid entry for the form caption with the correct tform1.caption and instructions that the resource string should not be translated although the msgid is not valid. But somehow the presence and order of appearance of the extraneous entry have an impact on the translation shown. This not the only anomaly that was encountered. Consider this example.

msgid "" msgstr "Content-Type: text/plain; charset=UTF-8" #: main.shello msgid "Hello" msgstr "" #: reference.to.nowhere msgid "Hello" msgstr "Bonjour" #: tform1.button1.caption msgid "Close" msgstr "Fermer" #: tform1.label1.caption msgid "Label1" msgstr ""

That was surprising because it seemed plausible that given the correct reference to the SHello resourced string and its empty msgstr, the label caption should have been untranslated and displayed as "Hello".

Message Context and MO Files toc

As mentioned above, two entries with the same untranslated string in a PO file are considered ambiguous and the GNU gettext system adds a message context field, called msgsctxt, to differentiate the entries. The Lazarus implementation also adds a mgsctxt field when the same untranslated string occurs more than once and this field is invariably the entry referencereference in quotes. This duplication may appear to not be that useful, but in actuality this approach satisfies a GNU gettext requirement while preserving the Lazarus reliance on the reference field.

A MO file (extension .mo) is a binary file compiled from a PO file. In principle the compiled file is smaller and faster than the .po file. The set of tools provided with Lazarus does not include a compiler but the GNU gettext project provides one named msgfmt. It is included by default in Mint 20.1 and probably many other Linux distributions, but it is probably not be included in Windows. Maybe that is why the Free Pascal wiki recommends using the poedit translation editor to create MO files.

Both these compilers can be used to generate MO files from PO files created by the Lazarus gettext implementation. However, they ignore the reference field. On the other hand, the msgctxt field is used to provide unique translations for ambiguous entries. By keeping both fields identical, the Lazarus gettext implemention ensures compatibility with the original gettext system. At least that's my interpretation of what is happening and I'll stick to it until evidence to the contrary is provided.

It is telling that Lazarus does not provide a compiler and, more to the point, does not ship with any compiled .mo files. The many translations of the IDE itself are stored in .po files only. For my use case, distributing MO files would be a mistake. It only makes sense to do that when one wants to concentrate in one's hands the control for translations which is exactly what I cannot do.

Caveat toc

A misleading statement was made about the automatic search for the regionalized version of a national language depending on the system's locale. That is not quite true. As I said, the LANG environment variable in my system is fr_CA.UTF-8. Accordingly, the automatic translation mechanism (enabled with the inclusion of the DefaultTranslator unit in the program's uses statement) searches for the file test.fr_CA.UTF-8.po in all the usual places. If such a file is not found, then the search is repeated for a file named test.fr.po, but a search for test.fr_CA.po is never performed. I am not sure if this should be seen as normal behaviour or it should be considered "a feature" or even a bug.

This problem came up in 2017 when I wanted to translate console applications. In my unittranslator.pas unit which replaces LCLTranslator to avoid the latter's dependence on the LCL package, the GetLang function will remove the UTF-8 encoding suffix from the system locale. That works for my needs but I do not know if it is a generally acceptable solution.

References and Conclusion toc

The principal reference for translating Lazarus programs is the Translations / i18n / localizations for programs Free Pascal wiki page. Additional information is found in the Everything else about translations page. There are many other wiki pages dedicated to the subject. Do be careful, some of these are old and contain advice that technically may be correct but is nevertheless out of date. Translating Lazarus programs is really as simple as described in the principal reference wiki or as described above.

Extensive documentation for the GNU gettext system is available in many forms. I found the Pology User Manual was also a useful resource.

<-Translating Console Applications in Free Pascal