PostNuke

Flexible Content Management System

News

PostNuke and its future language system: a translator's viewpoint

Contributed by on Sep 11, 2005 - 10:31 PM

1 What this article is about, and why I'm writing it


PostNuke has developed quite a lot since it forked from PHP-Nuke; many changes have already been implemented in 0.7.6.0, and many more will be reflected in 0.8. However, compared with other parts of the core code, the language system hasn't changed much. And what has changed recently might not always have been a good idea.


Most of the development work on PostNuke is done by speakers of European languages (predominantly English and German, but also Danish, French and Spanish, to name but some). Speakers of those languages might not be aware of the differences between their mother tongue and some of the more-exotic but widely-spoken languages in the world. Since I spend quite a bit of my time applying different languages, I'd like to show a few problems and make a few suggestions that perhaps could be further discussed.


I have looked through the PostNuke language files before, but it was only when I, together with two students, attempted to translate them into Chinese that we noticed how difficult a task that can be. Parts of PostNuke 0.7.6.0 actually cannot be translated meaningfully into Chinese or Japanese anymore.


When I was working on the translation, I happened to "complain" in the pnLanguages forum on PostNuke's NOC (Network Operations Center) about a few of the problems, and was answered by David Nelson of the pnLanguages team. He suggested I should put all my "frustration" (well, it isn't that serious, but some of the problems really are…) into an article, which is what I'm doing right here and now.


2 Current problems with PostNuke's language system


There is a thread [1] on the PostNuke forums, started by ColdRolledSteel, in which he raised a number of very important points. It was that thread that actually made me start thinking about the shortcomings of the current language system, and what the solutions might be.


If you read further on in that thread, you will find comments about 0.7.6.0. It's a long discussion, but if you're interested in PostNuke localization then I recommend you to read all of it, since many serious problems are mentioned. Good-quality localization into some languages may actually become very difficult unless the language system is overhauled.


Most PostNuke enthusiasts know that PostNuke forked from PHP-Nuke. With PHP-Nuke, not only do you have to pay to get access to the latest version of a GPLed software, but also many people would probably agree that there can be significant changes from version to version.


Not all of those changes may be approved-of by many, since the line of development of PHP-Nuke is decided by just one person: Francesco Burzi. You can read many posts in the PostNuke forums about the problems the PostNuke developers have had in fixing problems caused by PHP-Nuke code leftovers.


What seems to be forgotten is that, unfortunately, one of the legacies is the way PostNuke handles languages.


Here is a summary of problems raised in the above-mentioned thread and other threads, that I encountered:


  • too many files;

  • too many phrases used multiple times;

  • language files are hard to handle;

  • separation of grammatical particles can make it impossible to provide a good translation in some languages.

Again, please read the above-mentioned thread [1].


3 Suggestions


3.1 Files and folders


3.1.1 Simple structure instead of chaos


Instead of having one language folder for each language in every module folder, why not have just one central "language" folder with just one file for each language? The file could be an XML file, containing all the strings needed by the core system. If a new module is installed, the installation procedure could copy all needed variables and strings into that file. When a module is uninstalled, the de-installation procedure would then be responsible for removing those strings from the language file (in the same way that tables are created and deleted at present).


3.1.2 Why the current language system is chaotic


Does anyone know how many variables and strings are used in the core system alone? Maybe some of the developers will know (though I seriously doubt it); but, certainly, nobody who tries to produce a language pack knows. What they will soon learn, however, is that many strings show up more than once.


If I recall correctly, there are also a few different variables for the same purpose/string. All this is definitely not very motivating for those who want to translate PostNuke into another language. Plus, the translator may well not remember the exact phrase he/she used the last time this string was encountered a few modules ago. Just as there is often more than one way to express something, there is frequently also more than one way to translate a phrase. So, if you consider consistency in translation to be important, the present language system does not help achieve that.


It would also help if the above-proposed single language file could be processed by software. I mean, the year is 2005 and even translators are using computers these days. Every step that makes administering and translating strings easier will lead to better productivity, higher-quality output and availability in more languages. A translation tool (something like pnDefineMachine) should be included in the PostNuke core distribution, and should allow translation of core and third-party modules. To give translators more flexibility in working, it would be ideal if all identified strings could be exported to a file so that they can be worked on using other software and then re-imported.


3.2 Strings and variables


3.2.1 Simple, unique, short and distinct


All strings should be revised to check whether they are really necessary. They should also be as short and as clear as possible. If one single file were used, it would be easy to eliminate duplicate strings and variables.


Any variable output from the PostNuke code (a number, a user name, etc.) for incorporation within a natural language string should be included in the language file. It should also be possible to choose where in the string the variable is positioned.


In the past, developers often just created two language "defines" and the translator was left to guess what was missing between the two. Sometimes, due to the ordering of "defines" within the language file, the translator also had to realize that two "defines" were indeed used in conjunction, because they did not follow in succession. In such cases, some burrowing through software code files was needed for verification and understanding.


This problem is slowly being addressed. The age check string from the NewUser module's global.php language file is a good example. This "define" allows for free placement of $minage within the translated string:



<code>define('_MUSTBE','You must be '.$minage.' or over, or have parental permission to register for an account on this site.');</code>


All strings should compose a "meaningful phrase". The best would be a complete sentence, but in some cases single words or short phrases may be OK or are unavoidable. Just please, please don't separate any prepositions or similar grammatical particles.


3.2.2 Why translating current strings can be really frustrating


Imagine you tell someone, "Here, this is what you've got to translate. Some strings will show up twice; but some occur three, four or even five times. No, I can't tell you which strings. Just translate it. Oh, and you won't be getting any clues about what some of the cryptic, enigmatic phrases mean, nor where to find them on the site." How motivated do you think that person will be?


How happy will the translator or a webmaster be when he/she encounters something like the "define" below (from the Xanthia module's admin.php language file)?


<code>define('_XA_CONFIGINFO','Simply enter the new value in the text area and click Commit, the changes will take effect immediately. Unfortunately, the ability to create and delete general configurations has been removed, as it was never fully implemented and now never will be.');</code>


Translating a language pack should mean translating the user interface, not the story of PostNuke's life – that's something to include in documentation, no?


In addition, it would be helpful if developers do their best to make strings clear and understandable! This one that used to be in the Downloads module's global.php (now rectified) could phase a native English speaker, let alone someone else: "Let main vote summary decimal out to n places".


The general lack of informative and explanatory comments in the core code files and language files does not make a translator's work easy.


And dealing with problems is made harder when strings occur multiple times.


I know the code is advancing with every new version of PostNuke. But, unfortunately, this is not true from a linguistic point of view. 0.7.6.0 is a step backwards, although it was probably meant to take the language system forwards.


Let's look at some examples involving a number of languages: English, German, Chinese and Japanese, but some others are mentioned as well.


Many native speakers of a language may not be very familiar with all the theory behind their mother tongue's grammar. They may have had to learn it at school, but now they just use it naturally. For localization however, it may be necessary to dive into all that theory a bit.


The English language (along with many others) uses prepositions ("on", "in", "behind"...). But some languages don't have them. Hungarian, Finnish and Japanese use postpositions instead. That's right, they come after the object, not before it. In Chinese, there are prepositions, but many phrases use constructions similar to postpositions.


Let's look at the problems that this can cause. Let's take this sentence: "The book is on the table".


"The" is a definite article; "book" is a noun; "is" is a verb; "on" is a preposition; "the" is a definite article; and "table" is a noun.


"The book is on the table" is a simple sentence which, in German, translates to "Das Buch ist auf dem Tisch." It's hard to deny that German and English are somehow related.


But the same cannot be said of Japanese. "Book" in Japanese is "hon", and "table" is "tsukue". Since a book is not a living creature, the verb used will be "aru" or, in neutral politeness, "arimasu".


But how about "on"? The correct word here is "ue", but you can't just use it that way. "Ue" means "top", "above"; so, in Japanese, you have to say "on top of". "Of" would be "no"; "on top of" becomes "no ue ni" ("inside the top of something", so to speak). Did you notice the different order? Now here's the whole sentence: "Hon wa tsukue no ue ni arimasu." ("Wa" is just a particle marking a sentence's topic.)


While "book", as subject, is still at the beginning of the sentence, it is followed immediately after by the object, "table", then the postposition, and finally the predicate. Now imagine what incorrect or incomprehensible Japanese would be displayed if prepositions and other words for a given phrase or sentence are spread within separate "defines" in the language files. If you read through the forum thread mentioned at the beginning of this article, you will also find a few examples for Finnish. If you "turn the table" and try to make an English sentence translated word-for-word from the Japanese sentence structure, you would get something like "book table on is".


This is another reason why variables that will be processed by the PHP code, and rendered in an output form to be incorporated in the displayed string, should be included in the "define" for that string in the language files.


Let's take another simple example: in English (and many other languages), we say "A of B", as in the phrase "3 of 10 users". But in Japanese, it is "B no A" (and "B de A" in Chinese). So, simplification is a good thing but needs to be done the proper way, otherwise important parts of PostNuke could be practically unusable in a few languages in which the potential user base might be enormous [2]. Do you think a Chinese or Japanese would like a portal that tells him "10 of 3 users are currently on-line"?


By the way, did you notice that there are no articles in the Japanese example? English has one definite article (effectively one gender), French has two (for the masculine and feiminine genders) and German three (for the masculine, feminine and neuter genders) , so there are big differences between languages even for such an elementary point of grammar. But then consider the treatment of number and quantity in different languages: anyone who has examined the language files will know that there are different strings for "day" and "days", since singular and plural exist in English, German and many other languages. However, this isn't true for Chinese: "day" is "tian" in Chinese, and no matter whether you're talking about one, two or a million days, it's still "tian".


3.3 What we can do


3.3.1 Where we really can simplify and standardize


We could create standard "interfaces" for both code (variables) and strings. Since, in the system discussed herein, we would already have only one file, we could establish a list of standard actions like "delete", "create", "submit", etc. This list should be extendible, so that module authors can add any "special" actions for their module (although it would be nice if those "special" actions, too, could be managed by the core development team, otherwise duplication is likely to occur once again). But when already-available actions are used, the module developer would just use the existing variables in their code, and wouldn't have to worry about defining the corresponding strings, since they would already be available in the core "library". Obviously, that implies the existence of a good developer's guide that gives module and block developers proper information about the PostNuke programming API.


3.3.2 Separate what can be separated


So far, we have been advocating the use of language "defines" that contain complete phrases and sentences, incorporating any necessary variables (or, at least, place-holders for variables) for which literal values will be supplied by PHP. As from 0.7.6.0, the developers started to separate prepositions in the core language files. I just tried to show that this may not be a good move, but let's see what we actually can separate out.


It is not a good idea to strip grammatical particles, since grammar can be very different from language to language, and those particles are precisely that: particles. What we always need in order to properly construe a meaning is a semantic unit that conveys enough useful information to allow the receiver to understand the intended sense.


Let me explain this with the above book example: Someone tells you that the book is on the table. However, since a tram was passing by, or you were just distracted by a pretty woman or you were simply asleep, you didn't clearly understand all of it; so you ask that person to tell you the same thing again.


Now, that person may think you are a foreigner and don't understand much English, so he may think "let's make it simple for that poor fellow" and just tells you, "Book". Nice. Yes, I know what a book is. But does he want me to write a book, buy you a book or did I perhaps completely misunderstand and he wants me to book him on a flight to Alaska?


He repeats, "The book." Nice. Now at least I know that he doesn't want to go to Alaska. "Table," he adds. With much imagination, I might start to guess now that there is a table with a book on it. If I had heard the complete phrase just once more, there wouldn't have been any problem.


But how about single words then? In English, there is something called the "imperative mood" for its verbs. The verb itself does not change, since there is no flection. So, at least in English, you don't see a difference between the regular form of a verb and its imperative mood. However, you can shout "Shoot!" or "Fire!" and the receiver of that order will know what to do.


Not every language will use a verb to express this. German will use a noun, "Feuer". Japanese is closer to English this time, it also uses a verb: "Ute" is the imperative form of "utsu", which actually means "to beat". So, what we can strip are units that can stand alone semantically, and actions are one such thing.


It may happen that such an action is not always a single word in every language, but to organize them in a list and to standardize them system-wide would help to create a more consistent user interface. It would be a bit like function calls in a code library, but for strings.


3.4 Error messages


3.4.1 Error codes in one list


It is common in software design to assign codes to errors and to display a message corresponding to the code. As with the actions discussed above, we could create a list of error messages shared in common by any module and block throughout PostNuke.


3.4.2 It says "Xxx". What should I do?


I think everyone has seen "a few" posts in the PostNuke forums asking what this error or that error means. With the above method, a given error would produce the same message throughout the whole system. It would probably even be possible to provide a list of possible causes for each error, so that users can solve their problems more easily. But again, who would want to maintain a list of causes, if every module comes with its own error messages, each of which is phrased differently even though it pertains to what is effectively the same error?


Obviously, because a module often provides some special functionality that does not exist in other modules, there will probably be some module-specific errors. This problem could be solved by asking every module developer to apply for a block of error codes for that module. Again, the installation procedure of that module would have to take care of copying that block of "defines" and strings into the main language file.


3.5 Expand your horizon


PostNuke is GPL software, and so are many other content management systems [3] and portals that exist today. If they want to implement really-extensive internationalization [4] of their design, then they face the same problems as PostNuke. Some of their people may have come up with solutions nobody here thought of, so why not look around and see how they solved some of the problems PostNuke has to deal with?


One example could be Drupal [5], which has found a few solutions for language-related [6] problems [7] that could prove useful.


4 Conclusion


This document is far from being complete. It only takes a glance at a few languages: more input is needed from speakers of other languages. Also, the modifications I suggested may not always be the best and there are probably also other improvements that can be made. However, if PostNuke really intends to move towards broad internationalization, then it will be necessary to revise the language system and, possibly, other parts of its design. This article is intended to contribute to a thought process about that.


Obviously, nobody knows all the "features" and "traps" of all languages. That's why the pnLanguages team needs to have members from as many different nationalities and languages as possible, so that we can assemble guidelines that will allow the correct use of any language in which we have a user base.


Localization and translation are now recognized as key issues for the PostNuke project. When it comes to language and internationalization, more liaison between the pnLanguages team and the core development team will be very important.


Relevant links


1. http://forums.postnuke.com/index.php?name=PNphpBB2&file=viewtopic&t=24602

2. http://en.wikipedia.org/wiki/China#Demographics

3. http://www.opensourcecms.com/

4. http://www.debian.org/doc/manuals/intro-i18n/index.en.html#contents

5. http://drupal.org/

6. http://drupal.org/handbook/modules/locale

7. http://drupal.org/node/24593

8. pnLanguages project on PostNuke's NOC

6195