Chapter 12
Much of today’s software is written for an international market. Worldwide sales enable vendors to maximize profits. In addition multinational companies often must build systems that cut across countries, cultures, and languages. Language translation can be a difficult issue. Data often is stored in the language of entry, but there can be a need to translate metadata, such as labels in forms and reports. This chapter presents the nucleus of a string translation model.
Table 12.1 summarizes several approaches to language translation. It is convenient to consider abbreviation along with translation.
Language Translation Approaches
Approach |
Synopsis |
Advantages |
Disadvantages |
---|---|---|---|
Attribute translation in place |
Each translated or abbreviated attribute has multiple parallel fields. |
|
|
Phrase-to-phrase translation |
A lookup mechanism converts a source phrase into a target language and abbreviation. |
|
|
Language-neutral translation |
Applications store concept IDs. A lookup table maps IDs to phrases. |
|
|
Automated translation |
A software algorithm translates a phrase from one language into another. |
|
|
One option is to add parallel columns for translations and abbreviations. This approach is certainly simple, but it is verbose (many columns could be needed) and brittle (each added translation or abbreviation causes modification of the schema).
A dedicated lookup table can convert a phrase from a base to a translated language and handle abbreviations. The advantage is that there are no disruptions to application schema. The downside is that phrases can be translated out of context leading to errors. For example, there are multiple meanings of the word bank.
The language-neutral translation service is a robust choice. This also uses a lookup table, but a concept ID represents the source idea. This approach separates the multiple meaning of words and phrases for a clean translation. The drawback is that application databases must replace translatable strings with concept IDs. Consequently this approach is normally limited to new applications.
Some Web sites implement the last option. For example, Babel Fish and Google Language Tools can both translate a phrase from a source to a target language. Such an approach is not viable for most applications as translation quality is often poor.
The next sections elaborate the first three options.
The simplest approach is to add columns for translations and abbreviations. Figure 12.1 shows an example. The birth place, hair color, and eye color strings are stored in both English and Spanish. The other fields are not translated. This approach is vulnerable to inconsistencies. For example, one person could have brown hair with a Spanish translation and another person could also have brown hair with a different translation.
Consider this approach when only a few fields must be translated. Also consider this approach when XML files store data. XML files can handle parallel fields with nested elements (unlike relational database tables).
Figure 12.2 and Figure 12.3 model the lookup mechanism for phrase-to-phrase translation. The advantage of this approach is that there is no disruption to any existing application schema. Consider this approach when you can limit the phrase vocabulary and avoid multiple meanings.
A Phrase is a string with a specific Language and AbbreviationType. The Language for a string can be a Dialect, a MajorLanguage, or AllLanguage. A MajorLanguage is a natural language, such as French, English, and Japanese. A Dialect is a variation of a MajorLanguage, such as UK English, US English, and Australian English. AllLanguage has a single record for strings do not vary across languages.
Each Phrase has an AbbreviationType which is the maximum length for a string. For example, there may be a short name (5 characters), a medium name (10 characters), a long name (20 characters), and an extra long name (80 characters). Abbreviations are especially handy for reports and user interface forms.
PhraseEquivalence cross references Phrases with the same meaning. (See the Symmetric relationship antipattern in Chapter 8.) There are synonymous Phrases across Languages and AbbreviationTypes but not for the same Language and AbbreviationType (hence the uniqueness constraint).
The translation service is dedicated software that runs apart from client applications. The translation database stores corresponding Phrases for various Languages and AbbreviationTypes. (A person must populate the translation database.) Upon request, the service finds the translation given a source Phrase, target Language, and target AbbreviationType.
Figure 12.4 shows a sample application table that could be subject to the translation mechanism. The phrase-to-phrase approach has a language bias. For example, the source data may be stored in English and converted to another language only upon translation mapping. Architecturally, a language bias is undesirable because users may detect the favored language.
The pseudocode in Figure 12.5 illustrates the logic for finding a translation. (The pseudocode is written using the UML’s Object Constraint Language [Warmer-1999].) The basic logic is to first look for an exact match to the target language. Otherwise, if a Dialect is specified, look for the corresponding MajorLanguage. If that fails, then make one more try to look for the AllLanguage record.
Figure 12.6 and Figure 12.7 show a model for a language-neutral translation service. This approach separates the multiple meaning of words and phrases for a clean translation. However, you replace translatable strings with concepts IDs, limiting this approach to new applications.
A Phrase is a string with a specific Language and AbbreviationType. The Language for a string can be a Dialect, a MajorLanguage, or AllLanguage. A MajorLanguage is a natural language, such as French, English, and Japanese. A Dialect is a variation of a MajorLanguage, such as UK English, US English, and Australian English. AllLanguage has a single record for strings that do not vary across languages.
Each Phrase has an AbbreviationType which is the maximum length for a string. For example, a name may be short (5 characters), medium (10 characters), long (20 characters), and extra long (80 characters). Abbreviations are especially handy for reports and forms.
A TranslationConcept is the idea in a person’s mind that underlies a group of related Phrases. The premise of language-neutral translation is that an idea can be precisely expressed in any Language. Of course, this assumption is not exactly true as each language has its nuances. However, it is a good approximation for translating short phrases such as those that occur in user interface screens and reports. The translation service is not intended for long passages such as those in documents and books.
Table 12.2 shows a simple example. A person has the concept “truck” in mind with a translationConceptID of 2054.
Language-Neutral Translation: Sample Phrases
translationConceptID |
Language |
AbbreviationType |
Phrase |
---|---|---|---|
2054 |
MajorLanguage = English |
long |
truck |
MajorLanguage = French |
long |
camion |
|
MajorLanguage = English |
short |
trk |
|
Dialect = British English |
long |
lorry |
In practice, many persons could populate data and define redundant TranslationConcepts. Multiple definitions are undesirable but difficult to avoid. These multiple definitions ripple throughout application databases and are difficult to consolidate.
ConceptEquivalence provides a cross reference for synonymous TranslationConcepts and effects a logical merge. (See Chapter 11.) The application tables store translationConceptIDs. ConceptEquivalence serves only as a cross-reference and is not referenced by application tables. (See the Symmetric relationship antipattern in Chapter 8.) Each occurrence of ConceptEquivalence has a preferred TranslationConcept.
The translation service is dedicated software that runs apart from client applications. To use the service, an application database substitutes a translationConceptID for each translatable phrase. For each TranslationConcept, the translation database stores the corresponding Phrases for the pertinent Languages and AbbreviationTypes. (A person must populate the translation database.) Upon request, the service finds the Phrase for the specified TranslationConcept, Language, and AbbreviationType.
Figure 12.8 shows a sample application table that is subject to language-neutral translation. The use of concept IDs works well for a new application. But it would be disruptive for an existing application to change strings to IDs.
The pseudocode in Figure 12.9 illustrates the logic for finding a phrase, given a TranslationConcept, AbbreviationType, and Language. The basic logic is to first look for an exact match to the target language. Otherwise, if a Dialect is specified, look for the corresponding MajorLanguage. If that fails, then make one more try to look for the AllLanguage record.
A translation service is helpful when software must support multiple languages such as English, French, and Japanese. The need for such a capability often arises and can be delivered as a service apart from any particular application. This chapter presents several approaches to language translation.
Several commercial products have language translation capabilities including Multilizer, Schaudin, Lionbridge, and Xataface.
The terms internationalization and localization are prominent in the literature. “Internationalization is the process of designing a software application so that it can be adapted to various languages and regions without engineering changes. Localization is the process of adapting software for a specific region or language by adding locale-specific components and translating text.” [Wikipedia] The models in this chapter deal with internationalization. The population of data addresses localization.
[Warmer-1999] Jos Warmer and Anneke Kleppe. The Object Constraint Language. Boston: Addison-Wesley, 1999.