Internationalization

Only a few years ago, the World Wide Web was primarily an English-language medium, at least in the eyes of people from the United States and the United Kingdom. Of course, English is the native language of perhaps 10% of the world’s population and is used in economically influential countries, with international trade, and particularly in countries where English is spoken natively. Support for English on the Web was obviously essential, but over several years it became increasingly obvious that support for many other languages was essential, too, if the Web is to be truly worldwide.

Note

In this lesson, you will learn many terms relating to characters that have very specific meanings. If you are unfamiliar with this material, you might find the new terminology a little confusing. It will make sense if you work through it slowly.



Support for multiple languages raises many issues for people who think primarily or exclusively in English. For example, how do you express characters that cannot simply be typed using an English-language keyboard? For example, how do you express characters in German that use the umlaut, such as ü, or letters in French that have an acute accent above a vowel, such as é?

Note

The idea of internationalization is an important one in the XML world. Because internationalization is a long word, you will often see it abbreviated as i18n.



To understand how to solve this general problem, you need to consider how characters are encoded.

Character Encodings

This section looks at how individual characters and sets of characters are encoded on a computer.

As mentioned earlier, English characters can be entered into computer memory most simply from the keyboard. However, at a fundamental level, computers understand only numbers. A character must be represented in some way as a number so that you can use text in a computer. A mapping from a set of numbers to a set of characters is stored internally.

English-language characters (plus some other characters) can be represented using the American Standard Coding for Information Interchange (ASCII) coding. This is an 8-bit coding system. All English characters can be represented using the 256 characters (28) characters of ASCII. The hexadecimal number 21, for example, corresponds to the exclamation mark character (!).

Note

The characters up to hexadecimal 0020 don’t display a visible character onscreen, but they might affect screen appearance. Character 0020 (decimal 32), for example, is the space character.



Let’s look at ASCII characters in a little more detail. If you are using a computer that is running a recent version of Microsoft Windows, you will have access to the Character Map, which allows you to express the ASCII character set as well as many non-English characters. You will use that to begin to explore some issues related to characters, their representation, and their display.

In Windows 2000, to access the Character Map, choose the Start button, Programs, System Tools, Character Map. When you run the Character Map utility, you will see the window shown in Figure 6.1.

Figure 6.1. The Character Map utility in Windows 2000.


In this figure, the font applied initially is the Arial font. So, all characters displayed use the Arial font unless the user chooses an alternative font. We will return to the discussion of fonts a little later.

First, hover the mouse over the exclamation mark (!) in the top-left corner of the Character Map program. You will see a ToolTip:

U+0021 - Exclamation Mark 

The meaning of the final part of the ToolTip is obvious: It is simply the natural English term for the character indicated by the mouse.

The U+0021 corresponds to the hexadecimal number 21, (decimal 33) mentioned earlier. However, it is an abbreviation indicating that the Unicode system (explained in greater detail later in the chapter) is in use and that the particular character is 0021. The 0021 is expressed in hexadecimal notation and would be expressed as 0033 in decimal notation.

Listing 6.1 shows an XML representation of the exclamation mark in hexadecimal and decimal notation, as well the literal exclamation mark character.

Listing 6.1. Exclamations.xml: Expressing Characters in Hexadecimal and Decimal
<?xml version='1.0'?> 
<exclamations> 
<exc>&#x0021;</exc> 
<exc>&#0033;</exc> 
<exc>!</exc> 
</exclamations> 

If you run the code and display it in the Internet Explorer 5.5 browser, you will see something similar to Figure 6.2. Any other XML-compatible browser should give a similar appearance. Both character references display as an exclamation mark contained within the exc element, just like the literal exclamation mark in the third exc element.

Figure 6.2. Character references.


Of course, what you have done so far you could have done more easily using an English-language keyboard. When you need to include characters that either are not available from an English language keyboard or can be achieved only using obscure key combinations, the Character Map utility becomes more helpful.

Suppose that you want to express the simple idea in German, “Das ist güt” (meaning, “That is good”). You need the u with the umlaut (also called a diaeresis). This can be copied from the Character Map by clicking the desired character and clicking the Copy button.

This enables you to include foreign-language characters that use the Roman alphabet in English-language documents. If you scroll down in the Character Map utility, you will see characters used in such languages as Russian, Hebrew, and Arabic.

So, the type of facility supplied by the Character Map provides an answer sufficient for expressing many European and some Middle East languages. Of course, many languages cannot be expressed using the Character Map. For example, some Asian languages, such as Chinese, Japanese, and Korean, use many thousands of ideographic characters. An ideographic character represents a word or idea rather than an individual letter. A more general solution to supplying a desired character is needed to accommodate any written language.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset