Chapter 20. ASCII and a Cast of Characters

Digital computer memory stores only bits, so anything that we want to work with on the computer must be stored in the form of bits. We've already seen how bits can represent numbers and machine code. The next challenge must be text. After all, the great bulk of the accumulated information of this world is in the form of text, and our libraries are full of books and magazines and newspapers. Although we'd eventually like to use our computers to store sounds and pictures and movies, text is a much easier place to begin.

To represent text in digital form, we must develop some kind of system in which each letter corresponds to a unique code. Numbers and punctuation also occur in text, so codes for these must be developed as well. In short, we need codes for all alphanumeric characters. Such a system is sometimes known as a coded character set, and the individual codes are known as character codes.

The first question must be: How many bits do we need for these codes? The answer isn't an easy one!

When we think about representing text using bits, let's not get too far ahead of ourselves. We're accustomed to seeing text nicely formatted on the pages of a book or in the columns of a magazine or newspaper. Paragraphs are neatly separated into lines of a consistent width. Yet this formatting isn't essential to the text itself. When we read a short story in a magazine and years later encounter that same story in a book, we don't think the story has changed just because the text column is wider in the book than in the magazine.

In other words, don't think about text as formatted into two-dimensional columns on the printed page. Think of text instead as a one-dimensional stream of letters, numbers, and punctuation marks, with perhaps an additional code to indicate the end of one paragraph and the start of another.

Again, if you read a story in a magazine and later see it in a book and the typeface is a little different, is that a big deal? If the magazine version begins

Call me Ishmael.

and the book version begins

Call me Ishmael.

is that something we really want to be concerned with just yet? Probably not. Yes, the typeface subtly affects the tone of the text, but the story hasn't been lost with the change of typeface. The typeface can always be changed back. There's no harm done.

Here's another way we're going to simplify the problem: Let's stick to plain vanilla text. No italics, no boldface, no underlining, no colors, no outlined letters, no subscripts, no superscripts. And no accent marks. No Å or é or ñ or ö. Just the naked Latin alphabet as it's used in 99 percent of English.

In our earlier studies of Morse code and Braille, we've already seen how the letters of the alphabet can be represented in a binary form. Although these systems are fine for their specific purposes, both have their failings when it comes to computers. Morse code, for example, is a variable-width code: It uses shorter codes for frequently used letters and longer codes for less common ones. Such a code is suitable for telegraphy, but it might be awkward for computers. In addition, Morse code doesn't differentiate between uppercase and lowercase versions of letters.

Braille is a fixed-width code, which is much preferable for computers. Every letter is represented by 6 bits. Braille also differentiates between uppercase and lowercase letters, although it does so with the use of a special escape code. This code indicates that the next character is uppercase. What this really means is that every capital letter requires two codes rather than one. Numbers are represented with a shift code: After that special code, the codes that follow are assumed to represent numbers until another shift code signals the return to letters.

Our goal here is to develop a coded character set so that a sentence such as

I have 27 sisters.

can be represented by a series of codes, each of which is a certain number of bits. Some of the codes will represent letters, some will representation punctuation marks, and some will represent numbers. There should even be a code that represents the space between words. There are 18 characters in that sentence (including the spaces between the words). The consecutive character codes for such a sentence are often referred to as a text string.

That we need codes for numbers in a text string such as 27 might seem odd because we've been using bits to represent numbers for many chapters now. We may be tempted to assume that the codes for the 2 and 7 in this sentence are simply the binary numbers 10 and 111. But that's not necessarily the case. In the context of a sentence such as this, the characters 2 and 7 can be treated like any other character found in written English. They can have character codes that are completely unrelated to the actual values of the numbers.

Perhaps the most economical code for text is a 5-bit code that originated in an 1874 printing telegraph developed by Emile Baudot (pronounced bawdoh), an officer in the French Telegraph Service; his code was adopted by the Service in 1877. This code was later modified by Donald Murray and standardized in 1931 by the Comité Consultatif International Télégraphique et Téléphonique (CCITT), which is now known as the International Telecommunication Union (ITU). The code is formally known as the International Telegraph Alphabet No. 2, or ITA-2, and it's more popularly known in the United States as Baudot, although it's more correctly called the Murray code.

In the twentieth century, Baudot was often used in teletypewriters. A Baudot teletypewriter has a keyboard that looks something like a typewriter, except that it has only 30 keys and a spacebar. Teletypewriter keys are actually switches that cause a binary code to be generated and sent down the teletypewriter's output cable, one bit after the other. A teletypewriter also contains a printing mechanism. Codes coming through the teletypewriter's input cable trigger electromagnets that print characters on paper.

Because Baudot is a 5-bit code, there are only 32 codes. The hexadecimal values of these codes range from 00h through 1Fh. Here's how the 32 available codes correspond to the letters of the alphabet:

Hex Code

Baudot Letter

Hex Code

Baudot Letter

00

 

10

E

01

T

11

Z

02

Carriage Return

12

D

03

O

13

B

04

Space

14

S

05

H

15

Y

06

N

16

F

07

M

17

X

08

Line Feed

18

A

09

L

19

W

0A

R

1A

J

0B

G

1B

Figure Shift

0C

I

1C

U

0D

P

1D

Q

0E

C

1E

K

0F

V

1F

Letter Shift

Code 00h isn't assigned to anything. Of the remaining 31 codes, 26 are assigned to letters of the alphabet and the other five are indicated by italicized words or phrases in the table.

Code 04h is the Space code, which is used for the space separating words. Codes 02h and 08h are labeled Carriage Return and Line Feed. This terminology comes from the typewriter. When you're typing on a typewriter and reach the end of a line, you push a lever or button that does two things. First, it causes the carriage to be moved to the right so that the next line begins at the left side of the paper. That's a carriage return. Second, the typewriter rolls the carriage so that the next line is underneath the line you just finished. That's the linefeed. In Baudot, separate keyboard keys generate these two codes. A Baudot teletypewriter printer responds to these two codes when printing.

Where are the numbers and punctuation marks in the Baudot system? That's the purpose of code 1Bh, identified in the table as Figure Shift. After the Figure Shift code, all subsequent codes are interpreted as numbers or punctuation marks until the Letter Shift code (1Fh) causes them to revert to the letters. Here are the codes for the numbers and punctuation:

Hex Code

Baudot Figure

Hex Code

Baudot Figure

00

 

10

3

01

5

11

+

02

Carriage Return

12

Who Are You?

03

9

13

?

04

Space

14

'

05

#

15

6

06

,

16

$

07

.

17

/

08

Line Feed

18

-

09

)

19

2

0A

4

1A

Bell

0B

&

1B

Figure Shift

0C

8

1C

7

0D

0

1D

1

0E

:

1E

(

0F

=

1F

Letter Shift

Actually, the code as formalized by the ITU doesn't define codes 05h, 0Bh, and 16h, and instead reserves them "for national use." The table shows how these codes were used in the United States. The same codes were often used for accented letters of some European languages. The Bell code is supposed to ring an audible bell on the teletypewriter. The "Who Are You?" code activates a mechanism whereby a teletypewriter can identify itself.

Like Morse code, this 5-bit code doesn't differentiate between uppercase and lowercase. The sentence

I SPENT $25 TODAY.

is represented by the following stream of hexadecimal data:

0C 04 14 0D 10 06 01 04 1B 16 19 01 1F 04 01 03 12 18 15 1B 07 02 08

Notice the three shift codes: 1Bh right before the number, 1Fh after the number, and 1Bh again before the final period. The line concludes with carriage-return and linefeed codes.

Unfortunately, if you sent this stream of data to a teletypewriter printer twice in a row, it would come out like this:

I SPENT $25 TODAY.
8 '03,5 $25 TODAY.

What happened? The last shift code the printer received before the second line was a Figure Shift code, so the codes at the beginning of the second line were interpreted as numbers.

Problems like this are typical nasty results of using shift codes. Although Baudot is certainly an economical code, it's probably preferable to use unique codes for numbers and punctuation, as well as separate codes for lowercase and uppercase letters.

So if we want to figure out how many bits we need for a better character encoding system than Baudot, just add them up: We need 52 codes just for the uppercase and lowercase letters and 10 codes for the digits 0 through 9. We're up to 62 already. Throw in a few punctuation marks, and we top 64 codes, which means we need more than 6 bits. But we seem to have lots of leeway before we exceed 128 characters, which would require 8 bits.

So the answer is 7. We need 7 bits to represent the characters of English text if we want uppercase and lowercase with no shifting.

And what are these codes? Well, the actual codes can be anything we want. If we were going to build our own computer, and we were going to build every piece of hardware required by this computer, and we were going to program this computer ourselves and never use the computer to connect to any other computer, we could make up our own codes. All we need do is assign every character we'll be using a unique code.

But since it's rarely the case that computers are built and used in isolation, it makes more sense for everyone to agree to use the same codes. That way, the computers that we build can be more compatible with one another and maybe even actually exchange textual information.

We also probably shouldn't assign codes in a haphazard manner. For example, when we work with text on the computer, certain advantages accrue if the letters of the alphabet are assigned to sequential codes. This ordering scheme makes alphabetizing and sorting easier, for example.

Fortunately, such a standard has already been developed. It's called the American Standard Code for Information Interchange, abbreviated ASCII, and referred to with the unlikely pronunciation ASS-key. It was formalized in 1967 and remains the single most important standard in the entire computer industry. With one big exception (which I'll describe later), whenever you deal with text on a computer you can be sure that ASCII is involved in some way.

ASCII is a 7-bit code using binary codes 0000000 through 1111111, which are hexadecimal codes 00h through 7Fh. Let's take a look at the ASCII codes, but let's not start at the very beginning because the first 32 codes are conceptually a bit more difficult than the rest of the codes. I'll begin with the second batch of 32 codes, which includes punctuation and the ten numeric digits. This table shows the hexadecimal code and the character that corresponds to that code:

Hex Code

ASCII Character

Hex Code

ASCII Character

20

Space

30

0

21

!

31

1

22

"

32

2

23

#

33

3

24

$

34

4

25

%

35

5

26

&

36

6

27

'

37

7

28

(

38

8

29

)

39

9

2A

*

3A

:

2B

+

3B

;

2C

,

3C

<

2D

-

3D

=

2E

.

3E

>

2F

/

3F

?

Notice that 20h is the space character that divides words and sentences.

The next 32 codes include the uppercase letters and some additional punctuation. Aside from the @ sign and the underscore, these punctuation symbols aren't normally found on typewriters. They're all now standard on computer keyboards.

Hex Code

ASCII Character

Hex Code

ASCII Character

40

@

50

P

41

A

51

Q

42

B

52

R

43

C

53

S

44

D

54

T

45

E

55

U

46

F

56

V

47

G

57

W

48

H

58

X

49

I

59

Y

4A

J

5A

Z

4B

K

5B

[

4C

L

5C

4D

M

5D

]

4E

N

5E

^

4F

O

5F

_

The next 32 characters include all the lowercase letters and some additional punctuation, again not often found on typewriters:

Hex Code

ASCII Character

Hex Code

ASCII Character

60

`

70

p

61

a

71

q

62

b

72

r

63

c

73

s

64

d

74

t

65

e

75

u

66

f

76

v

67

g

77

w

68

h

78

x

69

i

79

y

6A

j

7A

z

6B

k

7B

{

6C

l

7C

|

6D

m

7D

}

6E

n

7E

~

6F

o

  

Notice that this table is missing the last character corresponding to code 7Fh. If you're keeping count, the three tables here show a total of 95 characters. Because ASCII is a 7-bit code, 128 codes are possible, so 33 more codes should be available. I'll get to those shortly.

The text string

Hello, you!

can be represented in ASCII using the hexadecimal codes

48 65 6C 6C 6F 2C 20 79 6F 75 21

Notice the comma (code 2C), the space (code 20) and the exclamation point (code 21) as well as the codes for the letters. Here's another short sentence:

I am 12 years old.

and its ASCII representation:

49 20 61 6D 20 31 32 20 79 65 61 72 73 20 6F 6C 64 2E

Notice that the number 12 in this sentence is represented by the hexadecimal numbers 31h and 32h, which are the ASCII codes for the digits 1 and 2. When the number 12 is part of a text stream, it should not be represented by the hexadecimal codes 01h and 02h, or the BCD code 12h, or the hexadecimal code 0Ch. These other codes all mean something else in ASCII.

A particular uppercase letter in ASCII differs from its lowercase counterpart by 20h. This fact makes it fairly easy to write some code that (for example) capitalizes a string of text. Suppose a certain area of memory contains a text string, one character per byte. The following 8080 subroutine assumes that the address of the first character in the text string is stored in register HL. Register C contains the length of that text string, which is the number of characters:

Capitalize: MOV A,C     ; C = number of characters left
            CPI A,00h   ; Compare with 0
            JZ  AllDone ; If C is 0, we're finished

            MOV A,[HL]  ; Get the next character
            CPI A,61h   ; Check if it's less than 'a'
            JC SkipIt   ; If so, ignore it

            CPI A,7Bh   ; Check if it's greater than 'z'
            JNC SkipIt  ; If so, ignore it

            SBI A,20h   ; It's lowercase, so subtract 20h
            MOV [HL],A  ; Store the character 

SkipIt:     INX HL          ; Increment the text address
            DCR C           ; Decrement the counter
            JMP Capitalize  ; Go back to the top

AllDone:    RET

The statement that subtracts 20h from the lowercase letter to convert it to uppercase can be replaced with this:

ANI A,DFh

The ANI instruction is an AND Immediate. It performs a bitwise AND operation between the value in the accumulator and the value DFh, which is 11011111 in binary. By bitwise, I mean that the instruction performs an AND operation between each pair of corresponding bits that make up the two numbers. This AND operation preserves all the bits in A except the third from the left, which is set to 0. Setting that bit to 0 also effectively converts an ASCII lowercase letter to uppercase.

The 95 codes shown above are said to refer to graphic characters because they have a visual representation. ASCII also includes 33 control characters that have no visual representation but instead perform certain functions. For the sake of completeness, here are the 33 ASCII control characters, but don't worry if they seem mostly incomprehensible. At the time ASCII was developed, it was intended mostly for teletypewriters, and many of these codes are currently obscure.

Hex Code

Acronym

Control Character Name

00

NUL

Null (Nothing)

01

SOH

Start of Heading

02

STX

Start of Text

03

ETX

End of Text

04

EOT

End of Transmission

05

ENQ

Enquiry (i.e., Inquiry)

06

ACK

Acknowledge

07

BEL

Bell

08

BS

Backspace

09

HT

Horizontal Tabulation

0A

LF

Line Feed

0B

VT

Vertical Tabulation

0C

FF

Form Feed

0D

CR

Carriage Return

0E

SO

Shift-Out

0F

SI

Shift-In

10

DLE

Data Link Escape

11

DC1

Device Control 1

12

DC2

Device Control 2

13

DC3

Device Control 3

14

DC4

Device Control 4

15

NAK

Negative Acknowledge

16

SYN

Synchronous Idle

17

ETB

End of Transmission Block

18

CAN

Cancel

19

EM

End of Medium

1A

SUB

Substitute Character

1B

ESC

Escape

1C

FS

File Separator or Information Separator 4

1D

GS

Group Separator or Information Separator 3

1E

RS

Record Separator or Information Separator 2

1F

US

Unit Separator or Information Separator 1

7F

DEL

Delete

The idea here is that control characters can be intermixed with graphic characters to do some rudimentary formatting of the text. This is easiest to understand if you think of a device—such as a teletypewriter or a simple printer—that types characters on a page in response to a stream of ASCII codes. The device's printing head normally responds to character codes by printing a character and moving one space to the right. The most important control characters alter this normal behavior.

For example, consider the hexadecimal character string

41 09 42 09 43 09

The 09 character is a Horizontal Tabulation code, or Tab for short. If you think of all the horizontal character positions on the printer page as being numbered starting with 0, the Tab code usually means to print the next character at the next horizontal position that's a multiple of 8, like this:

A B C

This is a handy way to keep text lined up in columns.

Even today, many computer printers respond to a Form Feed code (12h) by ejecting the current page and starting a new page.

The Backspace code can be used for printing composite characters on some old printers. For example, suppose the computer controlling the teletypewriter wanted to display a lowercase e with a grave accent mark, like so: è. This could be achieved by using the hexadecimal codes 65 08 60.

By far the most important control codes are Carriage Return and Line Feed, which have the same meaning as the similar Baudot codes. On a printer, the Carriage Return code moves the printing head to the left side of the page, and the Line Feed code moves the printing head one line down. Both codes are generally required to go to a new line. A Carriage Return can be used by itself to print over an existing line, and a Line Feed can be used by itself to skip to the next line without moving to the left margin.

Although ASCII is the dominant standard in the computing world, it isn't used on many of IBM's larger computer systems. In connection with the System/360, IBM developed its own 8-bit character code known as the Extended BCD Interchange Code, or EBCDIC (pronounced EBB-see-dick), which was an extension of an earlier 6-bit code known as BCDIC, which was derived from codes used on IBM punch cards. This style of punch card—capable of storing 80 characters of text—was introduced by IBM in 1928 and used for over 50 years.

image with no caption

When considering the relationship between punch cards and their associated 8-bit EBCDIC character codes, keep in mind that these codes evolved over many decades under several different types of technologies. For that reason, don't expect to discover too much logic or consistency.

A character is encoded on a punch card by a combination of one or more rectangular holes punched in a single column. The character itself is often printed near the top of the card. The lower 10 rows are identified by number and are known as the 0-row, the 1-row, and so forth through the 9-row. The unnumbered row above the 0-row is called the 11-row, and the top row is called the 12-row. There is no 10-row.

More IBM punch card terminology: Rows 0 through 9 are known as the digit rows, or digit punches. Rows 11 and 12 are known as the zone rows, or zone punches. And some IBM punch card confusion: Sometimes rows 0 and 9 are considered to be zone rows rather than digit rows.

An 8-bit EBCDIC character code is composed of a high-order nibble (4-bit value) and a low-order nibble. The low-order nibble is the BCD code corresponding to the digit punches of the character. The high-order nibble is a code corresponding (in a fairly arbitrary way) to the zone punches of the character. You'll recall from Chapter 19 that BCD stands for binarycoded decimal—a 4-bit code for digits 0 through 9.

For the digits 0 through 9, there are no zone punches. That lack of punches corresponds to a high-order nibble of 1111. The low-order nibble is the BCD code of the digit punch. Here's a table of EBCDIC codes for the digits 0 through 9:

Hex Code

EBCDIC Character

F0

0

F1

1

F2

2

F3

3

F4

4

F5

5

F6

6

F7

7

F8

8

F9

9

For the uppercase letters, a zone punch of just the 12-row is indicated by the nibble 1100, a zone punch of just the 11-row is indicated by the nibble 1101, and a zone punch of just the 0-row is indicated by the nibble 1110. The EBCDIC codes for the uppercase letters are

Hex Code

EBCDIC Character

Hex Code

EBCDIC Character

Hex Code

EBCDIC Character

C1

A

D1

J

  

C2

B

D2

K

E2

S

C3

C

D3

L

E3

T

C4

D

D4

M

E4

U

C5

E

D5

N

E5

V

C6

F

D6

O

E6

W

C7

G

D7

P

E7

X

C8

H

D8

Q

E8

Y

C9

I

D9

R

E9

Z

Notice the gaps in the numbering of these codes. In some applications, these gaps can be maddening when you're writing programs using EBCDIC text.

The lowercase letters have the same digit punches as the uppercase letters but different zone punches. For lowercase letters a through i, the 12-row and 0-row are punched, corresponding to the code 1000. For j through r, the 12-row and 11-row are punched. This is the code 1001. For the letters s through z, the 11-row and 0-row are punched—the code 1010. The EBCDIC codes for the lowercase letters are

Hex Code

EBCDIC Character

Hex Code

EBCDIC Character

Hex Code

EBCDIC Character

81

a

91

j

  

82

b

92

k

A2

s

83

c

93

l

A3

t

84

d

94

m

A4

u

85

e

95

n

A5

v

86

f

96

o

A6

w

87

g

97

p

A7

x

88

h

98

q

A8

y

89

i

99

r

A9

z

Of course, there are other EBCDIC codes for punctuation and control characters, but it's hardly necessary to do a full-blown exploration of this system.

It might seem as if each column of an IBM punch card is sufficient to encode 12 bits of information. Each hole is a bit, right? So it should be possible to encode ASCII character codes on a punch card using only 7 of the 12 positions in each column. But in practice, this doesn't work very well. Too many holes get punched, threatening the physical integrity of the card.

Many of the 8-bit codes in EBCDIC aren't defined, suggesting that the use of 7 bits in ASCII makes more sense. At the time ASCII was being developed, memory was very expensive. Some people felt that ASCII should be a 6-bit code using a shift character to differentiate between lowercase and uppercase to conserve memory. Once that idea was rejected, others believed that ASCII should be an 8-bit code because even at that time it was considered more likely that computers would have 8-bit architectures than 7-bit architectures. Of course, 8-bit bytes are now the standard. Although ASCII is technically a 7-bit code, it's almost universally stored as 8-bit values.

The equivalence of bytes and characters is certainly convenient because we can get a rough sense of how much computer memory a particular text document requires simply by counting the characters. To some, the kilos and megas of computer storage are more comprehensible when expressed in terms of text.

For example, a traditional double-spaced typewritten 8½-by-11-inch page with 1-inch margins has about 27 lines of text. Each line is about 6½ inches wide with 10 characters per inch, for a total of about 1750 bytes. A singlespace typewritten page has about double that, or 3.5 kilobytes.

A page in The New Yorker magazine has 3 columns of text with 60 lines per column and about 40 characters per line. That's 7200 characters (or bytes) per page.

The New York Times has six columns of text per page. If the entire page is covered with text without any titles or pictures (which is highly unusual), each column has 155 lines of about 35 characters each. The entire page has 32,550 characters, or 32 kilobytes.

A hardcover book has about 500 words per page. An average word is about 5 letters—actually 6 characters, counting the space between words. So a book has about 3000 characters per page. Let's say the average book has 333 pages, which may be a made-up figure but nicely implies that the average book is about 1 million bytes, or 1 megabyte.

Of course, books vary all over the place:

F. Scott Fitzgerald's The Great Gatsby is about 300 kilobytes.
J. D. Salinger's Catcher in the Rye is about 400 kilobytes.
Mark Twain's The Adventures of Huckleberry Finn is about 540 kilobytes.
John Steinbeck's The Grapes of Wrath is about a megabyte.
Herman Melville's Moby Dick is about 1.3 megabytes.
Henry Fielding's The History of Tom Jones is about 2.25 megabytes.
Margaret Mitchell's Gone with the Wind is about 2.5 megabytes.
Stephen King's complete and uncut The Stand is about 2.7 megabytes.
Leo Tolstoy's War and Peace is about 3.9 megabytes.
Marcel Proust's Remembrance of Things Past is about 7.7 megabytes.

The United States Library of Congress has about 20 million books for a total of 20 trillion characters, or 20 terabytes, of text data. (It has a bunch of photographs and sound recordings as well.)

Although ASCII is certainly the most important standard in the computer industry, it isn't perfect. The big problem with the American Standard Code for Information Interchange is that it's just too darn American! Indeed, ASCII is hardly suitable even for other nations whose principal language is English. Although ASCII includes a dollar sign, where is the British pound sign? And what about the accented letters used in many Western European languages? To say nothing of the non-Latin alphabets used in Europe, including Greek, Arabic, Hebrew, and Cyrillic. Or the Brahmi scripts of India and Southeast Asia, including Devanagari, Bengali, Thai, and Tibetan. And how can a 7-bit code possibly handle the tens of thousands of ideographs of Chinese, Japanese, and Korean and the ten thousand–odd Hangul syllables of Korean?

Even when ASCII was being developed, the needs of some other nations were kept in mind, although without much consideration for non-Latin alphabets. According to the published ASCII standard, ten ASCII codes (40h, 5Bh, 5Ch, 5Dh, 5Eh, 60h, 7Bh, 7Ch, 7Dh, and 7Eh) are available to be redefined for national uses. In addition, the number sign (#) can be replaced by the British pound sign (£), and the dollar sign ($) can be replaced by a generalized currency sign (¤) if necessary. Obviously, replacing symbols makes sense only when everyone involved in using a particular text document containing these redefined codes knows about the change.

Because many computer systems store characters as 8-bit values, it's possible to devise an extended ASCII character set that contains 256 characters rather than just 128. In such a character set, codes 00h through 7Fh are defined just as they are in ASCII; codes 80h through FFh can be something else entirely. This technique has been used to define additional character codes to accommodate accented letters and non-Latin alphabets. As an example, here's a 96-character extension of ASCII called the Latin Alphabet No. 1 that defines characters for codes A0h through FFh. In this table, the high-order nibble of the hexadecimal character code is shown in the top row; the low-order nibble is shown in the left column.

 

A-

B-

C-

D-

E-

F-

-0

 

°

À

Ð

à

ð

-1

¡

±

Á

Ñ

á

ñ

-2

¢

²

Â

Ò

â

ò

-3

£

³

Ã

Ó

ã

ó

-4

¤

´

Ä

Ô

ä

ô

-5

¥

µ

Å

Õ

å

õ

-6

¦

Æ

Ö

æ

ö

-7

§

·

Ç

×

ç

÷

-8

¨

¸

È

Ø

è

ø

-9

©

¹

É

Ù

é

ù

-A

ª

º

Ê

Ú

ê

ú

-B

«

»

Ë

Û

ë

û

-C

¬

¼

Ì

Ü

ì

ü

-D

-

½

Í

Ý

í

ý

-E

®

¾

Î

Þ

î

þ

-F

-

¿

Ï

ß

ï

ÿ

The character for code A0h is defined as a no-break space. Usually when a computer program formats text into lines and paragraphs, it breaks each line at a space character, which is ASCII code 20h. Code A0h is supposed to be displayed as a space but can't be used for breaking a line. A no-break space might be used in the text "WW II," for example. Code ADh is defined as a soft hyphen. This is a hyphen used to separate syllables in the middle of words. It appears on the printed page only when it's necessary to break a word between two lines.

Unfortunately, many different extensions of ASCII have been defined over the decades, leading to much confusion and incompatibility. ASCII has been extended in a more radical way to encode the ideographs of Chinese, Japanese, and Korean. In one popular encoding—called Shift-JIS (Japanese Industrial Standard)—codes 81h through 9Fh actually represent the initial byte of a 2-byte character code. In this way, Shift-JIS allows for the encoding of about 6000 additional characters. Unfortunately, Shift-JIS isn't the only system that uses this technique. Three other standard double-byte character sets (DBCS) are popular in Asia.

That there are a number of incompatible double-byte character sets is only one of the problems with them. The other problem is that some characters—specifically, the normal ASCII characters—are represented by 1-byte codes, while the thousands of ideographs are represented by 2-byte codes. This makes it difficult to work with such character sets.

Under the assumption that it's preferable to have just one unambiguous character encoding system that's suitable for all the world's languages, in 1988 several major computer companies got together and began developing an alternative to ASCII known as Unicode. Whereas ASCII is a 7-bit code, Unicode is a 16-bit code. Each and every character in Unicode requires 2 bytes. That means that Unicode has character codes ranging from 0000h through FFFFh and can represent 65,536 different characters. That's enough for all the world's languages that are likely to be used in computer communication, with room for expansion.

Unicode doesn't start from scratch. The first 128 characters of Unicode—codes 0000h through 007Fh—are the same as the ASCII characters. Also, Unicode codes 00A0h through 00FFh are the same as the Latin Alphabet No. 1 extension of ASCII that I described earlier. Other worldwide standards are also incorporated into Unicode.

While Unicode may be an obvious improvement over existing character codes, that doesn't guarantee it instant acceptability. ASCII and the myriad flawed extensions of ASCII have become so entrenched in the computing world that it will be difficult to dislodge them.

The only real problem with Unicode is that it makes invalid the old equivalence between one character of text and 1 byte of storage. Encoded in ASCII, The Grapes of Wrath is about 1 megabyte in size. Encoded in Unicode, it's about 2 megabytes. But that's a small price to pay for a universal unambiguous character encoding system.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset