Lesson 9. Multilingual text

After reading lesson 9, you’ll be able to

  • Access and manipulate individual letters
  • Cipher and decipher secret messages
  • Write your programs for a multilingual world

From "Hello, playground" at the beginning, you’ve been using text in your programs. The individual letters, digits, and symbols are called characters. When you string together characters and place them between quotes, it’s called a literal string.

Consider this

You know computers represent numbers with 1s and 0s. If you were a computer, how would you represent the alphabet and human language?

If you said with numbers, you’re right. Characters of the alphabet have numeric values, which means you can manipulate them like numbers.

It’s not entirely straightforward, though. The characters from every written language and countless emoji add up to thousands of characters. There are some tricks to representing text in a space-efficient and flexible manner.

9.1. Declaring string variables

Literal values wrapped in quotes are inferred to be of the type string, so the following three lines are equivalent:

peace := "peace"
var peace = "peace"
var peace string = "peace"

If you declare a variable without providing a value, it will be initialized with the zero value for its type. The zero value for the string type is an empty string (""):

var blank string

9.1.1. Raw string literals

String literals may contain escape sequences, such as the mentioned in lesson 2. To avoid substituting for a new line, you can wrap text in backticks (`) instead of quotes ("), as shown in the following listing. Backticks indicate a raw string literal.

Listing 9.1. Raw string literals: raw.go
fmt.Println("peace be upon you
upon you be peace")
fmt.Println(`strings can span multiple lines with the 
 escape sequence`)

The previous listing displays this output:

peace be upon you
upon you be peace
strings can span multiple lines with the 
 escape sequence

Unlike conventional string literals, raw string literals can span multiple lines of source code, as shown in the next listing.

Listing 9.2. Multiple-line raw string literals: raw-lines.go
fmt.Println(`
    peace be upon you
    upon you be peace`)

Running listing 9.2 will produce the following output, including the tabs used for indentation:

        peace be upon you
        upon you be peace

Literal strings and raw strings both result in strings, as the following listing shows.

Listing 9.3. String type: raw-type.go
fmt.Printf("%v is a %[1]T
", "literal string")             1
fmt.Printf("%v is a %[1]T
", `raw string literal`)         2

  • 1 Prints literal string is a string
  • 2 Prints raw string literal is a string
Quick check 9.1

Q1:

For the Windows file path C:go, would you use a string literal or a raw string literal, and why?

QC 9.1 answer

1:

Use a raw string literal `C:go` because "C:go" fails with an unknown escape sequence error.

 

9.2. Characters, code points, runes, and bytes

The Unicode Consortium assigns numeric values, called code points, to over one million unique characters. For example, 65 is the code point for the capital letter A, and 128515 is a smiley face .

To represent a single Unicode code point, Go provides rune, which is an alias for the int32 type.

A byte is an alias for the uint8 type. It’s intended for binary data, though byte can be used for English characters defined by ASCII, an older 128-character subset of Unicode.

Type aliases

An alias is another name for the same type, so rune and int32 are interchangeable. Though byte and rune have been in Go from the beginning, Go 1.9 introduced the ability to declare your own type aliases. The syntax looks like this:

type byte = uint8
type rune = int32

Both byte and rune behave like the integer types they are aliases for, as shown in the following listing.

Listing 9.4. rune and byte: rune.go
var pi rune = 960
var alpha rune = 940
var omega rune = 969
var bang byte = 33

fmt.Printf("%v %v %v %v
", pi, alpha, omega, bang)      1

  • 1 Prints 960 940 969 33

To display the characters rather than their numeric values, the %c format verb can be used with Printf:

fmt.Printf("%c%c%c%c
", pi, alpha, omega, bang)      1

  • 1 Prints πάω!
Tip

Any integer type will work with %c, but the rune alias indicates that the number 960 represents a character.

Rather than memorize Unicode code points, Go provides a character literal. Just enclose a character in single quotes 'A'. If no type is specified, Go will infer a rune, so the following three lines are equivalent:

grade := 'A'
var grade = 'A'
var grade rune = 'A'

The grade variable still contains a numeric value, in this case 65, the code point for a capital 'A'. Character literals can also be used with the byte alias:

var star byte = '*'
Quick check 9.2

1

How many characters does ASCII encode?

2

What type is byte an alias for? What about rune?

3

What are the code points for an asterisk (*), a smiley , and an acute é?

QC 9.2 answer

1

128 characters.

2

A byte is an alias for the uint8 type. A rune is an alias for the int32 type.

3

var star byte = '*'
fmt.Printf("%c %[1]v
", star) 1

smile := ''
fmt.Printf("%c %[1]v
", smile) 2

acute := 'é'
fmt.Printf("%c %[1]v
", acute) 3

  • 2 Prints * 42
  • 2 Prints 128515
  • 3 Prints é 233

 

9.3. Pulling the strings

A puppeteer manipulates a marionette by pulling on strings, but strings in Go aren’t susceptible to manipulation. A variable can be assigned to a different string, but strings themselves can’t be altered:

peace := "shalom"
peace = "salām"

Your program can access individual characters, but it can’t alter the characters of a string. The following listing uses square brackets [] to specify an index into a string, which accesses a single byte (ASCII character). The index starts from zero.

Listing 9.5. Indexing into a string: index.go
message := "shalom"
c := message[5]
fmt.Printf("%c
", c)          1

  • 1 Prints m

Strings in Go are immutable, as they are in Python, Java, and JavaScript. Unlike strings in Ruby and character arrays in C, you can’t modify a string in Go:

message[5] = 'd'        1

  • 1 Cannot assign to message[5]
Quick check 9.3

Q1:

Write a program to print each byte (ASCII character) of "shalom", one character per line.

QC 9.3 answer

1:

message := "shalom"
for i := 0; i < 6; i++ {
    c := message[i]
    fmt.Printf("%c
", c)
}

 

9.4. Manipulating characters with Caesar cipher

One effective method of sending secret messages in the second century was to shift every letter, so 'a' becomes 'd', 'b' becomes 'e', and so on. The result might pass for a foreign language:

L fdph, L vdz, L frqtxhuhg.

Julius Caesar

It turns out that manipulating characters as numeric values is really easy with computers, as shown in the following listing.

Listing 9.6. Manipulate a single character: caesar.go
c := 'a'
c = c + 3
fmt.Printf("%c", c)        1

  • 1 Prints d

The code in listing 9.6 has one problem, though. It doesn’t account for all the messages about xylophones, yaks, and zebras. To address this need, the original Caesar cipher wraps around, so 'x' becomes 'a', 'y' becomes 'b', and 'z' becomes 'c'. With 26 characters in the English alphabet, it’s a simple matter:

if c > 'z' {
    c = c - 26
}

To decipher this Caesar cipher, subtract 3 instead of adding 3. But then you need to account for c < 'a' by adding 26. What a pain.

Quick check 9.4

Q1:

What is the result of the expression c = c - 'a' + 'A' if c is a lowercase 'g'?

QC 9.4 answer

1:

The letter is converted to uppercase:

c := 'g'
c = c - 'a' + 'A'
fmt.Printf("%c", c)        1

  • 1 Prints G

 

9.4.1. A modern variant

ROT13 (rotate 13) is a 20th century variant of Caesar cipher. It has one difference: it adds 13 instead of 3. With ROT13, ciphering and deciphering are the same convenient operation.

Let’s suppose, while scanning the heavens for alien communications, the SETI Institute received a transmission with the following message:

message := "uv vagreangvbany fcnpr fgngvba"

We suspect this message is actually English text that was ciphered with ROT13. Call it a hunch. Before you can crack the code, there’s one more thing you need to know. This message is 30 characters long, which can be determined with the built-in len function:

fmt.Println(len(message))       1

  • 1 Prints 30
Note

Go has a handful of built-in functions that don’t require an import statement. The len function can determine the length for a variety of types. In this case, len returns the length of a string in bytes.

The following listing will decipher a message from space. Run it in the Go Playground to find out what the aliens are saying.

Listing 9.7. ROT13 cipher: rot13.go
message := "uv vagreangvbany fcnpr fgngvba"

for i := 0; i < len(message); i++ {        1
    c := message[i]
    if c >= 'a' && c <= 'z' {              2
        c = c + 13
        if c > 'z' {
            c = c - 26
        }
    }
    fmt.Printf("%c", c)
}

  • 1 Iterates through each ASCII character
  • 2 Leaves spaces and punctuation as they are

Note that the ROT13 implementation in the previous listing is only intended for ASCII characters (bytes). It will get confused by a message written in Spanish or Russian. The next section looks at a solution for this issue.

Quick check 9.5

1

What does the built-in len function do when passed a string?

2

Type listing 9.7 into the Go Playground. What does the message say?

QC 9.5 answer

1

The len function returns the length of a string in bytes.

2

hi international space station

 

9.5. Decoding strings into runes

Strings in Go are encoded with UTF-8, one of several encodings for Unicode code points. UTF-8 is an efficient variable length encoding where a single code point may use 8 bits, 16 bits, or 32 bits. By using a variable length encoding, UTF-8 makes the transition from ASCII straightforward, because ASCII characters are identical to their UTF-8 encoded counterparts.

Note

UTF-8 is the dominant character encoding for the World Wide Web. It was invented in 1992 by Ken Thompson, one of the designers of Go.

The ROT13 program in listing 9.7 accessed the individual bytes (8-bit) of the message string without accounting for characters that are multiple bytes long (16-bit or 32-bit). This is why it works fine for English characters (ASCII), but produces garbled results for Russian and Spanish. You can do better, amigo.

The first step to supporting other languages is to decode characters to the rune type before manipulating them. Fortunately, Go has functions and language features for decoding UTF-8 encoded strings.

The utf8 package provides functions to determine the length of a string in runes rather than bytes and to decode the first character of a string. The DecodeRuneInString function returns the first character and the number of bytes the character consumed, as shown in listing 9.8.

Note

Unlike many programming languages, functions in Go can return multiple values. Multiple return values are discussed in lesson 12.

Listing 9.8. The utf8 package: spanish.go
package main

import (
    "fmt"
    "unicode/utf8"
)

func main() {
    question := "¿Cómo estás?"

    fmt.Println(len(question), "bytes")                          1
    fmt.Println(utf8.RuneCountInString(question), "runes")       2

    c, size := utf8.DecodeRuneInString(question)
    fmt.Printf("First rune: %c %v bytes", c, size)               3
}

  • 1 Prints 15 bytes
  • 2 Prints 12 runes
  • 3 Prints First rune: ¿ 2 bytes

The Go language provides the range keyword to iterate over a variety of collections (covered in unit 4). It can also decode UTF-8 encoded strings, as shown in the following listing.

Listing 9.9. Decoding runes: spanish-range.go
question := "¿Cómo estás?"

for i, c := range question {
    fmt.Printf("%v %c
", i, c)
}

On each iteration, the variables i and c are assigned to an index into the string and the code point (rune) at that position.

If you don’t need the index, the blank identifier (an underscore) allows you to ignore it:

for _, c := range question {
    fmt.Printf("%c ", c)            1
}

  • 1 Prints ¿ C ó m o e s t á s ?

Quick check 9.6

1

How many runes are in the English alphabet "abcdefghijklmnopqrstuvwxyz"? How many bytes?

2

How many bytes are in the rune '¿'?

QC 9.6 answer

1

There are 26 runes and 26 bytes in the English alphabet.

2

There are 2 bytes in the rune '¿'.

 

Summary

  • Escape sequences like are ignored in raw string literals (`).
  • Strings are immutable. Individual characters can be accessed but not altered.
  • Strings use a variable length encoding called UTF-8, where each character consumes 1–4 bytes.
  • A byte is an alias for the uint8 type, and rune is an alias for the int32 type.
  • The range keyword can decode a UTF-8 encoded string into runes.

Let’s see if you got this...

Experiment: caesar.go

Decipher the quote from Julius Caesar:

L fdph, L vdz, L frqtxhuhg.

Julius Caesar

Your program will need to shift uppercase and lowercase letters by –3. Remember that 'a' becomes 'x', 'b' becomes 'y', and 'c' becomes 'z', and likewise for uppercase letters.

Experiment: international.go

Cipher the Spanish message “Hola Estación Espacial Internacional” with ROT13. Modify listing 9.7 to use the range keyword. Now when you use ROT13 on Spanish text, characters with accents are preserved.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset