No doubt you know that the world is a very small place and the need for software that recognizes languages other than United States English is important. Here’s the problem: if you think you know what a character is in a language other than English, you are probably mistaken. Most character set encodings, including Unicode, are evolving. This inherent fuzziness can threaten software security. The rest of this short chapter, based on information learned during Microsoft’s Windows Security Push, describes some of the threats related to internationalization, suggests ways to avoid them, and touches on some other general security best practices.
You’ll often see the term "I18N" when working with foreign language software. I18N means "internationalization" (in which the letter I is followed by 18 characters and then the letter N).
This chapter does not cover general globalization best practices except as they affect security. It’s also assumed that you have read Chapter 10 and Chapter 11. Once you’ve read this chapter, I hope you’ll quickly realize that someone in your group should own the security implications of I18N issues in your applications. Now I’ll explain why.
You should follow two security rules when building applications designed for international audiences:
Use Unicode.
Don’t convert between Unicode and other code pages/character sets.
If you follow these two rules, you’ll run into few I18N-related security issues; in fact, you can jump to the next chapter if these two rules hold true for your application! For the rest of you, you need to know a few things.
A character set encoding maps some set of characters (A, ß, Æ, and so on) to a set of binary values (usually from one to four bytes) called code values or code points. Hundreds of such encodings are in use today, and Microsoft Windows supports several dozen. Every character set encoding, including Unicode, has security issues, mainly due to character conversion. However, Unicode is the only worldwide standard and security experts have given it the most thorough examination. The bulk of Windows and Microsoft Office data is stored in Unicode, and your code will have fewer conversion issues—and potentially fewer security issues—if you also use Unicode. The Microsoft .NET common language runtime and the .NET Framework use only Unicode.
There are three primary binary representations of the Unicode encoding: UTF-8, UTF-16, and UTF-32. Although all three forms represent exactly the same character repertoire, UTF-16 is the primary form supported by Windows and .NET. You will avoid one class of security issue if you use UTF-16. UTF-8 is popular for internet protocols and on other platforms. Windows National Language Support (NLS) provides an API for converting between UTF-8 and UTF-16, MultiByteToWideChar and WideCharToMultiByte. There is little reason to use UTF-32.