Chapter 2: Dealing with Binary and Random Data

When building solutions that leverage cryptography, we're almost always faced with two issues.

The first is managing binary data, which is a sequence of bytes, including characters that can't be represented as text. Most people have the experience of opening a binary file (such as an image and an executable application) in a text editor such as Notepad and being presented with a sequence of random, garbled symbols that can't be read, let alone edited.

In cryptography, encrypted messages, hashes, keys, and sometimes even decrypted messages are guaranteed to contain non-printable, binary data. This introduces challenges for developers as binary data often requires special handling to be visualized (such as while debugging our applications), stored, and transferred.

In this chapter, we'll explore how we can use encodings such as base64 and hex (short for hexadecimal) with Node.js to make binary data representable as text so that it can be more easily dealt with.

The second issue we'll explore is generating random data. Random byte sequences are frequently used in cryptography, as we'll see throughout the book, including as encryption keys, seeds, and salts. Understanding how to generate them with Node.js is a very important part of building applications that leverage cryptography safely.

In this chapter, we'll explore:

  • How to encode and represent binary data as hex or base64 with Node.js. We'll also include a primer on character encodings such as ASCII and UTF-8.
  • How to generate cryptographically safe random byte sequences in Node.js.

Encoding and representing binary data

When using cryptography, including hashes, signatures, and encrypted messages, we commonly have to deal with binary data. As every developer has experienced while working with binary data or files, they cannot easily be printed on screen or copied and pasted into other fields, so it's common to change their representation by encoding them into strings that only use human-readable characters.

Figure 2.1 – Looking at binary data on a terminal. Note how the terminal is trying to interpret every byte sequence as UTF-8 and frequently encounters invalid ones (replaced with the "" symbol)

Figure 2.1 – Looking at binary data on a terminal. Note how the terminal is trying to interpret every byte sequence as UTF-8 and frequently encounters invalid ones (replaced with the "" symbol)

In Node.js, binary data is generally stored in Buffer objects and can be represented in multiple ways, including some encodings that are guaranteed to be human-readable. Throughout this book, we'll frequently use two ways of representing binary data: hex (short for "hexadecimal", which uses base16) and base64 encodings.

As we'll explain in a bit, encoded strings are always longer than the original, binary data, so they're generally not recommended for storing on a persistent storage medium (for example, in a database or a file on disk). However, unlike binary blobs, they have the advantage of being just sequences of printable ASCII characters. Representations such as hex or base64 make it possible for humans to analyze binary data much more easily, and they are simpler to transfer: they can be copied between different fields or systems more easily than binary data, including through copying and pasting.

A brief word on character encodings and why we encode binary data

Before we dive into ways of representing binary data, it helps to have a refresher on character encodings, including ASCII and UTF-8, as we'll be dealing with them often in this book.

Binary data is made of bytes, which, in modern computer systems, are sequences of 8 bits (each one either 0 or 1). This means that each byte can represent a number from 0 to 255 (28-1).

For computers to be able to display text, we have created encodings, which are conventions for converting between numbers and characters so that each character maps to a decimal number, and vice-versa.

One of the first such conventions that is still relevant today is ASCII encoding (American Standard Code for Information Interchange), which contains 128 symbols represented by a 7-bit number, 0 to 127 (this was before 8-bit systems became the standard they are today). The ASCII table includes the Latin alphabet with both lowercase and uppercase letters, numbers, some basic punctuation symbols, and other non-printable characters (things such as newlines, the tab character, and other control characters). For example, in the ASCII table, the number 71 would map to the character G (uppercase Latin g), and 103 would be g (lowercase Latin g).

ASCII Table

You can find a full list of the characters in the ASCII table online, for example, at https://www.asciitable.xyz/.

Throughout the years, other character encodings have been created, some of which found more adoption than others. However, one important development was the creation of the Unicode standard and its parent consortium, whose mission is to create a comprehensive encoding system to represent every character used by every alphabet around the world, and then more, including symbols, ligatures, and emojis. The Unicode standard contains a growing list of symbols. As of the time of writing, the current version is Unicode 14.0 and it contains almost 145,000 characters.

In addition to defining the list of symbols, Unicode also contains a few different character encodings. Among those, UTF-8 is by far the most frequently used encoding on the Internet. It can represent each symbol of the Unicode standard by using between 1 and 4 bytes per character.

With UTF-8, the first 128 symbols are mapped exactly as in the ASCII table, so the ASCII table is essentially a "subset" of UTF-8 now. For symbols that are not defined in the ASCII table, including all non-Latin scripts, UTF-8 requires between 2 and 4 total bytes to represent them.

Use of UTF-8 in Code Samples

Throughout the code samples in this book, when dealing with text, we'll always assume that it is encoded as UTF-8: you'll often see us using methods that convert a Node.js Buffer object into a text string by requesting its UTF-8 representation. However, if your source data uses a different encoding (such as UTF-16 or another one), you will be able to modify your code to support that.

The problem with binary data, such as encrypted messages, is that they can contain any sequence of bytes, each one from 0 to 255, which, when interpreted as ASCII-encoded or UTF-8-encoded text, always contain a mix of unprintable characters and/or invalid sequences of bytes as per the UTF-8 standard. Thus, to be able to look at those strings conveniently, for ease of transmission or debugging, we need to represent the text in alternate forms, such as hex or base64.

Buffers in Node.js

In Node.js, binary data is commonly stored in Buffer objects. Many methods of the standard library that deal with binary data, including those in the crypto module that we'll be studying in this book, leverage buffers extensively.

There are multiple ways to create a Buffer object in Node.js. The two most important ones for now are the following:

  • Buffer.alloc(size), which allows the creation of an empty buffer of size bytes, by default, all zeros:

    const buf = Buffer.alloc(3)

    console.log(buf)

    // -> <Buffer 00 00 00>

  • Buffer.from(*) creates a Buffer object from a variety of sources, including arrays and ArrayBuffer objects, other Buffer objects, and, most importantly for us, strings.

    When creating a buffer from a string, you can specify two arguments: Buffer.from(string, encoding), where encoding is optional and defaults to 'utf8'
for UTF-8-encoded text (we'll see more encodings in the next pages of this chapter).

    For example, these two statements instantiate a buffer with identical content:

    const buf = Buffer.from('Hello world!', 'utf8')

    const buf = Buffer.from('Hello world!')

Once created, Buffer objects contain a variety of properties and methods, some of which we'll encounter throughout this book. For now, it's worth highlighting two of them:

  • buf.toString(encoding) is a method that returns the string representation of the buffer in the specified encoding (which defaults to 'utf8' if not set); see this, for example:

    const buf = Buffer.from('Hello world!', 'utf8')

    console.log(buf.toString('utf8'))

    // -> 'Hello world!'

    In the preceding code, buf.toString('utf8') would have been identical to buf.toString().

  • buf.length is a property that contains the length of the buffer, in bytes; see this, for example:

    const buf = Buffer.from('Hello world!', 'utf8')

    console.log(buf.length)

    // -> 12

    Multi-Byte Characters in Unicode

    Note that as per our discussion regarding encodings, some strings encoded as UTF-8 or UTF-16 may contain multi-byte characters, so their byte length can be different from their string length (or character count). For example, the letter è (Latin e with a grave) is displayed as a single character but uses two bytes when encoded as UTF-8, so:

    'è'.length returns 1 because it counts the number of characters (here defined as Unicode code points).

    (Buffer.from('è', 'utf8')).length returns 2 instead because the letter è in UTF-8 is encoded using two bytes.

    If you're interested in learning more about character encodings, multi-byte characters, and the related topic of Unicode string normalization, I recommend the article What every JavaScript developer should know about Unicode, by Dmitri Pavlutin: https://bit.ly/crypto-unicode.

    Buffer in Node.js

    Full documentation for the Buffer APIs in Node.js can be found at https://nodejs.org/api/buffer.html.

Hex encoding

The first common way of representing binary data is to encode it in hexadecimal (hex for short) format.

With hex encoding, we split all bytes into two groups of 4 bits each, each able to represent 16 combinations, with numbers from 0 to 15 (24-1). Then we use a simplified encoding in which the first 10 combinations are represented by the numbers 0-9, and the remaining 6 use the letters a-f (case-insensitive).

For example, the number 180 (10110100 in binary) is outside of the bounds of the ASCII table (which only defines characters 0-127) and is not a valid sequence in UTF-8, so it can't be represented with either encoding. When encoded as hex, we are splitting that into two sequences of 4 bytes each: 1011 (11 in decimal, which maps to the letter B in the table above) and 0100 (4 in decimal). In hex, then, the representation of 180 is B4.

As you can see, when using hex encoding, the length of data in bytes doubles: while the number 180 can be stored in a single byte (which is not a printable character), writing B4 in a file requires storing 2 characters, so 2 bytes.

You can encode any arbitrary byte sequence using hex. For example, the string Hello world! is represented as 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 (for ease of reading, the convention is to use spaces to separate each byte, or every 2 characters encoded as hex; however, that's not mandatory). Multiple tools allow you to convert from ASCII or UTF-8 text to hex, such as https://bit.ly/crypto-hex. These tools can be used to perform the opposite operation too, as long as the data contains printable characters when decoded: as an example, try converting 48 65 B4 6C 6F (note the B4 byte, which we observed previously as unprintable) into UTF-8, and you'll see that your computer will recognize an invalid UTF-8 sequence and display a "" character to alert you to an error.

Figure 2.2 – Certain byte sequences such as B4 cannot be represented as UTF-8 text

Figure 2.2 – Certain byte sequences such as B4 cannot be represented as UTF-8 text

With Node.js, you can create Buffer objects directly from hex-encoded strings, using the Buffer.from method with 'hex' as encoding; the hex-encoded input is case-insensitive; see this, for example (note that spaces are not allowed between octets in the hex-encoded string for this method):

const buf = Buffer.from('48656C6C6F20776F726C6421', 'hex')

console.log(buf.toString('utf8'))

// -> 'Hello world!'

Likewise, you can use the buf.toString('hex') method to get the hex-encoded representation of any Buffer object, regardless of how it was created or whether it contains binary or textual data; see the following, for example:

const buf = Buffer.from('Hi, Buffer!', 'utf8')

console.log(buf.toString('hex'))

// -> '48692c2042756666657221'

Many programming languages, JavaScript included, allow you to write numbers in your code directly using their hexadecimal notation by adding the 0x prefix. For example, the following expression in JavaScript prints true:

console.log(0xB4 === 180) // -> true

While hex encoding is highly inefficient in terms of storage requirements, as it doubles the size of our data, it is often used during development as it has three interesting properties:

  • The length of the original data is always exactly half the length of the hex-encoded string.
  • Each byte is represented by exactly two characters, and it's possible to convert them to decimal with a quick multiplication and addition: multiply the first character by 16, and then add the second (for example, for converting C1 to decimal, remember that C maps to 12, so the result is 12 * 16 + 1 = 193).
  • If the data you've encoded is plain text, each sequence of two hex-encoded characters can map directly to a symbol in the ASCII table. For example, 41 in hex (65 in decimal) corresponds to the letter A (uppercase Latin a)

While these three things might not seem much, they can come in very handy when debugging code that uses binary data!

Base64

The second common way of representing binary data is base64. As the name suggests, this uses an encoding with 64 different symbols, each one representing 6 bits of the underlying data.

Just like hex encoding splits the underlying data into groups of 4 bits and then maps them to a small subset of symbols (16 in total), base64 uses groups of 6 bits and a set of 64 symbols. There are multiple character sets and specifications for base64 encoding, but the most common ones are as follows:

  • The "Base64 standard encoding," as defined by RFC 4648 Section 4, uses the following 64 symbols:

    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234 56789+/

  • The "Base64 URL encoding," as defined by RFC 4648 Section 5, uses the following 64 symbols:

    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234 56789-_

The two encodings are very similar (and, unlike hex, these are case-sensitive), but they differ in the symbols used to encode the decimal numbers 62 and 63, respectively, +/ for "standard encoding" and -_ for "URL encoding." In fact, many web applications prefer to use the second format because the characters + and / are not URL-safe, so they have to be encoded when used in a URL with the usual percentage encoding, becoming %2B and %2F, respectively. Instead, - and _ are URL-safe and do not require encoding when used in URLs.

Additionally, sometimes up to 2 padding characters, =, are added to make the length of base64-encoded strings an exact multiple of 4. Depending on the variant of base64 used and the parser library, padding may be required (base64-encoded strings that lack the required padding may not be parsed correctly) or optional (parsers accept strings with or without padding).

As you can already understand, base64 encoding is a bit more complex than the hex one, and we won't get into the details of the specifications or the algorithms for encoding and decoding strings with base64.

The good news is that Node.js supports base64 encoding natively in the Buffer APIs, with 'base64' and 'base64url' available as values for the encoding arguments in the methods we saw previously (note that 'base64url' was added in Node.js 15.7); see this, for example:

const buf1 = Buffer.from('SGk=', 'base64')

console.log(buf1.toString())

// -> 'Hi'

const buf2 = Buffer.from('8424bff8', 'hex')

console.log(buf2.toString('base64'))

// -> 'hCS/+A=='

console.log(buf2.toString('base64url'))

// -> 'hCS_-A'

When using Node.js, note the following:

  • All methods that parse a base64-encoded string (such as Buffer.from) accept base64-encoded data in any form, regardless of whether you're specifying 'base64' or 'base64url' as encoding, with optional padding. This means that the first line of the preceding code sample could have accepted input encoded with base64 in both "standard encoding" and "URL encoding," and using 'base64' or 'base64url' would not make a difference in either case. Additionally, with these methods, padding is always optional.
  • Methods that format a string, encoding it with base64, use "standard encoding" when passing 'base64' as the encoding format (and include padding if necessary), and "URL encoding" when passing 'base64url' (never using padding), as you can see from the preceding code sample in the calls to buf.toString(encoding).

While Node.js is fairly flexible with accepting base64-encoded input that uses either format, other applications, frameworks, or programming languages might not be. Especially when you're passing base64-encoded data between different applications or systems, be mindful to use the correct format!

Base64 URL Encoding in Older Versions of Node.js

As mentioned, 'base64url' was implemented in Node.js 15.7. This has not changed the behavior of methods such as Buffer.from, which were accepting base64-encoded strings in either format before. However, methods such as buf.toString() only supported encoding to base64 in "standard encoding" format.

With Node.js < 15.7, encoding data from a Buffer object to URL-safe base64 required a few extra steps, such as this code, which might not be the prettiest but does the job:

buf.toString('base64')

.replace(/=/g, '')

.replace(/+/g, '-')

.replace(///g, '_')

Despite being quite a bit more complex, and trickier when mixing encoding standards and implementations, base64 is very useful because it's a more storage-efficient encoding than hex, yet it still relies entirely on printable characters in the ASCII table: encoding data using base64 generates strings that are just around 33% larger than the original binary data. Base64 is also widely supported, and it's used by data exchange formats such as JSON and XML when embedding binary data.

Now that we're comfortable with dealing with binary data, let's dive into the first situation (of the many in this book) in which we'll have to manage non-text sequences: generating random bytes.

Generating cryptographically secure random byte sequences

When building applications that leverage cryptography, it's very common to have to generate random byte sequences, and we'll encounter that in every chapter of this book. For example, we'll use random byte sequences as encryption keys (as in Chapter 4, Symmetric Encryption in Node.js) and as salt for hashes (Chapter 3, File and Password Hashing with Node.js).

Thankfully, Node.js already includes a function to generate random data in the crypto module: randomBytes(size, callback).

The importance of randomness

In this book, just as in real-life applications, we're going to use random byte sequences for highly sensitive operations, such as generating encryption keys. Because of that, it's of the utmost importance to be able to have something as close as possible to true randomness. That is: given a number returned by our random number generator, an attacker should not be able to guess the next number.

Computers are deterministic machines, so, by definition, generating random numbers is a challenge for them. True Random Number Generator (TRNG) devices exist and are generally based on the observation of quantum effects; however, these are uncommon.

Instead, for most practical applications, we rely on Cryptographically Secure Pseudo-Random Number Generators (CSPRNGs), which use various sources of entropy (or "noise") to generate unpredictable numbers. These systems are generally built into the kernel of the operating systems, such as /dev/random on Linux, which is continuously seeded by a variety of observations that are "random" and difficult for an attacker to predict (examples include the average time between key presses on a keyboard, the timing of kernel interrupts, and others).

In Node.js, crypto.randomBytes returns random byte sequences using the operating system's CSPRNG, and it's considered safe for cryptographic usage.

Math.random() and Other Non-Cryptographically Safe PRNGs

Functions such as the JavaScript function Math.random() (which is available in Node.js too) are not cryptographically safe, and should not be used for generating random numbers or byte sequences for use in cryptographic operations.

In fact, Math.random() is seeded only once at the beginning of the application or script (at least this is the case of the V8 runtime as of the time of writing, in early 2021), so an attacker that managed to determine the initial seed would then be able to regenerate the same sequence of random numbers.

Within Node.js, you can verify that this is the case by invoking the node binary with the --random_seed flag to a number of your choosing and then calling Math.random(). You'll see that, as long as you pass the same number as the value for --random_seed, Math.random() will return the same sequence of "random" numbers even on different invocations of Node.js.

LavaRand

For an interesting discussion on TRNGs and CSPRNGs, and for a curious and fun approach to generating random entropy ("noise") to seed CSPRNGs, check out this blog post by Cloudflare explaining how they're using a wall of lava lamps in their headquarters to get safer random numbers: https://bit.ly/crypto-lavarand.

Using crypto.randomBytes

As mentioned, the randomBytes(size, callback) function from the crypto package is the recommended way to generate a random sequence of bytes of the length size that are safe for cryptographic usage.

As you can see from its signature, the function is asynchronous and it passes the result to a function that is the legacy Node.js "error-first callback" (the callback argument can be omitted, but that makes the function perform as synchronous I/O that blocks the event loop, and it's not recommended).

For example, to generate a random sequence of 32 bytes (256-bit), the traditional way of invoking the function asynchronously is as follows:

2.1: Using crypto.randomBytes

const crypto = require('crypto')

crypto.randomBytes(32, (err, buf) => {

    if (err) {

        throw err

    }

    console.log(buf.toString('hex'))

})

Executing the preceding code will print in the terminal a random, 64 character-long hex string.

As those developers who have been writing code for Node.js for longer know, the "error-first callback" style is outdated because it produced code that was harder to read, and it was prone to cause "callback hell" with many nested functions. Because of that, you'll see us in this book frequently "modernizing" these older functions, converting them to methods that return a Promise object and that can be used with the modern async/await pattern. To do that, we'll be using the promisify function from the util module.

For example, we can rewrite the preceding code as follows:

2.2: Using crypto.randomBytes with async/await

const crypto = require('crypto')

const {promisify} = require('util')

const randomBytes = promisify(crypto.randomBytes)

;(async function() {

    const buf = await randomBytes(32)

    console.log(buf.toString('hex'))

})()

Just like the previous code snippet, this will print a random hex string of 64 characters, but it is much more readable (especially when included in complex applications).

The promisify method allows us to convert functions that use the legacy "error-first callback" style into async ones (technically, functions that return Promise objects), so we can await on them.

As for ;(async function() { … })(), that is an asynchronous Immediately-Invoked Function Expression (IIFE). It is necessary because Node.js does not allow the use of the await keyword outside of a function defined with the async modifier, so we need to define an anonymous async function and invoke it right away.

Top-Level Await

Starting with Node.js 14.8.0, support for the so-called "top-level await" (ability to use the await keyword at the top level, outside of a function) is available, but only in code that uses JavaScript modules, hence, with files using the .mjs extension. Because of that, we'll continue using the async IIFE pattern in this book.

Summary

In this chapter, we learned about two important topics that are key to using cryptographic functions. First, we learned about the most common encodings for textual strings and binary data that we'll use throughout the book, and we looked at the first code samples using the Buffer module in Node.js. We then explored the importance of random numbers in cryptography and learned how to generate cryptographically secure random byte sequences with Node.js.

In the next chapter, we'll start with the first class of cryptographic operations: hashing. We'll learn about how hashing differs from encryption, what it's used for, and how to hash strings, files, and stream data with Node.js. We'll also look at a few different hashing algorithms and how to pick the right one for each scenario.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset