Chapter 16

Special Instructions

Abstract

To close the book, this chapter provides a brief description of some of the specialized instructions available on the Intel® Architecture. For each instruction covered, links are provided to white paper resources that describe usages for that instruction.

Keywords

AESNI

AES-NI

PCLMUL

CRC32

SSE4.2

The x87 floating point instructions were added to the x86 instruction set in order to alleviate some common software problems. Over the years, the Intel® Architecture has collected quite a few specialized instructions that are designed to simplify the lives of developers and improve performance. One of the associated challenges is that as the number of instructions increases, it can be hard for someone not familiar with the architecture to find these helpful instructions. The author thought it fitting to close the book by highlighting some of these lesser known instructions. Additional information can be found in the whitepapers listed at the end of each section and within the Intel® Software Developer Manual and Optimization Reference.

16.1 Intel® Advanced Encryption Standard New Instructions (AES-NI)

The Intel® AES New Instructions (AES-NI) extension adds seven new instructions designed to accelerate AES encryption and decryption.

Typically, one of the largest challenges to the wide adoption of various security measures is performance. Despite the value users place on security, their behavior is strongly influenced by what they perceive to improve their performance, and what they perceive to impede their performance. As a result, providing high performance encryption and decryption is a prerequisite for enabling ubiquitous encryption.

Before the introduction of these dedicated AES instructions, the standard technique for improving AES performance was via a lookup table. This approach has been shown to be susceptible in practice, not just in theory, to side-channel attacks. Rather than exploiting a weakness in the cryptographic algorithm, side-channel attacks instead focus on accidental data leakage as a result of the implementation. In the case of AES lookup table implementation, the side-channel attack performs cache-timing to sample what cache lines are accessed by the AES implementation. Eventually enough samples are collected to reveal the key used in the encryption or decryption process. As a result, not only do the AES-NI instructions improve performance, but they also provide additional security against these types of attacks.

16.1.1 Further Reading

1. http://www.intel.com/content/www/us/en/architecture-and-technology/advanced-encryption-standard--aes-/data-protection-aes-general-technology.html

2. http://www.intel.com/content/dam/doc/white-paper/enterprise-security-aes-ni-white-paper.pdf

16.2 PCLMUL-Packed Carry-Less Multiplication

Introduced in the AES-NI extensions first available in the Intel® Westmere processor generation, the PCLMUL instruction performs carry-less multiplication of two 64-bit integers stored in SIMD registers, storing their product as a 128-bit integer. As the name implies, carry-less multiplication performs integer multiplication but ignores any carry digits that would normally propagate to the next place. Because carry-less multiplication is one of the steps for performing multiplication in the Galois Field, this instruction is capable of accelerating many different operations. This PCLMUL instruction was added with the AES-NI extension in order to accelerate the Galois Counter Mode (GCM) of AES. Another common usage of PCLMUL is to accelerate the CRC calculation for arbitrary polynomials.

Since each 128-bit SSE register is capable of holding two packed 64-bit values, the first operand, in AT&T syntax, is an 8-bit immediate that encodes which of the two packed quad words should be used in the second and third operands. The first bit of the lower nibble of this immediate represents the location in the third operand, which will be used as both a multiplier source and as the final destination for the product. The first bit of the higher nibble, that is, the fifth bit of the byte, represents the second operand. A value of zero in either of these bits encodes the low packed quad word, while a value of one encodes the high packed quad word. Aside from the SSE version, Intel® AVX added a VEX encoded nondestructive version.

16.2.1 Further Reading

 http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

 http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-paper.pdf

 https://software.intel.com/en-us/articles/intel-carry-less-multiplication-instruction-and-its-usage-for-computing-the-gcm-mode/

16.3 CRC32

The SSE4.2 instruction extensions add a CRC32 instruction for calculating the 32-bit CRC for the 0x11EDC6F41 polynomial. For computing a CRC with a different polynomial, use the PCLMUL instruction. Aside from data integrity checks, the CRC32 instruction can also be used as a fast hash function.

16.3.1 Further Reading

 http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/hash-method-performance-paper.pdf

16.4 SSE4.2 String Functions

Aside from the CRC32 and POPCNT instructions, the SSE4.2 instruction set extension adds functionality for performing common string operations with SIMD instructions. This new functionality revolves around comparing, searching, and validating strings with five new instructions.

Four of the five new instructions vary slightly in behavior and follow the general format of PCMPxSTRy, where x and y are variable characters that control the string length interpretation and result format, respectively. These are two possible values for x, the character “E,” for explicit length strings, and the character “I,” for implicit length strings. Implicit length strings are terminated with a sentinel character, that is, standard C strings terminated with a NULL character. On the other hand, explicit length strings require the length of the strings to be loaded into general purpose registers. As a result, the implicit length forms, PCMPISTRy, are designed for working on text strings and will automatically stop processing when a NULL character is encountered. On the other hand, the explicit length forms, PCMPESTRy, are designed for working on binary strings, where a NULL character isn’t used as a sentinel.

For the explicit length forms, the string length of the second operand, in AT&T syntax, is stored in the RDX or EDX register, depending on the processor mode, while the length of the third operand is stored in the RAX or EAX register. Since the length of the strings are only relevant to the instruction with regard to the values loaded into the SIMD registers, both lengths loaded into these general purpose registers are internally saturated, although the values stored in the actual registers aren’t affected, to the maximum width of the SIMD register. This fact can be exploited in order to reduce the number of registers required for some string operations by also using the explicit length registers as counters.

The second form variable, the y variable, in the PCMPxSTRy form, controls the output format. This variable can either be the character “I” for index or “M” for mask. In index mode, the result is stored into the ECX register. In mask mode, the result of each comparison is stored in the destination operand as a mask.

Therefore, the four new PMPxSTRy instructions can be defined as follows:

PCMPESTRI Compares two explicit length strings, whose lengths are in RDX/RAX or EDX/EAX and stores the result in ECX.

PCMPESTRM Compares two explicit length strings and stores the comparison result in the destination SIMD register.

PCMPISTRI Compares two implicit length strings, stopping at the first NULL byte, and stores the result in the ECX.

PCMPISTRM Compares two implicit length strings, and stores the comparison result in the destination SIMD register.

The PCMPxSTRy instructions are designed to handle a lot of different scenarios. As a result, the first operand is an 8 bit immediate that controls the exact behavior of the comparison. This includes whether the comparisons occur between signed or unsigned characters or words, what should be reported, and so on. Figure 16.1 illustrates the format of this byte.

f16-01-9780128007266
Figure 16.1 PCMPxSTRy immediate operand.

In order to communicate additional information about the result, the arithmetic flags in the EFLAGS register are overloaded with special meanings.

16.4.1 Further Reading

 https://software.intel.com/en-us/articles/schema-validation-with-intel-streaming-simd-extensions-4-intel-sse4

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset