A basic component of many cryptographic algorithms is what is known as a hash function. When a hash function satisfies certain non-invertibility properties, it can be used to make many algorithms more efficient. In the following, we discuss the basic properties of hash functions and attacks on them. We also briefly discuss the random oracle model, which is a method of analyzing the security of algorithms that use hash functions. Later, in Chapter 13, hash functions will be used in digital signature algorithms. They also play a role in security protocols in Chapter 15, and in several other situations.
A cryptographic hash function takes as input a message of arbitrary length and produces as output a message digest of fixed length, for example, 256 bits as depicted in Figure 11.1. Certain properties should be satisfied:
Given a message , the message digest can be calculated very quickly.
Given a , it is computationally infeasible to find an with (in other words, is a one-way, or preimage resistant, function). Note that if is the message digest of some message, we are not trying to find this message. We are only looking for some with .
It is computationally infeasible to find messages and with (in this case, the function is said to be strongly collision resistant).
Note that since the set of possible messages is much larger than the set of possible message digests, there should always be many examples of messages and with . The requirement (3) says that it should be hard to find examples. In particular, if Bob produces a message and its hash , Alice wants to be reasonably certain that Bob does not know another message with , even if both and are allowed to be random strings of symbols.
Preimage resistance and collision resistance are closely related, but we list them separately because they are used in slightly different circumstances. The following argument shows that, for our hash functions, collision resistance implies preimage resistance: Suppose is not preimage resistant. Take a random and compute . If is not preimage resistant, we can quickly find with . Because is many-to-one, it is likely that , so we have a collision, contradicting the collision resistance of . However, there are examples that show that for arbitrary functions, collision resistance does not imply preimage resistance. See Exercise 12.
In practice, it is sometimes sufficient to weaken (3) to require to be weakly collision resistant. This means that given , it is computationally infeasible to find with . This property is also called second preimage resistance.
Requirement (3) is the hardest one to satisfy. In fact, in 2004, Wang, Feng, Lai, and Yu (see [Wang et al.]) found many examples of collisions for the popular hash functions MD4, MD5, HAVAL-128, and RIPEMD. The MD5 collisions have been used by Ondrej Mikle [Mikle] to create two different and meaningful documents with the same hash, and the paper [Lenstra et al.] shows how to produce examples of X.509 certificates (see Section 15.5) with the same MD5 hash (see also Exercise 15). This means that a valid digital signature (see Chapter 13) on one certificate is also valid for the other certificate, hence it is impossible for someone to determine which is the certificate that was legitimately signed by a Certification Authority. It has been reported that weaknesses in MD5 were part of the design of the Flame malware, which attacked several computers in the Middle East, including Iran’s oil industry, from 2010 to 2012.
In 2005, Wang, Yin, and Yu [Wang et al. 2] predicted that collisions could be found for the hash function SHA-1 with around calculations, which is much better than the expected calculations required by the birthday attack (see Section 12.1). In addition, they found collisions in a smaller 60-round version of SHA-1. These weaknesses were a cause for concern for using these hash algorithms and led to research into replacements. Finally, in 2017, a joint project between CWI Amsterdam and Google Research found collisions for SHA-1 [Stevens et al.]. Although SHA-1 is still common, it is starting to be used less and less.
One of the main uses of hash functions is in digital signatures. Since the length of a digital signature is often at least as long as the document being signed, it is much more efficient to sign the hash of a document rather than the full document. This will be discussed in Chapter 13.
Hash functions may also be employed as a check on data integrity. The question of data integrity comes up in basically two scenarios. The first is when the data (encrypted or not) are being transmitted to another person and a noisy communication channel introduces errors to the data. The second occurs when an observer rearranges the transmission in some manner before it gets to the receiver. Either way, the data have become corrupted.
For example, suppose Alice sends Bob long messages about financial transactions with Eve and encrypts them in blocks. Perhaps Eve deduces that the tenth block of each message lists the amount of money that is to be deposited to Eve’s account. She could easily substitute the tenth block from one message into another and increase the deposit.
In another situation, Alice might send Bob a message consisting of several blocks of data, but one of the blocks is lost during transmission. Bob might never realize that the block is missing.
Here is how hash functions can be used. Say we send over the communications channel and it is received as . To check whether errors might have occurred, the recipient computes and sees whether it equals . If any errors occurred, it is likely that , because of the collision-resistance properties of .
Let be a large integer. Let be regarded as an integer between 0 and . This function clearly satisfies (1). However, (2) and (3) fail: Given , let . Then . So is not one-way. Similarly, choose any two values and that are congruent mod . Then , so is not strongly collision resistant.
The following example, sometimes called the discrete log hash function, is due to Chaum, van Heijst, and Pfitzmann [Chaum et al.]. It satisfies (2) and (3) but is much too slow to be used in practice. However, it demonstrates the basic idea of a hash function.
First we select a large prime number such that is also prime (see Exercise 15 in Chapter 13). We now choose two primitive roots and for . Since is a primitive root, there exists such that . However, we assume that is not known (finding , if not given it in advance, involves solving a discrete log problem, which we assume is hard).
The hash function will map integers mod to integers mod . Therefore, the message digest usually contains approximately half as many bits as the message. This is not as drastic a reduction in size as is usually required in practice, but it suffices for our purposes.
Write with . Then define
The following shows that the function is probably strongly collision resistant.
If we know messages with , then we can determine the discrete logarithm .
Proof
Write and . Suppose
Using the fact that , we rewrite this as
Since is a primitive root mod , we know that if and only if . In our case, this means that
Let . There are exactly solutions to the preceding congruence (see Subsection 3.3.1), and they can be found quickly. By the choice of , the only factors of are . Since , it follows that . Therefore, if , then it is a nonzero multiple of of absolute value less than . This means that , so or 2. Therefore, there are at most two possibilities for . Calculate for each possibility; only one of them will yield . Therefore, we obtain , as desired.
On the other hand, if , then the preceding yields . Since , we must have . Therefore, , contrary to our assumption.
It is now easy to show that is preimage resistant. Suppose we have an algorithm that starts with a message digest and quickly finds an with . In this case, it is easy to find with : Choose a random and compute , then compute . Since maps messages to message digests, there are many messages with . It is therefore not very likely that . If it is, try another random . Soon, we should find a collision, that is, messages with . The preceding proposition shows that we can then solve a discrete log problem. Therefore, it is unlikely that such an algorithm exists.
As we mentioned earlier, this hash function is good for illustrative purposes but is impractical because of its slow nature. Although it can be computed efficiently via repeated squaring, it turns out that even repeated squaring is too slow for practical applications. In applications such as electronic commerce, the extra time required to perform the multiplications in software is prohibitive.