In this section, we show that the iterative nature of hash algorithms based on the Merkle-Damgård construction makes them less resistant than expected to finding multicollisions, namely inputs all with the same hash value. This was pointed out by Joux [Joux], who also gave implications for properties of concatenated hash functions, which we discuss below.
Suppose there are people and there are possible birthdays. It can be shown that if , then there is a good chance of at least people having the same birthday. In other words, we expect a -collision. If the output of a hash function is random, then we expect that this estimate holds for -collisions of hash function values. Namely, if a hash function has -bit outputs, hence possible values, and if we calculate values of the hash function, we expect a -collision. However, in the following, we’ll show that often we can obtain collisions much more easily.
In many hash functions, for example, SHA-256, there is a compression function that operates on inputs of a fixed length. Also, there is a fixed initial value . The message is padded to obtain the desired format, then the following steps are performed:
Split the message into blocks .
Let be the initial value .
For , let .
Let .
In SHA-256, the compression function is described in Section 11.4. For each iteration, it takes a 256-bit input from the preceding iteration along with a message block of length 512 and outputs a new string of length 256.
Suppose the output of the function , and therefore also of the hash function , has bits. A birthday attack can find, in approximately steps, two blocks and such that . Let . A second birthday attack finds blocks and with . Continuing in this manner, we let
and use a birthday attack to find and with
This process is continued until we have pairs of blocks , , where is some integer to be determined later.
We claim that each of the messages
(all possible combinations with and ) has the same hash value. This is because of the iterative nature of the hash algorithm. At each calculation , the same value is obtained whether or . Therefore, the output of the function during each step of the hash algorithm is independent of whether an or an is used. Therefore, the final output of the hash algorithm is the same for all messages. We thus have a -collision.
This procedure takes approximately steps and has an expected running time of approximately a constant times (see Exercise 13). Let , for example. Then it takes only around twice as long to find four messages with same hash value as it took to find two messages with the same hash. If the output of the hash function were truly random, rather than produced, for example, by an iterative algorithm, then the above procedure would not work. The expected time to find four messages with the same hash would then be approximately , which is much longer than the time it takes to find two colliding messages. Therefore, it is easier to find multicollisions with an iterative hash algorithm.
An interesting consequence of the preceding discussion relates to attempts to improve hash functions by concatenating their outputs. Suppose we have two hash functions and . Before [Joux] appeared, the general wisdom was that the concatenation
should be a significantly stronger hash function than either or individually. This would allow people to use somewhat weak hash functions to build much stronger ones. However, it now seems that this is not the case. Suppose the output of has bits. Also, assume that is calculated by an iterative algorithm, as in the preceding discussion. No assumptions are needed for . We may even assume that it is a random oracle, in the sense of Section 12.3. In time approximately , we can find messages that all have the same hash value for . We then compute the value of for each of these messages. By the birthday paradox, we expect to find a match among these values of . Since these messages all have the same value, we have a collision for . Therefore, in time proportional to (we’ll explain this estimate shortly), we expect to be able to find a collision for . This is not much longer than the time a birthday attack takes to find a collision for the longer of and , and is much faster than the time that a standard birthday attack would take on this concatenated hash function.
How did we get the estimate for the running time? We used steps to get the messages with the same value. Each of these messages consisted of blocks of a fixed length. We then evaluated for each of these messages. For almost every hash function, the evaluation time is proportional to the length of the input. Therefore, the evaluation time is proportional to for each of the messages that are given to . This gives the term in the estimated running time.