If there are 23 people in a room, the probability is slightly more than 50% that two of them have the same birthday. If there are 30, the probability is around 70%. This might seem surprising; it is called the birthday paradox. Let’s see why it’s true. We’ll ignore leap years (which would slightly lower the probability of a match) and we assume that all birthdays are equally likely (if not, the probability of a match would be slightly higher).
Consider the case of 23 people. We’ll compute the probability that they all have different birthdays. Line them up in a row. The first person uses up one day, so the second person has probability of having a different birthday. There are two days removed for the third person, so the probability is that the third birthday differs from the first two. Therefore, the probability of all three people having different birthdays is . Continuing in this way, we see that the probability that all 23 people have different birthdays is
Therefore, the probability of at least two having the same birthday is
One way to understand the preceding calculation intuitively is to consider the case of 40 people. If the first 30 have a match, we’re done, so suppose the first 30 have different birthdays. Now we have to choose the last 10 birthdays. Since 30 birthdays are already chosen, we have approximately a 10% chance that a randomly chosen birthday will match one of the first 30. And we are choosing 10 birthdays. Therefore, it shouldn’t be too surprising that we get a match. In fact, the probability is 89% that there is a match among 40 people.
More generally, suppose we have objects, where is large. There are people, and each chooses an object (with replacement, so several people could choose the same one). Then
Note that this is only an approximation that holds for large ; for small it is better to use the above product and obtain an exact answer. In Exercise 12, we derive this approximation. Choosing , we find that if , then the probability is 50% that at least two people choose the same object.
To summarize, if there are possibilities and we have a list of length , then there is a good chance of a match. If we want to increase the chance of a match, we can make the list have length or . The main point is that a length of a constant times (instead of something like ) suffices.
For example, suppose we have 40 license plates, each ending in a three-digit number. What is the probability that two of the license plates end in the same three digits? We have , the number of possible three-digit numbers, and , the number of license plates under consideration. Since
the approximate probability of a match is
which is more than 50%. We stress that this is only an approximation. The correct answer is obtained by calculating
The next time you are stuck in traffic (and have a passenger to record numbers), check out this prediction.
But what is the probability that one of these 40 license plates has the same last three digits as yours (assuming that yours ends in three digits)? Each plate has probability of not matching yours, so the probability is that none of the 40 plates matches your plate. The reason the birthday paradox works is that we are not just looking for matches between one fixed plate, such as yours, and the other plates. We are looking for matches between any two plates in the set, so there are many more opportunities for matches.
For more examples, see Examples 36 and 37 in the Computer Appendices.
The applications of these ideas to cryptology require a slightly different setup. Suppose there are two rooms, each with 30 people. What is the probability that someone in the first room has the same birthday as someone in the second room? More generally, suppose there are objects and there are two groups of people. Each person from each group selects an object (with replacement). What is the probability that someone from the first group chooses the same object as someone from the second group? In this case,
If , then the probability of exactly matches is approximately . An analysis of this problem, with generalizations, is given in [Girault et al.]. Note that the present situation differs from the earlier problem of finding a match in one set of people. Here, we have two sets of people, so a total of people. Therefore, the probability of a match in this set is approximately . But around half of the time, these matches are between members of the same group, and half the time the matches are the desired ones, namely, between the two groups. The precise effect is to cut the probability down to .
Again, if there are possibilities and we have two lists of length , then there is a good chance of a match. Also, if we want to increase the chance of a match, we can make the lists have length or . The main point is that a length of a constant times (instead of something like ) suffices.
For example, if we take and , then
Since , there is approximately a 91.5% probability that someone in one group of 30 people has the same birthday as someone in a second group of 30 people.
The birthday attack can be used to find collisions for hash functions if the output of the hash function is not sufficiently large. Suppose that is an -bit hash function. Then there are possible outputs. Make a list for approximately random choices of . Then we have the situation of “people” with possible “birthdays,” so there is a good chance of having two values and with the same hash value. If we make the list longer, for example values of , the probability becomes very high that there is a match.
Similarly, suppose we have two sets of inputs, and . If we compute for approximately randomly chosen and compute for approximately randomly chosen , then we expect some value to be equal to some value . This situation will arise in an attack on signature schemes in Chapter 13, where will be a set of good documents and will be a set of fraudulent documents.
If the output of the hash function is around bits, the above attacks have a high chance of success. It is necessary to make lists of length approximately and to store them. This is possible on most computers. However, if the hash function outputs 256-bit values, then the lists have length around , which is too large, both in time and in memory.
Suppose we are working with a large prime and want to evaluate . In other words, we want to solve . We can do this with high probability by a birthday attack.
Make two lists, both of length around :
The first list contains numbers for approximately randomly chosen values of .
The second list contains numbers for approximately randomly chosen values of .
There is a good chance that there is a match between some element on the first list and some element on the second list. If so, we have
Therefore, is the desired discrete logarithm.
Let’s compare this method with the Baby Step, Giant Step (BSGS) method described in Section 10.2. Both methods have running time and storage space proportional to . However, the BSGS algorithm is deterministic, which means that it is guaranteed to produce an answer. The birthday algorithm is probabilistic, which means that it probably produces an answer, but this is not guaranteed. Moreover, there is a computational advantage to the BSGS algorithm. Computing one member of a list from a previous one requires one multiplication (by or by ). In the birthday algorithm, the exponent is chosen randomly, so must be computed each time. This makes the algorithm slower. Therefore, the BSGS algorithm is somewhat superior to the birthday method.