Blood Testing
15.1 THE PROBLEM
Imagine that a very large number of people, N, are to each have their blood tested for a possible dangerous infection. Think, for example, of the millions of men inducted into the armed services during World War II, and the then not-so-uncommon sexually transmitted infection of syphilis. The obvious way to do this is simply to test each person, for a total of N tests. During World War II a different, quite clever alternative approach was actually used, based on the fact that while syphilis wasn’t uncommon, most men were actually not infected.
The underlying idea was simple: for some number k, pool the blood samples from k men and test the mixture. If all k men are free of infection, then the test will be negative, and one test clears all k men at once. If the test is positive, however, at least one man is infected, and so then all of them are individually tested, for a total of k + 1 tests. Taking the probability of being infected as p, this leads to the following two questions: (1) What is the value of k that minimizes the number of expected tests? (2) For that value of k, what is the expected number of tests (and how does it compare with N, the number of tests if each man is individually tested)?
15.2 THEORETICAL ANALYSIS
Since p is the probability a man will test positive, then 1 − p is the probability he will test negative. Thus, (1 − p)k is the probability a mix of the blood samples from k people will test negative, and 1 − (1 − p)k is the probability the test will be positive (and so k additional, individual tests will need to be done). So, what we have is
1 test with probability (1 − p)k
and
k + 1 tests with probability 1 − (1 − p)k
for each blood mixture formed from k people. Since N people will create N/k blood pools, then, if T (a random variable) is the total number of tests required for N people, the average value of T is
or,
To find that k that minimizes E(T), we’ll next solve the equation d E(T)/dk = 0, a calculation that is valid if k is a continuous variable—which of course it isn’t. But I’ll go ahead with the calculation anyway, as the objection is more a purist’s one than anything else. Since
(1 − p)k = eln{(1 − p)k} = ekln(1 − p),
we have
For the case of p Ü 1 (during the years leading up to World War II a typical value of p for syphilis was something like 0.001 to 0.01), and we can use approximations valid for this situation:
(1 − p)k ≈ 1 − kp
and
ln(1 − p) ≈ −p.
So,
or
Now, since p Ü 1, then p2 ÜÜ 1, and so we can drop the p2k term. This gives us the answer to our first question:
For example, if p = 0.01, then k = 10, while if p = 0.001, then k = 32.
To answer the second question, we’ll put into the expression for E(T) to get
we have
So, if p = 0.01, then E(T) = 0.2N = N/5, while if p = 0.001, then E(T) = 0.063N = N/16. These are dramatic reductions in the expected number of tests required.