A Review of Some Recent Advances in Causal Inference 389
areas involving big data include, for example, systems biology (e.g., [12,19,30,32,40,62]),
neuroscience (e.g., [8,20,49,58]), climate science (e.g., [16,17]), and marketing (e.g., [7]).
21.1.2 Observational versus Experimental Data
Going back to the prisoners example, which of the three posed questions can we answer? This
depends on the origin of the data, and brings us to the distinction between observational
and experimental data.
21.1.2.1 Observational Data
Suppose first that participation in the program was voluntary. Then we would have so-called
observational data, because the subjects (prisoners) chose their own treatment (rehabilita-
tion program or not), while the researchers just observed the results. From observational
data, we can easily answer question 1. It is difficult, however, to answer questions 2 and 3.
Let us first consider question 2. Because the participants form a self-selected subgroup,
there may be many differences between the participants and the nonparticipants. For
example, the participants may be more motivated to change their lives, and this may
contribute to the difference in rearrest rates. In this case, the effects of the program and
the motivation of the prisoners are said to be mixed-up or confounded.
Next, let us consider question 3. At first sight, one may think that the answer is simply
20%, because this was the rearrest rate among the participants of the program. But again
we have to keep in mind that the participants form a self-selected subgroup that is likely to
have special characteristics. Hence, the rearrest rate of this subgroup cannot be extrapolated
to the entire prisoners population.
21.1.2.2 Experimental Data
Now suppose that it was up to the researchers to decide which prisoners participated in the
program. For example, suppose that the researchers rolled a die for each prisoner, and let
him/her participate if the outcome was 1 or 2. Then we would have a so-called randomized
controlled experiment and experimental data.
Let us look again at question 2. Because of the randomization, the motivation level of the
prisoners is likely to be similar in the two groups. Moreover, any other factors of importance
(such as social background, type of crime committed, and number of earlier crimes) are
likely to be similar in the two groups. Hence, the groups are equal in all respects, except
for participation in the program. The observed difference in rearrest rate must therefore be
due to the program. This answers question 2.
Finally, the answer to question 3 is now 20%, because the randomized treatment
assignment ensures that the participants form a representative sample of the population.
Thus, causal questions are best answered by experimental data, and we should work
with such data whenever possible. Experimental data are not always available, however,
because randomized controlled experiments can be unethical, infeasible, time consuming, or
expensive. On the other hand, observational data are often relatively cheap and abundant.
In this chapter, we therefore consider the problem of answering causal questions about
large-scale systems from observational data.
21.1.3 Problem Formulation
It is relatively straightforward to make standard predictions based on observational data
(see the observational world in Figure 21.1), or to estimate causal effects from randomized
controlled experiments (see the experimental world in Figure 21.1). But we want to