Tutorial for Causal Inference 363
20.2 The Scientific Question
The first step in the causal “roadmap” is to specify the scientific objective. As a running
example, we will consider the timing of antiretroviral therapy (ART) initiation and its
impact on outcomes among HIV+ individuals. Early ART initiation has been been shown to
improve patient outcomes as well as reduce transmission between discordant couples [9,10].
Suppose we want to learn the effect of immediate ART initiation (i.e., irrespective of CD4+
T-cell count) on mortality. Large consortiums, such as the International Epidemiologic
Databases to Evaluate AIDS and Sustainable East Africa Research in Community Health,
are providing unprecedented quantities of data to answer this and other questions [12,13].
To sharply frame our scientific aim, we need to further specify the system, including the
target population (e.g., patients and context), the exposure (e.g., criteria and timing), and
the outcome. As a second try, consider our goal as learning the impact of initiating ART
within 1 month of diagnosis on 5-year all-cause mortality among adults, recently diagnosed
with HIV in Sub-Saharan Africa. This might seem like an insurmountable task, and it may
seem safer to frame our question in terms of an association. Indeed, there seems to be a
tendency to shy away from causal language when stating the scientific objective. However,
we are not fundamentally interested in the correlation between early ART initiation and
mortality among HIV+ adults. Instead, we want to isolate the effect of interest from the
spurious sources of dependence (e.g., confounding, selection bias, informative censoring) as
shown in Figure 20.1. The framework, discussed in this chapter, provides a pathway from
our scientific aim to estimation of a statistical parameter that best approximates our causal
effect, while keeping any assumptions transparent.
20.3 The Causal Model
The second step of the roadmap is to specify our causal model. Causal inference is distinct
from statistics in that it requires something more than a sample from the observed data
distribution. In particular, causal inference requires specification of background knowledge,
and causal models provide a rigorous language for expressing this knowledge and its limits.
In this chapter, we focus on structural causal models [14] to formally represent which
variables potentially affect one another, the roles of unmeasured factors, and the functional
form of those relationships. Structural causal models unify causal graphs [15], structural
equations [16,17], and counterfactuals. We also briefly introduce the Neyman–Rubin
potential outcomes framework [18–20] and discuss its relation to the structural causal model.
Consider again our running example. Let W denote the set of baseline covariates,
including sociodemographics, clinical measurements, and social constructs. The exposure
A is an indicator, equalling 1 if the patient initiated ART within 1 month of diagnosis and
equalling 0 otherwise (i.e., initiation took longer than 1 month). Finally, the outcome Y is an
indicator that the patient did not survive 5 years of follow-up. These factors have scientific
meaning to the question and comprise the set of endogenous variables: X = {W, A, Y }.They
can be measurable (e.g., age and sex) or unmeasurable and are affected by other variables
in the model.
Each endogenous variable is associated with a set of background factors U =
(U
W
,U
A
,U
Y
) with some joint distribution P
U
. These represent all the unmeasured factors,
affecting other variables in the model but not included in X. For example, U
A
could include
unknown clinic-level factors, influencing whether or not a patient initiates early ART.