378 Handbook of Big Data
• Direct and indirect effects [73–76]: What is the direct effect of early ART initiation on
5-year mortality that is not mediated through changes in HIV RNA viral load?
• Stochastic interventions (nondeteriministic interventions) [26]: What would be the
5-year mortality if the distribution of time until ART initiation shifted toward shorter
wait times? What is the impact of early ART initiation on 5-year mortality if HIV RNA
viralload,theintermediate,remainedatthe value it would have been in the absence of
the exposure (i.e., the natural direct effect [77–79])?
Overall, access to unprecedented amounts of data does not undo the age-old adage:
“correlation is not causation.” Indeed, there are numerous sources of association (depen-
dence) between two variables: direct effects, indirect effects, measured confounding, unmea-
sured confounding, and selection bias. The methods, introduced here, allow researchers to
move from saying drug X is associated with an adverse side effect to saying (under the
necessary and transparently stated assumptions) an adverse side effect is caused by drug
X. Even if the needed identifiability assumptions are not expected to hold, this framework
helps us to estimate a statistical parameter, coming as close to the wished causal parameter.
In other words, this framework ensures that the scientific question is driving the analysis
and not the other way around.
Appendix: Extensions to Multiple Time Point Interventions
As an introduction to causal inference, we focused on causal parameters corresponding to a
static intervention on a single node. In this appendix, we step through the causal roadmap
for an example of a longitudinal effect, corresponding to a multiple time point intervention.
Step 1—Specify the scientific question: What is the effect of delayed ART initiation
on patient outcomes? As before, we want to be specific about the target population:
recently diagnosed HIV+ adults in Sub-Saharan Africa. We also need to be clear about
the definition and timing of the exposures. For simplicity, let us assume that the patients
have monthly clinic visits and therefore could initiate ART or not each month. (This
framework could easily be extended to shorter or longer time intervals.) Suppose the
outcome is viral suppression after 12 months of follow-up.
Step 2—Specify the causal model: Let baseline (t = 0) be the time that the patient
is diagnosed with HIV. Let L
0
represent the vector of baseline covariates, including
sociodemographics, clinical measurements, and social constructs. Likewise, let L
t
represent
the vector of time-updated covariates (e.g., clinical measurements). Let A
t
be an indicator
that the patient initiated ART at time t. For example, A
0
= 1 represents starting ART
on the same day as diagnosis (i.e., month 0), whereas A
1
= 1 represents initiation at the
first month visit. Finally, let Y be an indicator that the patient had undetectable HIV
RNA viral load at the end of follow-up. For simplicity, let us consider only three time
points and assume complete follow-up. Our structural causal model M
F
, only reflecting
the causal ordering, is given by
Endogenous nodes: X =(L
0
,A
0
,L
1
,A
1
,Y)
Exogenous nodes: U =(U
L
0
,U
A
0
,U
L
1
,U
A
1
,U
Y
) with some true joint distribution
P
U,0
. We place no assumptions on the set of possible distributions for U.(Duringthe
identifiability step, we will need to make some independence assumptions. However,