392 Handbook of Big Data
Throughout, we make several assumptions about the model. The graph G is assumed to
be acyclic (hence a DAG), and the error terms
1
,...,
p
are jointly independent. In terms
of the causal interpretation, these assumptions mean that we do not allow feedback loops
nor unmeasured confounding variables. The above model with these assumptions was called
a structural causal model by [42]. We will simply refer to it as a structural equation model
(SEM ). If all structural equations are linear, we will call it a linear SEM.
We now discuss two important properties of SEMs, namely factorization and d-
separation. If X
1
,...,X
p
are generated from an SEM with causal DAG G, then the density
f(x
1
,...,x
p
)ofX
1
,...,X
p
(assuming it exists) factorizes as
f(x
1
,...,x
p
)=
p
i=1
f
i
(x
i
|pa(x
i
, G)) (21.2)
where f
i
(x
i
|pa(x
i
, G)) is the conditional density of X
i
given pa(X
i
, G).
If a density factorizes according to a DAG as in Equation 21.2, then one can use the DAG
to read off conditional independencies that must hold in the distribution (regardless of the
choice of the f
i
(·)’s), using a graphical criterion called d-separation (see, e.g., Definition 1
in [43]). In particular, the so-called global Markov property implies that when two disjoint
sets A and B of vertices are d-separated by a third disjoint set S,thenA and B are
conditionally independent given S (A B|S) in any distribution that factorizes according
to the DAG.
Example 21.3 We consider the following structural equations and the corresponding causal
DAG for the random variables P , S, R,andM:
P g
1
(M,
P
)
S g
2
(P,
S
)
R g
3
(M,S,
R
)
M g
4
(
M
)
PSR
M
where
P
,
S
,
R
,and
M
are mutually independent with arbitrary mean zero distributions.
For each structural equation, the variables on the right-hand side appear in the causal DAG
as the parents of the variable on the left-hand side.
We denote the random variables by M, P , S,andR, because these structural equations
can be used to describe a possible causal mechanism behind the prisoners data (Example
21.1),whereM = measure of motivation, P = participation in the program (P =1means
participation, P =0otherwise), S = measure of social skills taught by the program, and
R = rearrest (R =1means rearrest, R =0otherwise).
We see that the causal DAG of this SEM indeed provides a clear and compact description
its causal assumptions. In particular, it allows that motivation directly affects participation
and rearrest. Moreover, it allows that participation directly affects social skills, and that
social skills directly aect rearrest. The missing edge between M and S encodes the
assumption that there is no direct effect from motivation on social skills. In other words,
any effect of motivation on social skills goes entirely through participation (see the path
M P S). Similarly, the missing edge between P and R encodes the assumption that
there is no direct effect of participation on rearrest; any effect of participation on rearrest
must fully go through social skills (see the path P S R).
A Review of Some Recent Advances in Causal Inference 393
21.2.3 Postintervention Distributions and Causal Effects
Now how does the framework of the SEM allow us to move between the observational and
experimental worlds? This is straightforward, because an intervention at some variable
X
i
simply means that we change the generating mechanism of X
i
, that is, we change
the corresponding structural equation g
i
(·) (and leave the other structural equations
unchanged). For example, one can let X
i
i
,where
i
has some given distribution,
or X
i
x
i
for some fixed value x
i
in the support of X
i
. The latter is often denoted as
Pearl’s do-intervention do(X
i
= x
i
) and is interpreted as setting the variable X
i
to the value
x
i
by an outside intervention, uniformly over the entire population [43].
Example 21.4 In the prisoners example (see Examples 21.1 and 21.3), the quantity P (R =
1|do(P =1))represents the rearrest probability when all prisoners are forced to participate
in the program, while P (R =1|do(P =0))is the rearrest probability if no prisoner is allowed
to participate in the program. We emphasize that these quantities are generally not equal to
the usual conditional probabilities P (R =1|P =1)and P (R =1|P =0), which represent
the rearrest probabilities among prisoners who choose to participate or not to participate in
the program.
In the gene expression example (see Example 21.2),letX
i
and X
j
represent the gene
expression level of genes i and j.ThenE(X
j
|do(X
i
= x
i
)) represents the average expression
level of gene j after setting the gene expression level of gene i to the value x
i
by an outside
intervention.
21.2.3.1 Truncated Factorization Formula
A do-intervention on X
i
means that X
i
no longer depends on its former parents in the
DAG, so that the incoming edges into X
i
can be removed. This leads to a so-called truncated
DAG. The postintervention distribution factorizes according to this truncated DAG, so that
we get
f(x
1
,...,x
p
|do(X
i
= x
i
)) =
j=i
f
j
(x
j
|pa(x
j
, G)) if x
i
= x
i
,
0otherwise.
(21.3)
This is called the truncated factorization formula [41], the manipulation formula [59] or
the g-formula [52]. Note that this formula heavily uses the factorization formula (Equation
21.2) and the autonomy assumption (see page 391).
21.2.3.2 Defining the Total Effect
Summary measures of the postintervention distribution can be used to define total causal
effects. In the prisoners example, it is natural to define the total effect of P on R as
P (R =1|do(P =1)) P (R =1|do(P =0)).
Again, we emphasize that this is different from P (R =1|P =1) P (R =1|P =0).
In a setting with continuous variables, the total effect of X
i
on Y can be defined as
∂x
i
E(Y |do(X
i
= x
i
)
x
i
=x
i
.
394 Handbook of Big Data
21.2.3.3 Computing the Total Effect
A total effect can be computed using, for example, covariate adjustment [43,57], inverse
probability weighting (IPW) [23,53], or instrumental variables (e.g., [4]). In all these
methods, the causal DAG plays an important role, because it tells us which variables can be
used for covariate adjustment, which variables can be used as instruments, or which weights
should be used in IPW.
In this chapter, we focus mostly on linear SEMs. In this setting, the total effect of X
i
on
Y can be easily computed via linear regression with covariate adjustment. If Y pa(X
i
, G)
then the effect of X
i
on Y equals zero. Otherwise, it equals the regression coefficient of X
i
in the linear regression of Y on X
i
and pa(X
i
, G) (see Proposition 3.1 of [39]). In other
words, we simply regress Y on X
i
while adjusting for the parents of X
i
in the causal DAG.
This is also called adjusting for direct causes of the intervention variable.
Example 21.5 We consider the following linear SEM:
X
1
2X
4
+
1
X
2
3X
1
+
2
X
3
2X
2
+ X
4
+
3
X
4
4
X
1
X
2
X
3
X
4
2
1
3 2
.
The errors are mutually independent with arbitrary mean zero distributions. We note that
the coefficients in the structural equations are depicted as edge weights in the causal DAG.
Suppose we are interested in the total effect of X
1
on X
3
. Then we consider an outside
intervention that sets X
1
to the value x
1
, that is, do(X
1
= x
1
). This means that we change
the structural equation for X
1
to X
1
x
1
. Because the other structural equations do not
change, we then obtain X
2
=3x
1
+
2
, X
4
=
4
,andX
3
=2X
2
+ X
4
+
3
=6x
1
+2
2
+
4
+
3
.Hence,E(X
3
|do(X
1
= x
1
)) = 6x
1
, and differentiating with respect to x
1
yields a
total effect of 6.
We note that the total effect of X
1
on X
3
also equals the product of the edge weights
along the directed path X
1
X
2
X
3
. This is true in general for linear SEMs: the
total effect of X
i
on Y can be obtained by multiplying the edge weights along each directed
path from X
i
to Y , and then summing over the directed paths (if there is more than
one).
The total effect can also be obtained via regression. Because pa(X
1
, G)={X
4
}, the total
effect of X
1
on X
3
equals the coecient of X
1
in the regression of X
3
on X
1
and X
4
.Itcan
be easily verified that this again yields 6. One can also verify that adjusting for any other
subset of {X
2
,X
4
} does not yield the correct total effect.
21.3 Causal Structure Learning
The material in the previous section can be used if the causal DAG is known. In settings
with big data, however, it is rare that one can draw the causal DAG. In this section, we
therefore consider methods for learning DAGs from observational data. Such methods are
called causal structure learning methods.
Recall from Section 21.2.2 that DAGs encode conditional independencies via
d-separation. Thus, by considering conditional independencies in the observational dis-
A Review of Some Recent Advances in Causal Inference 395
tribution, one may hope to reverse-engineer the causal DAG that generated the data.
Unfortunately, this does not work in general, because the same set of d-separation
relationships can be encoded by several DAGs. Such DAGs are called Markov equivalent
and form a Markov equivalence class.
A Markov equivalence class can be described uniquely by a completed partially DAG
(CPDAG) [3,9]. The skeleton of the CPDAG is defined as follows. Two nodes X
i
and X
j
are adjacent in the CPDAG if and only if, in any DAG in the Markov equivalence class, X
i
and X
j
cannot be d-separated by any set of the remaining nodes. The orientation of the
edges in the CPDAG is as follows. A directed edge X
i
X
j
in the CPDAG means that
the edge X
i
X
j
occurs in all DAGs in the Markov equivalence class. An undirected edge
X
i
X
j
in the CPDAG means that there is a DAG in the Markov equivalence class with
X
i
X
j
,aswellasaDAGwithX
i
X
j
.
It can happen that a distribution contains more conditional independence relationships
than those that are encoded by the DAG via d-separation. If this is not the case, then the
distribution is called faithful with respect to the DAG. If a distribution is both Markov
and faithful with respect to a DAG, then the conditional independencies in the distribution
correspond exactly to d-separation relationships in the DAG, and the DAG is called a perfect
map of the distribution.
Problem setting. Throughout this section, we consider the following setting. We are given n
i.i.d. observations of X,whereX =(X
1
,...,X
p
) is generated from a SEM. We assume that
the corresponding causal DAG G is a perfect map of the distribution of X.Weaimto
learn the Markov equivalence class of G.
In the following three subsections we discuss so-called constraint-based, score-based,
and hybrid methods for this task. The discussed algorithms are available in the R-package
pcalg [29]. In the last subsection we discuss a class of methods that can be used if one is
willing to impose additional restrictions on the SEM that allow identification of the causal
DAG (rather than its CPDAG).
21.3.1 Constraint-Based Methods
Constraint-based methods learn the CPDAG by exploiting conditional independence
constraints in the observational distribution. The most prominent example of such a method
is probably the PC algorithm [60]. This algorithm first estimates the skeleton of the
underlying CPDAG, and then determines the orientation of as many edges as possible.
We discuss the estimation of the skeleton in more detail. Recall that, under the Markov
and faithfulness assumptions, two nodes X
i
and X
j
are adjacent in the CPDAG if and only
if they are conditionally dependent given all subsets of X{X
i
,X
j
}. Therefore, adjacency of
X
i
and X
j
can be determined by testing X
i
X
j
|S for all possible subsets S X{X
i
,X
j
}.
This naive approach is used in the SGS algorithm [60]. It quickly becomes computationally
infeasible for a large number of variables.
The PC algorithm avoids this computational trap by using the following fact about
DAGs: two nodes X
i
and X
j
in a DAG G are d-separated by some subset of the remaining
nodes if and only if they are d-separated by pa(X
i
, G)orbypa(X
j
, G). This fact may
seem of little help at first, because we do not know pa(X
i
, G)andpa(X
j
, G) (then we
would know the DAG!). It is helpful, however, because it allows a clever ordering of the
conditional independence tests in the PC algorithm, as follows. The algorithm starts with
a complete undirected graph. It then assesses, for all pairs of variables, whether they are
marginally independent. If a pair of variables is found to be independent, then the edge
396 Handbook of Big Data
between them is removed. Next, for each pair of nodes (X
i
,X
j
)thatarestilladjacent,
it tests conditional independence of the corresponding random variables given all possible
subsets of size 1 of adj(X
i
, G
) {X
j
} and of adj(X
j
, G
) {X
i
},whereG
is the current
graph. Again, it removes the edge if such a conditional independence is deemed to be
true. The algorithm continues in this way, considering conditioning sets of increasing size,
until the size of the conditioning sets is larger than the size of the adjacency sets of the
nodes.
This procedure gives the correct skeleton when using perfect conditional independence
information. To see this, note that at any point in the procedure, the current graph is a
supergraph of the skeleton of the CPDAG. By construction of the algorithm, this ensures
that X
i
X
j
|pa(X
i
, G)andX
i
X
j
|pa(X
j
, G) were assessed.
After applying certain edge orientation rules, the output of the PC algorithm is a
partially directed graph, the estimated CPDAG. This output depends on the ordering
of the variables (except in the limit of an infinite sample size), because the ordering
determines which conditional independence tests are done. This issue was studied in [14],
where it was shown that the order-dependence can be very severe in high-dimensional
settings with many variables and a small sample size (see Section 21.4.3 for a data
example). Moreover, an order-independent version of the PC algorithm, called PC-stable,
was proposed in [14]. This version is now the default implementation in the R-package
pcalg [29].
We note that the user has to specify a significance level α for the conditional
independence tests. Because of multiple testing, this parameter does not play the role of an
overall significance level. It should rather be viewed as a tuning parameter for the algorithm,
where smaller values of α typically lead to sparser graphs.
The PC and PC-stable algorithms are computationally feasible for sparse graphs with
thousands of variables. Both PC and PC-stable were shown to be consistent in sparse high-
dimensional settings, when the joint distribution is multivariate Gaussian and conditional
independence is assessed by testing for zero partial correlation [14,28]. By using Rank
correlation, consistency can be achieved in sparse high-dimensional settings for a broader
class of Gaussian copula or nonparanormal models [21].
21.3.2 Score-Based Methods
Score-based methods learn the CPDAG by (greedily) searching for an optimally scoring
DAG, where the score measures how well the data fit to the DAG, while penalizing the
complexity of the DAG.
A prominent example of such an algorithm is the greedy equivalence search (GES)
algorithm [10]. GES is a grow–shrink algorithm that consists of two phases: a forward
phase and a backward phase. The forward phase starts with an initial estimate (often the
empty graph) of the CPDAG, and sequentially adds single edges, each time choosing the
edge addition that yields the maximum improvement of the score, until the score can no
longer be improved. The backward phase starts with the output of the forward phase, and
sequentially deletes single edges, each time choosing the edge deletion that yields a maximum
improvement of the score, until the score can no longer be improved. A computational
advantage of GES over the traditional DAG-search methods is that it searches over the
space of all possible CPDAGs, instead of over the space of all possible DAGs.
The GES algorithm requires the scoring criterion to be score equivalent, meaning
that every DAG in a Markov equivalence class gets the same score. Moreover, the
choice of scoring criterion is crucial for computational and statistical performances. The
so-called decomposability property of a scoring criterion allows fast updates of scores during
the forward and the backward phase. For example, (penalized) log-likelihood scores are
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset