21. A Review of Some Recent Advances in Causal Inference (2/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

392 Handbook of Big Data

Throughout, we make several assumptions about the model. The graph G is assumed to

be acyclic (hence a DAG), and the error terms 

,...,

are jointly independent. In terms

of the causal interpretation, these assumptions mean that we do not allow feedback loops

nor unmeasured confounding variables. The above model with these assumptions was called

a structural causal model by [42]. We will simply refer to it as a structural equation model

(SEM ). If all structural equations are linear, we will call it a linear SEM.

We now discuss two important properties of SEMs, namely factorization and d-

separation. If X

,...,X

are generated from an SEM with causal DAG G, then the density

f(x

,...,x

)ofX

,...,X

(assuming it exists) factorizes as

f(x

,...,x



i=1

|pa(x

, G)) (21.2)

where f

|pa(x

, G)) is the conditional density of X

given pa(X

, G).

If a density factorizes according to a DAG as in Equation 21.2, then one can use the DAG

to read oﬀ conditional independencies that must hold in the distribution (regardless of the

choice of the f

(·)’s), using a graphical criterion called d-separation (see, e.g., Deﬁnition 1

in [43]). In particular, the so-called global Markov property implies that when two disjoint

sets A and B of vertices are d-separated by a third disjoint set S,thenA and B are

conditionally independent given S (A ⊥⊥ B|S) in any distribution that factorizes according

to the DAG.

Example 21.3 We consider the following structural equations and the corresponding causal

DAG for the random variables P , S, R,andM:

P ← g

(M,

)

S ← g

(P, 

)

R ← g

(M,S,

)

M ← g

(

)

PSR

where 

, 

,and

are mutually independent with arbitrary mean zero distributions.

For each structural equation, the variables on the right-hand side appear in the causal DAG

as the parents of the variable on the left-hand side.

We denote the random variables by M, P , S,andR, because these structural equations

can be used to describe a possible causal mechanism behind the prisoners data (Example

21.1),whereM = measure of motivation, P = participation in the program (P =1means

participation, P =0otherwise), S = measure of social skills taught by the program, and

R = rearrest (R =1means rearrest, R =0otherwise).

We see that the causal DAG of this SEM indeed provides a clear and compact description

its causal assumptions. In particular, it allows that motivation directly aﬀects participation

and rearrest. Moreover, it allows that participation directly aﬀects social skills, and that

social skills directly aﬀect rearrest. The missing edge between M and S encodes the

assumption that there is no direct eﬀect from motivation on social skills. In other words,

any eﬀect of motivation on social skills goes entirely through participation (see the path

M → P → S). Similarly, the missing edge between P and R encodes the assumption that

there is no direct eﬀect of participation on rearrest; any eﬀect of participation on rearrest

must fully go through social skills (see the path P → S → R).

A Review of Some Recent Advances in Causal Inference 393

21.2.3 Postintervention Distributions and Causal Eﬀects

Now how does the framework of the SEM allow us to move between the observational and

experimental worlds? This is straightforward, because an intervention at some variable

simply means that we change the generating mechanism of X

, that is, we change

the corresponding structural equation g

(·) (and leave the other structural equations

unchanged). For example, one can let X

← 

,where

has some given distribution,

or X

← x



for some ﬁxed value x



in the support of X

. The latter is often denoted as

Pearl’s do-intervention do(X

= x



) and is interpreted as setting the variable X

to the value



by an outside intervention, uniformly over the entire population [43].

Example 21.4 In the prisoners example (see Examples 21.1 and 21.3), the quantity P (R =

1|do(P =1))represents the rearrest probability when all prisoners are forced to participate

in the program, while P (R =1|do(P =0))is the rearrest probability if no prisoner is allowed

to participate in the program. We emphasize that these quantities are generally not equal to

the usual conditional probabilities P (R =1|P =1)and P (R =1|P =0), which represent

the rearrest probabilities among prisoners who choose to participate or not to participate in

the program.

In the gene expression example (see Example 21.2),letX

and X

represent the gene

expression level of genes i and j.ThenE(X

|do(X

= x



)) represents the average expression

level of gene j after setting the gene expression level of gene i to the value x



by an outside

intervention.

21.2.3.1 Truncated Factorization Formula

A do-intervention on X

means that X

no longer depends on its former parents in the

DAG, so that the incoming edges into X

can be removed. This leads to a so-called truncated

DAG. The postintervention distribution factorizes according to this truncated DAG, so that

we get

f(x

,...,x

|do(X

= x



)) =





j=i

|pa(x

, G)) if x

= x



0otherwise.

(21.3)

This is called the truncated factorization formula [41], the manipulation formula [59] or

the g-formula [52]. Note that this formula heavily uses the factorization formula (Equation

21.2) and the autonomy assumption (see page 391).

21.2.3.2 Deﬁning the Total Eﬀect

Summary measures of the postintervention distribution can be used to deﬁne total causal

eﬀects. In the prisoners example, it is natural to deﬁne the total eﬀect of P on R as

P (R =1|do(P =1))− P (R =1|do(P =0)).

Again, we emphasize that this is diﬀerent from P (R =1|P =1)− P (R =1|P =0).

In a setting with continuous variables, the total eﬀect of X

on Y can be deﬁned as

∂

∂x

E(Y |do(X

= x

)





394 Handbook of Big Data

21.2.3.3 Computing the Total Eﬀect

A total eﬀect can be computed using, for example, covariate adjustment [43,57], inverse

probability weighting (IPW) [23,53], or instrumental variables (e.g., [4]). In all these

methods, the causal DAG plays an important role, because it tells us which variables can be

used for covariate adjustment, which variables can be used as instruments, or which weights

should be used in IPW.

In this chapter, we focus mostly on linear SEMs. In this setting, the total eﬀect of X

Y can be easily computed via linear regression with covariate adjustment. If Y ∈ pa(X

, G)

then the eﬀect of X

on Y equals zero. Otherwise, it equals the regression coeﬃcient of X

in the linear regression of Y on X

and pa(X

, G) (see Proposition 3.1 of [39]). In other

words, we simply regress Y on X

while adjusting for the parents of X

in the causal DAG.

This is also called adjusting for direct causes of the intervention variable.

Example 21.5 We consider the following linear SEM:

← 2X

+ 

← 3X

+ 

← 2X

+ X

+ 

← 

3 2

The errors are mutually independent with arbitrary mean zero distributions. We note that

the coeﬃcients in the structural equations are depicted as edge weights in the causal DAG.

Suppose we are interested in the total eﬀect of X

on X

. Then we consider an outside

intervention that sets X

to the value x

, that is, do(X

= x

). This means that we change

the structural equation for X

to X

← x

. Because the other structural equations do not

change, we then obtain X

=3x

+ 

, X

= 

,andX

=2X

+ X

+ 

=6x

+2



+ 

.Hence,E(X

|do(X

= x

)) = 6x

, and diﬀerentiating with respect to x

yields a

total eﬀect of 6.

We note that the total eﬀect of X

on X

also equals the product of the edge weights

along the directed path X

→ X

. This is true in general for linear SEMs: the

total eﬀect of X

on Y can be obtained by multiplying the edge weights along each directed

path from X

to Y , and then summing over the directed paths (if there is more than

one).

The total eﬀect can also be obtained via regression. Because pa(X

, G)={X

}, the total

eﬀect of X

on X

equals the coeﬃcient of X

in the regression of X

on X

and X

.Itcan

be easily veriﬁed that this again yields 6. One can also verify that adjusting for any other

subset of {X

} does not yield the correct total eﬀect.

21.3 Causal Structure Learning

The material in the previous section can be used if the causal DAG is known. In settings

with big data, however, it is rare that one can draw the causal DAG. In this section, we

therefore consider methods for learning DAGs from observational data. Such methods are

called causal structure learning methods.

Recall from Section 21.2.2 that DAGs encode conditional independencies via

d-separation. Thus, by considering conditional independencies in the observational dis-

A Review of Some Recent Advances in Causal Inference 395

tribution, one may hope to reverse-engineer the causal DAG that generated the data.

Unfortunately, this does not work in general, because the same set of d-separation

relationships can be encoded by several DAGs. Such DAGs are called Markov equivalent

and form a Markov equivalence class.

A Markov equivalence class can be described uniquely by a completed partially DAG

(CPDAG) [3,9]. The skeleton of the CPDAG is deﬁned as follows. Two nodes X

and X

are adjacent in the CPDAG if and only if, in any DAG in the Markov equivalence class, X

and X

cannot be d-separated by any set of the remaining nodes. The orientation of the

edges in the CPDAG is as follows. A directed edge X

→ X

in the CPDAG means that

the edge X

→ X

occurs in all DAGs in the Markov equivalence class. An undirected edge

− X

in the CPDAG means that there is a DAG in the Markov equivalence class with

→ X

,aswellasaDAGwithX

← X

It can happen that a distribution contains more conditional independence relationships

than those that are encoded by the DAG via d-separation. If this is not the case, then the

distribution is called faithful with respect to the DAG. If a distribution is both Markov

and faithful with respect to a DAG, then the conditional independencies in the distribution

correspond exactly to d-separation relationships in the DAG, and the DAG is called a perfect

map of the distribution.

Problem setting. Throughout this section, we consider the following setting. We are given n

i.i.d. observations of X,whereX =(X

,...,X

) is generated from a SEM. We assume that

the corresponding causal DAG G is a perfect map of the distribution of X.Weaimto

learn the Markov equivalence class of G.

In the following three subsections we discuss so-called constraint-based, score-based,

and hybrid methods for this task. The discussed algorithms are available in the R-package

pcalg [29]. In the last subsection we discuss a class of methods that can be used if one is

willing to impose additional restrictions on the SEM that allow identiﬁcation of the causal

DAG (rather than its CPDAG).

21.3.1 Constraint-Based Methods

Constraint-based methods learn the CPDAG by exploiting conditional independence

constraints in the observational distribution. The most prominent example of such a method

is probably the PC algorithm [60]. This algorithm ﬁrst estimates the skeleton of the

underlying CPDAG, and then determines the orientation of as many edges as possible.

We discuss the estimation of the skeleton in more detail. Recall that, under the Markov

and faithfulness assumptions, two nodes X

and X

are adjacent in the CPDAG if and only

if they are conditionally dependent given all subsets of X{X

}. Therefore, adjacency of

and X

can be determined by testing X

⊥⊥ X

|S for all possible subsets S ⊆ X{X

This naive approach is used in the SGS algorithm [60]. It quickly becomes computationally

infeasible for a large number of variables.

The PC algorithm avoids this computational trap by using the following fact about

DAGs: two nodes X

and X

in a DAG G are d-separated by some subset of the remaining

nodes if and only if they are d-separated by pa(X

, G)orbypa(X

, G). This fact may

seem of little help at ﬁrst, because we do not know pa(X

, G)andpa(X

, G) (then we

would know the DAG!). It is helpful, however, because it allows a clever ordering of the

conditional independence tests in the PC algorithm, as follows. The algorithm starts with

a complete undirected graph. It then assesses, for all pairs of variables, whether they are

marginally independent. If a pair of variables is found to be independent, then the edge

396 Handbook of Big Data

between them is removed. Next, for each pair of nodes (X

)thatarestilladjacent,

it tests conditional independence of the corresponding random variables given all possible

subsets of size 1 of adj(X

, G

∗

) {X

} and of adj(X

, G

∗

) {X

},whereG

∗

is the current

graph. Again, it removes the edge if such a conditional independence is deemed to be

true. The algorithm continues in this way, considering conditioning sets of increasing size,

until the size of the conditioning sets is larger than the size of the adjacency sets of the

nodes.

This procedure gives the correct skeleton when using perfect conditional independence

information. To see this, note that at any point in the procedure, the current graph is a

supergraph of the skeleton of the CPDAG. By construction of the algorithm, this ensures

that X

⊥⊥ X

|pa(X

, G)andX

⊥⊥ X

|pa(X

, G) were assessed.

After applying certain edge orientation rules, the output of the PC algorithm is a

partially directed graph, the estimated CPDAG. This output depends on the ordering

of the variables (except in the limit of an inﬁnite sample size), because the ordering

determines which conditional independence tests are done. This issue was studied in [14],

where it was shown that the order-dependence can be very severe in high-dimensional

settings with many variables and a small sample size (see Section 21.4.3 for a data

example). Moreover, an order-independent version of the PC algorithm, called PC-stable,

was proposed in [14]. This version is now the default implementation in the R-package

pcalg [29].

We note that the user has to specify a signiﬁcance level α for the conditional

independence tests. Because of multiple testing, this parameter does not play the role of an

overall signiﬁcance level. It should rather be viewed as a tuning parameter for the algorithm,

where smaller values of α typically lead to sparser graphs.

The PC and PC-stable algorithms are computationally feasible for sparse graphs with

thousands of variables. Both PC and PC-stable were shown to be consistent in sparse high-

dimensional settings, when the joint distribution is multivariate Gaussian and conditional

independence is assessed by testing for zero partial correlation [14,28]. By using Rank

correlation, consistency can be achieved in sparse high-dimensional settings for a broader

class of Gaussian copula or nonparanormal models [21].

21.3.2 Score-Based Methods

Score-based methods learn the CPDAG by (greedily) searching for an optimally scoring

DAG, where the score measures how well the data ﬁt to the DAG, while penalizing the

complexity of the DAG.

A prominent example of such an algorithm is the greedy equivalence search (GES)

algorithm [10]. GES is a grow–shrink algorithm that consists of two phases: a forward

phase and a backward phase. The forward phase starts with an initial estimate (often the

empty graph) of the CPDAG, and sequentially adds single edges, each time choosing the

edge addition that yields the maximum improvement of the score, until the score can no

longer be improved. The backward phase starts with the output of the forward phase, and

sequentially deletes single edges, each time choosing the edge deletion that yields a maximum

improvement of the score, until the score can no longer be improved. A computational

advantage of GES over the traditional DAG-search methods is that it searches over the

space of all possible CPDAGs, instead of over the space of all possible DAGs.

The GES algorithm requires the scoring criterion to be score equivalent, meaning

that every DAG in a Markov equivalence class gets the same score. Moreover, the

choice of scoring criterion is crucial for computational and statistical performances. The

so-called decomposability property of a scoring criterion allows fast updates of scores during

the forward and the backward phase. For example, (penalized) log-likelihood scores are

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 21. A Review of Some Recent Advances in Causal Inference (2/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
21. A Review of Some Recent Advances in Causal Inference (2/5)