2. Big-n versus Big-p in Big Data (2/3)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

26 Handb ook of Big Data

2.5 Case of Big-p

Our focus now shifts to statistical consideration, in Big-p settings. The Big-p case has be-

come commonplace, even central, to the Big Data ﬁeld. Yet the problems with this situation

are not always clearly identiﬁed and discussed. We will explore such points in the following

sections.

2.5.1 Running Examples: Linear Regression and PCA

As running examples to cite for concreteness, let us take linear regression models and PCA.

Linear models are important due to their ubiquity, and due to the fact, noted in Section

2.5.2, that nonparametric regression methodology may be problematic in Big-p settings.

PCA plays a major role in many analyses of Big Data, ironically because it is often used

for dimensionality reduction—turning Big-p into Small-p.

We will assume the standard notation for the regression case: We have a matrix X of

predictor variable data and vector Y of response variable data, and the usual formula



β =(X



−1



Y (2.2)

gives us the vector of estimated coeﬃcients.

2.5.2 Curse of Dimensionality

The curse of dimensionality (COD), an old notion due to Bellman, is often mentioned in

Big Data contexts. Indeed, Bellman’s concept may be viewed as the forerunner of Big-p

research.

2.5.2.1 Data Sparsity

A major aspect of the COD is the idea that most points in a bounded region of high-

dimensional space are distant from each other. For instance, consider nearest-neighbor re-

gression, in which we predict the response variable—the Yvalue—of a new observation to

be the average of the responses of the k observations in our training set whose predictor vari-

ables are closest to the new one. With high values of p, one can show that those neighboring

observations tend to be far from the new one [2]. They thus tend to be unrepresentative.

especially at the edges of the data space.

These considerations are described dramatically in [18, p. 101]: “To maintain a given

degree of accuracy of [a nonparametric regression] estimator, the sample size must increase

exponentially with [p].” This is quite an irony. Early researchers in nonparametric regression

methods would have envied analysts who enjoy today’s Big-n datasets, as the nonparametric

methods need large n to be eﬀective, for ﬁxed p. But the COD implies that all this can be

lost if we are also in a Big-p setting. Thus, parametric models turn out to be useful even

with Big Data.

However, arguably the impact of the COD is not yet fully clear. For example, a common

explanation for the COD is that as p →∞, the volume of the unit hypersphere relative to

that of the unit hypercube goes to zero [4]. Therefore, if the variables have, say, a uniform

distribution on the hypercube, the probability that a point is within 0.5 unit of the center

goes to 0. This supports the notion that data for large p are sparse, a condition we will refer

to here as data sp arsity, to distinguish it from the notion of variable sparsity introduced

earlier.

Big-n versus Big-p in Big Data 27

Yet as pointed out in [12], this data sparsity argument depends crucially on choosing

the l

(Euclidean) metric

d[(x

, ..., x

), (y

, ..., y

)] =



− y

)

+ ... +(x

− y

)

(2.3)

If one instead uses the l

∞

metric

d[(x

, ..., x

), (y

, ..., y

)] = max(|x

− y

|, ..., |x

− y

|) (2.4)

then all points are within 0.5 of the center, so that the volume argument no longer implies

sparsity.

However, there are other arguments for sparsity that do not involve a speciﬁc metric.

In [7], for instance, it is shown, again with the unit hypercube as our data space, that the

size of a subcube needed to contain a certain fraction of our data grows with p. The results

in [2] apply to general metrics.

2.5.2.2 Variable Sparsity May Ameliorate the COD

The work showing data sparsity in Section 2.5.2.1 has one common assumption: all vari-

ables/dimensions should be treated identically. Revisiting this assumption may be the key

to dealing with the COD.

In nonparametric prediction applications, for instance, it may be that the unweighted

sums in Equations 2.3 and 2.4 should be replaced by weighted ones, with the more important

predictor variables carrying larger weights, say

d[(x

, ..., x

), (y

, ..., y

)] =



− y

)

+ ... + w

− y

)

(2.5)

Then as p grows, neighbors of a given point are mainly those points that are close for

the important predictor variables. Sparsity is then not an issue, and the COD becomes

less relevant [12]. In essence, the eﬀective value of p remains bounded as the actual value

increases.

Of course, ﬁnding the proper set of weights would be a challenge. If one knows the rough

order of importance of the predictors in a regression setting, one might try, say, w

= i

and choose γ by cross-validation, for instance. But in any case, the above argument at least

shows that, in principle, the COD may not be as large an obstacle as presently thought.

Another approach to the COD has been assumptions of variable sparsity. In regression,

this may be a reasonable assumption, motivated as follows. Suppose

E(Y |X

, ...)=

∞



i=1

(2.6)

say with the X

i.i.d. N(0,1). Then, the simple existence of the right-hand side implies that

→ 0asi →∞. In other words, only a ﬁnite number of the β

are large, ﬁtting our

deﬁnition of variable sparsity. And typically our predictors are correlated, with this again

reducing the eﬀective value of p.

This is part of the motivation of LASSO regression [16]. Here, we choose



β to minimize

||Y −X



β||

(2.7)

as usual, except now with the constraint





|≤λ (2.8)

28 Handb ook of Big Data

This typically yields an estimator with many 0s, that is, a sparse solution.

The situation is not so clearcut in the case of PCA. To see why, ﬁrst recall that the

eigenvalues in PCA represent an apportionment of the total variance of our p variables.

Many analysts create a scree plot, displaying the eigenvalues in decreasing order, with the

goal of determining an eﬀective value of p. This produces good results in many cases,

especially if the data have been collected with regression of a speciﬁc variable as one’s goal.

The above discussion involving Equation 2.6 suggests that a scree plot will begin to ﬂatten

after a few eigenvalues.

By contrast, consider a dataset of people, for instance. Potentially there are a virtually

limitless number of variables that could be recorded for each person, say: height, weight,

age, gender, hair color (multivariate, several frequencies), years of education, highest degree

attained, ﬁeld of degree, age of ﬁrst trip more than 100 miles from home, number of surgeries,

type of work, length of work experience, number of coworkers, number of times ﬁling for

political oﬃce, latitude/longitude/altitude of home, marital status, number of marriages,

hours of sleep per night and many, many more. While there are some correlations between

these variables, it is clear that the eﬀective value of p can be extremely large.

In this kind of situation, the total variance can increase without bound as p grows, and

there will be many strong principal components. In such settings, PCA will likely not work

well for dimensional reduction. Sparse PCA methods, for example, [19], would not work

either, since the situation is not sparse.

Another approach to dealing with the COD that may be viewed as a form of vari-

able sparsity is to make very strong assumptions about the distributions of our data in

application-speciﬁc contexts. A notable example is [5] for genomics data. Here, each gene is

assumed to have either zero or nonzero eﬀect, the latter with probability φ,whichistobe

estimated from the data. Based on these very strong assumptions, the problem becomes in

some sense ﬁnite-dimensional, and signiﬁcance tests are then performed, while controlling

for false discovery rates.

2.5.3 Principle of Extreme Values

In addition to sparsity, the COD may be viewed as a reﬂection of a familiar phenomenon

that will be convenient to call the principle of extreme values (PEV):

PEVs: Say U

, ..., U

are events of low probability. As p increases, the probability

that at least one of them will occur goes to 1.

This is very imprecisely stated, but all readers will immediately recognize it, as it describes,

for example, the problem of multiple comparisons mentioned earlier. It is thus not a new

concept, but since it will arise often here, we give it a name.

The PEV can be viewed as more general than the COD. Consider, for instance, lin-

ear regression models and the famous formula for the covariance matrix of the estimated

coeﬃcient vector,

Cov(



β)=σ



−1

(2.9)

Suppose all elements of the predictor variables were doubled. Then, the standard errors of

the estimated coeﬃcients would be halved. In other words, the more disperse our predictor

variables are, the better. Thus, data sparsity is actually helpful, rather than problematic as

in the nonparametric regression case.

Thus, the COD may not be an issue with parametric models, but the PEV always comes

into play, even in the parametric setting. Consider selection of predictor variables for a linear

regression model, for example. The likelihood of spurious results, that is, of some predictors

appearing to be important even though they are not, grows with p, again by the PEV.

Big-n versus Big-p in Big Data 29

There is potential trouble in this regard for PCA as well. Say we are working with

variables W

, ..., W

, and write W =(W

, ..., W

)



. The largest eigenvalue is

=argmax

Var(u



W) = arg max



Cov(W)u, ||u]] = 1 (2.10)

Our estimate of λ

will be calculated using the sample covariance matrix



Cov(W) of W in

Equation 2.10, that is,



=argmax





Cov(W)u, ||u]] = 1 (2.11)

The PEVs then come into play for large p relative to n. Since we are maximizing, the

estimate of λ

will be greatly biased upward. For example, suppose X

, ... are i.i.d.

N(0,1). Let W

= X

+ X

, with W

= X

for i>1. In this case, the true population value

of λ

is 2.62 (independent of p), while in simulated samples for various p, the following

values of



were obtained:



50 2.62

500 4.34

5000 17.17

In other words, the PEV will likely result in severe distortions in PCA.

2.5.4 What Theoretical Literature Tells Us

There is a vast theoretical literature on the Big-p case. No attempt is made here to sum-

marize it, but again as a guide to our thoughts, here are just a few citations (somewhat

arbitrarily chosen):

• The 1988 paper by Portnoy [14] has already been mentioned. It roughly ﬁnds that we

are safe if p is o(

√

n). But it does not exclude the possibility that methods might be

found that allow much larger values of p than this.

• One common way to deal with Big-p settingsinregressionmodelingistodovariable

selection, thus reducing p. A component in many variable selection techniques is cross-

validation. To assess a model, we repeatedly ﬁt the model to one part of the data and

use the result to predict the remaining, left out, portion. To the surprise of many, if

not all, researchers in this ﬁeld, work in 1993 by Shao [15] showed that cross-validation

does not work unless the size k of the left out portion, relative to n,goesto0asn goes

to inﬁnity. While this is not a major impediment in practice, the surprising nature of

Shao’s result suggests that our intuition may be weak in ﬁnding good methods for the

Big-p case, many of which are based on choosing a tuning parameter by cross-validation.

Arguably, Shao’s result is yet another consequence of the PEVs: Unless k becomes small,

there will be so many of subsamples in our cross-validation process that enough of them

will contain extreme values to distort the ﬁnal extimator.

• Work such as [8] in 2009 has investigated conditions under which PCA estimates are

consistent for ﬁxed n and p →∞. They consider spiked patterns of eigenvalues, as

proposed by Johnstone. Their results are positive but the technical conditions would

likely be quite diﬃcult (or impossible) to verify in practice.

• Indeed, diﬃcult-to-verify technical conditions plague the entire theoretical literature in

Big-p topics. Moreover, some of the results are negative, that is, showing potential lack

of consistency of the LASSO, as in [9].

30 Handb ook of Big Data

This issue of verifying technical assumptions is key. For example, consider [8], the PCA

analysis cited above. First, they assume a ρ-mixing condition. Roughly speaking, this says

that there is some permutation of our variables such that any pair of the (newly indexed)

variables are less correlated if their indices are distant from each other, with the correlation

going to zero as the distance goes to inﬁnity. This may be a natural assumption in a

time series context, but questionable in many others. Moreover, the assumption would

be diﬃcult to verify empirically, since the PEVs implies there will be a lot of spurious

correlations.

Another assumption made in that paper is that



i=1

(



i=1

)

→ 0asp →∞ (2.12)

where λ

is the ith-largest eigenvalue for the ﬁrst p variables. Again, the PEV is a problem

here, as it causes, as noted earlier, large biases in the estimated eigenvalues, a big obstacle

to verifying the assumption.

The list of assumptions of course does not stop there. One result, for instance, makes an

assumption of uniformly bounded eighth moments on components of a certain complicated

matrix expression.

To be sure, in a mathematical sense work such as this has been deep and very highly

impressive. However, its connection to practical data analysis is unclear.

2.5.5 Example

An approach this author has found useful in explaining Big-p issues to data analysts has

been that of Foster and Stine at Wharton [6]. Interested in the eﬀects of overﬁtting, they

added noise variables to the data, completely unrelated to the data. After performing a linear

regression analysis, every one of the fake predictor variables was found to be signiﬁcant.This

is an excellent illustration of the PEVs, and of the use of signiﬁcance testing for variable

selection.

This author ran a similar analysis using LASSO, speciﬁcally the R lars package. The

dataset is from Census 2000, for those working as programmers or engineers in Silicon Valley

counties. There were 20,009 observations, and 10 variables, such as age and education, with

wage income taken as the response variable. In the spirit of the Wharton study, 50 noise

variables were added, i.i.d. N (0,1). The original data were centered and scaled.

After running LASSO on the augmented dataset, the value of λ in Equation 2.8 that

minimized the Mallows Cp value occurred at Step 20. The latter corresponded to inclusion

of variables 6, 1, 8, 3, 5, 9, 2, 48, 26, 45, 27, 33, 14, 21, 58, 55, 49, 38, and 42 (in that order).

In other words, LASSO identiﬁed 12 of the fake variables as important predictors. It also

failed to pick up two of the real variables: 7 and 10.

The invention of the LASSO spawned many reﬁnements, such as elastic net, SCAD,

and so on. But it is clear that the PEV is problematic for any such procedure in practice.

2.6 Conclusion

This chapter had the goal of clarifying the problems of Big-n and Big-p in Big Data. Though

there is an eﬀective method for dealing with the computational issues in Big-n,itisargued

that the statistical issues with Big-p continue to loom large. The data analyst should not

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Big-n versus Big-p in Big Data (2/3)

Create new playlist

Sign In

Sign Up

Table of Contents for
2. Big-n versus Big-p in Big Data (2/3)