26 Handb ook of Big Data
2.5 Case of Big-p
Our focus now shifts to statistical consideration, in Big-p settings. The Big-p case has be-
come commonplace, even central, to the Big Data field. Yet the problems with this situation
are not always clearly identified and discussed. We will explore such points in the following
sections.
2.5.1 Running Examples: Linear Regression and PCA
As running examples to cite for concreteness, let us take linear regression models and PCA.
Linear models are important due to their ubiquity, and due to the fact, noted in Section
2.5.2, that nonparametric regression methodology may be problematic in Big-p settings.
PCA plays a major role in many analyses of Big Data, ironically because it is often used
for dimensionality reduction—turning Big-p into Small-p.
We will assume the standard notation for the regression case: We have a matrix X of
predictor variable data and vector Y of response variable data, and the usual formula
β =(X
X)
1
X
Y (2.2)
gives us the vector of estimated coefficients.
2.5.2 Curse of Dimensionality
The curse of dimensionality (COD), an old notion due to Bellman, is often mentioned in
Big Data contexts. Indeed, Bellman’s concept may be viewed as the forerunner of Big-p
research.
2.5.2.1 Data Sparsity
A major aspect of the COD is the idea that most points in a bounded region of high-
dimensional space are distant from each other. For instance, consider nearest-neighbor re-
gression, in which we predict the response variable—the Yvalue—of a new observation to
be the average of the responses of the k observations in our training set whose predictor vari-
ables are closest to the new one. With high values of p, one can show that those neighboring
observations tend to be far from the new one [2]. They thus tend to be unrepresentative.
especially at the edges of the data space.
These considerations are described dramatically in [18, p. 101]: “To maintain a given
degree of accuracy of [a nonparametric regression] estimator, the sample size must increase
exponentially with [p].” This is quite an irony. Early researchers in nonparametric regression
methods would have envied analysts who enjoy today’s Big-n datasets, as the nonparametric
methods need large n to be effective, for fixed p. But the COD implies that all this can be
lost if we are also in a Big-p setting. Thus, parametric models turn out to be useful even
with Big Data.
However, arguably the impact of the COD is not yet fully clear. For example, a common
explanation for the COD is that as p →∞, the volume of the unit hypersphere relative to
that of the unit hypercube goes to zero [4]. Therefore, if the variables have, say, a uniform
distribution on the hypercube, the probability that a point is within 0.5 unit of the center
goes to 0. This supports the notion that data for large p are sparse, a condition we will refer
to here as data sp arsity, to distinguish it from the notion of variable sparsity introduced
earlier.
Big-n versus Big-p in Big Data 27
Yet as pointed out in [12], this data sparsity argument depends crucially on choosing
the l
2
(Euclidean) metric
d[(x
1
, ..., x
p
), (y
1
, ..., y
p
)] =
(x
1
y
1
)
2
+ ... +(x
p
y
p
)
2
(2.3)
If one instead uses the l
metric
d[(x
1
, ..., x
p
), (y
1
, ..., y
p
)] = max(|x
1
y
1
|, ..., |x
p
y
p
|) (2.4)
then all points are within 0.5 of the center, so that the volume argument no longer implies
sparsity.
However, there are other arguments for sparsity that do not involve a specific metric.
In [7], for instance, it is shown, again with the unit hypercube as our data space, that the
size of a subcube needed to contain a certain fraction of our data grows with p. The results
in [2] apply to general metrics.
2.5.2.2 Variable Sparsity May Ameliorate the COD
The work showing data sparsity in Section 2.5.2.1 has one common assumption: all vari-
ables/dimensions should be treated identically. Revisiting this assumption may be the key
to dealing with the COD.
In nonparametric prediction applications, for instance, it may be that the unweighted
sums in Equations 2.3 and 2.4 should be replaced by weighted ones, with the more important
predictor variables carrying larger weights, say
d[(x
1
, ..., x
p
), (y
1
, ..., y
p
)] =
w
1
(x
1
y
1
)
2
+ ... + w
p
(x
p
y
p
)
2
(2.5)
Then as p grows, neighbors of a given point are mainly those points that are close for
the important predictor variables. Sparsity is then not an issue, and the COD becomes
less relevant [12]. In essence, the effective value of p remains bounded as the actual value
increases.
Of course, finding the proper set of weights would be a challenge. If one knows the rough
order of importance of the predictors in a regression setting, one might try, say, w
i
= i
γ
,
and choose γ by cross-validation, for instance. But in any case, the above argument at least
shows that, in principle, the COD may not be as large an obstacle as presently thought.
Another approach to the COD has been assumptions of variable sparsity. In regression,
this may be a reasonable assumption, motivated as follows. Suppose
E(Y |X
1
,X
2
, ...)=
i=1
β
i
X
i
(2.6)
say with the X
i
i.i.d. N(0,1). Then, the simple existence of the right-hand side implies that
β
i
0asi →∞. In other words, only a finite number of the β
i
are large, fitting our
definition of variable sparsity. And typically our predictors are correlated, with this again
reducing the effective value of p.
This is part of the motivation of LASSO regression [16]. Here, we choose
β to minimize
||Y X
β||
2
(2.7)
as usual, except now with the constraint
i
|
β
i
|≤λ (2.8)
28 Handb ook of Big Data
This typically yields an estimator with many 0s, that is, a sparse solution.
The situation is not so clearcut in the case of PCA. To see why, first recall that the
eigenvalues in PCA represent an apportionment of the total variance of our p variables.
Many analysts create a scree plot, displaying the eigenvalues in decreasing order, with the
goal of determining an effective value of p. This produces good results in many cases,
especially if the data have been collected with regression of a specific variable as one’s goal.
The above discussion involving Equation 2.6 suggests that a scree plot will begin to flatten
after a few eigenvalues.
By contrast, consider a dataset of people, for instance. Potentially there are a virtually
limitless number of variables that could be recorded for each person, say: height, weight,
age, gender, hair color (multivariate, several frequencies), years of education, highest degree
attained, field of degree, age of first trip more than 100 miles from home, number of surgeries,
type of work, length of work experience, number of coworkers, number of times filing for
political office, latitude/longitude/altitude of home, marital status, number of marriages,
hours of sleep per night and many, many more. While there are some correlations between
these variables, it is clear that the effective value of p can be extremely large.
In this kind of situation, the total variance can increase without bound as p grows, and
there will be many strong principal components. In such settings, PCA will likely not work
well for dimensional reduction. Sparse PCA methods, for example, [19], would not work
either, since the situation is not sparse.
Another approach to dealing with the COD that may be viewed as a form of vari-
able sparsity is to make very strong assumptions about the distributions of our data in
application-specific contexts. A notable example is [5] for genomics data. Here, each gene is
assumed to have either zero or nonzero effect, the latter with probability φ,whichistobe
estimated from the data. Based on these very strong assumptions, the problem becomes in
some sense finite-dimensional, and significance tests are then performed, while controlling
for false discovery rates.
2.5.3 Principle of Extreme Values
In addition to sparsity, the COD may be viewed as a reflection of a familiar phenomenon
that will be convenient to call the principle of extreme values (PEV):
PEVs: Say U
1
, ..., U
p
are events of low probability. As p increases, the probability
that at least one of them will occur goes to 1.
This is very imprecisely stated, but all readers will immediately recognize it, as it describes,
for example, the problem of multiple comparisons mentioned earlier. It is thus not a new
concept, but since it will arise often here, we give it a name.
The PEV can be viewed as more general than the COD. Consider, for instance, lin-
ear regression models and the famous formula for the covariance matrix of the estimated
coefficient vector,
Cov(
β)=σ
2
(X
X)
1
(2.9)
Suppose all elements of the predictor variables were doubled. Then, the standard errors of
the estimated coefficients would be halved. In other words, the more disperse our predictor
variables are, the better. Thus, data sparsity is actually helpful, rather than problematic as
in the nonparametric regression case.
Thus, the COD may not be an issue with parametric models, but the PEV always comes
into play, even in the parametric setting. Consider selection of predictor variables for a linear
regression model, for example. The likelihood of spurious results, that is, of some predictors
appearing to be important even though they are not, grows with p, again by the PEV.
Big-n versus Big-p in Big Data 29
There is potential trouble in this regard for PCA as well. Say we are working with
variables W
1
, ..., W
p
, and write W =(W
1
, ..., W
p
)
. The largest eigenvalue is
λ
1
=argmax
u
Var(u
W) = arg max
u
u
Cov(W)u, ||u]] = 1 (2.10)
Our estimate of λ
1
will be calculated using the sample covariance matrix
Cov(W) of W in
Equation 2.10, that is,
λ
1
=argmax
u
u
Cov(W)u, ||u]] = 1 (2.11)
The PEVs then come into play for large p relative to n. Since we are maximizing, the
estimate of λ
1
will be greatly biased upward. For example, suppose X
1
,X
2
, ... are i.i.d.
N(0,1). Let W
1
= X
1
+ X
2
, with W
i
= X
i
for i>1. In this case, the true population value
of λ
1
is 2.62 (independent of p), while in simulated samples for various p, the following
values of
λ
1
were obtained:
p
λ
1
50 2.62
500 4.34
5000 17.17
In other words, the PEV will likely result in severe distortions in PCA.
2.5.4 What Theoretical Literature Tells Us
There is a vast theoretical literature on the Big-p case. No attempt is made here to sum-
marize it, but again as a guide to our thoughts, here are just a few citations (somewhat
arbitrarily chosen):
The 1988 paper by Portnoy [14] has already been mentioned. It roughly finds that we
are safe if p is o(
n). But it does not exclude the possibility that methods might be
found that allow much larger values of p than this.
One common way to deal with Big-p settingsinregressionmodelingistodovariable
selection, thus reducing p. A component in many variable selection techniques is cross-
validation. To assess a model, we repeatedly fit the model to one part of the data and
use the result to predict the remaining, left out, portion. To the surprise of many, if
not all, researchers in this field, work in 1993 by Shao [15] showed that cross-validation
does not work unless the size k of the left out portion, relative to n,goesto0asn goes
to infinity. While this is not a major impediment in practice, the surprising nature of
Shao’s result suggests that our intuition may be weak in finding good methods for the
Big-p case, many of which are based on choosing a tuning parameter by cross-validation.
Arguably, Shao’s result is yet another consequence of the PEVs: Unless k becomes small,
there will be so many of subsamples in our cross-validation process that enough of them
will contain extreme values to distort the final extimator.
Work such as [8] in 2009 has investigated conditions under which PCA estimates are
consistent for fixed n and p →∞. They consider spiked patterns of eigenvalues, as
proposed by Johnstone. Their results are positive but the technical conditions would
likely be quite difficult (or impossible) to verify in practice.
Indeed, difficult-to-verify technical conditions plague the entire theoretical literature in
Big-p topics. Moreover, some of the results are negative, that is, showing potential lack
of consistency of the LASSO, as in [9].
30 Handb ook of Big Data
This issue of verifying technical assumptions is key. For example, consider [8], the PCA
analysis cited above. First, they assume a ρ-mixing condition. Roughly speaking, this says
that there is some permutation of our variables such that any pair of the (newly indexed)
variables are less correlated if their indices are distant from each other, with the correlation
going to zero as the distance goes to infinity. This may be a natural assumption in a
time series context, but questionable in many others. Moreover, the assumption would
be difficult to verify empirically, since the PEVs implies there will be a lot of spurious
correlations.
Another assumption made in that paper is that
p
i=1
λ
2
ip
(
p
i=1
λ
ip
)
2
0asp →∞ (2.12)
where λ
ip
is the ith-largest eigenvalue for the first p variables. Again, the PEV is a problem
here, as it causes, as noted earlier, large biases in the estimated eigenvalues, a big obstacle
to verifying the assumption.
The list of assumptions of course does not stop there. One result, for instance, makes an
assumption of uniformly bounded eighth moments on components of a certain complicated
matrix expression.
To be sure, in a mathematical sense work such as this has been deep and very highly
impressive. However, its connection to practical data analysis is unclear.
2.5.5 Example
An approach this author has found useful in explaining Big-p issues to data analysts has
been that of Foster and Stine at Wharton [6]. Interested in the effects of overfitting, they
added noise variables to the data, completely unrelated to the data. After performing a linear
regression analysis, every one of the fake predictor variables was found to be significant.This
is an excellent illustration of the PEVs, and of the use of significance testing for variable
selection.
This author ran a similar analysis using LASSO, specifically the R lars package. The
dataset is from Census 2000, for those working as programmers or engineers in Silicon Valley
counties. There were 20,009 observations, and 10 variables, such as age and education, with
wage income taken as the response variable. In the spirit of the Wharton study, 50 noise
variables were added, i.i.d. N (0,1). The original data were centered and scaled.
After running LASSO on the augmented dataset, the value of λ in Equation 2.8 that
minimized the Mallows Cp value occurred at Step 20. The latter corresponded to inclusion
of variables 6, 1, 8, 3, 5, 9, 2, 48, 26, 45, 27, 33, 14, 21, 58, 55, 49, 38, and 42 (in that order).
In other words, LASSO identified 12 of the fake variables as important predictors. It also
failed to pick up two of the real variables: 7 and 10.
The invention of the LASSO spawned many refinements, such as elastic net, SCAD,
and so on. But it is clear that the PEV is problematic for any such procedure in practice.
2.6 Conclusion
This chapter had the goal of clarifying the problems of Big-n and Big-p in Big Data. Though
there is an effective method for dealing with the computational issues in Big-n,itisargued
that the statistical issues with Big-p continue to loom large. The data analyst should not
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset