28 Handb ook of Big Data
This typically yields an estimator with many 0s, that is, a sparse solution.
The situation is not so clearcut in the case of PCA. To see why, first recall that the
eigenvalues in PCA represent an apportionment of the total variance of our p variables.
Many analysts create a scree plot, displaying the eigenvalues in decreasing order, with the
goal of determining an effective value of p. This produces good results in many cases,
especially if the data have been collected with regression of a specific variable as one’s goal.
The above discussion involving Equation 2.6 suggests that a scree plot will begin to flatten
after a few eigenvalues.
By contrast, consider a dataset of people, for instance. Potentially there are a virtually
limitless number of variables that could be recorded for each person, say: height, weight,
age, gender, hair color (multivariate, several frequencies), years of education, highest degree
attained, field of degree, age of first trip more than 100 miles from home, number of surgeries,
type of work, length of work experience, number of coworkers, number of times filing for
political office, latitude/longitude/altitude of home, marital status, number of marriages,
hours of sleep per night and many, many more. While there are some correlations between
these variables, it is clear that the effective value of p can be extremely large.
In this kind of situation, the total variance can increase without bound as p grows, and
there will be many strong principal components. In such settings, PCA will likely not work
well for dimensional reduction. Sparse PCA methods, for example, [19], would not work
either, since the situation is not sparse.
Another approach to dealing with the COD that may be viewed as a form of vari-
able sparsity is to make very strong assumptions about the distributions of our data in
application-specific contexts. A notable example is [5] for genomics data. Here, each gene is
assumed to have either zero or nonzero effect, the latter with probability φ,whichistobe
estimated from the data. Based on these very strong assumptions, the problem becomes in
some sense finite-dimensional, and significance tests are then performed, while controlling
for false discovery rates.
2.5.3 Principle of Extreme Values
In addition to sparsity, the COD may be viewed as a reflection of a familiar phenomenon
that will be convenient to call the principle of extreme values (PEV):
PEVs: Say U
1
, ..., U
p
are events of low probability. As p increases, the probability
that at least one of them will occur goes to 1.
This is very imprecisely stated, but all readers will immediately recognize it, as it describes,
for example, the problem of multiple comparisons mentioned earlier. It is thus not a new
concept, but since it will arise often here, we give it a name.
The PEV can be viewed as more general than the COD. Consider, for instance, lin-
ear regression models and the famous formula for the covariance matrix of the estimated
coefficient vector,
Cov(
β)=σ
2
(X
X)
−1
(2.9)
Suppose all elements of the predictor variables were doubled. Then, the standard errors of
the estimated coefficients would be halved. In other words, the more disperse our predictor
variables are, the better. Thus, data sparsity is actually helpful, rather than problematic as
in the nonparametric regression case.
Thus, the COD may not be an issue with parametric models, but the PEV always comes
into play, even in the parametric setting. Consider selection of predictor variables for a linear
regression model, for example. The likelihood of spurious results, that is, of some predictors
appearing to be important even though they are not, grows with p, again by the PEV.