286 Handbook of Big Data
1. While the sample size might be large, the number of features for which measure-
ments are available is often much larger. For instance, a web search company
might track every single query made by each of its users. In this setting, the
number of users (the sample size) is huge, but the number of possible queries
(the set of features) is orders of magnitude larger. Similarly, in biology, new
technologies have made it possible to obtain a detailed molecular snapshot of the
activity of a tissue sample or even a single cell; however, this snapshot is expensive
and so typically the sample size is quite small relative to the number of molecular
measurements that are obtained. These examples are high dimensional and there
are many more features than observations.
2. When faced with old-fashioned small data, statisticians often seek to answer very
simple questions, such as Is this parameter nonzero? and Is X correlated with
Y ? However, on the basis of Big Data, we often ask far more complex questions,
such as Which of these 1,000,000 parameters are nonzero? and What are the
conditional dependence relationships among these 30,000 features?
In essence, people ask very complex questions on the basis of Big Data; answering these
questions leads to statistical challenges. Because so much data are available and so many
features have been measured, we become ambitious and seek to fit very complex models.
However, this level of complexity often cannot be supported by the available sample size,
leading to a host of problems, such as overfitting. Even seemingly simple tasks—such as
fitting a linear model to predict a response on the basis of a set of features—can be quite
challenging with Big Data.
In the classical statistical setting, we often estimate a parameter vector θ as the
minimizer of some loss function L(θ), which might be the negative log-likelihood of the data
under some model. In the context of Big Data, reducing the complexity of the parameter
space is of paramount importance. We can do this by making an assumption about the
structure of θ—for instance, that it is sparse or low-rank—and selecting a penalty function
P that encourages this structure in θ. We can then estimate θ as
ˆ
θ, a minimizer of
minimize
θ
{L(θ)+λP(θ)}, (16.1)
where λ is a nonnegative tuning parameter that controls the tradeoff between the loss
function (which encourages the parameter estimate to fit the data) and the penalty function
(which encourages the parameter estimate to conform to our structural assumptions). For
a suitably chosen penalty function P, Equation 16.1 can overcome some of the problems
associated with the complexity of the parameter space, thereby yielding a good estimate
of θ. This makes Equation 16.1 a valuable framework when working with Big Data.
Typically, the loss function L is convex. If the penalty function P is also chosen to be
convex, then Equation 16.1 is a convex optimization problem. This means that every local
optimum is a global optimum and that we have access to a sophisticated set of tools for
finding the minimizer of Equation 16.1 (Boyd and Vandenberghe, 2004). This is particularly
important in the context of Big Data, for which efficient algorithms and implementations
are critical.
In this chapter, we will explore the loss + penalty framework given in Equation 16.1 in
the setting, where both L and P are convex functions. We will consider three motivating
examples: linear regression, matrix completion, and Gaussian graphical modeling. In
Section 16.2, we will explore the convex loss functions associated with these motivating
examples. In Section 16.3, we will introduce some convex penalty functions that can be used
to induce particular types of structure in the parameters. We will explore three algorithms
for solving convex optimization problems in Section 16.4 and will apply them to our three
motivating examples in Section 16.5. Finally, we close with a discussion in Section 16.6.