Penalized Estimation in Complex Models 299
extreme, when λ is large, we get a model that is less sensitive to the data observed, which
is beneficial when there are not enough observations to support a more complex model.
The optimal choice of the parameter λ depends on many factors that are unknowable to a
data analyst. Conceptually, it depends on how much information the data contain about
the unknown parameter θ, on the choice of penalty, and on the structure possessed by θ.
In practice, cross-validation, AIC, and BIC are often used to select a value of λ (see, e.g.,
Hastie et al. 2009).
In recent years, the theoretical properties of the solution to Equation 16.1 have been
studied extensively. A common thread in this work is that under a certain set of assumptions,
if the complexity of the model is allowed to grow faster than the sample size, then
penalized estimators can work reliably in settings where the classical unpenalized versions
fail completely. Many authors have analyzed Equation 16.1 in the context of specific
loss functions and specific penalties (see B¨uhlmann and van de Geer 2011 and references
therein for a comprehensive overview of theoretical results). Negahban et al. (2012) consider
Equation 16.1 in considerable generality.
Throughout this chapter, we have assumed that the loss and penalty functions are
convex. As mentioned earlier, convex formulations are attractive from a computational
standpoint—we do not have to worry about suboptimality of local minima, and we can
apply standard optimization methods, such as those presented in Section 16.4. Beyond this,
the convex framework has an advantage in that a solution
ˆ
θ to Equation 16.1 has a simple
characterization in terms of a set of optimality conditions (see, e.g., Boyd and Vandenberghe
2004); this facilitates the study of its theoretical properties.
In this chapter, we have ignored a large literature of penalized estimation methods
that make use of nonconvex penalties (Fan and Li, 2001; Zhang, 2010). These methods
yield estimators that closely match the desired problem structure and that have attractive
theoretical properties. However, they pay a price in terms of local minima and difficulties
in computation.
In recent years, the success of penalized estimation techniques has led to a rich interplay
between the fields of statistics and optimization. For example, Agarwal et al. (2012) revisit
the theoretical convergence rate of PGM and show that a faster rate is obtained up to the
level of accuracy that is of interest in a statistical context. Another recent line of work
studies the tradeoffs between computational efficiency and statistical optimality (Berthet
and Rigollet, 2013; Chandrasekaran and Jordan, 2013; Ma and Wu, 2015).
In this chapter, we have presented a flexible toolkit for fitting complex models to large-
scale data. As the data continue to grow in scale, so too will the complexity of the questions
being asked, and consequently the need for penalized estimation techniques to answer those
questions.
Acknowledgments
JB was supported in part by NSF Award DMS-1405746 and DW was supported in part by
NSF CAREER Award DMS-1252624.
References
Agarwal, A., Negahban, S. and Wainwright, M. J. (2012), Fast global convergence of
gradient methods for high-dimensional statistical recovery, The Annals of Statistics 40(5),
2452–2482.