M. de Carvalho*
Faculty of Mathematics, Pontificia Universidad Católica de Chile, Santiago, Chile
My experience on discussing concepts of risk and statistics of extremes with practitioners started in 2009 while I was a visiting researcher at the Portuguese Central Bank (Banco de Portugal). At the beginning, colleague practitioners were intrigued about the methods I was applying; the questions were recurrent: “What is the difference between statistics of extremes and survival analysis (or duration analysis)?1 And why don't you apply empirical estimators?” The short answer is that when modeling rare catastrophic events, we need to extrapolate beyond observed data—into the tails of a distribution—and standard inference methods often fail to deal with this properly. To see this, suppose that we observe a random sample of losses and that we estimate the survivor function , using the empirical survivor function, , for . Now, suppose that we want to assess what is the probability of observing a loss just larger than the maximum observed loss, . Obviously, the probability of that event turns out to be zero [, for all ], thus illustrating that the empirical survivor function fails to be able to extrapolate into the right tail of the loss distribution. As put simply by Taleb (2012, p. 46), “the fool believes that the tallest mountain in the world will be equal to the tallest one he has observed.”
In this chapter, I resume some viewpoints that I shared with the discussion group “Future of Statistics of Extremes” at the ESSEC Conference on Extreme Events in Finance, which took place in Royaumont Abbey, France, on 15–17 December 2014
extreme-events-in-finance.essec.edu
and which originated the invitation by the editor for writing this chapter. My goal is on providing a personal view on some recent concepts and methods of statistics of extremes and to discuss challenges and opportunities that could lead to potential future developments. The scope is far from encyclopedic, and many other interesting perspectives are found all over this monograph.
In Section 9.2, I note that a bivariate extreme value distribution is an example of what I call here a measure-dependent measure, I briefly review kernel density estimators for the spectral density, and I discuss families of spectral measures. In Section 9.3, I argue that the spectral density ratio model (de Carvalho and Davison, 2014), the proportional tails model (Einmahl et al., 2015), and the exponential families for heavy-tailed data (Fithian and Wager, 2015) share similar construction principles; in addition, I discuss en passant a new nonparametric estimator for the so-called scedasis function, which is one of the main estimation targets on the proportional tails model. Comments on potential future developments are scattered across the chapter and a miscellanea of topics are included in Section 9.4.
Throughout this chapter I use the acronym EVD to denote extreme value distribution.
Let be a probability measure on , and let be a parameter space. The family is a statistical model. Obviously not every statistical model is appropriate for modeling risk. As mentioned in Section 9.1, candidate statistical models should possess the ability to extrapolate into the tails of a distribution, beyond existing data.
See Coles (2001, Theorem 3.1.1). Here, and are location and scale parameters, while is a shape parameter that determines the rate decay of the tail: , light tail (Gumbel); , heavy tail (Fréchet); , short tail (Weibull). The generalized EVD ( in (9.1)) is a three-parameter family that plays an important role in statistics of univariate extremes.
In some cases we want to assess the risk of observing simultaneously large values of two random variables (say, two simultaneous large losses in a portfolio), and the mathematical basis for such modeling is that of statistics of bivariate extremes. In this context, “extremal dependence” is often interpreted as a synonym of risk. Moving from one dimension to two dimensions increases sharply the complexity of models for the extremes. The first challenge one faces when modeling bivariate extremes is that the estimation object of interest is infinite dimensional, whereas in the univariate case only three parameters are needed. The intuition is the following. When modeling bivariate extremes, apart from the marginal distributions, we are also interested in the extremal dependence structure of the data, and—as we shall see in Theorem 9.4—only an infinite-dimensional object is flexible enough to capture the “spectrum” of all possible types of dependence.
Let , where I assume that and are unit Fréchet marginally distributed, that is, , for . Similarly to the univariate case, the classical theory for characterizing the extremal behavior of bivariate extremes is based on block maxima, here given by the componentwise maxima ; note that the componentwise maxima needs not to be a sample point. Similarly to the univariate case, we focus on the standardized maxima, which for Fréchet marginals is given by the standardized componentwise maxima, that is, . Next, I define a special type of statistical model that plays a key role on bivariate extreme value modeling.
What are the relevant statistical models for statistics of bivariate extremes? Is there an extension of the generalized EVD for the bivariate setting? The following is a bivariate analogue to Theorem 9.1.
See Coles (2001, Theorem 8.1). Throughout I refer to as a bivariate EVD. Note the similarities between (9.1) and (9.3): both start with an “exp,” but for bivariate EVD , whereas for univariate EVD . To understand why needs to be an element of , let or in (9.3). Some further comments are in order. First, since (9.2) is the only constraint on , neither nor can have a finite parameterization. Second, a bivariate extreme value distribution is an example of a measure-dependent measure, as introduced in Definition 9.2.
A pseudo-polar transformation is useful for understanding the role of , which is the so-called spectral measure. Define , and denote and as the radius and pseudo-angle, respectively. If is relatively large, then ; if is relatively large, then . de Haan and Resnick (1977) have shown that , as . Thus, when the radius is large, the pseudo-angles are approximately distributed according to . Perfect (extremal) dependence corresponds to being degenerate at , whereas independence corresponds to being a binomial distribution function, with half of the mass in 0 and the other half in 1. The spectral probability measure determines the interactions between joint extremes and is thus an estimating target of interest; other functionals of the spectral measure are also often used, such as the spectral density or Pickands (1981) dependence function , for . The cases of extremal independence and extremal dependence, respectively, correspond to the bivariate EVDs, and , for .
In practice, we have to deal with a statistical problem—lack of knowledge on —and an inference challenge—that is, obtaining estimates that obey the marginal moment constraints and define a density on the unit interval. Indeed, as posed by Coles (2001, p. 146) “it is not straightforward to constrain nonparametric estimators to satisfy functional constraints of the type” of Eq. (9.2). Inference should be conducted by using pseudo-angles , which are constructed from a sample of size , thresholding the pseudo-radius at a sufficiently high threshold . Kernel smoothing estimators for have been recently proposed by 2013 and are based on
Here denotes the beta density with shape parameters , and is a parameter responsible for the level of smoothing, which can be obtained through cross validation. Each beta density is centered around a pseudo-angle in the sense that , for . And how can we obtain the probability masses, ? There are at least two options. A simple one is to consider Euclidean likelihood methods (Owen, 2001, pp. 63–66), in which case the vector of probability masses solves
By the method of Lagrange multipliers, we obtain , where , and . This yields the following estimator, known as the smooth Euclidean likelihood spectral density:
Another option proposed by de Carvalho et al. (2013) is to consider a similar approach to that of Einmahl and Segers (2009), in which case the vector of probability masses solves the following empirical likelihood (Owen, 2001) problem:
Again by the method of Lagrange multipliers, the solution is , for , where is the Lagrange multiplier associated with the second equality constraint in (9.7), defined implicitly as the solution to the equation
This yields the following estimator, known as the smooth empirical likelihood spectral density:
One can readily construct smooth estimators for the corresponding spectral measures; the smooth Euclidean spectral measure and smooth empirical likelihood spectral measure are, respectively, given by
where is the regularized incomplete beta function, with . By construction both estimators, (9.6) and (9.8), obey the moment constraint, so that, for example,
Put differently, realizations of the random probability measures and are elements of . Examples of applications of these estimators in finance can be found in Kiriliouk et al. (2015, Figure 4). At the moment, the large sample properties of these estimators remain unknown.
Other estimators for the spectral measure (obeying (9.2)) can be found in Boldi and Davison (2007), Guillotte et al. (2011), and Sabourin and Naveau (2014).
Formally, is a set of predictor-dependent (henceforth pd) probability measures if the are probability measures on , indexed by a covariate ; here is the Borel sigma-algebra on . Analogously, I define the following:
And why do we care about pd spectral measures? Pd spectral measures allow us to assess how extremal dependence evolves over a certain covariate , that is, they allow us to model nonstationary extremal dependence structures. Pd spectral measures are a natural probabilistic concept for modeling extremal dependence structures that may change according to a covariate. Indeed, in many settings of applied interest, it seems natural to regard risk from a covariate-adjusted viewpoint, and this leads us to ideas of “conditional risk.” However, if we want to develop ideas of “conditional risk” for bivariate extremes, that is, if we want to assess systematic variation of risk according to a covariate, we need to allow for nonstationary extremal dependence structures.
To describe how extremal dependence may change over a predictor, I now introduce the concept of spectral surface.
A simple spectral surface can be constructed with the pd spectral density , where . In Figure 9.1, I represent a spectral surface based on this model, with , for . (Larger values of the predictor lead to larger levels of extremal dependence.) Other spectral surfaces can be readily constructed from parametric models for the spectral density; see, for instance, Coles (2001, Section 8.2.1).
Let's now regard the subject of pd bivariate extremes from another viewpoint. Modeling nonstationarity in marginal distributions has been the focus of much recent literature in applied extreme value modeling; see for instance Coles (2001, Chapter 6). The simplest approach in this setting was popularized long ago by Davison and Smith (1990), and it is based on indexing the location and scale parameters of the generalized EVD by a predictor, say, by taking
And how to model “nonstationary bivariate extremes” if one must? Surprisingly, by comparison to the marginal case, approaches to modeling nonstationarity in the extremal dependence structure have received relatively little attention. These should be important to assess the dynamics governing extremal dependence of variables of interest. For example, has extremal dependence between returns of CAC 40 and DAX 30 been constant over time, or has this level been changing over the years?
By using pd spectral measures, we are essentially indexing the parameter of the bivariate extreme value distribution () with a covariate, and thus the approach can be regarded as an analogue of the Davison–Smith paradigm in (9.10), but for the bivariate setting. In the same way that (9.10) is a covariate-adjusted version of the generalized EVD (9.1), the following concept can be regarded as a pd version of the bivariate EVD in (9.3).
Similarly to Section 2.2, n practice we need to obtain estimates that obey the marginal moment constraint and define a density on the unit interval, for all . It is not straightforward to construct nonparametric estimators able to yield valid pd spectral measures. Indeed, any such estimator, , needs to obey the moment constraint, that is, , for all . Castro and de Carvalho (2016) and Castro et al. (2015) are currently developing models for these contexts, but there are still plenty of opportunities here.3
Needless to say that other pd objects of interest can be readily constructed. For example, a pd version of Pickands (1981) dependence function can be defined as , and a pd can also be constructed. Using the fact that (de Carvalho and Ramos, 2012, p. 91), the pd can be defined as , for .
Beyond pd spectral measures other families of spectral measures are of interest. In a recent paper, de Carvalho and Davison (2014) proposed a model for a family of spectral measures . The applied motivation for the concept was to track the effect of explanatory variables on joint extremes. Put differently, their main concern was on the joint modeling of extremal events when data are gathered from several populations, to each of which corresponds a vector of covariates. Thus, conceptually, there are already in de Carvalho and Davison (2014) some of the ingredients of pd spectral measures and related modeling objectives. Each element in the family should be regarded as a “distorted version,” of a baseline spectral measure , in a sense that I will precise in the succeeding text. Formally, spectral density ratio families are defined as follows.
From (9.11), we can write all the normalization and moment constraints for this family as a function of the baseline spectral measure and the tilting parameters, that is,
Inference is based on the combined sample from the spectral distributions . Details on estimation and inference through empirical likelihood methods can be found in de Carvalho and Davison (2011), de Carvalho and Davison (2014). An extremely appealing feature of their model is that it allows for borrowing strength across samples, in the sense that the estimate of is based on pseudo-angles, instead of simply . Although flexible, their approach requires however a substantial computational investment; in particular, inference entails intensive constrained optimization problems—even for a moderate —so that estimates of obey empirical versions of the normalization and moment constraints in (9.12). Their approach allows for modeling extremal dependence in settings such as Figure 9.2a, but it excludes data configurations such as Figure 9.2b. The pd-based approach of Castro et al. (2015) allows for inference to be conducted in both settings in Figure 9.2.
The main goal of this section is on describing the link between the specifications underlying the spectral density ratio model, discussed in Section 2.4, the proportional tails model (Einmahl et al., 2015), and the exponential families for heavy-tailed data (Fithian and Wager, 2015).
The proportional tails model is essentially an approach for modeling nonstationary extremes. Suppose that at time points we gather independent observations , respectively, sampled from the continuous distribution functions , all with a common right end point . Suppose further that there exists a (time-invariant) baseline distribution function , also with right end point , and a continuous function , such that
Here is the so-called scedasis density, and following Einmahl et al. (2015) I assume the following normalization constraint . Equation (9.13) is the key specification of the proportional tails model. Roughly speaking, the scedasis density tells us how much more/less mass there is on the tail , relatively to the baseline tail, , for a large ; uniform scedasis corresponds to a constant frequency of extremes over time.
The question arises naturally: “If the scedasis density provides an indication of the ‘relative frequency’ of extremes over time, would it seem natural that such function could be somehow connected to the intensity measure of the point process characterization of univariate extremes (Coles 2001, Section 7.3)?” To have an idea on how the concepts relate, I sketch here a heuristic argument. I insist, the argument is heuristic, and my aim here does not go beyond shedding some light on how these ideas connect. Consider the following artificial setting. Suppose that we could gather a large sample from , say, , and that at each time point we could also collect a large sample from , say, , for . For concreteness let's focus on . Then, the definition of scedasis in (9.13) and similar arguments as in Coles (2001, Section 4.2.2) suggest that for a sufficiently large ,
where , for , is the intensity measure of the limiting Poisson process for univariate extremes (cf Coles 2001, Theorem 7.1.1). Thus, it can be seen from (9.14) that in this artificial setting the scedasis density can be literally interpreted as a measure of the relative intensity of the extremes at period , with respect to a (time-invariant) baseline.
Another important question is: “How can we estimate the scedasis density?” Einmahl et al. (2015) propose a kernel-based estimator
where , with being a bandwidth and being a kernel; in addition, are the order statistics of . Specifically, Einmahl et al. (2015) recommend to be a symmetric kernel on . A conceptual problem with using a kernel on is that it allows for the scedasis density to put mass outside .4 Using similar ideas to the ones involved in the construction of the smooth spectral density estimators in Section 2.2, I propose here the following estimator:
Indeed, each beta density is centered close to in the sense that , for , where is the parameter controlling the level of smoothing. My goal here will not be on trying to recommend an estimator over the other, but rather on providing a brief description of strengths and limitations with both approaches. In Figure 9.3, I illustrate how the two estimators, (9.15) and (9.16), perform on the same data used by Einmahl et al. (2015) and on simulated data (single-run experiment).5 The data consist of daily negative returns of the Standard and Poor's index from 1988 to 2007 (), and I use the same value for and the same bandwidth () and (biweight) kernel [, for ] as authors; I also follow the author's settings for the simulated data. Finally, I consider for illustration.
In the Standard and Poor's example in Figure 9.3a, it can be seen that both estimators capture similar dynamics; the gray rectangles represent contraction periods of the US economy as dated by the National Bureau of Economic Research (NBER). It is interesting to observe that the local maxima of the scedasis density are relatively close to economic contraction periods. Indeed, “turning points” (local maxima and minima) of the scedasis density seem like an interesting estimation target for many settings of applied interest.
The estimator in (9.16) has the appealing feature of putting all mass of the scedasis density inside the interval, and some further numerical experiments suggest that it tends to have a similar behavior to that in (9.15) except at the boundary. However, a shortcoming with the method in (9.16) is that it may not be defined at the vertex 0 or 1, and hence it could be inappropriate for forecasting purposes.
The proportional tails model is extremely appealing and simple to fit. A possible shortcoming is that it does not allow for to change over time. For applications in which we suspect that may change over time, the generalized additive approach by Chavez-Demoulin and Davison (2005) is a sensible alternative, and although the model is more challenging to implement, it can be readily fitted with the R
package QRM
by typing in the command game
.
A problem that seems relevant for practice is that of cluster analysis for the proportional tails model. To see this, suppose that one estimates the scedasis density and tail index for several stocks. It seems natural to wonder: “How can we cluster stocks whose scedasis looks more alike, or—perhaps more interestingly—how can we cluster stocks with a similar scedasis and tail index?”
Lastly, I would like to comment that it seems conceivable that Bernstein polynomials could be used for scedasis density estimation. In particular, a natural question is “Would it be possible to construct a prior over the space of all integrated scedasis functions?” Random Bernstein polynomials could seem like the way to go; see Petrone (1999) and references therein.
In this section I sketch some basic ideas on exponential families for heavy-tailed data; I will be more brief here than in Section 3.1. My goal is mainly on introducing the model specification and to move on; further details can be found in Fithian and Wager (2015).
The starting point for the Fithian–Wager approach is on modeling the conditional right tail law from a population, , as an exponential family with carrier measure , for a sufficiently large threshold . Two random samples are assumed to be available, and , with ; hence, the applied setting of interest is one where the size of the sample from is much larger than the one from . The model specification is
where the sufficient statistic is of the form for a certain ; the functional form of is motivated from the case where and are generalized Pareto distributions (cf. Fithian and Wager, 2015, p. 487).
In common with the spectral density ratio model, the Fithian–Wager model is motivated by the gains from borrowing strength across samples. Fithian and Wager are not however concerned about spectral measures, but rather on estimating a (small-sample) mean of a heavy-tailed distribution, by borrowing information from a much larger sample from a related population with the same . More concretely, the authors propose a semiparametric method for estimating the mean of , by using the decomposition , where , , , and . The Fithian–Wager estimator for the (small-sample) mean can be written as
where and , for a large threshold . Here can be computed through a logistic regression with an intercept and predictor , as a consequence of results on imbalanced logistic regression (Owen, 2007). As it can be observed from (9.18), the main trick on the estimation of is on the exponential tilt-based estimator for the mean residual lifetime .
From previous sections it may have became obvious that the common link underlying the specification of the spectral density ratio model, the proportional tails model, and the exponential families for heavy-tailed data was the assumption that all members in a family of interest were obtained through a suitable “distortion” of a certain baseline measure. In this section I make this link more precise.
Some examples are presented in the succeeding text.
Here, I comment on the need for further developing models compatible with both asymptotic dependence and asymptotic independence. In two influential papers, Poon et al. (2003, 2004) put forward that asymptotic independence was observed on many pairs of stock market returns. This had important consequences in finance, mostly because inferences in a seminal paper (Longin and Solnik, 2001) had been based on the assumption of asymptotic dependence, and hence perhaps risk had been overestimated earlier. However, an important questions is: “What if pairs of financial losses can move over time from asymptotic independence to asymptotic dependence and the other way around?” Some markets are believed to be more integrated these days than in the past, so for such markets it is relevant to ask whether they could have entered an “asymptotic dependence regime.” An accurate answer to this question would require however models able to allow for smooth transitions from asymptotic independence to asymptotic dependence, and vice versa, but as already mentioned in Section 2.3 at the moment, there is a shortage of models for nonstationary extremal dependence structures. Wadsworth et al. (2016) presents an interesting approach for modeling asymptotic (in)dependence.
An important reference here is Genton et al. (2015), but there is a wealth of problems to work in this direction, so I stop my comment here.
Is there a way to reduce dimension in such a way that the interesting features of the data—in terms of tails of multivariate distributions—are preserved?6 I think it is fair to say that, apart from some remarkable exceptions, most models for multivariate extremes have been applied only to low-dimensional settings. I remember that at a seminal workshop on high-dimensional extremes, organized by Anthony Davison, at the Ecole Polytechnique Fédérale de Lausanne (September 14–18, 2009), for most talks high dimensional actually meant “two dimensional,” and all speakers were top scientists in the field.
Principal component analysis (PCA) itself would seem inappropriate, since principal axes are constructed in a way to find the directions that account for most variation, and for our axes of interest (whatever they are …), variation does not seem to be the most reasonable objective. A naive approach could be to use PCA for compositional data (Jolliffe, 2002, Section 13.3) and apply it to the pseudo-angles themselves. Such approach could perhaps provide a simple way to disentangle dependence into components that could be of practical interest.
Theory and methods are the backbone of our field, without regular variation we wouldn't have gone far anyway. But, beyond theory, should our community be investing even more than it already is, in modeling and applications? As put simply by Box (1979), “all models are wrong, but some are useful.” However, while most of us agree that models only provide an approximation to reality, we seem to be very demanding about the way that we develop theory about such—wrong yet useful—models. Some models entail ingenious approximations to reality and yet are very successful in practice. Should we venture more on this direction in the future? Applied work can also motivate new, and useful, theories. Should we venture more on collaborating with researchers from other fields or on creating more conferences such as the ESSEC Conference on Extreme Events in Finance, where one has the opportunity to regard risk and extremes from a broader perspective, so to think out of the box? Should the journal Extremes include an Applications and Case Studies section?
What has our community been supplying in terms of communication of risk and extremes? Silence, for the most part. Definitely there have been some noteworthy initiatives, but perhaps mostly from people outside of our field such as those of David Spiegelhalter and David Hand. My own view is that it would be excellent if, in a recent future, leading scientists in our field could be more involved in communicating risk and extremes to the general public, either by writing newspaper and magazine articles or by promoting science vulgarization. Our community is becoming more and more aware of this need, I think. I was happy to see Paul Embrechts showing recently his concern about this matter at EVA 2015 in Ann Arbor.
How can we accurately elicit prior information when modeling extreme events in finance, in cases where a conflict of interest may exist? Suppose that a regulator requires a bank to report an estimate. If prior information is gathered from a bank expert—and if the bank is better off by misreporting—then how can we trust in the accuracy of the inferences? In such cases, I think the only Bayesian analysis a regulator should be willing to accept would be an objective Bayes-based analysis; see Berger (2006) for a review on objective Bayes.