Chapter 2
The basic model descriptions of survival data are introduced. Basically, the distribution of survival data may be either continuous or discrete, but the nonparametric estimators of the distributions of survival time are discrete in any case. Therefore, when introducing the estimators, the discrete models are necessary to know.
In the present chapter, only nonparametric estimation is discussed. This means that no assumptions whatsoever are made about the true underlying distribution. Parametric models are presented in Chapter 5.
Traditional statistical model descriptions are based on the density and the cumulative distribution functions. These functions are not so suitable for use with censored and/or truncated data. The survival and hazard functions are better suited, as will be demonstrated here. It should, however, be acknowledged that all these functions are simple functions of each other, so the information they convey is in principle the same, it is only convenience that determines which one(s) to choose and use in a particular application.
We start with a motivating example, the life table, and its connection to death risks and survival probabilities in a population; see Table 2.1.
Life table for Swedish females 2010
age(x) |
pop |
deaths |
risk(%) |
alive at age x |
---|---|---|---|---|
0 |
55 407 |
117 |
0.211 |
100000 |
1 |
54 386 |
22 |
0.040 |
99 789 |
2 |
53 803 |
11 |
0.020 |
99 748 |
3 |
53 486 |
7 |
0.013 |
99 728 |
4 |
52 544 |
4 |
0.008 |
99 715 |
⋯ |
⋯ |
⋯ |
⋯ |
⋯ |
96 |
3 074 |
1 030 |
33.507 |
6 529 |
97 |
2 204 |
817 |
37.061 |
4 341 |
98 |
1 473 |
624 |
42.363 |
2 732 |
99 |
920 |
433 |
47.065 |
1 575 |
100+ |
1 400 |
837 |
59.786 |
834 |
Example 7 Swedish female life table 2009.
Statistics Sweden (SCB) is a government agency that produces statistics and has a coordinating role for the official statistics of Sweden. From their home page (http://www.scb.se) it is possible to download vital population statistics, like population size and number of deaths by age, gender, and year. From such data it is possible to construct a life table; see Table 2.1. It is constructed in the following way. The first column, headed by age(x), contains the ages in completed years. The second column, headed pop, contains the numbers of women in Sweden in the corresponding ages. It is equal to the mean population size over the year. Since population numbers from the SCB are taken at the turn of the years, the numbers in column two are averages of two columns in the SCB file. The third column is the total numbers of deaths in each age category during the year. The fourth and fifth columns are derived from the previous ones. The risk is essentially the ratio between the number of deaths and population size for each age.
Column five, alive at age x, contains the actual construction of the life table. It depicts the actual size of a birth cohort, year by year, exposed to the death risks of column four. It starts with 100 000 newborn females. Under the first year of life the death risk is 0.211%, meaning that we expect 0.00211 × 100 000 = 211 girls to die the first year. In other words, 100 000−211 = 99789 girls will remain alive at their first anniversary (at exact age one year). That is the number in the second row in the fifth column. The calculations then continue in the same way through the last two columns. Note, though, that the last interval (100+) is an open interval; that is, it has no fixed upper limit.
Note the difference in interpretation between the numbers in the second column (“pop”) and those in the last (“alive at age x”); the former give the age distribution in a population a given year (referred to as period data), while the latter constitute synthetic cohort data; that is, a group of newborn are followed from birth to death, and the actual size of the (still alive) cohort is recorded age by age. That is, “individuals” in the synthetic cohort live their whole lives under the mortality conditions of the year 2009. Of course, no real individual can experience this scenario, hence the name “synthetic cohort.” Nevertheless, it is a useful tool to illustrate the state of affairs regarding mortality in a given year.
It makes sense to plot the life table, see Figure 2.1. By dividing all numbers by the original cohort size (in our case 100,000), they may be interpreted as probabilities.
The survival function S(t), t > 0, is defined as the probability of surviving past t, t > 0, or with a formula,
S(t)=P(T≥t),t>0,
where T is the (random) life length under study. The symbol P(T ≥ t) reads “the probability that T is equal to or larger than t”; here t is a fixed number while T denotes a “random quantity.” In statistical language, T is called a random variable. In our example, T is the (future) life length of a randomly chosen newborn female, of course, unknown at birth.
We will, for the time being, assume that S is a “nice” function, smooth and differentiable. That will ensure that the following definitions are unambiguous.
The density function f is defined as minus the derivative of the survival function, or
f(t)=−∂∂tS(t),t>0.
The intuitive interpretation of this definition is that, for small enough s, the following approximation is close to the true value:
P(t0≤T<t0+s)≈sf(t0)
This is illustrated in Figure 2.2; for a short enough interval (t0, t0 + s], the probability of an observation falling in that interval is well approximated by the area of a rectangle with sides of lengths s and f(t0), or sf(t0). The formal mathematical definition as a limit is given in equation (2.1).
f(t)=lims→0P(t≤T<t+s)s,t>0.(2.1)
The hazard function is central to the understanding of survival analysis, so the reader is recommended to get at least an intuitive understanding of it. One way of thinking of it is as an “instant probability”; at a given age t, it measures the risk of dying in a short interval (t, t + s) immediately after t, for an individual who still is alive at t.
h(t)=lims→0P(t≤T<t+s|T≥t)sP(t0≤T<t0+s|T≥t0)≈sh(t0)
Note the difference between the density and the hazard functions. The former is (the limit of) an unconditional probability, while the latter is (the limit of) a conditional probability. Otherwise, the hazard function is also a kind of density function, with a similar graphical interpretation; see Figure 2.3.
The cumulative hazard function is defined as the integral of the hazard function,
H(t)=∫t0h(s)ds,t≥0.
That is, an intuitive interpretation is that the cumulative hazard function successively accumulates the instant risks.
The cumulative hazard function is important because it is fairly easy to estimate nonparametrically (i.e., without any restrictions), in contrast to the hazard and density functions.
Example 8 The exponential distribution
Perhaps the simplest continuous life length distribution is the exponential distribution. It is simple because its hazard function is constant:
h(t)=λ,λ>0, t≥0.
From this, it is easy to calculate the other functions that characterize the exponential distribution. The cumulative hazards function is
H(t)=λt,λ>0, t≥0,
the survival function is
S(t)=e−λt,λ>0, t≥0,
and the density function is
f(t)=λe−λt,λ>0, t≥0.
The property of constant hazard implies no aging. This is not a realistic property for human mortality, but, as we will see, a useful benchmark, and a useful model for modelling mortality over short time intervals (piece-wise constant hazard). The exponential distribution is described in detail in Appendix B.
So far we have assumed, implicitly or explicitly, that time is continuous. We will now introduce discrete time survival models, and the reason is two-fold: (i) Even if data are generated from truly continuous time models, nonparametric estimation of these models will, as will be shown later, give rise to estimators corresponding to a discrete time model. This is an inherent property of nonparametric maximum likelihood estimators. Thus, in order to study the properties of these estimators, we need some knowledge of discrete time models. (ii) Data are discrete, usually through grouping. For instance, life lengths may be measured in full years, introducing tied data.
It is important to realize that in practice all data are discrete. For instance, it is impossible to measure time with infinite precision. Therefore, all data are more or less rounded. If data are so much rounded that the result is heavily tied data, true discrete-data models are called for.
Discrete time models will now be introduced. Let R be a discrete random variable with
support (r1,r2,...,rk) (positive real numbers, usually 1, 2,... or 0, 1, 2, ...),
probability mass function
pi=P(R=ri),i=1,...,k,
with pi>0, i=1,...,k and ∑ki=1pi=1.
Then
F(t)=∑i:ri≤tpi,x−∞<t<∞,
is the cumulative distribution function, and
is the survival function.
The discrete time hazard function is defined as
Note that here, the hazard at any given time point is a conditional probability, so it must always be bounded to lie between zero and one. In the continuous case, on the other hand, the hazard function may take any positive value. Further note that if, like here, the support is finite, the last “hazard atom” is always equal to one (having lived to the “last station,” one is bound to die).
The system (2.3) of equations has a unique solution, easily found by recursion:
From this, we get the discrete time survival function at each support point as
and the general definition
It is easily seen that S is decreasing, S(0) = 1, and S(∞) = 0, as it should be.
Example 9 The geometric distribution
The geometric distribution has support on {1,2,...} (another version also includes zero in the support; this is the case for the one in R), and the hazard function h is constant:
Thus, the geometric distribution is the discrete analogue to the exponential distribution in that it implies no aging. The probability mass function is, from (2.3),
and the survival function becomes, from (2.3),
where [t] denotes the largest integer smaller than or equal to t (rounding downwards). In Figure 2.4 the four functions for a geometric distribution with h = 0.25 are plotted.
As an introductory example, look at an extremely simple data set: 4, 2*, 6, 1, 3* (starred observations are right censored; in the figure, deaths are marked with +, censored observations with a small circle); see Figure 2.5.
How should the survival function be estimated? The answer is that we take it in steps. First, the hazard “atoms” are estimated. It is done nonparametrically, and the result as such is not very useful. Its potential lies in that it is used as the building block in constructing estimates of the cumulative hazards and survival functions.
In Figure 2.6, the observed event times in Figure 2.5 are marked, the vertical dashed lines at durations 1, 4, and 6, respectively. In the estimation of the hazard atoms, the concept of risk set is of vital importance.
The risk set R(t) at duration t, t > 0 is defined mathematically as
or in words, the risk set at time t consists of all individuals present and under observation just prior to t. The reason we do not say “present at time t” is that it is vital to include those individuals who have an event or are right censored at the exact time t. In our example, the risk sets R(1), R(4), and R(6) are the interesting ones. They are:
The estimation of the hazard atoms is simple. First, we assume that the probability of an event at times where no event is observed, is zero. Then, at times where events do occur, we count the number of events and divides that number by the size of the corresponding risk set. The result is (see also Figure 2.7)
As is evident from Figure 2.7, the estimated hazard atoms will be too irregular to be of practical use; they need smoothing. The simplest way of smoothing them is to calculate the cumulative sums, which leads to the Nelson–Aalen estimator of the cumulative hazards function; see the next section. There are more direct smoothing techniques to get reasonable estimators of the hazard function itself, for example, kernel estimators, but they will not be discussed here. See, for example, (Silverman 1986) for a general introduction to kernel smoothing.
From the theoretical relation, we immediately get
which is the Nelson–Aalen estimator (Nelson 1972, Aalen 1978); see Figure 2.8. The sizes of the jumps are equal to the heights of the “spikes” in Figure 2.7.
From the theoretical relation (2.3), we get
see also Figure 2.9. Equation (2.5) may be given a heuristic interpretation: In order to survive time t, one must survive all “spikes” (or shocks) that come before time t. The multiplication principle for conditional probabilities then gives equation (2.5).
For the male mortality data set mort,
> require(eha)
> data(mort)
> head(mort)
id enter exit event birthdate ses
1 1 0.000 20.000 0 1800.010 upper
2 2 3.478 17.562 1 1800.015 lower
3 3 0.000 13.463 0 1800.031 upper
4 3 13.463 20.000 0 1800.031 lower
5 4 0.000 20.000 0 1800.064 lower
6 5 0.000 0.089 0 1800.084 lower
an indicator for death is introduced, called event. The value 1 indicates that the corresponding life time is fully observed, while the value 0 indicates a right-censored lifetime. Another common coding scheme is TRUE and FALSE, respectively.
In the R package eha, the plot function can be used to plot both Nelson– Aalen curves and Kaplan–Meier curves. The code for the Nelson–Aalen and Kaplan–Meier plots is
> par(mfrow = c(1, 2))# Two plots, “one row, two columns”.
> with(mort, plot(Surv(enter, exit, event), fn = “cum”))
> with(mort, plot(Surv(enter, exit, event), fn = “surv”))
and the result is seen in Figure 2.10 The package eha must be loaded for this to work. Note the use of the function with; it tells the plot function that it should get its data (enter, exit, event) from the data frame mort. The function Surv from the package survival creates a “survival object,” which is used in many places. It is, for instance, the response in all functions that perform regression analysis on survival data.
Note that the “Duration” in Figure 2.10 is duration (in years) since the day each man became 20 years of age. They are followed until death or age 40, whichever comes first. The right hand figure shows that approximately 25% of the men alive at age 20 died before they became 40.
If the Kaplan–Meier estimator is wanted as numbers, use survfit in combination with coxreg or coxph. Since the output will have approximately as many rows as there are distinct event times, we illustrate it by taking a (small) random sample from mort.
> indx <- sample(NROW(mort), size = 50, replace = FALSE)
> rsa <- mort[indx,]
> fit <- coxph(Surv(enter, exit, event) ˜ 1, data = rsa)
> s.fit <- survfit(fit)
> options(width=70)
> summary(s.fit)
Call: survfit(formula = fit)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0.526 46 1 0.978 0.0213 0.938 1.000
3.615 39 1 0.954 0.0321 0.893 1.000
4.036 38 1 0.929 0.0397 0.854 1.000
5.334 36 1 0.904 0.0460 0.818 0.998
8.653 30 1 0.874 0.0532 0.776 0.985
9.690 30 1 0.845 0.0587 0.738 0.968
10.139 29 1 0.817 0.0633 0.702 0.951
10.462 28 1 0.788 0.0672 0.667 0.931
13.515 29 1 0.761 0.0701 0.636 0.912
13.584 28 1 0.735 0.0725 0.605 0.891
The functions coxreg and coxph both fit Cox regression models, which is the topic of the next chapter. Here we fit a regression model with just a constant term and no covariates. This results in estimation of the “baseline hazards” which survfit transforms to elements of the Kaplan–Meier estimator seen above. Note also that, by default, a 95% confidence band is produced around the estimator.
You can also plot the output of the fit (s.fit) by simply typing
> plot(s.fit)
The result is not shown here. It will look similar to the right-hand panel of Figure 2.10.
It is also possible to fit a parametric model to data with the aid of the function phreg (it also works with the function aftreg). Just fit a “parametric proportional hazards model with no covariates.” Note that the default distribution
> par(mfrow = c(1, 2))
> fit.w <- phreg(Surv(enter, exit, event) ˜ 1, data = mort)
> plot(fit.w, fn = c(“cum”))
> plot(fit.w, fn = c(“sur”))
in phreg is the Weibull distribution, which means that if no distribution is specified (through the argument dist), then the Weibull distribution is assumed.
We can also show the parameter estimates. For the interpretations of these, see Appendix B, the Weibull distribution.
> fit.w
Call:
phreg(formula = Surv(enter, exit, event) ˜ 1, data = mort)
Covariate W.mean Coef Exp(Coef) se(Coef) Wald p
log(scale) 3.785 44.015 0.066 0.000
log(shape) 0.332 1.393 0.058 0.000
Events 276
Total time at risk 17038
Max. log. likelihood -1399.3
More about this will be discussed in Chapter 6.