Chapter 2

Single Sample Data

2.1 Introduction

The basic model descriptions of survival data are introduced. Basically, the distribution of survival data may be either continuous or discrete, but the nonparametric estimators of the distributions of survival time are discrete in any case. Therefore, when introducing the estimators, the discrete models are necessary to know.

In the present chapter, only nonparametric estimation is discussed. This means that no assumptions whatsoever are made about the true underlying distribution. Parametric models are presented in Chapter 5.

2.2 Continuous Time Model Descriptions

Traditional statistical model descriptions are based on the density and the cumulative distribution functions. These functions are not so suitable for use with censored and/or truncated data. The survival and hazard functions are better suited, as will be demonstrated here. It should, however, be acknowledged that all these functions are simple functions of each other, so the information they convey is in principle the same, it is only convenience that determines which one(s) to choose and use in a particular application.

2.2.1 The Survival Function

We start with a motivating example, the life table, and its connection to death risks and survival probabilities in a population; see Table 2.1.

Table 2.1

Life table for Swedish females 2010

age(x)

pop

deaths

risk(%)

alive at age x

0

55 407

117

0.211

100000

1

54 386

22

0.040

99 789

2

53 803

11

0.020

99 748

3

53 486

7

0.013

99 728

4

52 544

4

0.008

99 715

96

3 074

1 030

33.507

6 529

97

2 204

817

37.061

4 341

98

1 473

624

42.363

2 732

99

920

433

47.065

1 575

100+

1 400

837

59.786

834

Example 7 Swedish female life table 2009.

Statistics Sweden (SCB) is a government agency that produces statistics and has a coordinating role for the official statistics of Sweden. From their home page (http://www.scb.se) it is possible to download vital population statistics, like population size and number of deaths by age, gender, and year. From such data it is possible to construct a life table; see Table 2.1. It is constructed in the following way. The first column, headed by age(x), contains the ages in completed years. The second column, headed pop, contains the numbers of women in Sweden in the corresponding ages. It is equal to the mean population size over the year. Since population numbers from the SCB are taken at the turn of the years, the numbers in column two are averages of two columns in the SCB file. The third column is the total numbers of deaths in each age category during the year. The fourth and fifth columns are derived from the previous ones. The risk is essentially the ratio between the number of deaths and population size for each age.

Column five, alive at age x, contains the actual construction of the life table. It depicts the actual size of a birth cohort, year by year, exposed to the death risks of column four. It starts with 100 000 newborn females. Under the first year of life the death risk is 0.211%, meaning that we expect 0.00211 × 100 000 = 211 girls to die the first year. In other words, 100 000−211 = 99789 girls will remain alive at their first anniversary (at exact age one year). That is the number in the second row in the fifth column. The calculations then continue in the same way through the last two columns. Note, though, that the last interval (100+) is an open interval; that is, it has no fixed upper limit.

Note the difference in interpretation between the numbers in the second column (“pop”) and those in the last (“alive at age x”); the former give the age distribution in a population a given year (referred to as period data), while the latter constitute synthetic cohort data; that is, a group of newborn are followed from birth to death, and the actual size of the (still alive) cohort is recorded age by age. That is, “individuals” in the synthetic cohort live their whole lives under the mortality conditions of the year 2009. Of course, no real individual can experience this scenario, hence the name “synthetic cohort.” Nevertheless, it is a useful tool to illustrate the state of affairs regarding mortality in a given year.

It makes sense to plot the life table, see Figure 2.1. By dividing all numbers by the original cohort size (in our case 100,000), they may be interpreted as probabilities.

Figure 2.1

Figure showing life table and survival function, females, sweden 2009. median life length is 86; see construction in figure.

Life table and survival function, females, Sweden 2009. Median life length is 86; see construction in figure.

The survival function S(t), t > 0, is defined as the probability of surviving past t, t > 0, or with a formula,

S(t)=P(Tt),t>0,

where T is the (random) life length under study. The symbol P(Tt) reads “the probability that T is equal to or larger than t”; here t is a fixed number while T denotes a “random quantity.” In statistical language, T is called a random variable. In our example, T is the (future) life length of a randomly chosen newborn female, of course, unknown at birth.

We will, for the time being, assume that S is a “nice” function, smooth and differentiable. That will ensure that the following definitions are unambiguous.

2.2.2 The Density Function

The density function f is defined as minus the derivative of the survival function, or

f(t)=tS(t),t>0.

The intuitive interpretation of this definition is that, for small enough s, the following approximation is close to the true value:

P(t0T<t0+s)sf(t0)

This is illustrated in Figure 2.2; for a short enough interval (t0, t0 + s], the probability of an observation falling in that interval is well approximated by the area of a rectangle with sides of lengths s and f(t0), or sf(t0). The formal mathematical definition as a limit is given in equation (2.1).

f(t)=lims0P(tT<t+s)s,t>0.(2.1)

Figure 2.2

Figure showing interpretation of the density function f.

Interpretation of the density function f.

2.2.3 The Hazard Function

The hazard function is central to the understanding of survival analysis, so the reader is recommended to get at least an intuitive understanding of it. One way of thinking of it is as an “instant probability”; at a given age t, it measures the risk of dying in a short interval (t, t + s) immediately after t, for an individual who still is alive at t.

h(t)=lims0P(tT<t+s|Tt)sP(t0T<t0+s|Tt0)sh(t0)

Note the difference between the density and the hazard functions. The former is (the limit of) an unconditional probability, while the latter is (the limit of) a conditional probability. Otherwise, the hazard function is also a kind of density function, with a similar graphical interpretation; see Figure 2.3.

Figure 2.3

Figure showing interpretation of the hazard function h.

Interpretation of the hazard function h.

2.2.4 The cumulative hazard function

The cumulative hazard function is defined as the integral of the hazard function,

H(t)=t0h(s)ds,t0.

That is, an intuitive interpretation is that the cumulative hazard function successively accumulates the instant risks.

The cumulative hazard function is important because it is fairly easy to estimate nonparametrically (i.e., without any restrictions), in contrast to the hazard and density functions.

Example 8 The exponential distribution

Perhaps the simplest continuous life length distribution is the exponential distribution. It is simple because its hazard function is constant:

h(t)=λ,λ>0,  t0.

From this, it is easy to calculate the other functions that characterize the exponential distribution. The cumulative hazards function is

H(t)=λt,λ>0,  t0,

the survival function is

S(t)=eλt,λ>0,  t0,

and the density function is

f(t)=λeλt,λ>0,  t0.

The property of constant hazard implies no aging. This is not a realistic property for human mortality, but, as we will see, a useful benchmark, and a useful model for modelling mortality over short time intervals (piece-wise constant hazard). The exponential distribution is described in detail in Appendix B.

2.3 Discrete Time Models

So far we have assumed, implicitly or explicitly, that time is continuous. We will now introduce discrete time survival models, and the reason is two-fold: (i) Even if data are generated from truly continuous time models, nonparametric estimation of these models will, as will be shown later, give rise to estimators corresponding to a discrete time model. This is an inherent property of nonparametric maximum likelihood estimators. Thus, in order to study the properties of these estimators, we need some knowledge of discrete time models. (ii) Data are discrete, usually through grouping. For instance, life lengths may be measured in full years, introducing tied data.

It is important to realize that in practice all data are discrete. For instance, it is impossible to measure time with infinite precision. Therefore, all data are more or less rounded. If data are so much rounded that the result is heavily tied data, true discrete-data models are called for.

Discrete time models will now be introduced. Let R be a discrete random variable with

  • support (r1,r2,...,rk) (positive real numbers, usually 1, 2,... or 0, 1, 2, ...),

  • probability mass function

    pi=P(R=ri),i=1,...,k,

    with pi>0,        i=1,...,k and ki=1pi=1.

Then

F(t)=i:ritpi,x<t<,

is the cumulative distribution function, and

S(t)=i:rixpi,x<t<,

is the survival function.

The discrete time hazard function is defined as

hi=P(R=ri|Rri)=pij=ikpj,i=1,...,k.

Note that here, the hazard at any given time point is a conditional probability, so it must always be bounded to lie between zero and one. In the continuous case, on the other hand, the hazard function may take any positive value. Further note that if, like here, the support is finite, the last “hazard atom” is always equal to one (having lived to the “last station,” one is bound to die).

The system (2.3) of equations has a unique solution, easily found by recursion:

pi=hij=1i1(1hj),i=1,...,k.(2.2)

From this, we get the discrete time survival function at each support point as

S(ri)=j=ikpj=j=1i1(1hj),i=1,...,k,

and the general definition

S(t)=j:rj<t(1hj),t0(2.3)

It is easily seen that S is decreasing, S(0) = 1, and S(∞) = 0, as it should be.

Example 9 The geometric distribution

The geometric distribution has support on {1,2,...} (another version also includes zero in the support; this is the case for the one in R), and the hazard function h is constant:

hi=h,0<h<1,  i=1,2,....

Thus, the geometric distribution is the discrete analogue to the exponential distribution in that it implies no aging. The probability mass function is, from (2.3),

pi=h(1h)i1,i=1,2,...,

and the survival function becomes, from (2.3),

S(t)=(1h)[ t ],t>0,

where [t] denotes the largest integer smaller than or equal to t (rounding downwards). In Figure 2.4 the four functions for a geometric distribution with h = 0.25 are plotted.

Figure 2.4

Figure showing the geometric distribution with success probability 0.25.

The geometric distribution with success probability 0.25.

2.4 Nonparametric Estimators

As an introductory example, look at an extremely simple data set: 4, 2*, 6, 1, 3* (starred observations are right censored; in the figure, deaths are marked with +, censored observations with a small circle); see Figure 2.5.

Figure 2.5

Figure showing a simple survival data set.

A simple survival data set.

How should the survival function be estimated? The answer is that we take it in steps. First, the hazard “atoms” are estimated. It is done nonparametrically, and the result as such is not very useful. Its potential lies in that it is used as the building block in constructing estimates of the cumulative hazards and survival functions.

2.4.1 The Hazard Atoms

In Figure 2.6, the observed event times in Figure 2.5 are marked, the vertical dashed lines at durations 1, 4, and 6, respectively. In the estimation of the hazard atoms, the concept of risk set is of vital importance.

Figure 2.6

Figure showing preliminaries for estimating the hazard function.

Preliminaries for estimating the hazard function.

The risk set R(t) at duration t, t > 0 is defined mathematically as

R(t)={ all individuals under observation at t },(2.4)

or in words, the risk set at time t consists of all individuals present and under observation just prior to t. The reason we do not say “present at time t” is that it is vital to include those individuals who have an event or are right censored at the exact time t. In our example, the risk sets R(1), R(4), and R(6) are the interesting ones. They are:

R(1)={ 1,2,3,4,5 }R(4)={ 1,3 }R(6)={ 3 }

The estimation of the hazard atoms is simple. First, we assume that the probability of an event at times where no event is observed, is zero. Then, at times where events do occur, we count the number of events and divides that number by the size of the corresponding risk set. The result is (see also Figure 2.7)

h^(1)=15=0.2h^(4)=12=0.5h^(6)=11=1

Figure 2.7

Figure showing nonparametric estimation of the hazard function.

Nonparametric estimation of the hazard function.

As is evident from Figure 2.7, the estimated hazard atoms will be too irregular to be of practical use; they need smoothing. The simplest way of smoothing them is to calculate the cumulative sums, which leads to the Nelson–Aalen estimator of the cumulative hazards function; see the next section. There are more direct smoothing techniques to get reasonable estimators of the hazard function itself, for example, kernel estimators, but they will not be discussed here. See, for example, (Silverman 1986) for a general introduction to kernel smoothing.

2.4.2 The Nelson–Aalen Estimator

From the theoretical relation, we immediately get

H^(t)=sth^(s),t0

which is the Nelson–Aalen estimator (Nelson 1972, Aalen 1978); see Figure 2.8. The sizes of the jumps are equal to the heights of the “spikes” in Figure 2.7.

Figure 2.8

Figure showing the nelson–aalen estimator.

The Nelson–Aalen estimator.

2.4.3 The Kaplan–Meier Estimator.

From the theoretical relation (2.3), we get

S^(t)=s<t(1h^(s)),t0;(2.5)

see also Figure 2.9. Equation (2.5) may be given a heuristic interpretation: In order to survive time t, one must survive all “spikes” (or shocks) that come before time t. The multiplication principle for conditional probabilities then gives equation (2.5).

Figure 2.9

Figure showing the kaplan–meier estimator

The Kaplan–Meier estimator

2.5 Doing it in R

For the male mortality data set mort,

> require(eha)
> data(mort)
> head(mort)
 id enter  exit event birthdate  ses
1 1 0.000 20.000 0 1800.010 upper
2 2 3.478 17.562 1 1800.015 lower
3 3 0.000 13.463 0 1800.031 upper
4 3 13.463 20.000 0 1800.031 lower
5 4 0.000 20.000 0 1800.064 lower
6 5 0.000 0.089 0 1800.084 lower

an indicator for death is introduced, called event. The value 1 indicates that the corresponding life time is fully observed, while the value 0 indicates a right-censored lifetime. Another common coding scheme is TRUE and FALSE, respectively.

2.5.1 Nonparametric Estimation

In the R package eha, the plot function can be used to plot both Nelson– Aalen curves and Kaplan–Meier curves. The code for the Nelson–Aalen and Kaplan–Meier plots is

> par(mfrow = c(1, 2))# Two plots, “one row, two columns”.
> with(mort, plot(Surv(enter, exit, event), fn = “cum”))
> with(mort, plot(Surv(enter, exit, event), fn = “surv”))

and the result is seen in Figure 2.10 The package eha must be loaded for this to work. Note the use of the function with; it tells the plot function that it should get its data (enter, exit, event) from the data frame mort. The function Surv from the package survival creates a “survival object,” which is used in many places. It is, for instance, the response in all functions that perform regression analysis on survival data.

Figure 2.10

Figure showing nelson–aalen plot (left panel) and kaplan–meier plot (right panel), male mortality data. the dashed lines in the survival function plot are 95% confidence limits.

Nelson–Aalen plot (left panel) and Kaplan–Meier plot (right panel), male mortality data. The dashed lines in the Survival function plot are 95% confidence limits.

Figure 2.11

Figure showing weibull fit to the male mortality data.

Weibull fit to the male mortality data.

Note that the “Duration” in Figure 2.10 is duration (in years) since the day each man became 20 years of age. They are followed until death or age 40, whichever comes first. The right hand figure shows that approximately 25% of the men alive at age 20 died before they became 40.

If the Kaplan–Meier estimator is wanted as numbers, use survfit in combination with coxreg or coxph. Since the output will have approximately as many rows as there are distinct event times, we illustrate it by taking a (small) random sample from mort.

> indx <- sample(NROW(mort), size = 50, replace = FALSE)
> rsa <- mort[indx,]
> fit <- coxph(Surv(enter, exit, event) ˜ 1, data = rsa)
> s.fit <- survfit(fit)
> options(width=70)
> summary(s.fit)
Call: survfit(formula = fit)
  time n.risk n.event survival std.err lower 95% CI upper 95% CI
 0.526 46  1 0.978 0.0213   0.938   1.000
 3.615 39  1 0.954 0.0321   0.893   1.000
 4.036 38  1 0.929 0.0397   0.854   1.000
 5.334 36  1 0.904 0.0460   0.818   0.998
 8.653 30  1 0.874 0.0532   0.776   0.985
 9.690 30  1 0.845 0.0587   0.738   0.968
 10.139 29  1 0.817 0.0633   0.702   0.951
 10.462 28  1 0.788 0.0672   0.667   0.931
 13.515 29  1 0.761 0.0701   0.636   0.912
 13.584 28  1 0.735 0.0725   0.605   0.891

The functions coxreg and coxph both fit Cox regression models, which is the topic of the next chapter. Here we fit a regression model with just a constant term and no covariates. This results in estimation of the “baseline hazards” which survfit transforms to elements of the Kaplan–Meier estimator seen above. Note also that, by default, a 95% confidence band is produced around the estimator.

You can also plot the output of the fit (s.fit) by simply typing

> plot(s.fit)

The result is not shown here. It will look similar to the right-hand panel of Figure 2.10.

2.5.2 Parametric Estimation

It is also possible to fit a parametric model to data with the aid of the function phreg (it also works with the function aftreg). Just fit a “parametric proportional hazards model with no covariates.” Note that the default distribution

> par(mfrow = c(1, 2))
> fit.w <- phreg(Surv(enter, exit, event) ˜ 1, data = mort)
> plot(fit.w, fn = c(“cum”))
> plot(fit.w, fn = c(“sur”))

in phreg is the Weibull distribution, which means that if no distribution is specified (through the argument dist), then the Weibull distribution is assumed.

We can also show the parameter estimates. For the interpretations of these, see Appendix B, the Weibull distribution.

> fit.w
Call:
phreg(formula = Surv(enter, exit, event) ˜ 1, data = mort)
Covariate  W.mean  Coef Exp(Coef) se(Coef) Wald p
log(scale)    3.785 44.015 0.066  0.000
log(shape)    0.332 1.393 0.058  0.000
Events    276
Total time at risk  17038
Max. log. likelihood  -1399.3

More about this will be discussed in Chapter 6.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset