Chapter 5

More on Cox Regression

5.1 Introduction

Necessary and vital concepts like time-dependent covariates, communal covariates, handling of ties, model checking, and sensitivity analysis are introduced in this chapter.

5.2 Time-Varying Covariates

Only a special kind of time-varying covariates can be treated in R by the packages eha and survival, and that is so-called piecewise constant functions. How this is done is best described by an example.

Example 14 Civil status

The covariate (factor) civil status (called x in the R data frame) is an explanatory variable in a mortality study, which changes value from 0 to 1 at marriage. How should this be coded in the data file? The solution is to create two records (for individuals that marry), each with a fixed value of x:

  1. Original record: (t0, t, d, x(s), t0 < st), married at time T, t0 < T < t:

    x(s)={0,sT1,s>T

  2. First new record: (t0, T, 0, 0), always censored.

  3. Second new record: (T, t, d, 1).

The data file will contain two records as (with T = 3) you can see in Table 5.1. In this way, a time-varying covariate can always be handled by utilizing left truncation and right censoring. See also Figure 5.1 for an illustration. Note that this situation is formally equivalent to a situation with two individuals, one unmarried and one married. The first one is right censored at exactly the same time as the second one enters the study (left truncation).

Figure 5.1

Figure showing a time-varying covariate (civil status, 0 = unmarried, 1 = married). right censored and left truncated at age 30 (at marriage).

A time-varying covariate (civil status, 0 = unmarried, 1 = married). Right censored and left truncated at age 30 (at marriage).

Table 5.1

The coding of a time-varying covariate

id

enter

exit

event

civil status

23

0

3

0

0

23

3

8

1

1

A word of caution: Marriage status may be interpreted as an internal covariate, that is, the change of marriage status is individual, and may depend on health status. For instance, maybe only healthy persons get married. If so, a vital condition in Cox regression is violated; the risk of dying may only be modeled by conditions from the strict history of each individual. Generally, the use of time-dependent covariates is dangerous, and one should always think of possible reversed causality taking place when allowing for it.

5.3 Communal covariates

Communal (external) covariates are covariates that vary in time outside any individual, and are common to all individuals at each specific calendar time. In the econometric literature, such a variable is often called exogenous. This could be viewed upon as a special case of a time-varying covariate, but without the danger of reversed causality that was discussed earlier.

Example 15 Did food prices affect mortality in the 19th century?

For some 19th century parishes in southern Sweden, yearly time series of food prices are available. In this example we use deviation from the log trend in annual rye prices, shown in Figure 5.2. The idea behind this choice of transformation is to focus on deviations from the “normal” as a proxy for short-term variation in economic stress.

> data(scania)
> summary(scania)
  id   enter  exit   event
 Min.  :  1.0 Min.  :50  Min.  :50.00  Min.  :0.0000
 1st Qu.: 483.5 1st Qu.:50  1st Qu.:55.37  1st Qu.:0.0000
 Median : 966.0 Median :50  Median :62.70  Median :1.0000
 Mean  : 966.0 Mean  :50  Mean  :63.97  Mean  :0.5624
 3rd Qu.:1448.5 3rd Qu.:50  3rd Qu.:72.05  3rd Qu.:1.0000
 Max.  :1931.0 Max.  :50  Max.  :80.00  Max.  :1.0000
  birthdate  sex  parish  ses  immigrant
 Min.  :1763  male :975  1: 159  upper: 375  no : 759
 1st Qu.:1786 female:956  2: 181  lower:1556  yes:1172
 Median :1808   3: 180
 Mean  :1806   4: 258
 3rd Qu.:1825   5:1153
 Max.  :1845
> scania[scania$id == 1,]
  id enter  exit event birthdate sex parish  ses immigrant
29 1 50 59.242 1 1781.454 male  1 lower   no

Now, to put on the food prices as a time-varying covariate, use the function make.communal.

> require(eha)
> scania <- make.communal(scania, logrye[, 2, drop = FALSE],
      start = 1801.75)
> scania[scania$id == 1,]
 enter  exit event birthdate id sex parish  ses immigrant
1 50.000 50.296 0 1781.454 1 male  1 lower   no
2 50.296 51.296 0 1781.454 1 male  1 lower   no
3 51.296 52.296 0 1781.454 1 male  1 lower   no
4 52.296 53.296 0 1781.454 1 male  1 lower   no
5 53.296 54.296 0 1781.454 1 male  1 lower   no
6 54.296 55.296 0 1781.454 1 male  1 lower   no
7 55.296 56.296 0 1781.454 1 male  1 lower   no
8 56.296 57.296 0 1781.454 1 male  1 lower   no
9 57.296 58.296 0 1781.454 1 male  1 lower   no
10 58.296 59.242 1 1781.454 1 male  1 lower no
  foodprices
1  0.126
2  0.423
3  -0.019
4  -0.156
5  -0.177
6  -0.085
7  -0.007
8  0.104
9  0.118
10 -0.197

A new variable, foodprices is added, and each individual’s duration is split up in one-year-long intervals (except the first and last). This is how foodprices is treated as a time-varying covariate.

> fit <- coxreg(Surv(enter, exit, event) ˜ ses + sex + foodprices,
    data = scania)
> fit
Call:
coxreg(formula = Surv(enter, exit, event) ˜ ses + sex + foodprices,
 data = scania)
Covariate   Mean  Coef Rel.Risk  S.E. Wald p
ses
   upper 0.202 0   1 (reference)
   lower 0.798 -0.030 0.970 0.075 0.686
sex
   male 0.504 0   1 (reference)
  female 0.496 0.041 1.042 0.061 0.500
foodprices  0.002 0.321 1.378 0.182 0.077

Events    1086
Total time at risk  26979
Max. log. likelihoo  -7184.1
LR test statistic  3.7
Degrees of freedom   3
Overall p-value   0.295547

Figure 5.2

Figure showing log rye price deviations from trend, 19th century southern sweden.

Log rye price deviations from trend, 19th century southern Sweden.

Socioeconomic status and sex do not matter much, but food prices do; the effect is almost significant at the 5% level.

5.4 Tied Event Times

Tied event times can in theory not occur with continuous-time data, but it is impossible to measure duration and life lengths with infinite precision. Data are always more or less rounded, and tied event times occur frequently in practice. This may cause problems (biased estimates) if they occur too frequently. There are a few ways to handle tied data, and the so-called exact method considers all possible permutations of the tied event times in each risk set. It works as shown in the following example.

Example 16 Likelihood contribution at tied event time.

Ri = {1, 2, 3}, 1 and 2 are events; two possible orderings:

Li(β)=ψ(1)ψ(1)+ψ(2)+ψ(3)×ψ(2)ψ(2)+ψ(3)    +ψ(2)ψ(1)+ψ(2)+ψ(3)×ψ(1)ψ(1)+ψ(3)    =ψ(1)ψ(2)ψ(1)+ψ(2)+ψ(3){1ψ(2)+ψ(3)+1ψ(1)+ψ(3)}

The main drawback of the exact method is that it easily becomes unacceptably slow, because of the huge number of permutations that may be necessary to consider.

Fortunately, there are a few excellent approximations, most notably Efron’s (Efron 1977), which is the default method in most survival packages in R. Another common approximation, due to Breslow (Breslow 1974), is the default in some statistical software and also possible to choose in the eha and survival packages in R. Finally, there is no cost involved in using these approximations in the case of no ties at all; they will all give identical results.

With too much ties in the data, there is always the possibility of using discrete time methods. One simple method is to use the option method = “ml” in the coxreg function in the R package eha.

Example 17 Birth intervals.

We look at lengths of birth intervals between first and second births for married women in 19th century northern Sweden, a subset of the data set fert, available in the eha package. Four runs with coxreg are performed, with all the possible ways of treating ties.

> require(eha)
> data(fert)
> first <- fert[fert$parity == 1,]
> ## Default method (not necessary to specify method = “efron”
> fit.e <- coxreg(Surv(next.ivl, event) ˜ year + age, data = first,
    method = “efron”)
> ## Breslow
> fit.b <- coxreg(Surv(next.ivl, event) ˜ year + age, data = first,
    method = “breslow”)
> ## The hybrid mppl:
> fit.mp <- coxreg(Surv(next.ivl, event) ˜ year + age, data = first,
    method = “mppl”)
> ## True discrete:
> fit.ml <- coxreg(Surv(next.ivl, event) ˜ year + age, data = first,
    method = “ml”)

Then we compose a table of the four results; see Table 5.2.

Table 5.2

Comparison of four methods of handling ties, estimated coefficients

year

age

Efron

0.0017

−0.0390

Breslow

0.0017

−0.0390

MPPL

0.0016

−0.0390

ML

0.0016

−0.0390

There are almost no difference in results between the four methods. For the exact method mentioned above, the number of permutations is ~ 7 × 10192, which is impossible to handle.

With the help of the function risksets in eha, it is easy to check the composition of risk sets in general and the frequency of ties in particular.

> rs <- risksets(Surv(first$next.ivl, first$event))
> tt <- table(rs$n.event)
> tt
 1  2  3  4  5  6  7  9
369 199 133 60 27 14  2  2

This output says that 369 risk sets have only one event each (no ties), 199 risk sets have exactly two events each, etc.

The object rs is a list with seven components. Two of them are n.event, which counts the number of events in each risk set, and size, which gives the size (number of individuals under observation) of each risk set. Both these vectors are ordered after the risktimes, which is another component of the list. For the rest of the components, see the help page for risksets.

It is easy to produce the Nelson–Aalen plot with the output from risksets. The code is

> plot(rs$risktimes, cumsum(rs$n.event / rs$size), type = “s”,
   xlab = “Duration (years)”, ylab = “Cumulative hazards”)
> abline(h = 0)

and the result is shown in Figure 5.3.

Figure 5.3

Figure showing nelson–aalen plot with the aid of the function risksets in eha. birth intervals between first and second births, 19th century swedish data.

Nelson–Aalen plot with the aid of the function risksets in eha. Birth intervals between first and second births, 19th century Swedish data.

One version of the corresponding survival function can be constructed by

> sur <- exp(-cumsum(rs$n.event / rs$size))
> plot(rs$risktimes, sur, type = “s”,
   xlab = “Duration (years)”, ylab = “Surviving fraction”)
> abline(h = 0)

and the result is shown in Figure 5.4

Figure 5.4

Figure showing survival plot with the aid of the function risksets in eha. birth intervals between first and second births, 19th century swedish data.

Survival plot with the aid of the function risksets in eha. Birth intervals between first and second births, 19th century Swedish data.

5.5 Stratification

Stratification means that data is split up in groups called strata, and a separate partial likelihood function is created for each stratum, but with common values on the regression parameters corresponding to the common explanatory variables. In the estimation, these partial likelihoods are multiplied together, and the product is treated as a likelihood function. There is one restriction on the parameters; they are the same across strata.

There are typically two reasons for stratification. First, if the proportionality assumption (see Section 5.8) does not hold for a factor covariate, a way out is to stratify along it. Second, a factor may have too many levels, so that it is inappropriate to treat is as an ordinary factor. This argument is similar to the one about using a frailty model (Chapter 7). Stratification is also a useful tool with matched data; see Section 9.6.

When a factor does not produce proportional hazards between categories, stratify on the categories. For tests of the proportionality assumption, see Section 5.8.

Example 18 Birth intervals.

For the birth interval data, we stratify on ses in the Cox regression:

> data(fert)
> fert1 <- fert[fert$parity == 1,]
> fit <- coxreg(Surv(next.ivl, event) ˜ strata(ses) +
    age + year + prev.ivl + parish, data = fert1)

and plot the fit; see Figure 5.5. The nonproportionality pattern is clearly visible.

 plot(fit)

Figure 5.5

Figure showing cumulative hazards by socioeconomic status, birth intervals.

Cumulative hazards by socioeconomic status, birth intervals.

5.6 Sampling of Risk Sets

Some of the work by (Langholz & Borgan 1995) is implemented in eha. The idea behind sampling of risk sets is that with huge data sets, in each risk set there will be a few (often one) events, but many survivors. From an efficiency point of view, not much will be lost by only using a random fraction of the survivors in each risk set. The gain is, of course, computation speed and memory saving. How it is done in eha is shown in with an example.

Example 19 Sampling of risk sets, male mortality.

For the male mortality data set, we compare a full analysis with one where only four survivors are sampled in each risk set.

> data(mort)
> system.time(fit <- coxreg(Surv(enter, exit, event) ˜ ses,
      data = mort))
  user system elapsed
 0.020  0.000  0.017
> system.time(fit.4 <- coxreg(Surv(enter, exit, event) ˜ ses,
       max.survs = 4, data = mort))

  user system elapsed
 0.012  0.000  0.012

> fit

Call:
coxreg(formula = Surv(enter, exit, event) ˜ ses, data = mort)
Covariate  Mean   Coef Rel.Risk  S.E. Wald p
ses
  lower 0.416  0  1 (reference)
  upper 0.584 -0.480 0.619 0.121 0.000
Events    276
Total time at risk   17038
Max. log. likelihood -1845.3
LR test statistic   15.8
Degrees of freedom  1
Overall p-value  7.01379e-05

> fit.4

Call:
coxreg(formula = Surv(enter, exit, event) ˜ ses, data = mort,
 max.survs = 4)
Covariate  Mean   Coef Rel.Risk  S.E. Wald p
ses
  lower 0.416  0  1 (reference)
  upper 0.584 -0.501 0.606 0.135 0.000
Events    276
Total time at risk   17038
Max. log. likelihood -437.98
LR test statistic   13.9
Degrees of freedom  1
Overall p-value  0.00019136

The results are comparable, and a gain in execution time is noted. Of course, with the small data sets we work with here, the difference is of no practical importance.

It is worth noting that the function risksets has an argument max.survs, which, when it is set, sample survivors for each risk set in the data. The output can then be used as input to coxreg; see the relevant help pages for more information.

5.7 Residuals

Residuals are generally the key to judgment of model fit to data. It is easy to show how it works with linear regression.

Example 20 Linear regression.

Figure 5.6 shows a scatter plot of bivariate data and a fitted straight line. The residuals are the vertical distances between the line and the points.

Figure 5.6

Figure showing residuals in linear regression.

Residuals in linear regression.

Unfortunately, this kind of simple and intuitive graph does not exist for results from a Cox regression analysis. However, there are a few ideas how to accomplish something similar. The main idea goes as follows. If T is a survival time, and S(t) the corresponding survivor function (continuous time), then it is a mathematical fact that U = S(T) is uniformly distributed on the interval (0, 1). Continuing, this implies that — log(U) is exponentially distributed with rate (parameter) 1.

It is thus shown that H(T) (H is the cumulative hazard function for T) is exponentially distributed with rate 1. This motivates the Cox–Snell residuals, given a sample T1, ..., Tn of survival times,

rCi=ˆH(Ti;xi)=exp(βxi)ˆH0(Ti),i=1, ldots, n,

which should behave like a censored sample from an exponential distribution with parameter 1, if the model is good. If, for some i, Ti is censored, so is the residual.

5.7.1 Martingale Residuals

A drawback with the Cox–Snell residuals is that they contain censored values. An attempt to correct the censored residuals leads to the so-called martingale residuals. The idea is simple; it builds on the “lack-of-memory” property of the exponential distribution. The expected value is 1, and by adding 1 to the censored residuals, they become predictions (estimates) of the corresponding uncensored values. Then finally a twist is added: Subtract 1 from all values and multiply by −1, which leads to to the definition of the martingale residuals.

rMi=δirCi,  i=1,...,n.

where δi is the indicator of event for the i:th observation. These residuals may be interpreted as the Observed minus the Expected number of events for each individual.

Direct plots of the martingale residuals tend to be less informative, especially with large data sets.

Example 21 Plot of martingale residuals.

The data set kidney from the survival package is used. For more information about it, read its help page in R. For each patient, two survival times are recorded, but only one will be used here. That is accomplished with the function duplicated.

> require(eha)
> data(kidney)
> head(kidney)
 id time status age sex disease frail
1 1 8  1 28  1  Other  2.3
2 1  16  1 28  1  Other  2.3
3 2  23  1 48  2  GN  1.9
4 2  13  0 48  2  GN  1.9
5 3  22  1 32  1  Other  1.2
6 3  28  1 32  1  Other  1.2
> k1 <- kidney[!duplicated(kidney$id),]
> fit <- coxreg(Surv(time, status) ˜ disease + age + sex,
    data = k1)

The extractor function residuals is used to extract the martingale residuals from the fit.

> plot(residuals(fit), ylab = “Residuals”)

See Figure 5.7.

Figure 5.7

Figure showing martingale residuals from a fit of the kidney data.

Martingale residuals from a fit of the kidney data.

With a large data set, the plot will be hard to interpret; see Figure 5.8, which shows the martingale residuals from a fit of the male mortality (mort) data in the package eha. It is clearly seen that the residuals are grouped in two clusters. The positive ones are connected to the observations with an event, while the negative ones are corresponding to the censored observations. It is hard to draw any conclusions from plots like this one. However, the residuals are used in functions that evaluate goodness-of-fit.

Figure 5.8

Figure showing martingale residuals from a fit of the male mortality data.

Martingale residuals from a fit of the male mortality data.

5.8 Checking Model Assumptions

5.8.1 Proportionality

It is extremely important to check the proportionality assumption in the proportional hazards models. We have already seen how violations of the assumption can be dealt with (Section 5.5).

We illustrate this with the births data set in the eha package.

Example 22 Birth intervals, proportionality assumption.

In order to keep observations reasonably independent, we concentrate on one specific birth interval per woman, the interval between the second and third births. That is, in our sample are all women with at least two births, and we monitor each woman from the day of her second delivery until the third, or if that never happens, throughout her observable fertile period, i.e, to age 50 or to loss of follow-up. The latter will result in a right-censored interval.

> require(eha)
> ##data(births)
> fert2 <- fert[fert$parity == 2,]

Then we save the fit from a Cox regression,

> fit <- coxreg(Surv(next.ivl, event) ˜ ses + age + year + parish,
    data = fert2)

and check the proportionality assumption. It is done with the function cox.zph in the survival package.

> prop.full <- cox.zph(fit)
> prop.full
    rho  chisq  p
sesunknown -0.02773 1.1235 0.28918
sesupper  -0.05311 4.2624 0.03896
seslower 0.04774 3.4199 0.06441
age   -0.07322 6.8550 0.00884
year  -0.00315 0.0156 0.90071
parishNOR  0.00907 0.1234 0.72538
parishSKL  0.01883 0.5333 0.46524
GLOBAL   NA 20.1604 0.00523

In the table above, look first at the last row, GLOBAL. It is the result of a test of the global null hypothesis that proportionality holds. The small p-value tells us that we have a big problem with the proportionality assumption. Looking further up, we see two possible problems, the variables age and ses. Unfortunately, age is a continuous variable. A categorical variable can be stratified upon, but with age we have to categorize first.

Let us first investigate the effect of stratifying on ses;

> fit1 <- coxreg(Surv(next.ivl, event) ˜ strata(ses) +
     age + year + parish, data = fert2)
> plot(fit1)

and plot the result; see Figure 5.9. The nonproportionality is visible; some

Figure 5.9

Figure showing stratification on socio-economic status.

Stratification on socio-economic status.

curves cross. The test of proportionality of the stratified model shows that we still have a problem with age.

> fit1.zph <- cox.zph(fit1)
> print(fit1.zph)
    rho  chisq  p
age  -0.074152 6.94200 0.00842
year  0.000807 0.00101 0.97469
parishNOR 0.010598 0.16842 0.68152
parishSKL 0.021465 0.69300 0.40515
GLOBAL   NA 9.51235 0.04949

Since age is continuous covariate, we may have to categorize it. First, its distribution is checked with a histogram; see Figure 5.10. The range of age may reasonably be split into four equal-length intervals with the cut function.

Figure 5.10

Figure showing the distribution of age, birth interval data.

The distribution of age, birth interval data.

> hist(fert2$age, main = ““, xlab = ”age”)
> fert2$qage <- cut(fert2$age, 4)
> fit2 <- coxreg(Surv(next.ivl, event) ˜ strata(ses) +
    qage + year + parish, data = fert2)

and then the proportionality assumption is tested again.

> fit2.zph <- cox.zph(fit2)
> fit2.zph
     rho  chisq  p
qage(24.2,31.5] 0.00529 0.03975 0.8420
qage(31.5,38.8] -0.02639 1.01577 0.3135
qage(38.8,46]  -0.05763 5.13047 0.0235
year    0.00235 0.00879 0.9253
parishNOR   0.00501 0.03762 0.8462
parishSKL   0.01500 0.33875 0.5606
GLOBAL    NA 8.73344 0.1891

The result is now reasonable, with the high age group deviating slightly. This is not so strange; fertility becomes essentially zero in the age range in the upper forties, and therefore it is natural that the proportionality assumption is violated. One way of dealing with the problem, apart from stratification, would be to analyze the age groups separately.

5.8.2 Log-Linearity

In the Cox regression model, the effect of a covariate on the hazard is log-linear; that is, it affects the log hazard linearly. Sometimes a covariate needs to be transformed to fit into the pattern. The first question is then: Should the covariate be transformed in the analysis? A graphical test can be done by plotting the martingale residuals from a null fit against the covariate; see Figure 5.11.

> x <- rnorm(100)
> tid <- rexp(100, exp(1.5 * x^2))
> event <- rep(1, 100)
> fit <- coxreg(Surv(tid, event) ˜ 1)
> plot(x, residuals(fit))
> lines(lowess(x, residuals(fit)))

Figure 5.11

Figure showing plot of residuals against the explanatory variable x.

Plot of residuals against the explanatory variable x.

The function lowess fits a “smooth” curve to a scatterplot. The curvature is clearly visible; let us see what happens if we make the plot with x2 instead of x. (Note that we now are cheating!) See Figure 5.12. This plot indicates more clearly that an increasing x2 is associated with an increasing risk for the event to occur.

Figure 5.12

Figure showing plot of residuals against the explanatory variable x2.

Plot of residuals against the explanatory variable x2.

5.9 Fixed Study Period Survival

Sometimes survival up to some fixed time is of interest. For instance, in medicine five-year survival may be used as a measure of the success of a treatment. The measure is then the probability that a patient will survive five years after treatment.

This is simple to implement; it is a binary experiment, where for each patient, survival or not survival is recorded (plus covariates, such as treatment). The only complicating factor is incomplete observations, that is, right

> plot(x^2, residuals(fit))
> lines(lowess(x^2, residuals(fit)))

censoring, in which case it is recommended to use ordinary Cox regression anyway. Otherwise, logistic regression with a suitable link function works well. Of course, as mentioned earlier, the cloglog link is what gets you closest to the proportional hazards model, but usually it makes very little difference. See Kalbfleisch & Prentice (2002, p. 334 ff.) for slightly more detail.

5.10 Left- or Right-Censored Data

Sometimes the available data is either left or right censored; that is, for each individual i, we know survival of ti or not, that is, the data are of the form

(ti,δi),i=1,...n,

with δi = 1 if survival (died after ti) and δi = 0 if not (died before ti). It is still possible to estimate the survivor function S nonparametrically. The likelihood function is

L(S)=ni=1{S(ti)}δi{1S(ti)}1δi

How to maximize this under the restrictions ti<tjS(ti)S(tj) with the EM algorithm is shown by (Groeneboom & Wellner 1992).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset