Chapter 4

Poisson Regression

4.1 Introduction

Here Poisson regression is introduced and its connection to Cox regression is discussed. We start by defining the Poisson distribution, then discuss the connection to Cox regression and tabular lifetime data.

4.2 The Poisson Distribution

The Poisson distribution is used for count data, that is, when the result may be any positive integer 0, 1, 2,..., without upper limit. The probability density function (pdf) P of a random variable X following a Poisson distribution is

P(X=k)=λkk!exp(λ),λ>0;  k=0,1,2,...,(4.1)

The parameter λ is both the mean and the variance of the distribution. In Figure 4.1 the pdf (4.1) is plotted for some values of λ. Note that when λ increases, the distribution looks more and more like a normal distribution.

Figure 4.1

Figure showing the poisson cdf for different values of the mean, λ.

The Poisson cdf for different values of the mean, λ.

In R, the Poisson distribution is represented by four functions, dpois, ppois, qpois, and rpois, representing the probability density function (pdf); the cumulative distribution function (cdf); the quantile function (the inverse of the cdf); and random number generation, respectively. See the help page for the Poisson distribution for more detail. In fact, this is the scheme present for all probability distributions available in R.

For example, the upper-left bar plot in Figure 4.1 is produced in R by

> barplot(dpois(0:5, lambda = 0.5), axes = FALSE,
   main = expression(paste(lambda, “ = ”, 0.5)))

Note that the function dpois is vectorized:

> dpois(0:5, lambda = 0.5)
[1] 0.6065306597 0.3032653299 0.0758163325 0.0126360554
[5] 0.0015795069 0.0001579507

Example 12 Marital fertility

As an example where the Poisson distribution may be relevant, we look at the number of children born to a woman after marriage. The data frame fert in eha can be used to calculate the number of births per married woman i Skellefteå during the 19th century; however, this data set contains only marriages with one or more births. Let us instead count the number of births beyond one.

> library(eha)
> data(fert)
> f0 <- fert[fert$event == 1,]
> kids <- tapply(f0$id, f0$id, length) - 1
> barplot(table(kids))

The result is shown in Figure 4.2. The question is now: Does this look like a Poisson distribution? One way of checking this is to plot the theoretic distribution with the same mean (the parameter λ) as the sample mean in the data.

Figure 4.2

Figure showing number of children beyond one for married women with at least one child.

Number of children beyond one for married women with at least one child.

> lam <- mean(kids)
> barplot(dpois(0:12, lambda = lam))

The result is shown in Figure 4.3. Obviously, the fertility data do not follow the Poisson distribution so well. It is, in fact, over-dispersed compared to the Poisson distribution. A simple way to check that is to calculate the sample mean and variance of the data. If data come from a Poisson distribution, these numbers should be equal (theoretically) or reasonably close.

Figure 4.3

Figure showing theoretical poisson distribution with λ = 4.549.

Theoretical Poisson distribution with λ = 4.549.

> mean(kids)
[1] 4.548682
> var(kids)
[1] 8.586838

They are not very close, which also is obvious from comparing the graphs.

4.3 The Connection to Cox Regression

There is an interesting connection between Cox regression and Poisson regression, which can be illustrated as follows.

> dat <- data.frame(enter = rep(0, 4), exit = 1:4,
     event = rep(1, 4), x = c(0, 1, 0, 1))
> dat
 enter exit event x
1 0 1 1 0
2 0 2 1 1
3 0 3 1 0
4 0 4 1 1

We have generated a very simple data set, dat. It consists of four life lengths with a covariate x. A Cox regression, where the covariate x is related to survival is given by

> library(eha)
> fit <- coxreg(Surv(enter, exit, event) ˜ x, data = dat)
> fit
Call:
coxreg(formula = Surv(enter, exit, event) ˜ x, data = dat)
Covariate   Mean  Coef Rel.Risk S.E. Wald p
x    0.600 −0.941 0.390 1.240  0.448
Events    4
Total time at risk   10
Max. log. likelihood  -2.87
LR test statistic  0.62
Degrees of freedom   1
Overall p-value   0.43248
> fit$hazards
[[1]]
 [,1]  [,2]
[1,] 1 0.2246892
[2,] 2 0.3508641
[3,] 3 0.4493785
[4,] 4 1.6004852
attr(,“class”)
[1] “hazdata”

Note that fit$hazards is a list with one component per stratum. In this case there is only one stratum, so the list has one component. This component is a matrix with two columns; the first contains the observed distinct failure times, and the second the corresponding estimates of the hazard “atoms.”

Now, this can be replicated by Poisson regression! First, the data set is summarized around each observed failure time. That is, a snapshot of the risk set at each failure time is created by the function toBinary.

> library(eha)
> datB <- toBinary(dat)
> datB
 event riskset risktime x orig.row
1 1  1   1 0   1
2 0  1   1 1   2
3 0  1   1 0   3
4 0  1   1 1   4
2.1  1  2   2 1   2
3.1  0  2   2 0   3
4.1  0  2   2 1   4
3.2  1  3   3 0   3
4.2  0  3   3 1   4
4.3  1  4   4 1   4

Note that three new ”covariates” are created; riskset, risktime, and orig.row. In the Poisson regression to follow, the factor riskset must be included as a cluster. Formally, this is equivalent to including it as a factor covariate, but for data set with many distinct failure times, the number of levels will be too large to be run by glm. The function glmmboot in eha is better suited.

> fit2 <- glmmboot(event ˜ x, cluster = riskset,
    family = poisson, data = datB)
> coefficients(fit2)
  x
-0.9406136

The parameter estimate corresponding to x is exactly the same as in the Cox regression. The baseline hazard function is estimated with the aid of the frailty estimates for the clustering based on riskset.

> exp(fit2$frail)
[1] 0.3596118 0.5615528 0.7192236 2.5615528

These numbers are the hazard atoms in the baseline hazard estimation, so they should be equal to the output from fit$hazards in the Cox regression. This is not the case, and this depends on the fact that coxreg estimates the baseline hazards at the mean of the covariate, that is, at x = 0.5 in our toy example. The easiest way to make the numbers comparable is by subtracting 0.5 from x in the Poisson regression.

> datB$x <- datB$x - fit$means
> fit2 <- glmmboot(event ˜ x, cluster = riskset,
    family = poisson, data = datB)
> exp(fit2$frail)
[1] 0.2246892 0.3508641 0.4493785 1.6004852

This is exactly what coxreg produced.

This correspondence between Poisson and Cox regression was pointed out by Søren Johansen (Johansen 1983).

4.4 The Connection to the Piecewise Constant Hazards Model

For completeness, the connection between the Poisson distribution and the piecewise constant hazard model is mentioned here. This is explored in detail in Section 6.2.4 (Chapter 6).

4.5 Tabular Lifetime Data

Although it is possible to use Poisson regression in place of Cox regression, the most useful application of Poisson regression in survival analysis is with tabular data.

Example 13 Mortality in ages 61–80, Sweden 2007

In eha, there is the data set swe07, which contains population size and number of deaths by age and sex in Sweden 2007. The age span is restricted to ages 61–80.

> data(swe07)
> head(swe07)
 pop deaths sex age log.pop
1 63483 286 female  61 11.05853
2 63770 309 female  62 11.06304
3 64182 317 female  63 11.06948
4 63097 366 female  64 11.05243
5 61671 387 female  65 11.02957
6 57793 419 female  66 10.96462
> tail(swe07)
 pop deaths sex  age log.pop
35 31074  884 male 75 10.34413
36 29718  904 male 76 10.29951
37 29722 1062 male 77 10.29964
38 28296 1112 male 78 10.25048
39 27550 1219 male 79 10.22376
40 25448 1365 male 80 10.14439

The Poisson model is, where Dij is number of deaths and Pij population size for age i and sex j, i = 61,..., 80; j = female (0), male (1), and λij is the corresponding mortality,

Dij~Poisson(λijPij),i=61,62,...,80;  j=0,1,

with

λ61,jP61,j=P61,jexp(γ+β*j)=exp(log(P61,j)+γ+β*j),j=0,1,

and

λijPij=Pijexp(γ+αi+β*j)=exp(log(Pij)+γ+αi+β*j),i=62,63,...,80;  j=0,1.

This calculation shows that it is the log of the population sizes (log(Pij) that is the correct offset to use in the Poisson regression. First, we want age to be a factor (no restrictions like linearity), then the R function glm (“generalized linear model”) is used to fit a Poisson regression model.

> swe07$age <- factor(swe07$age)
> fit <- glm(deaths ˜ offset(log.pop) + sex + age, family = poisson,
   data = swe07)
> drop1(fit, test = “Chisq”)
Single term deletions
Model:
deaths ˜ offset(log.pop) + sex + age
  Df Deviance AIC LRT  Pr(Chi)
<none>   10.2  382.8
sex 1  1428.0 1798.5 1417.7 < 2.2e-16
age 19  9764.6 10099.2 9754.4 < 2.2e-16

The function drop1 is used above to perform two LR tests. The results are that both variables are highly statistically significant, meaning that both are very important in describing the variation in mortality over sex and age. To know how, we present the parameter estimates.

> summary(fit)
Call:
glm(formula = deaths ˜ offset(log.pop) + sex + age, family = poisson,
 data = swe07)
Deviance Residuals:
 Min   1Q Median  3Q   Max
-0.94763 -0.40528  0.00104 0.32469 1.00890

Coefficients:
   Estimate Std. Error z value  Pr(>|z|)
(Intercept) -5.41436 0.03777 -143.358 < 2e-16
sexmale  0.46500 0.01246  37.311 < 2e-16
age62   0.04018 0.05171  0.777 0.43705
age63   0.15654 0.05019  3.119 0.00181
age64   0.28218 0.04896  5.764 8.23e-09
age65   0.36374 0.04834  7.524 5.30e-14
age66   0.48712 0.04786 10.178 < 2e-16
age67   0.61210 0.04755 12.873 < 2e-16
age68   0.62987 0.04868 12.938 < 2e-16
age69   0.81418 0.04742 17.168 < 2e-16
age70   0.81781 0.04744 17.239 < 2e-16
age71   0.93264 0.04690 19.887 < 2e-16
age72   1.03066 0.04659 22.122 < 2e-16
age73   1.12376 0.04618 24.334 < 2e-16
age74   1.27287 0.04548 27.985 < 2e-16
age75   1.37810 0.04507 30.577 < 2e-16
age76   1.45245 0.04481 32.413 < 2e-16
age77   1.61461 0.04366 36.979 < 2e-16
age78   1.73189 0.04317 40.115 < 2e-16
age79   1.83958 0.04266 43.125 < 2e-16
age80   2.00326 0.04217 47.502 < 2e-16

(Dispersion parameter for poisson family taken to be 1)

 Null deviance: 10876.873 on 39 degrees of freedom
Residual deviance: 10.232 on 19 degrees of freedom
AIC: 382.81

Number of Fisher Scoring iterations: 3

The results tell us that males have a distinctly higher mortality than females, and that mortality steadily increases with age (no surprises!). The parameter estimates for age also opens up the possibility to simplify the model by assuming a linear or quadratic effect of age on mortality. We leave that possibility for later.

One question remains: Is the female advantage relatively the same over ages? To answer that question, we introduce an interaction:

> fit1 <- glm(deaths ˜ offset(log.pop) + sex * age,
   family = poisson, data = swe07)
> drop1(fit1, test = “Chisq”)
Single term deletions
Model:
deaths ˜ offset(log.pop) + sex * age
   Df Deviance AIC LRT Pr(Chi)
<none>   0.000 410.58
sex:age 19  10.232 382.81 10.232  0.947

Note that the drop1 function only tests the interaction term. This is because, as we saw before, the main effects tend to be meaningless in the presence of interaction. However, here there is no sign of interaction, and we can conclude that in a survival model with sex as covariate, we have proportional hazards! This is most easily seen in graphs. The first plot is based on the estimated Poisson model without interaction. Some calculations are first needed.

> beta <- coefficients(fit)[2]
> alpha <- coefficients(fit)[-2] # Remove sex
> alpha[2:length(alpha)] <- alpha[2:length(alpha)] + alpha[1]
> lambda.females <- exp(alpha)
> lambda.males <- exp(alpha + beta)

Then the plot of the hazard functions by

> plot(61:80, lambda.males, ylim = c(0, 0.06), xlab = “Age”,
   type = “s”)
> lines(61:80, lambda.females, type = “s”, lty = 2)
> abline(h = 0)

Note the parameter ylim. It sets the scale on the y axis, and some trial and error may be necessary to get a good choice. The function line adds a curve to an already existing plot, and the function abline adds a straight line; the argument h = 0 results in a horizontal line. See the help pages for these functions for further information. at 0. The result is shown in Figure 4.4.

Figure 4.4

Figure showing model based hazard functions for males (upper) and females (lower).

Model based hazard functions for males (upper) and females (lower).

We can also make a plot based only on raw data. For that it is only necessary to calculate the occurrence-exposure rates. An occurrence-exposure rate is simply the ratio between the number of occurrences (of some event) divided by total exposure (waiting) time. In the present example, mean population size during one year is a very good approximation of the exact total exposure time (in years).

> femal <- swe07[swe07$sex == “female”,]
> mal <- swe07[swe07$sex == “male”,]
> f.rate <- femal$deaths / femal$pop
> m.rate <- mal$deaths / mal$pop

And finally the plot of the raw death rates is created as follows.

> plot(61:80, m.rate, ylim = c(0, 0.06), xlab = “Age”,
   ylab = “Mortality”, type = “s”)
> lines(61:80, f.rate, col = “red”, type = “s”)
> abline(h = 0)

with the result as shown in Figure 4.5.

Figure 4.5

Figure showing hazard functions for males (upper) and females (lower) based on raw data.

Hazard functions for males (upper) and females (lower) based on raw data.

The differences between Figure 4.4 and Figure 4.5 are very small, showing that the Poisson model without interaction fits the Swedish old age mortality data very well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset