Chapter 1

Event History and Survival Data

1.1 Introduction

What characterizes event history and survival data the most is its dynamic nature. Individuals are followed over time, and during that course, the timings of events of interest are noted. Naturally, things may happen that makes it necessary to interrupt an individual follow-up, such as the individual suddenly disappearing for some reason. With classical statistical tools, such as linear regression, these observations are difficult or impossible to handle in the analysis. The methods discussed in this book aim among other things, at solving such problems.

In this introductory chapter, the use of the special techniques that constitute survival and event history analysis are motivated. The concepts of right censoring and left truncation are defined and discussed. The data sets used throughout the book are also presented in this chapter.

The environment R is freely (under a GPL license) available for download from http://cran.r-project.org. There you will find precompiled versions for Linux, MacOS X, and Microsoft Windows, as well as the full source code, which is open. See Appendix C for more information.

1.2 Survival Data

Survival data (survival times) constitute the simplest form of event history data. A survival time is defined as the time it takes for an event to occur, measured from a well-defined start event. Thus, there are three basic elements which must be well defined: a time origin, a scale for measuring time, and an event. The response in a statistical analysis of such data is the exact time elapsed from the time origin to the time at which the event occurs. The challenge, which motivates special methods, is that in most applications, this duration is often not possible to observe exactly.

As an introduction to the research questions that are suitable for handling with event history and survival analysis, let us look at a data set found in the eha package (Broström 2012) in R (R Development Core Team 2011).

Example 1 Old age mortality

The data set oldmort in eha contains survival data from the parish Sundsvall in the mid-east of 19th century Sweden. The name oldmort is an acronym for old age mortality. The source is digitized information from historical parish registers and church books. More information about this can be found at the web page of The Demographic Data Base at Umeå University (DDB), http://www.ddb.umu.se.

The sampling was done as follows: Every person who was present and alive and 60 years of age or above anytime between 1 January 1860 and 31 December 1879 was followed from the entrance age (for most people that would be 60) until the age when last seen, determined by death, out-migration, or surviving until 31 December 1879. Those born during the eighteenth century would enter observation at an age above 60, given that they lived long enough, that is, at least until 1 January 1860.

Two types of finishing the observation of a person are distinguished: Either it is by death or it is by something else, out-migration or end of study period. In the first case we say that the event of interest has occurred, in the second case not.

After installing the eha package and starting an R session (see Appendix C), the data set is loaded as follows.

> library(eha)
> data(oldmort)

The first line loads the package eha into the workspace of R, and the second line loads the data set oldmort found in eha. Let us look at the first few lines of oldmort. It is conveniently done with the aid of the R function head:

> head(oldmort)
  id enter  exit event birthdate m.id f.id sex
1 765000603 94.510 95.813 TRUE 1765.490  NA  NA female
2 765000669 94.266 95.756 TRUE 1765.734  NA  NA female
3 768000648 91.093 91.947 TRUE 1768.907  NA  NA female
4 770000562 89.009 89.593 TRUE 1770.991  NA  NA female
5 770000707 89.998 90.211 TRUE 1770.002  NA  NA female
6 771000617 88.429 89.762 TRUE 1771.571  NA  NA female
   civ ses.50 birthplace imr.birth  region
1 widow unknown remote 22.20000 rural
2 unmarried unknown parish 17.71845 industry
3 widow unknown parish 12.70903 rural
4 widow unknown parish 16.90544 industry
5 widow middle region 11.97183 rural
6 widow unknown parish 13.08594 rural

The variables in oldmort have the following definitions and interpretations:

id A unique id number for each individual.

enter, exit The start age and stop age for this record (spell). For instance, in row No. 1, individual No. 765000603 enters under observation at age 94.51 and exits at age 95.813. Age is calculated as the number of days elapsed since birth, and this number is then divided by 365.25 to get age in years. The denominator is the average length of a year, taking into account that (almost) every fourth year is 366 days long.

The first individual was born around 1 July 1765, and so almost 95 years of age when the study started. Suppose that this woman had died at age 94; then she had not been in our study at all. This property of our sampling procedure is a special case of a phenomenon called length-biased sampling. That is, of those born in the eighteeenth century, only those who live well beyond 60 will be included. This bias must be compensated for in the analysis, and it is accomplished by conditioning on the fact that these persons were alive at 1 January 1860. This technique is called left truncation.

event A logical variable (taking values TRUE or FALSE) indicating if the exit is a death (TRUE) or not (FALSE). For our first individual, the value is TRUE, indicating that she died at the age of 95.813 years.

birthdate The birth date expressed as the time (in years) elapsed since January 1, year 0 (which by the way does not exist). For instance, the (pseudo) date 1765.490 is really June 27, 1765. The fraction 0.490 is the fraction of the year 1765 that elapsed until the birth of individual No. 765000603.

m.id Mother’s id. It is unknown for all the individuals listed above. That is the symbol NA, which stands for Not Available. The oldest people in the data set typically have no links to parents.

f.id Father’s id. See m.id.

sex A categorical variable with the levels female and male.

civ Civil status. A categorical variable with three levels; unmarried, married, and widow(er).

ses.50 Socioeconomic status (SES) at age 50. Based on occupation information. There is a large proportion of NA (missing values) in this variable. This is quite natural, because this variable was of secondary interest to the record holder (the priest in the parish). The occupation is only noted in connection to a vital event in the family (such as a death, birth, marriage, or in- or out-migration). For those who were above 50 at the start of the period there is no information on SES at 50.

birthplace A categorical variable with two categories, parish and remote, representing born in parish and born outside parish, respectively.

imr.birth A rather specific variable. It measures the infant mortality rate in the birth parish at the time of birth (per cent).

region Present geographical area of residence. The parishes in the region are grouped into three regions, Sundsvall town, rural, and industry. The industry is the sawmill one, which grew rapidly in this area during the late part of the 19th century. The Sundsvall area was, in fact, one of the largest sawmill areas in Europe at this time.

Of special interest is the triple (enter, exit, event), because it represents the response variable, or what can be seen of it. More specifically, the sampling frame is all persons observed to be alive and above 60 years of age between 1 January 1860 and 31 December 1879. The start event for these individuals is their 60th anniversary, and the stop event is death. Clearly, many individuals in the data set did not die before 1 January 1880, so for them we do not know the full duration between the start and stop events; such individuals are said to be right censored (the exact meaning of which will be given soon). The third component in the survival object (enter, exit, event); that is, event is a logical variable taking the value TRUE if exit is the true duration (the interval ends with a death) and FALSE if the individual is still alive at the duration “last seen.”

Individuals aged 60 or above between 1 January 1860 and 31 December 1879 are included in the study. Those who are above 60 at this start date are included only if they did not die between the age of 60 and the age at 1 January 1860. If this is not taken into account, a bias in the estimation of mortality will result. The proper way of dealing with this problem is to use left truncation, which is indicated by the variable enter. If we look at the first rows of oldmort, we see that the enter variable is very large; it is the age for each individual at 1 January 1860. You can add enter and birthdate for the first six individuals to see that:

> oldmort$enter[1:6] + oldmort$birthdate[1:6]
[1] 1860 1860 1860 1860 1860 1860

The statistical implication (description) of left truncation is that its presence forces the analysis to be conditional on survival up to the age enter.

A final important note: In order to get the actual duration at exit, we must subtract 60 from the value of exit. When we actually perform a survival analysis in R, we should subtract 60 from both enter and exit before we begin. It is not absolutely necessary in the case of Cox regression, because of the flexibility of the baseline hazard in the model (it is, in fact, left unspecified!). However, for parametric models, it may be important in order to avoid dealing with truncated distributions.

Now let us think of the research questions that could be answered by analyzing this data set. Since the data contain individual information on the length of life after 60, it is quite natural to study what determines a long life and what are the conditions that are negatively correlated with long life.

Obvious questions are: (i) Do women live longer than men? (Yes), (ii) Is it advantageous for a long life to be married? (Yes), (iii) Does socioeconomic status play any role for a long life? (Don’t know), and (iv) Does place of birth have any impact on a long life, and if so, is it different for women and men?

The answers to these, and other, questions will be given later. The methods in later chapters of the book are all illustrated on a few core eamples. They are all introduced in this chapter.

The data set oldmort contained only two states, referred to as Alive and Dead, and one possible transition, from Alive to Dead; see Figure 1.1. The ultimate study object in survival analysis is the time it takes from entering state Alive (e.g., becoming 60 years of age) until entering state dead (e.g., death). This time interval is defined by the exact time of two events, which we may call birth and death, although in practice these two events may be almost any kind of events. Economists, for instance, are interested in the duration of out-of-work spells, where “birth” refers to the event of losing the job, and “death” refers to the event of getting a job. In a clinical trial for treatment of cancer, the starting event time may be time of operation, and the final event time is time of relapse (if any).

Figure 1.1

Figure showing survival data.

Survival data.

1.3 Right Censoring

When an individual is lost to follow-up, we say that she is right censored; see Figure 1.2. As indicated in Figure 1.2, the true age at death is T, but due to right censoring, the only information available is that death age T is larger than C. The number C is the age at which this individual was last seen. In ordinary, classical regression analysis, such data are difficult, if not impossible, to handle. Discarding such information may introduce bias. The modern theory of survival analysis offers simple ways to deal with right-censored data.

Figure 1.2

Figure showing right censoring.

Right censoring.

A natural question to ask is: If there is right censoring, there should be something called left censoring, and if so, what is it? The answer to that is that yes, left censoring refers to a situation where the only thing known about a death age is that it is less than a certain value C. Note carefully that this is different from left truncation; see the next section.

1.4 Left Truncation

The concept of left truncation, or delayed entry, is well illustrated by the data set oldmort that was discussed in detail in Section 1.2. Please note the difference compared to left censoring. Unfortunately, you may still see articles where these two concepts are confused.

It is illustrative to think of the construction of the data set oldmort as a statistical follow-up study, starting on 1 January 1860. At that day, all persons present in the parish and 60 years of age or above, are included in the study. It is decided that the study will end at 31 December 1879; that is, the study period (follow-up time) is 20 years. The interesting event in this study is death. This means that the start event is the sixtieth anniversary of birth, and the final event is death. Due to the calendar time constraints (and migration), all individuals will not be observed to die (especially those who live long), and moreover, some individuals will enter the study after the “starting” event, the sixtieth anniversary. A person who entered late—say he is 65 on January 1, 1860—would not have been included had he died at age 63 (say). Therefore, in the analysis, we must condition on the fact that he was alive at 65. Another way of putting this is to say that this observation is left truncated at age 65.

People being too young at the start date will be included from the day they reach 60, if that happens before the closing date, 31 December 1879. They are not left truncated, but will have a higher and higher probability of being right censored, the later they enter the study.

1.5 Time Scales

In demographic applications age is often a natural time scale, that is, time is measured from birth. In the old age data just discussed, time was measured from age 60 instead. In a case like this, where there is a common “late start age,” it doesn’t matter much, but in other situations it does. Imagine, for instance, that interest lies in studying the time it takes for a woman to give birth to her first child after marriage. The natural way of measuring time is to start the clock at the day of marriage, but a possible (but not necessarily recommended!) alternative is to start the clock at some (small) common age of the women, for instance, at birth. This would give left truncated (at marriage) observations, since women were sampled at marriage. There are two clocks ticking, and you have to make a choice. Generally, it is important to realize that there often are alternatives, and that the result of an analysis may depend strongly on the choice made.

1.5.1 The Lexis Diagram

Two time scales are nearly always present in demographic research: age (or duration) and calendar time. For instance, an investigation of mortality may be limited in these two directions. In Figure 1.3 this is illustrated for a study of old age mortality during the years 1829 and 1895. “Old age mortality” is defined as mortality from age 50 and onwards to age 100. The Lexis diagram is a way of showing the interplay between the two time scales and (human) life lines. Age moves vertically and calendar time horizontally, which will imply that individual lives will move diagonally, from birth to death, from southwest to north-east, in the Lexis diagram. In our example study, we are only interested in the part of the life lines that appear inside the rectangle.

Figure 1.3

Figure showing lexis diagram; time period 1829–1894 and ages 50–100 under study.

Lexis diagram; time period 1829–1894 and ages 50–100 under study.

Assume that the data set at hand is saved in the text file ‘lex.dat’. Note that this data set is not part of eha; it is only used here for the illustration of the Lexis diagram.

> lex <- read.table(“Data/lex.dat”, header = TRUE)
> lex
 id enter  exit event birthdate sex
1 1 0 98.314 1 1735.333  male
2 2 0 87.788 1 1750.033  male
3 3 0 71.233 0 1760.003 female
4 4 0 87.965 1 1799.492  male
5 5 0 82.338 1 1829.003 female
6 6 0 45.873 1 1815.329 female
7 7 0 74.112 1 1740.513 female

How do we restrict the data to fit into the rectangle given by the Lexis diagram in Figure 1.3? With the two functions age.window and cal.window, it is easy. The former fixes the ‘age cut’ while the latter makes the ‘calendar time cut’.

The age cut:

> require(eha)
> lex <- age.window(lex, c(50, 100))
> lex
 id enter  exit event birthdate sex
1 1 50 98.314 1 1735.333  male
2 2 50 87.788 1 1750.033  male
3 3 50 71.233 0 1760.003 female
4 4 50 87.965 1 1799.492  male
5 5 50 82.338 1 1829.003 female
7 7 50 74.112 1 1740.513 female

Note that individual No. 6 dropped out completely because she died too young. Then the calendar time cut:

> lex <- cal.window(lex, c(1829, 1895))
> lex
 id enter  exit event birthdate sex
1 1 93.667 98.314 1 1735.333  male
2 2 78.967 87.788 1 1750.033  male
3 3 68.997 71.233 0 1760.003 female
4 4 50.000 87.965 1 1799.492  male
5 5 50.000 65.997 0 1829.003 female

and here individual No. 7 disappeared because she died before 1 January 1829. Her death date is her birth date plus her age at death, 1740.513 + 74.112 = 1814.625, or 17 August 1814.

1.6 Event History Data

Event history data arise, as the name suggests, by following subjects over time and making notes about what happens and when. Usually the interest is concentrated to a few specific kinds of events. The main application in this book is demography and epidemiology, and hence events of primary interest are births, deaths, marriages, and migration.

Example 2 Marital fertility in 19th century Sweden

As a rather complex example, let us look at marital fertility in 19th century Sweden; see Figure 1.4.

Figure 1.4

Figure showing marital fertility. follow-up starts at marriage and ends when the marriage is dissolved or the woman has had her fifth birth or she becomes 50 years of age, whichever comes first.

Marital fertility. Follow-up starts at marriage and ends when the marriage is dissolved or the woman has had her fifth birth or she becomes 50 years of age, whichever comes first.

In a marital fertility study, women are typically followed over time from the time of their marriage until the time the marriage is dissolved or her fertility period is over, say at age 50, whichever comes first. The marriage dissolution may be due to the death of the woman or of her husband, or it may be due to a divorce. If the study is limited to a given geographical area, women may get lost to follow-up due to out-migration. This event gives rise to a right-censored observation.

During the follow-up, the exact timings of child births are recorded. Interest in the analysis may lie in investigating which factors, if any, affect the length of birth intervals. A data set may look like this:

 id parity age year next.ivl event prev.ivl ses parish
1 1  0 24 1825 0.411 1  NA farmer SKL
2 1  1 25 1826  22.348 0 0.411 farmer SKL
3 2  0 18 1821 0.304 1  NA unknown SKL
4 2  1 19 1821 1.837 1 0.304 unknown SKL
5 2  2 21 1823 2.546 1 1.837 unknown SKL
6 2  3 23 1826 2.541 1 2.546 unknown SKL
7 2  4 26 1828 2.431 1 2.541 unknown SKL
8 2  5 28 1831 2.472 1 2.431 unknown SKL
9 2  6 31 1833 3.173 0 2.472 unknown SKL

This is the first 9 rows, corresponding to the first two mothers in the data file. The variable id is mother’s id, a label that uniquely identifies each individual.

A birth interval has a start point (in time) and an end point. These points are the time points of births, except for the first interval, where the start point is time of marriage, and the last interval, which is open to the right. However, the last interval is stopped at the time of marriage dissolution or when the mother becomes 50, whatever comes first. The variable parity is zero for the first interval, between date of marriage and date of first birth, one for the next interval, and so forth. The last (highest) number is thus equal to the total number of births for a woman during her first marriage (disregarding twin births, etc.).

Here is a variable-by-variable description of the data set.

id The mother’s unique id.

parity Order of previous birth; see above for details. Starts at zero.

age Mother’s age at the event defining the start of the interval.

year Calendar year for the birth defining the start of the interval.

next.ivl The time in years from the birth at parity to the birth at parity + 1, or, for the woman’s last interval, to the age of right censoring.

event An indicator for the interval ending with a birth. It is always equal to 1, except for the last interval, which always has event equal to zero.

prev.ivl The length of the interval preceding this one. For the first interval of a woman, it is always NA (Not Available).

ses Socioeconomic status (based on occupation data).

parish The home parish of mother at birth.

Just to make it clear: The first woman has id 1. She is represented by two records, meaning that she gave birth to one child. She waited 0.411 years from marriage to the first birth, and 22.348 years from the first birth to the second, which never happened. The second woman 2 is represented by seven records, implying that she gave birth to six children. And so on.

Of course, in an analysis of birth intervals we are interested in causal effects; why are some intervals short while others are long? The dependence of the history can be modeled by lengths of previous intervals (for the same mother), parity, survival of earlier births, and so on. Note that all relevant covariate information must refer to the past. More about that later.

The first interval of a woman is different from the others, since it starts with marriage. It therefore makes sense to analyze these intervals separately. The last interval of a woman is also special; it always ends with a right censoring, at the latest when the woman is 50 years of age. You should think of data for a woman generated sequentially in time, starting at the day of her marriage. Follow-up is made to the next birth, as long as she is alive, the marriage is still alive, and she is younger than 50 years of age. If there is no next birth, that is, she reaches 50, or the marriage is dissolved (most often by death of one of the spouses), the interval is censored at the duration when she still was under observation. Censoring can also occur by emigration, and reaching the end of follow-up, in this case 5 November 1901.

Example 3 The illness-death model.

Another useful setup is the so-called illness-death model; see Figure 1.5. Individuals may move back and forth between the states Healthy and Diseased, and from each of these two states there is a pathway to Death, which is an absorbing state, meaning that once in that state, you never leave it.

Figure 1.5

Figure showing the illness-death model.

The illness-death model.

1.7 More Data Sets

A few examples and data sets will be used repeatedly throughout the book, and we give a brief description of them here. They are all available in the R package eha, which is loaded into a running R session by the call

> require(eha)

This loads the eha package, if it wasn’t loaded before. In the examples to follow, we assume that this is already done. The main data source is the Demographic Data Base, Umeå University, Sweden. However, one data set is taken from the home page of Statistics Sweden (http://www.scb.se).

Example 4 Survival of males aged 20

This data set is included in the R (R Development Core Team 2011) package eha (Broström 2012). It contains information about 1023 males, age 20 between 1 January 1800 and 31 December 1819, and living in Skellefteå, a parish in the north-east of Sweden. The total number of records in the data frame is 1211; that is, some individuals are represented by more than one record in the data file. The reason for that is that the socioeconomic status (ses) is one of the covariates in the file, and it changes over time. Each time a change is recorded, a new record is created for that individual, with the new value of SES. For instance, the third and fourth rows in the data frame are

> library(eha)
> data(mort)
> mort[3:4,]
 id enter  exit event birthdate  ses
3 3 0.000 13.463 0 1800.031 upper
4 3 13.463 20.000 0 1800.031 lower

Note that the variable id is the same (3) for the two records, meaning that both records are information about individual No. 3. The variable enter is age (in years) that has elapsed since the 20th birth day anniversary, and exit likewise. The information about him is that he was born on 1800.031, or January 12, 1800, and he is followed from his 21th birth date, or from January 12, 1820. He is in an upper socioeconomic status until he is 20 + 13.46311 = 33.46311 years of age, when he unfortunately is degraded to a lower ses. He is then followed until 20 years have elapsed, or until his 41th birthday. The variable event tells us that he is alive we when stop observing him; the value zero indicates that the follow-up ends with right censoring.

In an analysis of male mortality with this data set, we could ask whether there is a socioeconomic difference in mortality, and also if it changes over time. That would typically be done by Cox regression or by a parametric proportional hazards model. More about that follows in later chapters.

Example 5 Infant mortality

This data set is taken from (Broström 1987) and concerns the interplay between infant and maternal mortality in 19th century Sweden (source: The Demographic Data Base, Umeå University, Sweden). More specifically, we are interested in estimating the effect of a mother’s death on the infant’s survival chances. Because maternal mortality was rare (around one per 200 births), matching is used. This is performed as follows: for each child experiencing the death of its mother (before age one), two matched controls were selected. The criteria were: same age as the case at the event, same sex, birth year, parish, socioeconomic status, and marital status of mother. The triplets so created were followed until age one, and eventual deaths of the infants were recorded. The data collected in this way is part of the eha package under the name infants, and the first row of the data frame are shown here:

> data(infants)
> head(infants)
 stratum enter exit event mother age sex  parish  civst
1  1 55 365 0  dead 26 boy Nedertornea married
2  1 55 365 0 alive 26 boy Nedertornea married
3  1 55 365 0 alive 26 boy Nedertornea married
4  2 13  76 1  dead 23 girl Nedertornea married
5  2 13 365 0 alive 23 girl Nedertornea married
6  2 13 365 0 alive 23 girl Nedertornea married
 ses year
1 farmer 1877
2 farmer 1870
3 farmer 1882
4 other 1847
5 other 1847
6 other 1848

A short description of the variable follows.

stratum denotes the id of the triplets, 35 in all.

enter is the age in days of the case, when its mother died.

exit is the age in days when follow-up ends. It takes the value 365 (one year) for those who survived their first anniversary.

event indicates whether a death (1) or a survival (0) was observed.

mother has value dead for all cases and the value alive for the controls.

age Age of mother at infant’s birth.

sex Sex of the infant.

parish Birth parish.

civst Civil status of mother, married or unmarried.

ses Socioeconomic status, often the father’s, based on registrations of occupation.

year Calendar year of the birth.

This data set is discussed and analyzed in Chapter 8.

Example 6 Old age mortality, tabular data

This data set is taken from Statistics Sweden. It is freely available on the web site http://www.scb.se. The aggregated data set contains information about population size and no. of deaths by sex and age for the ages 61 and above for the year 2007.

> data(swe07)
> head(swe07)
 pop deaths sex age log.pop
1 63483 286 female 61 11.05853
2 63770 309 female 62 11.06304
3 64182 317 female 63 11.06948
4 63097 366 female 64 11.05243
5 61671 387 female 65 11.02957
6 57793 419 female 66 10.96462
> tail(swe07)
 pop deaths sex age log.pop
35 31074 884 male 75 10.34413
36 29718 904 male 76 10.29951
37 29722  1062 male 77 10.29964
38 28296  1112 male 78 10.25048
39 27550  1219 male 79 10.22376
40 25448  1365 male 80 10.14439

The variables have the following meanings.

pop Average population size 2007 in the age and for the sex given on the same row. The average is based on the population at the beginning and end of the year 2007.

deaths The observed number of deaths in the age and for the sex given on the same row.

sex Female or male.

age Age in completed years.

log.pop The natural logarithm of pop. This variable is used as offset in a Poisson regression.

See Chapter 4 for how to analyze this data set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset