Chapter 24: Working with Multiple Observations per Subject

24.1  Introduction

24.2  Identifying the First or Last Observation in a Group

24.3  Counting the Number of Visits Using PROC FREQ

24.4  Computing Differences between Observations

24.5  Computing Differences between the First and Last Observation in a BY Group Using the LAG Function

24.6  Computing Differences between the First and Last Observation in a BY Group Using a RETAIN Statement

24.7  Using a Retained Variable to “Remember” a Previous Value

24.8  Problems

 

24.1  Introduction

You might encounter data sets with multiple observations per subject (or other groupings such as transactions on a single day or observations grouped in other ways). These data structures are sometimes referred to as longitudinal data.

Because SAS processes one observation at a time, you need special techniques to perform calculations across observations. For example, if you have a data set representing patient visits to a clinic, you might want to compare patient values from one visit to the next or from the first visit to the last visit. You may want to retrieve the data from the first or last visit for each patient.

The good news is that SAS has all the tools you need to accomplish these tasks. This chapter shows you how to use these tools.

24.2  Identifying the First or Last Observation in a Group

For most of the examples in this chapter, you will use a data set (Clinic) consisting of medical data on patients who visit a clinic. Here is a listing of this data set:

Figure 24.1: Listing of Data Set Clinic

Figure 24.1: Listing of Data Set

Variables in this data set are patient ID (ID), visit date (VisitDate), diagnosis (Dx), heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP). Notice that some patients have several visits, but one patient, Number 303, has a single visit.

One of the important tasks when dealing with data sets like this is to identify when you are processing the first or the last observation for each patient (or the variable that you want to use to identify a group). Program 24.1 shows how to do this:

Program 24.1: Creating FIRST. and LAST. Variables

  proc sort data=Learn.Clinic out=Clinic;

     by ID VisitDate;

  run;

  data Last;

     set clinic;

     by ID;

     put ID= VisitDate= First.ID= Last.ID=;

     if last.

  run;

The key to this program is the BY statement following the SET statement. In this example, the data set was also sorted by VisitDate within each ID so that the first observation for each ID would correspond to the first visit date and the last observation for each ID would correspond to the last visit date.

This BY statement following a SET statement creates two temporary SAS variables, First.ID and Last.ID. The PUT statement in the program was added so that you can see the value of these two temporary variables.

Here is the section of the SAS log showing these data values.

Note: The shading was added to help identify multiple visits for each patient.

Log File Showing FIRST.ID and LAST.ID Values

 ID=101 VisitDate=10/21/2005 FIRST.ID=1 LAST.ID=0

 ID=101 VisitDate=02/25/2006 FIRST.ID=0 LAST.ID=1

 ID=255 VisitDate=09/01/2005 FIRST.ID=1 LAST.ID=0

 ID=255 VisitDate=12/18/2005 FIRST.ID=0 LAST.ID=0

 ID=255 VisitDate=02/01/2006 FIRST.ID=0 LAST.ID=0

 ID=255 VisitDate=04/01/2006 FIRST.ID=0 LAST.ID=1

 ID=303 VisitDate=10/10/2006 FIRST.ID=1 LAST.ID=1

 ID=409 VisitDate=09/01/2005 FIRST.ID=1 LAST.ID=0

 ID=409 VisitDate=10/02/2005 FIRST.ID=0 LAST.ID=0

 ID=409 VisitDate=12/15/2006 FIRST.ID=0 LAST.ID=1

 ID=712 VisitDate=04/06/2006 FIRST.ID=1 LAST.ID=0

 ID=712 VisitDate=04/15/2006 FIRST.ID=0 LAST.ID=1

It should be pretty clear what these two variables represent: First.ID is true (1) for the first visit for each patient and false (0) otherwise. Last.ID is true for the last visit for each patient and false otherwise. For Patient Number 303, both First.ID and Last.ID are true (it is both the first and the last visit for this patient).

To create a data set consisting of the last visit for each patient, the program uses a subsetting IF statement to check when the value of Last.ID is equal to 1. Here is a listing of data set Last:

Figure 24.2: Listing of Data Set Last

Figure 24.2: Listing of Data Set Last

A common data processing requirement is to perform such operations as setting counters equal to 0,when you are processing the first observation in a BY group and outputting counts when you are processing the last observation in a BY group.

Program 24.2 uses a DATA step to count the number of visits for each patient. (We will demonstrate later how to do this using PROC FREQ.)

Program 24.2: Counting the Number of Visits per Patient Using the DATA Step

  data Count;

     set Clinic;

     by ID;

     *Initialize counter at first visit;

     if first.ID then N_visits = 0;

     *Increment the visit counter;

     N_visits + 1;

     *Output an observation at the last visit;

     if last.ID then output;

  run;

Each time you encounter a new ID, you set the variable N_visits equal to 0. You then increment this variable by one for each iteration of the DATA step, using a sum statement. Finally, an observation is written out to data set Count when you are processing the last visit for each patient. Here is a listing of data set Count:

Figure 24.3: Listing of Data Set Count

Figure 24.3: Listing of Data Set Count

Notice that the values for VisitDate, Dx, HR, SBP, and DBP are values from the last visit for each patient.

24.3  Counting the Number of Visits Using PROC FREQ

Instead of using a DATA step, you can use PROC FREQ to compute the number of visits per patient. Here is the solution:

Program 24.3: Using PROC FREQ to Count the Number of Observations in a BY Group

  proc freq data=Learn.Clinic noprint;

     tables ID / out=Counts;

  run;

 

The NOPRINT option is included because you want PROC FREQ to create a data set and you do not need printed output. In this program, you name the data set Counts. PROC FREQ uses the variable names Count and Percent for the variables representing the frequencies and percentages for each value of the variable listed in the TABLES statement. This is clear if you look at the listing of data set Counts created by Program 24.3.

Figure 24.4: Listing of Data Set Counts

Figure 24.4: Listing of Data Set Counts

You can use the RENAME= and DROP= or KEEP= data set options to control what variables are in this data set and what they are called. For example, the next program renames Count to N_Visits and drops the Percent variable. In addition, a MERGE statement adds the variable N_Visits to each observation in the original Clinic data set. Here is the program:

Program 24.4: Using the RENAME= and DROP= Data Set Options to Control the Output Data Set

  proc freq data=Clinic noprint;

     tables ID / out=Counts (rename=(count = N_Visits)

                             drop=percent);

  run;

  data Clinic;

     merge Learn.Clinic Counts;

     by ID;

  run;

The RENAME= data set option renames the variable Count (the name chosen by SAS) to N_Visits. The DROP= data set option tells SAS that you don’t want the variable Percent in the new data set. You don’t need to sort data set Counts because it will be in ID order. The default ordering for PROC FREQ is to order values by their internal (unformatted) values. Data set Clinic is already in ID order. Here is the result:

Figure 24.5: Listing of Data Set Clinic

Figure 24.5: Listing of Data Set Clinic

You now have the same result produced by Program 24.2. You may prefer the DATA step approach or PROC FREQ to do the counting for you.

24.4  Computing Differences between Observations

Suppose you want to see changes in heart rate (HR), systolic blood pressure (SBP), and diastolic blood pressure (DBP) from visit to visit. The LAG function provides an easy way to compare values from the present observation to ones in a previous observation. Here is a program to output changes from one visit to the next:

Program 24.5: Computing Differences between Observations

  data Difference;

     set Clinic;

     by ID;

     *Delete patients with only one visit;

     if first.ID and last.ID then delete;

     Diff_HR = HR - lag(HR);

     Diff_SBP = SBP - lag(SBP);

     Diff_DBP = DBP - lag(DBP);

     if not first.ID then output;

  run;

There are a few important points to notice about this program. First, you delete patients with only one visit (when both First.ID and Last.ID are true). Next, you use the LAG function to obtain values of HR, SBP, and DBP from the previous visit. Let’s take a moment and follow the DATA step logic.

The first patient (ID=101) has more than one visit, so the first IF statement is not true. The values of the three Diff variables are all missing during the first iteration of the DATA step. First.ID is true, so NOT First.ID is false, and an observation is not written out to the data set.

During the second iteration of the DATA step, the three Diff variables represent the difference between the current values and the values from the last visit. The last IF statement is true, and an observation is written out to data set Difference.

The next observation is the first visit for Patient Number 255. Again, this patient has more than one visit, so the first IF statement is not true. The three Diff variables represent the difference between the three values for Patient Number 255 and the last visit for Patient Number 101! This seems wrong, but hold on, this will work out OK since you are not going to output an observation this time (the last IF statement is not true). You can think of this iteration of the DATA step as “priming the pump” so that the LAG function returns the correct values on the next iteration of the DATA step. You might be tempted to execute the three DIFF statements conditionally. Do not do this. You need to execute the LAG function for every iteration of the DATA step in this program so that you get the correct difference values.

The next observation from the Clinic data set is the second visit for Patient Number 255. Now the three DIFF statements correctly compute the difference between the current values for Patient Number 255 and the values from the previous visit. Because the last IF statement is now true, this observation is written out to the Difference data set.

You may think this explanation is too long and tedious—and you may be right. However, it is really important that you understand the logic of this program if you are to use the LAG function correctly.

By the way, for you compulsive programmers (not me!), you can use the DIF function to compute the three differences directly. Refer to Chapter 11 for details on this function.

Here is a listing of data set Difference:

Figure 24.6: Listing of Data Set Difference

Figure 24.6: Listing of Data Set Difference

The value of HR, SBP, and DBP represent the values for the current visit. The three Diff values represent these values minus the values from the previous visit.

24.5  Computing Differences between the First and Last Observation in a BY Group Using the LAG Function

This section demonstrates how to compute the difference between the first and last visit for each patient. Remember how you were cautioned never to execute the LAG function conditionally? The program that follows does just that—and it works:

Program 24.6: Computing Differences between the First and Last Observation in a BY Group

  data First_Last;

     set Clinic;

     by ID;

     *Delete patients with only one visit;

     if first.ID and last.ID then delete;

     if first.ID or last.ID then do;

        Diff_HR = HR - lag(HR);

        Diff_SBP = SBP - lag(SBP);

        Diff_DBP = DBP - lag(DBP);

     end;

     if last.ID then output;

  run;

The first time the LAG function executes is when the first visit for each patient is being processed. The next time the LAG function executes is during the last visit. Therefore, the differences are the values in the last visit minus the values from the first visit (the last time the LAG function executed). Notice also that an observation is only written out during the last visit for each patient. Here is a listing of this data set:

Figure 24.7: Listing of Data Set First_Last

Figure 24.7: Listing of Data Set First_Last

The DIFF variables in this listing represent the difference between values recorded at the first visit minus values recorded at the last visit.

24.6  Computing Differences between the First and Last Observation in a BY Group Using a RETAIN Statement

Because SAS has so many useful tools, there are often several ways to solve the same problem. This section describes how to accomplish the same task as the previous section, except you use retained variables instead of the LAG function to accomplish the task.

Using retained variables is one of the best ways to “remember” values from previous observations. By default, remember that variables that do not come from SAS data sets are set to a missing value during each iteration of the DATA step. A RETAIN statement allows you to tell SAS not to do this. Armed with this knowledge, look at the following program:

Program 24.7: Demonstrating the Use of Retained Variables

  data First_Last;

     set Clinic;

     by ID;

     *Delete patients with only one visit;

     if first.ID and last.ID then delete;

     retain First_HR First_SBP First_DBP;

     if first.ID then do;

        First_HR = HR;

        First_SBP = SBP;

        First_DBP = DBP;

     end;   

     if last.ID then do;

        Diff_HR = HR - First_HR;

        Diff_SBP = SBP - First_SBP;

        Diff_DBP = DBP - First_DBP;

        output;

     end;

     drop First_: ;

  run;

You start out the same as Program 24.6. Next, you use a RETAIN statement so that the three variables, First_HR, First_SBP, and First_DBP, are not set back to missing when the DATA step iterates. (The RETAIN statement operates at compile time and is not an executable statement.)

Next, when you are processing the first visit for each patient, you set the three retained variables equal to the values of HR, SBP, and DBP, respectively. These values remain in the program data vector (PDV) and are not replaced until the first visit for a new patient is processed. When you reach the last visit for each patient, you compute the three difference values and output them.

One final comment on this program involves the DROP statement. The value First_: refers to all variables that begin with the characters First_.

The resulting data set is identical to the one created in Program 24.6.

24.7  Using a Retained Variable to “Remember” a Previous Value

This section describes another problem where using a retained variable greatly simplifies the code. For each patient in your Clinic data set, you want to check if the patient had a systolic blood pressure (SBP) over 140 during any of the visits. Here is the program:

Program 24.8: Using a Retained Variable to “Remember” a Previous Value

  data Hypertension;

     set Learn.Clinic;

     by ID;

     retain HighBP;

     if first.ID then HighBP = 'No ';

     if SBP gt 140 then HighBP = 'Yes';

     if last.ID then output;

  run;

This program uses the retained variable HighBP to “remember” if a patient ever had a systolic blood pressure over 140. As each new patient is processed, HighBP is set to No. Then, if any SBP is greater than 140, HighBP is set to Yes. Because this value is retained, it remains equal to Yes even if the SBP is less than 140 on all subsequent visits. When SAS reaches the last visit for each patient, an observation is written to the Hypertension data set. Here is a listing of Hypertension:

Figure 24.8: Listing of Data Set Hypertension

Figure 24.8: Listing of Data Set Hypertension

Even though SAS operates one observation at a time, you can use the techniques in this chapter to perform computations between observations in a SAS data set.

 

24.8  Problems

Solutions to odd-numbered problems are located at the back of this book. Solutions to all problems are available to professors. If you are a professor, visit the book’s companion website at support.sas.com/cody for information about how to obtain the solutions to all problems.

1.       Data set DailyPrices contains a stock symbol and from 1 to 5 observations showing the price for each date. Here is a listing of this data set:

image shown here

Note: These prices are made up but reflect the approximate prices when the first edition of this book was written in 2007.

Create a listing showing the stock symbol, the date, and the price from the most recent date for each stock.

 

2.       Create a temporary SAS data set containing the average price for each stock and the number of values used to create this average. In addition, this data set should contain the minimum and maximum price for each stock. A listing of this data set should look like this:

image shown here

3.       Create a temporary SAS data set with two variables, N_Days and Symbol, by writing a DATA step and using the SAS data set DailyPrices as input. N_Days represents the number of observations for each stock.

4.       Repeat Problem 3 using PROC FREQ to count the number of observations for each unique stock symbol.

5.       Using the SAS data set DailyPrices, compute the difference between the price on the last day minus the price on the first day. Omit any stocks that have only one observation. Do this using a DATA step and retained variables.

6.       Repeat Problem 5 using the LAG or DIF function.

7.       Using the SAS data set DailyPrices, compute day-to-day differences for all stocks that have more than one observation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset