Chapter 8. Managing Data in IMLPlus

Contents

  • 8.1 Overview of Managing Data in IMLPlus 173

  • 8.2 Creating a Data Object 174

  • 8.3 Creating a Data Object from a SAS Data Set 174

  • 8.4 Creating Linked Graphs from a Data Object 175

  • 8.5 Creating a Data Object from a Matrix 177

  • 8.6 Creating a SAS Data Set from a Data Object 177

  • 8.7 Creating a Matrix from a Data Object 179

  • 8.8 Adding New Variables to a Data Object 180

    • 8.8.1 Variable Transformations 181

    • 8.8.2 Adding Variables for Predicted and Residual Values 182

    • 8.8.3 A Module to Add Variables from a SAS Data Set 184

  • 8.9 Review: The Purpose of the DataObject Class 185

8.1 Overview of Managing Data in IMLPlus

As introduced in Section 6.3, the most important class in the IMLPlus language is the DataObject class. The DataObject class manages an in-memory copy of data. The class also manages graphical information about observations such as the shape and color of markers, the selected state of observations, and whether observations are displayed in plots or are excluded.

As shown in Chapter 7, "Creating Statistical Graphs," it is the DataObject class that enables you to create dynamically linked graphs. Graphs that are created from the same data object are automatically linked to each other.

This chapter describes how to create a data object from a source of data and how to create a SAS data set or a SAS/IML matrix from a data object. The chapter also describes how to modify a data object by adding new variables such as predicted values, residual values, and transformed variables.

8.2 Creating a Data Object

The DataObject class has several "Create" methods that instantiate a data object. A data object can be instantiated from any of several sources of data: from a SAS/IML matrix, from a Microsoft Excel worksheet, from an R data frame, from a SAS server data set, or from a SAS data set stored on the client PC. A server data set is one that is in a SAS library such as Work, Sashelp, or in a libref that you defined by using the LIBNAME statement. A client data set is one that can be accessed by the operating system of the computer running SAS/IML Studio. For example, this could be a data set on a hard drive or USB drive of the local PC, or any data set that is accessible through a mounted, networked drive.

The following table lists methods that create a data object from various data sources:

Table 8.1. Creating a Data Object

Method

Source

Create

SAS/IML matrix

CreateFromExcelFile

Microsoft Excel workbook

CreateFromFile

SAS data set on local PC or networked drive

CreateFromR

Data frame or matrix in R

CreateFromServerDataSet

SAS data set in libref

8.3 Creating a Data Object from a SAS Data Set

Most SAS programmers store data in a SAS data set. The CreateFromFile method instantiates a data object from a SAS data set on the client PC (the file has a sas7bdat extension). The Create-FromServerDataSet method instantiates a data object from a SAS data set on a SAS server, which is typically located in Work, Sasuser, or a user-defined libref such as MyLib.

Creating a data object from a SAS data set is easy: you need to declare the name of the data object and then instantiate the object, as shown in the following statements:

/* create a data object from a SAS data set */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");

The data object is invisible, so the statements do not produce any windows. If you want a tabular view of the data, you can create an instance of the DataTable class, as described in Section 6.8.

8.4 Creating Linked Graphs from a Data Object

The DataObject class provides a uniform interface to setting properties of the data, for retrieving data, and for creating dynamically linked graphs. By using the DataObject class, you can read data from various sources and then manipulate the data without regard to the details of how the data are stored.

The following statements (which continue the program in the previous section) create a scatter plot and a bar chart. The first argument to each Create method is a data object. Subsequent arguments name variables in that data object. Because the scatter plot and the bar chart are created from a common data object, they are automatically linked to each other and to any other graphical or tabular view of the same data object.

/* create plots from a common data object */
declare ScatterPlot plot;
plot = ScatterPlot.Create(dobj, "ReleaseDate", "Budget");

declare BarChart bar;
bar = BarChart.Create(dobj, "MPAARating");

The resulting graphs are shown in Figure 8.1. The figure shows the graphs after clicking on the PG category in the bar chart. Note that the 66 movies that are rated PG are selected. These observations are displayed as highlighted—both in the bar chart and also in the scatter plot.

Linked Graphs

Figure 8.1. Linked Graphs

Sometimes it is useful to label observations in a scatter plot by using values of a particular variable. You can tell the data object which variable is the "label variable" by using the SetRoleVar method in the DataObject class, as shown in the following statement:

dobj.SetRoleVar(ROLE_LABEL, "Title");

When you click on an observation marker in a scatter plot, the value of the "label variable" is displayed in the graph. For this example, the scatter plot displays the title of the movie. For example, click on the movie with the largest budget to discover that the movie is Spider-Man 3.

Figure 8.2 shows another example of a linked graph. It is created by the following statements:

declare BarChart barl;
barl = BarChart.Create(dobj, "Profanity");

Recall that the Profanity variable contains a number 0-10 that represents the level of profane language in a movie as reported by the kids-in-mind.com Web site. The bar chart shows both the distribution of the Profanity variable and the distribution of the selected PG-rated movies (shown with cross-hatching). The bar chart makes it easy to compare the conditional distribution (that is, the distribution of Profanity given that the movie is rated PG) with the distribution of all movies. Figure 8.2 shows that PG-rated movies tend to have relatively low levels of profanity: the mean for the PG-rated movies appears to be close to 2, whereas the general mean for all of the movies appears to be closer to 4.5.

Conditional Distribution (Cross-Hatched) of Profanity for PG-Rated Movies

Figure 8.2. Conditional Distribution (Cross-Hatched) of Profanity for PG-Rated Movies

After a data object is instantiated, there is no connection between the data object and the source of the data (in this case, Sasuser.Movies). You can delete or modify data in the data object without affecting the data source. Similarly, the source of the data can be deleted or modified without altering the in-memory data object.

8.5 Creating a Data Object from a Matrix

When the data you want to graph are in a SAS/IML matrix, you can create a data object directly from the matrix data. The Create method instantiates a data object from a SAS/IML matrix. For example, the following statements demonstrate how to create a data object from a matrix that contains three random variables:

/* create a data object from data in a matrix */
x = j(100, 3);                     /* 100 observations, 3 variables */
call randgen(x, "Normal");                                     /* 1 */
varNames = 'x1':'x3';                                          /* 2 */
declare DataObject dobj;
dobj = DataObject.Create("Normal Data", varNames, x);          /* 3 */

The program contains the following steps:

  1. The matrix x, which contains 100 observations and three columns, is filled with pseudorandom numbers from the standard normal distribution.

  2. The character vector varNames contains names to assign to the columns of x. The notation 'x1':'x3' uses the index operator to generate a vector of names with a common prefix.

  3. A data object is created from x. The columns are named according to the values of varNames.

The previous program does not create any output or graphs. However, you can see the contents of a data object by creating a data table from the data object:

DataTable.Create(dobj);

The data table might appear behind the program window. If so, move your programming window to reveal the data table.

8.6 Creating a SAS Data Set from a Data Object

You might want to save the contents of a data object, especially if you have added variables that you intend to use in future analyses. The following table summarizes the methods in the DataObject class that are frequently used to save the contents of a data object to a SAS data set.

Table 8.2. Creating a Data Set from a Data Object

Method

Destination

WriteToFile

SAS data set on local PC or networked drive

WriteToServerDataSet

SAS data set in libref

WriteVarsToServerDataSet

SAS data set in libref

The WriteVarsToServerDataSet method is useful for writing a subset of variables to a SAS data set. This is especially useful as a prelude to calling a SAS procedure. As explained in Section 8.2, a data object can be created from many sources of data, so if you want to call a SAS procedure, you need to make sure that the relevant variables are in a SAS data set on the SAS server. You could write the entire data object to a libref such as Work, but it is more efficient to write only the variables that are actually needed for the analysis.

For example, suppose you want to write a module named Skewness that calls the MEANS procedure to compute the skewness statistic for each specified variable in a data object. The following statements implement and call the Skewness module:

/* define a module that computes the skewness of variables in a
 * data object
 */
start Skewness(DataObject dobj, VarNames);              /* 1 */
   dobj.WriteVarsToServerDataSet(VarNames,
                         "Work", "Temp", true);         /* 2 */
   submit VarNames;
      proc means data=Temp noprint;                     /* 3 */
         var &VarNames;
         output out=Skew skewness= ;                    /* 4 */
      run;
   endsubmit;

   use Skew;
   read all var VarNames into x;                        /* 5 */
   close Skew;

   call delete("Work", "Temp");                         /* 6 */
   call delete("Work", "Skew");

   return ( x );
finish;

/* begin the main program */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Vehicles");

vars = {"MPG_Hwy" "MPG_City" "Engine_Liters"};
s = Skewness(dobj, vars);                         /* call the module */
print s[colname=vars label="Skewness"];

The Skewness module consists of the following main steps:

  1. The Skewness module is defined to take two arguments: a data object and a vector of variable names.

  2. The WriteVarToServerDataSet method writes a SAS data set named Work.Temp that contains the variables in the data object that are specified in the VarNames vector. The last argument to the method (true) specifies that observations that are excluded from analysis are not copied to the data set. (See the section "Attributes of Observations" on page 246 for further details.)

  3. The MEANS procedure is called from within a SUBMIT block. The procedure reads the data from Work.Temp and analyzes the variables specified in the VarNames vector.

  4. The OUTPUT statement outputs the skewness statistic for each variable. The program does not specify a value for the SKEWNESS= option, therefore the output variables have the same names as the input variables.

  5. The USE and READ statements read the output from the MEANS procedure into a row vector named x.

  6. The DELETE subroutine deletes the temporary data sets created during the computation.

The results of calling the Skewness module are shown in Figure 8.7.

The Skewness of Three Variables

Figure 8.3. The Skewness of Three Variables

In this example, the Sasuser.Vehicles data are already in a SAS data set in a libref, so it was not necessary to use the WriteVarsToServerDataSet method. However, the Skewness module is written so that it works for any data in a data object, regardless of the data source. Furthermore, recall that a data object is independent of the data from which it was instantiated. You can use DataObject class methods to delete observations or to exclude them from the analysis (see Section 10.5.2). Therefore, it is a good programming practice to write variables in a data object to a SAS data set prior to calling a SAS procedure.

8.7 Creating a Matrix from a Data Object

You might want to use SAS/IML operations and functions to analyze data that are in a data object. You can get data from one or more variables in a data object by using the GetVarData method in the DataObject class.

The GetVarData method can be called in two ways: you can get all observations for a set of variables, or you can specify the observation numbers that you want to get. Getting all of the observations is shown in the following statements, which get all values for the Hybrid variable in the Vehicles data set:

/* extract values from a variable into a vector */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Vehicles");
dobj.GetVarData("Hybrid", h);

In the previous statements, the h vector contains the data in the Hybrid variable. In particular, the vector contains a one for each observation that is for a hybrid-electric vehicle, and a zero for other observations. You can use the LOC function to find conventional vehicles, and then extract only those observations for each of several variables:

/* extract certain observations into a matrix */
idx = loc(h=0);                       /* find conventional vehicles */
VarNames = {"MPG_Hwy" "MPG_City" "Engine_Liters"};
/* extract data for specified vars and specified obs into matrix x  */
dobj.GetVarData(VarNames, idx, x);
corr = corr(x);                       /* correlation between vars   */
print corr[r=VarNames c=VarNames];

The previous statements fill the x matrix with data from the specified variables and for the specified observations. Then a correlation matrix is computed by using the CORR function. The result is shown in Figure 8.4. Notice that the x matrix contains observations only for the traditional (not hybrid-electric) vehicles.

Correlation Matrix for Conventional Vehicles

Figure 8.4. Correlation Matrix for Conventional Vehicles

8.8 Adding New Variables to a Data Object

As mentioned previously, after a data object is instantiated you can delete or modify data in the data object without affecting the data source. For example, you can add or delete variables. Deleting a variable is accomplished with the DeleteVar in the DataObject class. You can add new variables with the AddVar, AddVars, and AddAnalysisVar methods.

In practice, adding variables occurs more frequently than deleting variables. There are several reasons for needing to add new variables to a DataObject. This chapter discusses two reasons: transformations of variables, and adding predicted and residual values from a regression model.

8.8.1 Variable Transformations

As mentioned in Section 3.3.1, it is common to transform data during exploratory data analysis and statistical modeling. If a variable is heavily skewed, it is common to apply a logarithmic transformation in an attempt to work with a more symmetric distribution.

The program in this section creates a data object from the Movies data set. Methods in the DataOb-ject class are used to copy the Budget variable into a SAS/IML vector. SAS/IML functions are used to transform the data, and then a DataObject method is used to create a new variable in the data object that contains the transformed data. The program is as follows:

/* transform data, add new variable, and create histogram */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.GetVarData("Budget", b);                            /* 1 */

/* apply log transform */
log_b = log(b);                                          /* 2 */
dobj.AddVar("LogBudget", "log(Budget)", log_b);          /* 3 */
Histogram.Create(dobj,"LogBudget");                      /* 4 */

As usual, the object of the DataObject class is called dobj. It is instantiated from the Movies data. The program contains the following steps:

  1. The GetVarData method retrieves the data in the Budget variable and puts the data into a vector called b.

  2. The LOG function applies a logarithmic transform to the data and stores the transformed data in the log_b vector.

  3. The AddVar method adds the transformed data to the data object. The first argument is the name of the new variable, in this case LogBudget. The second argument is the label for the new variable. The third argument is the vector that contains the data for the new variable.

  4. The data object is invisible, so it is reassuring to create a data table or graph to make sure that the program worked. In this case, the program creates a histogram of the new variable.

The result of the program is shown in Figure 8.5. The logarithmic transformation has created a new variable whose distribution is more symmetric than the distribution of the original data.

Transformed Data

Figure 8.5. Transformed Data

You can also call the AddVar method with only two arguments, as shown in the following statement:

dobj.AddVar("LogBudget", log_b);              /* alternative signature */

In this case the label for the new variable is omitted. Consequently the name of the variable is also used as the label for the new variable.

Notice that the Budget variable contains only positive values, so it is safe to take the log of each observation. If you are not certain whether the data contains nonpositive values, you must take greater care in writing the SAS/IML statements that transform the data. A common convention is to assign a missing value to the logarithm of a nonpositive value, as shown in the following statements:

8.8.2 Adding Variables for Predicted and Residual Values

In Section 4.3, the GLM procedure is used to model a response variable. You can modify the program in that section to model the Mpg_Hwy variable by a quadratic function of the size of a vehicle's engine, as represented by the Engine_Liters variable. The GLM procedure can write an output data set that contains variables such as the predicted and residual values for the data.

This section describes how you can add variables that contain predicted and residual values to a data object. As shown in the example in the previous section, you first read the variables into vectors, and then add the vectors to the data object by calling the AddVar method.

For example, the following program calls the GLM procedure to compute a quadratic regression model. The procedure creates an output data set that contains predicted and residual values for a linear model. The program begins by creating a data object from the Vehicles data, creating a scatter plot of two variables, and calling the GLM procedure:

/* call a SAS procedure to create predicted and residual values */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Vehicles");
dobj.SetRoleVar(ROLE_LABEL, "Model");     /* label for selected obs */

declare ScatterPlot p;
p = ScatterPlot.Create(dobj, "Engine_Liters", "Mpg_Hwy");

submit;
proc glm data=Sasuser.Vehicles;
   model Mpg_Hwy = Engine_Liters | Engine_Liters;
   output out=GLMOut P=Pred R=Resid;
quit;
endsubmit;

The Pred and Resid variables are in the GLMOut data set. After you read them into SAS/IML vectors, you can call the AddVar method twice, once for each variable, or you call the AddVars method to add both variables at once, as shown in the following statements:

/* add predicted and residual values to a data object */
use GLMOut;
read all var {"Pred" "Resid"};
close GLMOut;

dobj.AddVars( {"Pred" "Resid"},
              {"Predicted Values" "Residual Values"},
              pred || resid );
ScatterPlot.Create(dobj, "Engine_Liters", "Resid");

The arguments to the AddVars method are similar to those for the AddVar method: the first argument names the new variables, the second (optional) argument specifies the variables' labels, and the third provides the data.

The program creates a scatter plot, as shown in Figure 8.6. The scatter plot does not reveal any pattern to the residuals that might indicate an incorrectly specified model. Several outliers in the residual plot are selected. The selected observations indicate that the Prius Hybrid and the Corvette get better gas mileage than would be expected from the model, whereas the RX-8 gets substantially less mileage than would be expected.

A Residual Plot

Figure 8.6. A Residual Plot

8.8.3 A Module to Add Variables from a SAS Data Set

The analysis in the previous section calls a SAS procedure (GLM) and writes the results of the procedure to an output data set. These results are then read into vectors and added to an IMLPlus data object. This sequence of operations occurs so frequently that SAS/IML Studio is distributed with a module that reads variables from a SAS data set in a libref and adds those variables to a data object. The module is named CopyServerDataToDataObject; it is documented in the online Help chapter "IMLPlus Module Reference."

The module is implemented so that it not only adds the variables but also preserves the formats of variables. For example, the following module call is an alternative way to add the Pred and Resid variables from the GLMOut data set:

/* add predicted and residual values to a data object */
VarNamesInDataSet    = {"Pred" "Resid"};
VarNamesInDataObject = {"Pred" "Resid"}; /* same for this example */
Labels = {"Predicted Values" "Residual Values"};
ok = CopyServerDataToDataObject("work", "GLMOut", dobj,
     VarNamesInDataSet, VarNamesInDataObject, Labels,
     1 /* replace variable if it already exists */ );

The module returns 1 if it succeeds and returns 0 if it fails. Notice that you can specify the names of the variables in the data object independently from the names of the variables in the SAS data set.

The last argument to the module specifies what to do if one of the variables you are adding has the same name as an existing variable. The various options are documented in the online Help.

If you specify an empty matrix for the penultimate argument, then the module uses the labels (if any) in the SAS data set as labels for the corresponding new variables in the data object. How can you specify an empty matrix? By using a matrix name that has not been assigned a value in the program. A useful convention is to reserve the matrix _NULL_ for the empty matrix, and never assign_null_ a value. With that convention, the simplest way to add variables from a SAS data set into a data object is shown in the following statements:

ok = CopyServerDataToDataObject("work", "GLMOut", dobj,
     VarNamesInDataSet, VarNamesInDataObject, _NULL_, 1 );

If you need the data in a SAS/IML vector, you can retrieve the data from a data object by using the GetVarData method in the DataObject class. This is described in "Creating a Matrix from a Data Object" on page 179.

8.9 Review: The Purpose of the DataObject Class

A data object serves the following purposes:

  • to read data into memory from various sources

  • to be a uniform programming interface by providing methods that set and get data and properties of observations and variables

  • to coordinate the display of data in dynamically linked graphs and data tables

These roles of the DataObject class are shown schematically in Figure 6.1.

You can use methods in the data object to manage the properties of observations and variables. For example, each observation is represented by a marker in scatter plots. The data object ensures that the properties of each observation marker (for example, color and shape) are the same in all graphical or tabular views of the data. The same is true for properties of variables: properties such as formats, labels, and roles are maintained by the data object. Chapter 10, "Marker Shapes, Colors, and Other Attributes of Data," describes how to manage these and other properties of observations and variables.

When you are exploring data, an important property of an observation is whether or not the observation is in a selected state. Each observation in a data object can be in a selected state or in an unselected state. The data object makes sure that observations that are selected in one graph are highlighted in all other graphs. For example, in Figure 8.1, the PG-rated movies are selected in the bar chart. When you click on a bar, the scatter plot updates and highlights the markers for those same movies. Similarly, when you select a group of observations in the scatter plot, the bar chart displays the distribution of ratings for the selected observations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset