Contents
8.2 Creating a Data Object 174
8.3 Creating a Data Object from a SAS Data Set 174
8.4 Creating Linked Graphs from a Data Object 175
8.5 Creating a Data Object from a Matrix 177
8.6 Creating a SAS Data Set from a Data Object 177
8.7 Creating a Matrix from a Data Object 179
8.8 Adding New Variables to a Data Object 180
8.8.1 Variable Transformations 181
8.8.2 Adding Variables for Predicted and Residual Values 182
8.8.3 A Module to Add Variables from a SAS Data Set 184
8.9 Review: The Purpose of the DataObject Class 185
As introduced in Section 6.3, the most important class in the IMLPlus language is the DataObject class. The DataObject class manages an in-memory copy of data. The class also manages graphical information about observations such as the shape and color of markers, the selected state of observations, and whether observations are displayed in plots or are excluded.
As shown in Chapter 7, "Creating Statistical Graphs," it is the DataObject class that enables you to create dynamically linked graphs. Graphs that are created from the same data object are automatically linked to each other.
This chapter describes how to create a data object from a source of data and how to create a SAS data set or a SAS/IML matrix from a data object. The chapter also describes how to modify a data object by adding new variables such as predicted values, residual values, and transformed variables.
The DataObject class has several "Create" methods that instantiate a data object. A data object can be instantiated from any of several sources of data: from a SAS/IML matrix, from a Microsoft Excel worksheet, from an R data frame, from a SAS server data set, or from a SAS data set stored on the client PC. A server data set is one that is in a SAS library such as Work
, Sashelp
, or in a libref that you defined by using the LIBNAME statement. A client data set is one that can be accessed by the operating system of the computer running SAS/IML Studio. For example, this could be a data set on a hard drive or USB drive of the local PC, or any data set that is accessible through a mounted, networked drive.
The following table lists methods that create a data object from various data sources:
Most SAS programmers store data in a SAS data set. The CreateFromFile method instantiates a data object from a SAS data set on the client PC (the file has a sas7bdat extension). The Create-FromServerDataSet method instantiates a data object from a SAS data set on a SAS server, which is typically located in Work
, Sasuser
, or a user-defined libref such as MyLib.
Creating a data object from a SAS data set is easy: you need to declare the name of the data object and then instantiate the object, as shown in the following statements:
/* create a data object from a SAS data set */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
The data object is invisible, so the statements do not produce any windows. If you want a tabular view of the data, you can create an instance of the DataTable class, as described in Section 6.8.
The DataObject class provides a uniform interface to setting properties of the data, for retrieving data, and for creating dynamically linked graphs. By using the DataObject class, you can read data from various sources and then manipulate the data without regard to the details of how the data are stored.
The following statements (which continue the program in the previous section) create a scatter plot and a bar chart. The first argument to each Create method is a data object. Subsequent arguments name variables in that data object. Because the scatter plot and the bar chart are created from a common data object, they are automatically linked to each other and to any other graphical or tabular view of the same data object.
/* create plots from a common data object */
declare ScatterPlot plot;
plot = ScatterPlot.Create(dobj, "ReleaseDate", "Budget");
declare BarChart bar;
bar = BarChart.Create(dobj, "MPAARating");
The resulting graphs are shown in Figure 8.1. The figure shows the graphs after clicking on the PG category in the bar chart. Note that the 66 movies that are rated PG are selected. These observations are displayed as highlighted—both in the bar chart and also in the scatter plot.
Sometimes it is useful to label observations in a scatter plot by using values of a particular variable. You can tell the data object which variable is the "label variable" by using the SetRoleVar method in the DataObject class, as shown in the following statement:
dobj.SetRoleVar(ROLE_LABEL, "Title");
When you click on an observation marker in a scatter plot, the value of the "label variable" is displayed in the graph. For this example, the scatter plot displays the title of the movie. For example, click on the movie with the largest budget to discover that the movie is Spider-Man 3.
Figure 8.2 shows another example of a linked graph. It is created by the following statements:
declare BarChart barl;
barl = BarChart.Create(dobj, "Profanity");
Recall that the Profanity
variable contains a number 0-10 that represents the level of profane language in a movie as reported by the kids-in-mind.com
Web site. The bar chart shows both the distribution of the Profanity
variable and the distribution of the selected PG-rated movies (shown with cross-hatching). The bar chart makes it easy to compare the conditional distribution (that is, the distribution of Profanity
given that the movie is rated PG) with the distribution of all movies. Figure 8.2 shows that PG-rated movies tend to have relatively low levels of profanity: the mean for the PG-rated movies appears to be close to 2, whereas the general mean for all of the movies appears to be closer to 4.5.
After a data object is instantiated, there is no connection between the data object and the source of the data (in this case, Sasuser.Movies
). You can delete or modify data in the data object without affecting the data source. Similarly, the source of the data can be deleted or modified without altering the in-memory data object.
When the data you want to graph are in a SAS/IML matrix, you can create a data object directly from the matrix data. The Create method instantiates a data object from a SAS/IML matrix. For example, the following statements demonstrate how to create a data object from a matrix that contains three random variables:
/* create a data object from data in a matrix */
x = j(100, 3); /* 100 observations, 3 variables */
call randgen(x, "Normal"); /* 1 */
varNames = 'x1':'x3'; /* 2 */
declare DataObject dobj;
dobj = DataObject.Create("Normal Data", varNames, x); /* 3 */
The program contains the following steps:
The matrix x
, which contains 100 observations and three columns, is filled with pseudorandom numbers from the standard normal distribution.
The character vector varNames
contains names to assign to the columns of x
. The notation 'x1':'x3'
uses the index operator to generate a vector of names with a common prefix.
A data object is created from x
. The columns are named according to the values of varNames.
The previous program does not create any output or graphs. However, you can see the contents of a data object by creating a data table from the data object:
DataTable.Create(dobj);
The data table might appear behind the program window. If so, move your programming window to reveal the data table.
You might want to save the contents of a data object, especially if you have added variables that you intend to use in future analyses. The following table summarizes the methods in the DataObject class that are frequently used to save the contents of a data object to a SAS data set.
Table 8.2. Creating a Data Set from a Data Object
Method | Destination |
---|---|
WriteToFile | SAS data set on local PC or networked drive |
WriteToServerDataSet | SAS data set in libref |
WriteVarsToServerDataSet | SAS data set in libref |
The WriteVarsToServerDataSet method is useful for writing a subset of variables to a SAS data set. This is especially useful as a prelude to calling a SAS procedure. As explained in Section 8.2, a data object can be created from many sources of data, so if you want to call a SAS procedure, you need to make sure that the relevant variables are in a SAS data set on the SAS server. You could write the entire data object to a libref such as Work
, but it is more efficient to write only the variables that are actually needed for the analysis.
For example, suppose you want to write a module named Skewness that calls the MEANS procedure to compute the skewness statistic for each specified variable in a data object. The following statements implement and call the Skewness module:
/* define a module that computes the skewness of variables in a
* data object
*/
start Skewness(DataObject dobj, VarNames); /* 1 */
dobj.WriteVarsToServerDataSet(VarNames,
"Work", "Temp", true); /* 2 */
submit VarNames;
proc means data=Temp noprint; /* 3 */
var &VarNames;
output out=Skew skewness= ; /* 4 */
run;
endsubmit;
use Skew;
read all var VarNames into x; /* 5 */
close Skew;
call delete("Work", "Temp"); /* 6 */
call delete("Work", "Skew");
return ( x );
finish;
/* begin the main program */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Vehicles");
vars = {"MPG_Hwy" "MPG_City" "Engine_Liters"};
s = Skewness(dobj, vars); /* call the module */
print s[colname=vars label="Skewness"];
The Skewness module consists of the following main steps:
The Skewness module is defined to take two arguments: a data object and a vector of variable names.
The WriteVarToServerDataSet method writes a SAS data set named Work.Temp
that contains the variables in the data object that are specified in the VarNames
vector. The last argument to the method (true
) specifies that observations that are excluded from analysis are not copied to the data set. (See the section "Attributes of Observations" on page 246 for further details.)
The MEANS procedure is called from within a SUBMIT block. The procedure reads the data from Work.Temp
and analyzes the variables specified in the VarNames
vector.
The OUTPUT statement outputs the skewness statistic for each variable. The program does not specify a value for the SKEWNESS= option, therefore the output variables have the same names as the input variables.
The USE and READ statements read the output from the MEANS procedure into a row vector named x.
The DELETE subroutine deletes the temporary data sets created during the computation.
The results of calling the Skewness module are shown in Figure 8.7.
In this example, the Sasuser.Vehicles
data are already in a SAS data set in a libref, so it was not necessary to use the WriteVarsToServerDataSet method. However, the Skewness module is written so that it works for any data in a data object, regardless of the data source. Furthermore, recall that a data object is independent of the data from which it was instantiated. You can use DataObject class methods to delete observations or to exclude them from the analysis (see Section 10.5.2). Therefore, it is a good programming practice to write variables in a data object to a SAS data set prior to calling a SAS procedure.
You might want to use SAS/IML operations and functions to analyze data that are in a data object. You can get data from one or more variables in a data object by using the GetVarData method in the DataObject class.
The GetVarData method can be called in two ways: you can get all observations for a set of variables, or you can specify the observation numbers that you want to get. Getting all of the observations is shown in the following statements, which get all values for the Hybrid
variable in the Vehicles
data set:
/* extract values from a variable into a vector */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Vehicles");
dobj.GetVarData("Hybrid", h);
In the previous statements, the h
vector contains the data in the Hybrid
variable. In particular, the vector contains a one for each observation that is for a hybrid-electric vehicle, and a zero for other observations. You can use the LOC function to find conventional vehicles, and then extract only those observations for each of several variables:
/* extract certain observations into a matrix */
idx = loc(h=0); /* find conventional vehicles */
VarNames = {"MPG_Hwy" "MPG_City" "Engine_Liters"};
/* extract data for specified vars and specified obs into matrix x */
dobj.GetVarData(VarNames, idx, x);
corr = corr(x); /* correlation between vars */
print corr[r=VarNames c=VarNames];
The previous statements fill the x
matrix with data from the specified variables and for the specified observations. Then a correlation matrix is computed by using the CORR function. The result is shown in Figure 8.4. Notice that the x
matrix contains observations only for the traditional (not hybrid-electric) vehicles.
As mentioned previously, after a data object is instantiated you can delete or modify data in the data object without affecting the data source. For example, you can add or delete variables. Deleting a variable is accomplished with the DeleteVar in the DataObject class. You can add new variables with the AddVar, AddVars, and AddAnalysisVar methods.
In practice, adding variables occurs more frequently than deleting variables. There are several reasons for needing to add new variables to a DataObject. This chapter discusses two reasons: transformations of variables, and adding predicted and residual values from a regression model.
As mentioned in Section 3.3.1, it is common to transform data during exploratory data analysis and statistical modeling. If a variable is heavily skewed, it is common to apply a logarithmic transformation in an attempt to work with a more symmetric distribution.
The program in this section creates a data object from the Movies
data set. Methods in the DataOb-ject class are used to copy the Budget
variable into a SAS/IML vector. SAS/IML functions are used to transform the data, and then a DataObject method is used to create a new variable in the data object that contains the transformed data. The program is as follows:
/* transform data, add new variable, and create histogram */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.GetVarData("Budget", b); /* 1 */
/* apply log transform */
log_b = log(b); /* 2 */
dobj.AddVar("LogBudget", "log(Budget)", log_b); /* 3 */
Histogram.Create(dobj,"LogBudget"); /* 4 */
As usual, the object of the DataObject class is called dobj
. It is instantiated from the Movies
data. The program contains the following steps:
The GetVarData method retrieves the data in the Budget
variable and puts the data into a vector called b.
The LOG function applies a logarithmic transform to the data and stores the transformed data in the log_b
vector.
The AddVar method adds the transformed data to the data object. The first argument is the name of the new variable, in this case LogBudget
. The second argument is the label for the new variable. The third argument is the vector that contains the data for the new variable.
The data object is invisible, so it is reassuring to create a data table or graph to make sure that the program worked. In this case, the program creates a histogram of the new variable.
The result of the program is shown in Figure 8.5. The logarithmic transformation has created a new variable whose distribution is more symmetric than the distribution of the original data.
You can also call the AddVar method with only two arguments, as shown in the following statement:
dobj.AddVar("LogBudget", log_b); /* alternative signature */
In this case the label for the new variable is omitted. Consequently the name of the variable is also used as the label for the new variable.
Notice that the Budget
variable contains only positive values, so it is safe to take the log of each observation. If you are not certain whether the data contains nonpositive values, you must take greater care in writing the SAS/IML statements that transform the data. A common convention is to assign a missing value to the logarithm of a nonpositive value, as shown in the following statements:
In Section 4.3, the GLM procedure is used to model a response variable. You can modify the program in that section to model the Mpg_Hwy
variable by a quadratic function of the size of a
vehicle's engine, as represented by the Engine_Liters
variable. The GLM procedure can write an output data set that contains variables such as the predicted and residual values for the data.
This section describes how you can add variables that contain predicted and residual values to a data object. As shown in the example in the previous section, you first read the variables into vectors, and then add the vectors to the data object by calling the AddVar method.
For example, the following program calls the GLM procedure to compute a quadratic regression model. The procedure creates an output data set that contains predicted and residual values for a linear model. The program begins by creating a data object from the Vehicles
data, creating a scatter plot of two variables, and calling the GLM procedure:
/* call a SAS procedure to create predicted and residual values */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Vehicles");
dobj.SetRoleVar(ROLE_LABEL, "Model"); /* label for selected obs */
declare ScatterPlot p;
p = ScatterPlot.Create(dobj, "Engine_Liters", "Mpg_Hwy");
submit;
proc glm data=Sasuser.Vehicles;
model Mpg_Hwy = Engine_Liters | Engine_Liters;
output out=GLMOut P=Pred R=Resid;
quit;
endsubmit;
The Pred
and Resid
variables are in the GLMOut
data set. After you read them into SAS/IML vectors, you can call the AddVar method twice, once for each variable, or you call the AddVars method to add both variables at once, as shown in the following statements:
/* add predicted and residual values to a data object */
use GLMOut;
read all var {"Pred" "Resid"};
close GLMOut;
dobj.AddVars( {"Pred" "Resid"},
{"Predicted Values" "Residual Values"},
pred || resid );
ScatterPlot.Create(dobj, "Engine_Liters", "Resid");
The arguments to the AddVars method are similar to those for the AddVar method: the first argument names the new variables, the second (optional) argument specifies the variables' labels, and the third provides the data.
The program creates a scatter plot, as shown in Figure 8.6. The scatter plot does not reveal any pattern to the residuals that might indicate an incorrectly specified model. Several outliers in the residual plot are selected. The selected observations indicate that the Prius Hybrid and the Corvette get better gas mileage than would be expected from the model, whereas the RX-8 gets substantially less mileage than would be expected.
The analysis in the previous section calls a SAS procedure (GLM) and writes the results of the procedure to an output data set. These results are then read into vectors and added to an IMLPlus data object. This sequence of operations occurs so frequently that SAS/IML Studio is distributed with a module that reads variables from a SAS data set in a libref and adds those variables to a data object. The module is named CopyServerDataToDataObject; it is documented in the online Help chapter "IMLPlus Module Reference."
The module is implemented so that it not only adds the variables but also preserves the formats of variables. For example, the following module call is an alternative way to add the Pred
and Resid
variables from the GLMOut
data set:
/* add predicted and residual values to a data object */
VarNamesInDataSet = {"Pred" "Resid"};
VarNamesInDataObject = {"Pred" "Resid"}; /* same for this example */
Labels = {"Predicted Values" "Residual Values"};
ok = CopyServerDataToDataObject("work", "GLMOut", dobj,
VarNamesInDataSet, VarNamesInDataObject, Labels,
1 /* replace variable if it already exists */ );
The module returns 1 if it succeeds and returns 0 if it fails. Notice that you can specify the names of the variables in the data object independently from the names of the variables in the SAS data set.
The last argument to the module specifies what to do if one of the variables you are adding has the same name as an existing variable. The various options are documented in the online Help.
If you specify an empty matrix for the penultimate argument, then the module uses the labels (if any) in the SAS data set as labels for the corresponding new variables in the data object. How can you specify an empty matrix? By using a matrix name that has not been assigned a value in the program. A useful convention is to reserve the matrix _NULL_
for the empty matrix, and never assign_null_
a value. With that convention, the simplest way to add variables from a SAS data set into a data object is shown in the following statements:
ok = CopyServerDataToDataObject("work", "GLMOut", dobj,
VarNamesInDataSet, VarNamesInDataObject, _NULL_, 1 );
If you need the data in a SAS/IML vector, you can retrieve the data from a data object by using the GetVarData method in the DataObject class. This is described in "Creating a Matrix from a Data Object" on page 179.
A data object serves the following purposes:
to read data into memory from various sources
to be a uniform programming interface by providing methods that set and get data and properties of observations and variables
to coordinate the display of data in dynamically linked graphs and data tables
These roles of the DataObject class are shown schematically in Figure 6.1.
You can use methods in the data object to manage the properties of observations and variables. For example, each observation is represented by a marker in scatter plots. The data object ensures that the properties of each observation marker (for example, color and shape) are the same in all graphical or tabular views of the data. The same is true for properties of variables: properties such as formats, labels, and roles are maintained by the data object. Chapter 10, "Marker Shapes, Colors, and Other Attributes of Data," describes how to manage these and other properties of observations and variables.
When you are exploring data, an important property of an observation is whether or not the observation is in a selected state. Each observation in a data object can be in a selected state or in an unselected state. The data object makes sure that observations that are selected in one graph are highlighted in all other graphs. For example, in Figure 8.1, the PG-rated movies are selected in the bar chart. When you click on a bar, the scatter plot updates and highlights the markers for those same movies. Similarly, when you select a group of observations in the scatter plot, the bar chart displays the distribution of ratings for the selected observations.