Contents
10.1 Overview of Data Attributes 225
10.2 Changing Marker Properties 226
10.2.1 Using Marker Shapes to Indicate Values of a Categorical Variable 226
10.2.2 Using Marker Colors to Indicate Values of a Continuous Variable 229
10.2.3 Coloring by Values of a Continuous Variable 232
10.3 Changing the Display Order of Categories 236
10.3.1 Setting the Display Order of a Categorical Variable 236
10.3.2 Using a Statistic to Set the Display Order of a Categorical Variable 238
10.4 Selecting Observations 241
10.5 Getting and Setting Attributes of Data 244
10.5.1 Properties of Variables 244
10.5.2 Attributes of Observations 246
The DataObject class provides methods that can be grouped into two categories: methods that provide access to the data and methods that describe how to represent the data in graphs or analyses. The first category includes methods such as GetVarData and AddVar. These methods retrieve data or add a variable to the data object; they are described in the previous chapter.
In contrast, the second category of methods affect attributes of the data. Examples of attributes include the color and shape of an observation marker. These properties affect the way that the data are displayed in graphs.
This chapter describes how to use methods in the DataObject class to manage attributes of observations and variables.
The shape and color of observation markers can be used to visually indicate the value of a third variable that does not appear on the graph. The shape of a marker often encodes the value of a categorical variable with a small number of categories, whereas color often encodes the value of a continuous variable. For example, if you have data about patients in a drug trial, you might want to use marker shape to represent male versus female, or perhaps to distinguish individuals in a test group from individuals in the control group. In the same study, you might use color to indicate a continuous variable such as the age or weight of a patient.
This section describes how to use methods in the DataObject class to set properties for markers based on values of other variables.
Suppose that you want to set the shape of markers to reflect the MPAA rating of movies in the Movies
data set. For definiteness, suppose you want to create a scatter plot of the US_Gross
variable versus the ReleaseDate
variable for the data in the Movies
data set. You decide to encode the marker shapes according to the following table:
Table 10.1. Movie Ratings, Marker Symbols, and IMLPlus Constants
MPAA Rating | Symbol | IMLPlus Constant |
---|---|---|
G | □ | MARKER_SQUARE |
NR | × | MARKER_X |
PG | Δ | MARKER_TRIANGLE |
PG-13 | + | MARKER_PLUS |
R | MARKER_INVTRIANGLE |
The last column in the table is the IMLPlus constant that specifies a marker shape in the SetMark-erShape method of the DataObject class. The SetMarkerShape method enables you to specify the marker shape to use when plotting specific observations. The following program finds the observations that are associated with each MPAA rating category by using the technique presented in Section 3.3.5:
/* use marker shape to encode category */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.SetRoleVar(ROLE_LABEL, "Title"); /* 1 */
dobj.GetVarData("MPAARating", Group); /* 2 */
u = unique(Group); /* 3 */
shapes = MARKER_SQUARE || MARKER_X ||
MARKER_TRIANGLE || MARKER_PLUS ||
MARKER_INVTRIANGLE; /* 4 */
do i = 1 to ncol(u); /* 5 */
idx = loc(Group = u[i]);
dobj.SetMarkerShape(idx, shapes[i]); /* 6 */
end;
declare ScatterPlot plot;
plot = ScatterPlot.Create(dobj, "ReleaseDate", "US_Gross");
plot.SetMarkerSize(5); /* 7 */
The program begins by creating a data object from the Movies
data set. The program consists of the following main steps:
The SetRoleVar method is called. (This statement is optional.) When you click on an observation marker in a scatter plot, the value of the "role variable" is displayed in the graph. For this example, clicking on an observation displays the title of the movie.
The GetVarData method in the DataObject class retrieves data from the data object. The SAS/IML vector Group
is created to hold the values of the MPAARating
variable. Notice that, in this case, you could also have gotten these data from Sasuser.Movies
by using the SAS/IML USE and READ statements.
The UNIQUE function returns a sorted vector of the unique values of the Group
vector. These values are stored in the vector u
.
A vector, shapes
, is created. It contains the marker shapes that correspond to Table 10.1.
The program loops over the group categories in u
. The purpose of the loop is to assign the shape in shape[i]
to all observations whose MPAA rating is equal to u[i]
.
After the LOC function finds the observations in each category, the SetMarkerShape sets the marker shape for those observations. Notice that you do not need to check whether idx
is empty, because each category in u
occurs at least once in Group
.
After the scatter plot is created, the SetMarkerSize method increases the size of markers displayed in the graph. If the marker size is too small, it can be difficult to distinguish between different marker shapes.
The scatter plot appears in Figure 10.1. You can click on an observation in a scatter plot to select that observation. In Figure 10.1 several of the largest-grossing movies are selected, which causes the titles of the movies to be displayed. Notice that most of the top-grossing movies are rated PG-13 (+), followed by PG-rated movies (Δ). There are few R-rated movies () that grossed more than $100 million, which is surprising considering the total number of R-rated movies in the data.
It is interesting to note that the largest grossing titles are all sequels that are part of successful movie franchises: Pirates of the Caribbean, Star Wars, and Spider-Man. The reader is invited to explore other top-grossing movies in the data and determine the titles of the movies. How many are sequels?
The program in this section used prior knowledge of the data in order to construct the shapes
vector: the shapes
vector has five shapes, one for each category of the MPAARating
variable. However, in general, you might not know how many categories a variable contains. For example, none of the movies included in this data set are rated NC-17, but that is a valid (but rare) MPAA rating.
One solution to this problem is to reuse shapes if there are more categories than shapes. The following statements use the SAS/IML MOD function to ensure that the index into the shape
vector is always a number between 1 and the number of shapes, regardless of the value of the looping variable i:
/* if more categories than shapes, reuse shapes */
j = 1 + mod(i-1, ncol(shapes));
shape = shapes[j];
dobj.SetMarkerShape(idx, shape);
Recall that the expression mod.(s, t) gives the remainder after dividing s by t. Consequently, the expression 1 + mod(i − 1, n) is a number between 1 and n for all positive integers i.
You can use this programming technique to write a module that sets the marker shapes of observations according to the categories of some variable, as shown in the following statements:
Recall from Section 6.10 that you can specify objects as arguments to IMLPlus modules. The first argument of the SetMarkerShapeByGroup module is an object of the DataObject class. To pass objects to a module, you must specify the name of the class in addition to the name of the argument when you list the arguments in the START statement. Any arguments that are not preceded by a class name are assumed to be SAS/IML matrices.
The graph in Figure 10.1 displays a great deal of information. It displays two continuous variables and one categorical variable. However, the markers in the graph are all black. The SetMarkerColor method in the DataObject class enables you to specify a color for given observations.
The SetMarkerColor method can be used directly to color individual markers. It is also used indirectly when you call any of several modules provided with SAS/IML Studio. These modules enable you to manipulate colors and easily color observations according to the value of a variable. The subsequent subsections discuss colors and how to perform common tasks with colors in SAS/IML Studio. This section begins with a description of how to specify colors in IMLPlus.
This section describes how colors are represented in IMLPlus. This is an advanced topic that can be omitted during an intial reading. The key point of this section is that there are two ways of representing colors: as ordered triples or as integers. There are also two modules distributed with SAS/IML Studio that enable you to convert between these representations: the RGBToInt and IntToRGB modules.
There are many ways to specify colors. IMLPlus represents colors by using the RGB coordinate system, which represents a color as an additive mixture of three primary colors: red, green, and blue.
One way to represent a color in RGB coordinates is as an ordered triple: the first coordinate represents the amount of red in the color, the second coordinate represents the amount of blue, and the third coordinate represents the amount of green. A popular representation of colors assigns eight bits (one byte) to each of the three colors. This means that each coordinate in the RGB system is an integer between zero and 255. In these coordinates, a value of 0 means the absence of a color and a value of 255 means the color is fully present. Thus (0, 0, 0) represents black and (255, 255, 255) represents white. Red is (255, 0, 0), whereas blue is (0, 0, 255). A mixture of primary colors such as (65,105,225) is a bluish shade that some might call "royal blue."
A second representation of colors packs the three RGB coordinates into a single integer. This is a very compact representation, although not very intuitive! The packing is accomplished by writing the integer in hexadecimal notation: use the lower two place-values to store the amount of blue in the color, use the third and fourth place-values to store the amount of green, and use the fifth and sixth place-values to store the red component.
In this compact representation, a value of 00 in the appropriate place-values means the absence of a color, whereas a value of FF means the color is fully present. In SAS you specify a hexadecimal value by prefixing the number with '0' and appending an 'x' to the end of the number. Thus 0000000x represents black, whereas OFFFFFFx represents white. Red is 0FF0000x, whereas blue is 00000FFx. The royal blue with RGB values (65,105, 225) is compactly represented as 04169E1x.
SAS/IML Studio comes with two modules that convert colors between ordered triples and integers. The RGBToInt module converts ordered triples to integers, whereas the IntToRGB module converts in the other direction. These modules are demonstrated by the following statements:
/* convert colors between different representations */
RoyalBlueRGB = {65 105 225}; /* input as ordered triple */
IntColor = RGBToInt(RoyalBlueRGB);
print IntColor[format=hex6.]; /* print as hexadecimal */
RoyalBlueInt = 04169E1x; /* input as hexadecimal */
TripletColor = IntToRGB(RoyalBlueInt);
print TripletColor; /* print as ordered triple */
The output is shown in Figure 10.2. Note that the base-10 representation of the Intcolor
integer (which is 4286945) is never displayed. The hexadecimal representation of the integer is printed by using the FORMAT=HEX6. option. Similarly, the integer stored in RoyalBluelnt
is specified by its hexadecimal value by using the '0' prefix and the 'x' suffix.
IMLPlus provides a number of predefined colors such as BLACK, RED, and GREEN. These predefined colors are stored as integers. The following program prints the hexadecimal values and ordered triples for a small subset of the predefined colors:
/* names and RGB values of predefined colors */
colName = {"BLACK","RED","GREEN","BLUE","ORANGE","PINK","YELLOW","WHITE"};
color = BLACK// RED// GREEN// BLUE// ORANGE// PINK// YELLOW// WHITE;
RGB = IntToRGB(color);
print colName color[format=hex6.] RGB[colname={"Red" "Green" "Blue"}];
In Section 8.8.2, you learned how to add residual values to a data object. It is often convenient to change the color or marker of observations whose residual values are far from zero. These observations that are not well-predicted by the model are called outliers.
The following statements assume that the residual values are in the data object in a variable called Resid
. The statements set the color of any observation with large residuals:
/* set color of observations with large residual values */
dobj.GetVarData("Resid", resid);
idx = loc(resid<=-10 | resid>=10);
if ncol(idx)>0 then
dobj.SetMarkerColor(idx, RED);
You can see the colored observations on a scatter plot (such as Figure 8.6) that includes the Resid
variable.
The definition of a "large" residual is data dependent. In this example, an observation is colored red if the actual Mpg_City
value differs from the predicted value by ten or more miles per gallon.
Although this example used an arbitrary value (10) to determine which residuals are considered "large," you can also detect and color outliers by using less arbitrary criteria. For example, some authors recommend computing the externally studentized residuals and examining an observation when the absolute value of the studentized residual exceeds 2. (The RSTUDENT= option in the GLM OUTPUT statement computes externally studentized residuals.) Regression diagnostics are discussed in Chapter 12, "Regression Diagnostics."
It is convenient to use marker shapes to indicate levels of a categorical variable. However, colors are better for encoding the values of continuous variables and discrete ordinal variables because you can associate one color (say, blue) with low values and another (say, red) with high values. Often one or more colors are used to represent intermediate colors.
The set of colors that represents values of a variable is called a color ramp (or sometimes a color map). Often a linear mapping is used to associate values of a variable to colors: the minimum value of the variable is mapped to the first color, the maximum value is mapped to the last color, and some scheme is used to associate colors with intermediate values. Nonlinear schemes are sometimes used when the distribution of the values is highly skewed.
The next two sections describe ways to associate colors with values. The simplest is for each color to represent a large range of values. (For example, you can divide the data into quartiles.) In this scheme, there is no need to interpolate colors: each observation is placed into one bin, and each bin is assigned a color. The more complicated way to associate colors with values is to specify a small number of colors but to interpolate colors so that only a small range of values is associated with a color.
One Color for a Large Range of Values
Suppose you want to color observations in the Movies
data set according to values of the US_Gross
variable. You decide to use four colors to encode the US_Gross
values: blue for the movies with low values of US_Gross
, cyan and orange for movies with intermediate values, and red for movies that generated large box-office revenues. To be specific, suppose you want to associate revenue and colors by using the following rules:
Table 10.2. Encoding Colors for Movie Revenue Ranges
US Gross | Color | Comment |
---|---|---|
(0,25] | BLUE | A flop |
(25,100] | CYAN | A typical movie |
(l00,200] | ORANGE | A hit |
> 200 | RED | A blockbuster hit |
The values that divide one category from another are called cutoff values or sometimes break points. For this example, the cutoff values are 25, 100, and 200. The cutoff values, together with the minimum and maximum value of US_Gross
, form endpoints of intervals. All values within an interval are assigned the same color.
A naive implementation would loop over all observations, and use the SetMarkerColor method to assign a color for that observation based on the cutoff values for US_Gross
. This is inefficient. As pointed out in the section "Writing Efficient SAS/IML Programs" on page 79, you should avoid loops over all observations. It is almost always better to loop over categories and to use the LOC function to identify the observations that belong to each category. The following program loops over the four color categories:
/* color each observation based on value of a continuous variable */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.GetVarData("US_Gross", v); /* 1 */
EndPts = {0 25 100 200} || max(v); /* 2 */
Color = BLUE || CYAN || ORANGE || RED; /* 3 */
do i = 1 to ncol(Color); /* 4 */
idx = loc(v>EndPts[i] & v<=EndPts[i+1]); /* 5 */
if ncol(idx)>0 then
dobj.SetMarkerColor(idx, Color[i]);
end;
The following list describes the main steps of the program:
The data in the US_Gross
variable are copied into the vector v
.
The cutoff values define four intervals. The first interval is (0, 25], the second is (25, 100], and so on. So that the program does not need to treat the last category ("blockbusters") differently from the previous categories, it is useful to specify a large number (max(v))
to use as the right endpoint of the last interval.
The corresponding colors are defined as in Table 10.2.
The loop is over the categories defined by the four intervals.
This is the key step of the program. The LOC function finds all observations in the i th interval. The ith interval consists of values greater than EndPts[i]
and less than or equal to EndPts[i+1]
. The color of these observations is set to Color[i]
by calling the SetMarker-Color method.
You should carefully study the behavior of the statements inside the loop during the last iteration (when i=4
in the example). The LOC function finds elements of v
that exceed EndPts[4]
(200) and are less than or equal to EndPts[5]
, which corresponds to max
(v)
. If the EndPts
vector had only four elements, then the program would halt with an index-out-of-bounds error. By using max(v)
as the fifth element of EndPts
, the program avoids this error and also avoids having to handle the case v > 200 differently than the other cases.
It is left as an exercise for the reader to view the colored observations by creating a scatter plot of any two variables in the Sasuser.Movies
data set.
Color-Blending: Few Values for each Color
The previous section assigned a single color to a range of values for US_Gross
. In mathematical terms, the function that maps values to colors is piecewise constant. It is also possible to construct
a piecewise linear function that maps values to colors. Defining this mapping requires linear interpolation between colors. Specifically, given two colors, what are the colors between them?
Linear interpolation of real numbers is straightforward: given numbers x and y, the numbers between them are parameterized by (1 − t)x + ty for t [0,1]. Linear interpolation of colors works similarly. If two colors, x = (xr, xg, xb) and y = (yr, yg, yb), are represented as triples of integers, then for each t [0,1], the triple c = (1 − t)x + ty is between x and y. In general, c is not a triple of integers, so in practice you must specify a color close to c, typically by using the INT, FLOOR, or ROUND functions in Base SAS software.
Color interpolation makes it easy to color observation markers by values of a continuous variable. Let v be the vector that contains the values. Assume first that you have a color ramp with two colors, a and b (represented as triples). The idea is to assign the color a to any observations with the value min(v) and assign the color b to any observations with the value max(v). For intermediate values, interpolate colors between a and b. How can you do this? First, normalize the values of v by applying a linear transformation: for each element of v, let ti, = (vi, − min(v))/(max(v) − min(v)). Notice that the values of t are in the interval [0,1]. Consequently, you can assign the observation with value vi, the color closest to the triple (1 − ti, )a + ti, b.
The following program implements these ideas for a two-color ramp that varies between light brown and dark brown. The program uses linear interpolation to color-code observations according to values of a continuous variable.
/* linearly interpolate color of each observation */
dobj.GetVarData("US_Gross", v);
a = IntToRGB(CREAM); /* 1 */
b = IntToRGB(BROWN);
t = (v-min(v)) / (max(v)-min(v)); /* 2 */
colors = (1-t)*a + t*b; /* 3 */
dobj.SetMarkerColor(1:nrow(v), colors); /* 4 */
The previous statements were not written to handle missing values in v
, but can be easily adapted. This is left to the reader. It is also trivial to change the program to use a color ramp that is formed by any other pair of colors. The steps of the program are as follows:
Represent the two colors as RGB triples, a
and b
.
Normalize the values of v
. The new vector t
has values in the interval [0,1].
Linearly interpolate between the colors. The colors
vector does not contain integer values, but you can apply the INT, FLOOR, or ROUND functions to the colors
vector. If you do not explicitly truncate the colors, the SetMarkerColor method truncates the values (which is equivalent to applying the INT function).
Call the SetMarkerColor method to set the colors of the observations.
It is not initially apparent that Step 3 involves matrix computations, but it does. The vector t
is an n × 1 vector, and the vectors a
and b
are a 1 × 3 row vectors. Therefore, the matrix product t*b
is an n × 3 matrix, and similarly for the product (1-t) *a
.
The scatter plot shown in Figure 10.4 displays US_Gross
on the vertical axis. Even though the image in this book is monochromatic, the gradient of colors from a light color to a dark color is apparent. In practice, it is common to use color to depict a variable that is not present in the graph.
The ideas of this section can be extended to color ramps with more than two colors. Suppose you have a vector v and want to color observations by using a color ramp with k colors c1, c2,..., ck. Divide the range of v into k − 1 equal intervals defined by the evenly-spaced values L1, L2,..., Lk, where L1 = min(v) and Lk = max(v). To the ith interval [L,i, Li,+1], associate the two-color ramp with colors ci, and ci,+1. For an observation that has a value of v in the interval [Li, Li+1], color it according to the algorithm for two-color color ramps. In this way, you reduce the problem to one already solved: to handle a color ramp with k colors, simply handle k − 1 color ramps with two colors.
SAS/IML Studio distributes several modules that can help you interpolate colors and color-code observations. The modules are briefly described in Table 10.3.
Table 10.3. IMLPlus Modules for Manipulating Colors
An ordinal variable is a categorical variable for which the various categories can be ordered. For example, a variable that contains the days of the week has a natural ordering: Sunday, Monday,..., Saturday.
By default, all graphs display categorical variables in alphanumeric order. This order might not be the most appropriate to display. For example, if you create a bar chart for days of the week or months of the year, the bar chart should present categories in a chronological order rather than in alphabetical order. Similarly, for a variable with values "Low," "Medium," and "High," the values of the variable suggest an order that is not alphabetical.
Sometimes it is appropriate to order categories according to the values of some statistic applied to the categories. For example, you might want to order a bar chart according to the frequency counts, or you might want to order a box plot according to the mean (or median) value of the categories.
This section describes how to change the order in which a graph displays the levels of a categorical variable. For information about using the SAS/IML Studio GUI to set the order of a variable, see the section "Ordering Categories of a Nominal Variable" (Chapter 11, SAS/IML Studio User's Guide).
Suppose you want to create a bar chart showing how many movies in the Movies
data set were released for each month, across all years. You can determine the month that each movie was released by applying the MONNAMEw. format to the ReleaseDate
variable.
The following program creates a new character variable (Month
) in the data object that contains the month in which each movie was released. The program also creates a bar chart (shown in Figure 10.5) of the Month
variable.
/* create a bar chart of variable; bars displayed alphabetically */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.GetVarData("ReleaseDate", r);
month = putn(r,"monname3."); /* Jan, Feb, ..., Dec */
dobj.AddVar("Month", month); /* add new character variable */
declare BarChart bar;
bar = BarChart.Create(dobj, "Month");
The categories in Figure 10.5 are plotted in alphabetical order. This makes it difficult to compare consecutive months or to observe seasonal trends in the data. The months have a natural chronological ordering. You can set the ordering of a variable by using the SetVarValueOrder method in the DataObject class. This is shown in the following statements:
/* change order of bars; display chronologically */
order = {"Jan" "Feb" "Mar" "Apr" "May" "Jun"
"Jul" "Aug" "Sep" "Oct" "Nov" "Dec"};
/* alternatively: order = putn(mdy(1:12,1,1960), "monname3."); */
dobj.SetVarValueOrder("Month", order);
All graphs that display the variable will update to reflect the new ordering, as shown in Figure 10.6.
Figure 10.6 shows the relative changes from month to month. Notice that the distributors release relatively few movies in January and February; perhaps the large number of December movies are still playing in the theaters? It is also curious that there are more releases in the months of March and September than in the summer months.
Sometimes you might want to emphasize differences between groups by ordering categories according to some statistic calculated for data in each category. For example, you might want to order a bar chart according to the frequency counts, or you might want to order a box plot according to the mean (or median) value of the categories.
The following program demonstrates this idea by creating a box plot of the US_Gross
variable versus the month that the movie was released. The Month
variable is ordered according to the mean of the US_Gross
variable for movies that are released during that month (for any year).
The previous section describes how you can create a new character variable in the data object (named Month
) that contains the name of the month in which the movie was released. The following program continues the example. The program uses the technique of Section 3.3.5 to compute the mean gross revenue for each month:
/* compute a statistic for each category */
GroupVar = "Month"; /* variable that contains categories */
dobj.GetVarData(GroupVar, group);
dobj.GetVarData("US_Gross", y);
u = unique(group); /* find the categories */
numGroups = ncol(u); /* number of categories */
stat = j(1, numGroups); /* allocate a vector for results */
do i = 1 to numGroups; /* for each group... */
idx = loc(group=u[i]); /* find the observations in that group */
m = y[idx]; /* extract the values */
stat[i] = m[:]; /* compute statistic for group */
end;
print stat[colname=u];
The categories are stored in the u
vector in alphanumeric order. The mean of the ith group is stored in the stat
vector, as shown in Figure 10.7.
The means are known for each group, so the following statements order the months according to their means:
/* sort categories according to the statistic for each category */
r = rank(stat);
print r[colname=u];
sorted_u = u; /* copy data in u */
sorted_u[r] = u; /* permute categories into sorted order */
print sorted_u;
dobj.SetVarValueOrder(GroupVar, sorted_u);
declare BoxPlot box;
box = BoxPlot.Create(dobj, GroupVar, "US_Gross");
The program shows how the RANK function is used to sort the u
vector according to the values of the stat
vector. The output from the program is shown in Figure 10.8. The smallest mean is stored in stat[12]
which corresponds to movies released in September. Consequently, r[12]
contains the value 1, which indicates that September needs to be the first entry when the vector is sorted. The second entry should be October, and so on.
The actual sorting is accomplished by using the r
vector to permute the entries of u
. First, the vector sorted_u
is created as a copy of u
. This ensures that sorted_u
is the same size and type (character or numeric) as u
. Then the permutation occurs. This is shown schematically in Figure 10.9. The first entry in r
is 4, so sorted_u[4]
is assigned the contents of u[1]
which is "Apr". The second entry in r
is 6, so sorted_u[6]
is assigned the contents of u[2]
which is "Aug". This process continues for each row. Finally, the last (twelfth) entry in r
is 1, so sorted_u[1]
is assigned the contents of u[12]
which is "Sep". In this way, the rows of sorted_u
are sorted according to the values of stat
.
The last statements of the program reorder the categories of the Month
variable by using the SetVarValueOrder method in the DataObject class. All graphs that include the Month
variable display the months in the order given by the sorted_u
vector, as shown in the box plot in Figure 10.10.
The program in this section is written so as to work in many situations. For example, if you assign "MPAARating" to GroupVar
then the program creates a box plot of the MPAA ratings sorted according to the mean US gross revenues for each ratings category. (If GroupVar
contains the name of a numeric variable, the program still runs correctly provided that you comment out the PRINT statements since the COLNAME= option expects a character variable.)
Selecting observations in IMLPlus graphics is a major reason to use SAS/IML Studio in data analysis. You can discover relationships among variables in your data by selecting observations in one graph and seeing those same observations highlighted in other graphs.
You can select observations interactively by clicking on a bar in a bar chart or by using a selection rectangle to select several observations in a scatter plot. However, sometimes it is useful to select observations by running a program. Selecting observations with a program can be useful if the criterion that specifies the selected observations is complicated. It is also useful for selecting observations during a presentation or as part of an analysis that you run frequently, or for highlighting observations that have missing values in a certain variable.
The following program selects observations in a data object created from the Movies
data set for which the World_Gross
variable has a missing value:
/* select observations that contain a missing value for a variable */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.GetVarData("World_Gross", wGross);
/* find observations with missing values */
idx = loc(wGross=.);
if ncol(idx)>0 then
dobj.SelectObs(idx);
The SelectObs method selects observations. The observation numbers are specified by the first argument to the method. The SelectObs method has an optional second argument that specifies whether to deselect all observations prior to selecting new observations. By default, new selections do not deselect any existing selections, so you can call the SelectObs method several times to build up the union of observations that satisfy any of several criteria. For example, the following statements (which continue the preceding program) find all movies for which the world gross revenues are more than ten times the US revenue. These movies (which did far better internationally than in the US) are added to the set of selected observations:
/* select observations that satisfy a formula */
dobj.GetVarData("US_Gross", usGross);
jdx = loc(wGross>10*usGross);
if ncol(jdx)>0 then
dobj.SelectObs(jdx); /* add to previous selection */
The results of the previous statements are shown in Figure 10.11. The size of the selected observations is increased relative to the unselected observations so that they are more visible. You can increase the size difference between selected and unselected observations by clicking ALT+UP ARROW in the plot. As of SAS/IML Studio 3.3, there is no method to increase this size difference.
If you want to replace existing selected observations, you can call the SelectObs method with the second argument equal to false
. If you want to clear existing selected observations, you can call the DeselectAllObs method in the DataObject class. Similarly, the DeselectObs method allows you to deselect specified observations. You can use the DeselectObs method to specify the intersection of criteria. You can also use the XSECT function to form the intersection of criteria.
The following table lists the frequently used DataObject methods that select or deselect observations:
Table 10.4. Methods in the DataObject Class That Select or Deselect Observations
Method | Description |
---|---|
DeselectAllObs | Deselects all observations |
DeselectObs | Deselects specified observations |
SelectObs | Selects specified observations |
SelectObsWhere | Selects observations that satisfy a criterion |
In addition to viewing selected observations in IMLPlus graphs, there is a second reason you might want to select observations in a data object: many DataObject methods operate on the selected observations unless a vector of observation numbers is explicitly specified. The following table lists DataObject methods that, by default, operate on selected observations:
Table 10.5. DataObject Class Methods That Can Operate on Selected Observations
Description | |
---|---|
GetSelectedObsNumbers | Gets the observation numbers for the selected observations |
GetVarSelectedData | Gets the data for the selected observations and for a list of specified variables |
IncludeInAnalysis | Specifies whether an observation should be included in statistical analyses |
IncludeInPlots | Specifies whether an observation should be included in IMLPlus graphs |
SetMarkerColor | Sets the color of the marker that represents an observation |
SetMarkerShape | Sets the shape of the marker that represents an observation |
Previous sections describe methods such as SetRoleVar, SetMarkerColor, and SelectObs, which all affect attributes of data. This section gives further examples of using DataObject methods to change attributes for observations and properties for variables.
SAS data sets contain some properties of variables: the length of a variable (more often used for character variables than for numeric) and the variable's label, format, and informat. Even the name of the variable is a property and can be changed by using the RENAME statement in a DATA step. The DataObject class has methods that get or set all of these properties.
In addition, the DataObject class maintains additional properties. For example, a variable in a data object can be nominal. All character variables are nominal; numerical variables are assumed to represent continuous quantities unless explicitly set to be nominal. A variable can also be selected or unselected. Selected variables are used by analyses chosen from the SAS/IML Studio GUI to automatically fill in fields in some dialog boxes. The analyses built into SAS/IML Studio are accessible from the Analysis menu; they are described in the SAS/IML Studio User's Guide.
The following program calls several DataObject methods to set variable properties. It then displays a data table.
/* call DataObject methods related to variable properties */
obsNum = t(1:5); /* 5 × 1 vector */
x = obsNum / 100;
declare DataObject dobj;
dobj = DataObject.Create("Properties", {"ObsNum","x"}, obsNum||x);
/* call some "Set" methods for variable properties */
/* standard SAS variable properties */
dobj.SetVarFormat("x", "Percent6.1"); /* set format */
dobj.SetVarLabel("x", "A few values"); /* set label */
/*IMLPlus properties */
dobj.SetRoleVar(ROLE_WEIGHT, x"); /* set role (=WEIGHT) */
dobj.SelectVar("x"); /* select variable */
/* displays data table in front of any other SAS/IML Studio windows */
DataTable.Create(dobj).ActivateWindow();
The data table created by the program is shown in Figure 10.12. You can see from the data table that the x
variable is selected (notice the highlighting), is assigned the "weight" role (notice the 'W' in the column header), and is assigned the PERCENT6.1 format (notice the values).
You can also retrieve variable properties. The following example continues the previous program statements:
/* call some "Get" methods for variable properties */
dobj.GetVarNames(varNames); /* get name of all vars */
dobj.GetSelectedVarNames(varName); /* get name of sel. vars */
/* standard SAS variable properties */
format = dobj.GetVarFormat(varName); /* get format */
label = dobj.GetVarLabel(varName); /* get label */
informat = dobj.GetVarInformat(varName); /* get informat */
if informat="" then /* return empty string */
informat = "None"; /* if no informat */
/* IMLPlus properties */
weightName = dobj.GetRoleVar(ROLE_WEIGHT); /* get name of WEIGHT var*/
isNominal = dobj.IsNominal(varName); /* is variable nominal? */
print varName format informat label; /* SAS properties */
print weightName isNominal; /* IMLPlus properties */
The output from the program is shown in Figure 10.13. The program is not very interesting; its purpose is to demonstrate how to set and get frequently used properties of variables. Notice that the DataObject class contains methods that get and set standard SAS properties (variable name, format, and informat) in addition to IMLPlus properties (role, selected state). For information on how to set and examine variable properties by using the SAS/IML Studio GUI, see Chapter 4, "Interacting with the Data Table" (SAS/IML Studio User's Guide).
Although SAS data sets do not contain attributes for each observation, the DataObject class does. The DataObject class has methods to set and retrieve attributes for observations. For example, you can set the marker color and marker shape for an observation, and specify whether the observation is included in plots and included in analyses. You can also select or deselect an observation.
The following program continues the previous program by calling DataObject methods to set attributes of observation for the data object:
/* call DataObject methods related to observation attributes */
/* call some "Set" methods for observation attributes */
dobj.DeselectAllVar(); /* remove selected var */
dobj.SetMarkerColor(1:2, RED); /* set marker color */
dobj.SetMarkerShape(2:3, MARKER_STAR); /* set marker shape */
dobj.IncludeInAnalysis(3:4, false); /* set analysis indicator */
dobj.IncludeInPlots(4:5, false); /* set plot indicator */
dobj.SelectObs({1,3,5}); /* select observations */
The data table created by the program is shown in Figure 10.14. Notice that the selected observations are highlighted, and that the row headers have icons that show the shape for each observation, in addition to whether the observation is included in plots or in analyses. The icons also show the color for each observation, althought the colors are not apparent in the figure.
In the same way, you can call methods that retrieve attributes for observations. This is shown in the following statements:
/* call some "Get" methods for observation attributes */
dobj.GetMarkerFillColor(colorIdx); /* get all colors */
dobj.GetMarkerShape(shapeIdx); /* get all shapes */
dobj.GetObsNumbersInAnalysis(analIdx); /* get only obs numbers with */
dobj.GetObsNumbersInPlots(plotsIdx); /* a certain indicator */
dobj.GetSelectedObsNumbers(selIdx); /* get selected obs numbers */
/* compute some combinations */
selColor = colorIdx[selIdx]; /* color of selected obs */
shapePlots = shapeIdx[plotsIdx]; /* shapes in plots */
selAnal = xsect(analIdx, selIdx); /* selected obs in analyses */
/* printing convenience:
create a format that associates marker names with marker values */
submit;
proc format;
value shape
0='Square' 1='Plus' 2='Circle' 3='Diamond'
4='X' 5='Triangle' 6='InvTriangle' 7='Star';
run;
endsubmit;
print selColor[format=hex6.] shapePlots[format=shape.], selAnal;
The output from the program is shown in Figure 10.15
There are several statements in the program that require further explanation. Notice that there are two kinds of "Get" methods for observation attributes. Some (such as GetMarkerShape) create a vector that always has n rows, where n is the number of observations in the data set. The elements of the vector are colors, shapes, or some other attribute. In contrast, other methods (such as Get-SelectedObsNumbers) create a vector that contains k rows where k is the number of observations that satisfy a criterion. The elements of the vector are observation numbers (the indices of observations) for which a given criterion is true. Consequently, you use colorIdx[selIdx]
in order to compute the colors of selected observations, whereas you use the XSECT function to compute the indices that are both selected and included in analyses.
Notice how the SUBMIT block is used to call the FORMAT procedure in order to print names for the values returned by the GetMarkerShape method. The FORMAT procedure creates a new format which associates the values of IMLPlus keywords with explanatory text strings. For example, the
keyword MARKER_STAR has the numeric value 7 as you can determine by using the PRINT statement. The newly created format (named SHAPE.) is used to print the values of the shapePlots
vector.