Chapter 10. Marker Shapes, Colors, and Other Attributes of Data

Contents

  • 10.1 Overview of Data Attributes 225

  • 10.2 Changing Marker Properties 226

    • 10.2.1 Using Marker Shapes to Indicate Values of a Categorical Variable 226

    • 10.2.2 Using Marker Colors to Indicate Values of a Continuous Variable 229

    • 10.2.3 Coloring by Values of a Continuous Variable 232

  • 10.3 Changing the Display Order of Categories 236

    • 10.3.1 Setting the Display Order of a Categorical Variable 236

    • 10.3.2 Using a Statistic to Set the Display Order of a Categorical Variable 238

  • 10.4 Selecting Observations 241

  • 10.5 Getting and Setting Attributes of Data 244

    • 10.5.1 Properties of Variables 244

    • 10.5.2 Attributes of Observations 246

10.1 Overview of Data Attributes

The DataObject class provides methods that can be grouped into two categories: methods that provide access to the data and methods that describe how to represent the data in graphs or analyses. The first category includes methods such as GetVarData and AddVar. These methods retrieve data or add a variable to the data object; they are described in the previous chapter.

In contrast, the second category of methods affect attributes of the data. Examples of attributes include the color and shape of an observation marker. These properties affect the way that the data are displayed in graphs.

This chapter describes how to use methods in the DataObject class to manage attributes of observations and variables.

10.2 Changing Marker Properties

The shape and color of observation markers can be used to visually indicate the value of a third variable that does not appear on the graph. The shape of a marker often encodes the value of a categorical variable with a small number of categories, whereas color often encodes the value of a continuous variable. For example, if you have data about patients in a drug trial, you might want to use marker shape to represent male versus female, or perhaps to distinguish individuals in a test group from individuals in the control group. In the same study, you might use color to indicate a continuous variable such as the age or weight of a patient.

This section describes how to use methods in the DataObject class to set properties for markers based on values of other variables.

10.2.1 Using Marker Shapes to Indicate Values of a Categorical Variable

Suppose that you want to set the shape of markers to reflect the MPAA rating of movies in the Movies data set. For definiteness, suppose you want to create a scatter plot of the US_Gross variable versus the ReleaseDate variable for the data in the Movies data set. You decide to encode the marker shapes according to the following table:

Table 10.1. Movie Ratings, Marker Symbols, and IMLPlus Constants

MPAA Rating

Symbol

IMLPlus Constant

G

MARKER_SQUARE

NR

×

MARKER_X

PG

Δ

MARKER_TRIANGLE

PG-13

+

MARKER_PLUS

R

Movie Ratings, Marker Symbols, and IMLPlus Constants

MARKER_INVTRIANGLE

The last column in the table is the IMLPlus constant that specifies a marker shape in the SetMark-erShape method of the DataObject class. The SetMarkerShape method enables you to specify the marker shape to use when plotting specific observations. The following program finds the observations that are associated with each MPAA rating category by using the technique presented in Section 3.3.5:

/* use marker shape to encode category */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.SetRoleVar(ROLE_LABEL, "Title");                   /* 1 */
dobj.GetVarData("MPAARating", Group);                   /* 2 */
u = unique(Group);                                      /* 3 */
shapes = MARKER_SQUARE || MARKER_X ||
         MARKER_TRIANGLE || MARKER_PLUS ||
         MARKER_INVTRIANGLE;                            /* 4 */
do i = 1 to ncol(u);                                    /* 5 */
   idx = loc(Group = u[i]);
dobj.SetMarkerShape(idx, shapes[i]);                    /* 6 */
end;

declare ScatterPlot plot;
plot = ScatterPlot.Create(dobj, "ReleaseDate", "US_Gross");
plot.SetMarkerSize(5);                                  /* 7 */

The program begins by creating a data object from the Movies data set. The program consists of the following main steps:

  1. The SetRoleVar method is called. (This statement is optional.) When you click on an observation marker in a scatter plot, the value of the "role variable" is displayed in the graph. For this example, clicking on an observation displays the title of the movie.

  2. The GetVarData method in the DataObject class retrieves data from the data object. The SAS/IML vector Group is created to hold the values of the MPAARating variable. Notice that, in this case, you could also have gotten these data from Sasuser.Movies by using the SAS/IML USE and READ statements.

  3. The UNIQUE function returns a sorted vector of the unique values of the Group vector. These values are stored in the vector u.

  4. A vector, shapes, is created. It contains the marker shapes that correspond to Table 10.1.

  5. The program loops over the group categories in u. The purpose of the loop is to assign the shape in shape[i] to all observations whose MPAA rating is equal to u[i].

  6. After the LOC function finds the observations in each category, the SetMarkerShape sets the marker shape for those observations. Notice that you do not need to check whether idx is empty, because each category in u occurs at least once in Group.

  7. After the scatter plot is created, the SetMarkerSize method increases the size of markers displayed in the graph. If the marker size is too small, it can be difficult to distinguish between different marker shapes.

The scatter plot appears in Figure 10.1. You can click on an observation in a scatter plot to select that observation. In Figure 10.1 several of the largest-grossing movies are selected, which causes the titles of the movies to be displayed. Notice that most of the top-grossing movies are rated PG-13 (+), followed by PG-rated movies (Δ). There are few R-rated movies (Movie Ratings, Marker Symbols, and IMLPlus Constants) that grossed more than $100 million, which is surprising considering the total number of R-rated movies in the data.

Marker Shapes Correspond to MPAA Rating

Figure 10.1. Marker Shapes Correspond to MPAA Rating

It is interesting to note that the largest grossing titles are all sequels that are part of successful movie franchises: Pirates of the Caribbean, Star Wars, and Spider-Man. The reader is invited to explore other top-grossing movies in the data and determine the titles of the movies. How many are sequels?

The program in this section used prior knowledge of the data in order to construct the shapes vector: the shapes vector has five shapes, one for each category of the MPAARating variable. However, in general, you might not know how many categories a variable contains. For example, none of the movies included in this data set are rated NC-17, but that is a valid (but rare) MPAA rating.

One solution to this problem is to reuse shapes if there are more categories than shapes. The following statements use the SAS/IML MOD function to ensure that the index into the shape vector is always a number between 1 and the number of shapes, regardless of the value of the looping variable i:

/* if more categories than shapes, reuse shapes */
j = 1 + mod(i-1, ncol(shapes));
shape = shapes[j];
dobj.SetMarkerShape(idx, shape);

Recall that the expression mod.(s, t) gives the remainder after dividing s by t. Consequently, the expression 1 + mod(i − 1, n) is a number between 1 and n for all positive integers i.

You can use this programming technique to write a module that sets the marker shapes of observations according to the categories of some variable, as shown in the following statements:

Recall from Section 6.10 that you can specify objects as arguments to IMLPlus modules. The first argument of the SetMarkerShapeByGroup module is an object of the DataObject class. To pass objects to a module, you must specify the name of the class in addition to the name of the argument when you list the arguments in the START statement. Any arguments that are not preceded by a class name are assumed to be SAS/IML matrices.

10.2.2 Using Marker Colors to Indicate Values of a Continuous Variable

The graph in Figure 10.1 displays a great deal of information. It displays two continuous variables and one categorical variable. However, the markers in the graph are all black. The SetMarkerColor method in the DataObject class enables you to specify a color for given observations.

The SetMarkerColor method can be used directly to color individual markers. It is also used indirectly when you call any of several modules provided with SAS/IML Studio. These modules enable you to manipulate colors and easily color observations according to the value of a variable. The subsequent subsections discuss colors and how to perform common tasks with colors in SAS/IML Studio. This section begins with a description of how to specify colors in IMLPlus.

10.2.2.1 Color Representation in IMLPlus

This section describes how colors are represented in IMLPlus. This is an advanced topic that can be omitted during an intial reading. The key point of this section is that there are two ways of representing colors: as ordered triples or as integers. There are also two modules distributed with SAS/IML Studio that enable you to convert between these representations: the RGBToInt and IntToRGB modules.

There are many ways to specify colors. IMLPlus represents colors by using the RGB coordinate system, which represents a color as an additive mixture of three primary colors: red, green, and blue.

One way to represent a color in RGB coordinates is as an ordered triple: the first coordinate represents the amount of red in the color, the second coordinate represents the amount of blue, and the third coordinate represents the amount of green. A popular representation of colors assigns eight bits (one byte) to each of the three colors. This means that each coordinate in the RGB system is an integer between zero and 255. In these coordinates, a value of 0 means the absence of a color and a value of 255 means the color is fully present. Thus (0, 0, 0) represents black and (255, 255, 255) represents white. Red is (255, 0, 0), whereas blue is (0, 0, 255). A mixture of primary colors such as (65,105,225) is a bluish shade that some might call "royal blue."

A second representation of colors packs the three RGB coordinates into a single integer. This is a very compact representation, although not very intuitive! The packing is accomplished by writing the integer in hexadecimal notation: use the lower two place-values to store the amount of blue in the color, use the third and fourth place-values to store the amount of green, and use the fifth and sixth place-values to store the red component.

In this compact representation, a value of 00 in the appropriate place-values means the absence of a color, whereas a value of FF means the color is fully present. In SAS you specify a hexadecimal value by prefixing the number with '0' and appending an 'x' to the end of the number. Thus 0000000x represents black, whereas OFFFFFFx represents white. Red is 0FF0000x, whereas blue is 00000FFx. The royal blue with RGB values (65,105, 225) is compactly represented as 04169E1x.

SAS/IML Studio comes with two modules that convert colors between ordered triples and integers. The RGBToInt module converts ordered triples to integers, whereas the IntToRGB module converts in the other direction. These modules are demonstrated by the following statements:

/* convert colors between different representations */
RoyalBlueRGB = {65 105 225};             /* input as ordered triple */
IntColor = RGBToInt(RoyalBlueRGB);
print IntColor[format=hex6.];            /* print as hexadecimal    */

RoyalBlueInt = 04169E1x;                 /* input as hexadecimal    */
TripletColor = IntToRGB(RoyalBlueInt);
print TripletColor;                      /* print as ordered triple */

The output is shown in Figure 10.2. Note that the base-10 representation of the Intcolor integer (which is 4286945) is never displayed. The hexadecimal representation of the integer is printed by using the FORMAT=HEX6. option. Similarly, the integer stored in RoyalBluelnt is specified by its hexadecimal value by using the '0' prefix and the 'x' suffix.

Color Represented as Ordered Triple and Integer

Figure 10.2. Color Represented as Ordered Triple and Integer

IMLPlus provides a number of predefined colors such as BLACK, RED, and GREEN. These predefined colors are stored as integers. The following program prints the hexadecimal values and ordered triples for a small subset of the predefined colors:

/* names and RGB values of predefined colors */
colName = {"BLACK","RED","GREEN","BLUE","ORANGE","PINK","YELLOW","WHITE"};
color = BLACK// RED// GREEN// BLUE// ORANGE// PINK// YELLOW// WHITE;
RGB = IntToRGB(color);
print colName color[format=hex6.] RGB[colname={"Red" "Green" "Blue"}];
Hexadecimal and RGB Representation of Some Predefined Colors

Figure 10.3. Hexadecimal and RGB Representation of Some Predefined Colors

10.2.2.2 Using Color to Mark Outliers

In Section 8.8.2, you learned how to add residual values to a data object. It is often convenient to change the color or marker of observations whose residual values are far from zero. These observations that are not well-predicted by the model are called outliers.

The following statements assume that the residual values are in the data object in a variable called Resid. The statements set the color of any observation with large residuals:

/* set color of observations with large residual values */
dobj.GetVarData("Resid", resid);
idx = loc(resid<=-10 | resid>=10);
if ncol(idx)>0 then
   dobj.SetMarkerColor(idx, RED);

You can see the colored observations on a scatter plot (such as Figure 8.6) that includes the Resid variable.

The definition of a "large" residual is data dependent. In this example, an observation is colored red if the actual Mpg_City value differs from the predicted value by ten or more miles per gallon.

Although this example used an arbitrary value (10) to determine which residuals are considered "large," you can also detect and color outliers by using less arbitrary criteria. For example, some authors recommend computing the externally studentized residuals and examining an observation when the absolute value of the studentized residual exceeds 2. (The RSTUDENT= option in the GLM OUTPUT statement computes externally studentized residuals.) Regression diagnostics are discussed in Chapter 12, "Regression Diagnostics."

10.2.3 Coloring by Values of a Continuous Variable

It is convenient to use marker shapes to indicate levels of a categorical variable. However, colors are better for encoding the values of continuous variables and discrete ordinal variables because you can associate one color (say, blue) with low values and another (say, red) with high values. Often one or more colors are used to represent intermediate colors.

The set of colors that represents values of a variable is called a color ramp (or sometimes a color map). Often a linear mapping is used to associate values of a variable to colors: the minimum value of the variable is mapped to the first color, the maximum value is mapped to the last color, and some scheme is used to associate colors with intermediate values. Nonlinear schemes are sometimes used when the distribution of the values is highly skewed.

The next two sections describe ways to associate colors with values. The simplest is for each color to represent a large range of values. (For example, you can divide the data into quartiles.) In this scheme, there is no need to interpolate colors: each observation is placed into one bin, and each bin is assigned a color. The more complicated way to associate colors with values is to specify a small number of colors but to interpolate colors so that only a small range of values is associated with a color.

One Color for a Large Range of Values

Suppose you want to color observations in the Movies data set according to values of the US_Gross variable. You decide to use four colors to encode the US_Gross values: blue for the movies with low values of US_Gross, cyan and orange for movies with intermediate values, and red for movies that generated large box-office revenues. To be specific, suppose you want to associate revenue and colors by using the following rules:

Table 10.2. Encoding Colors for Movie Revenue Ranges

US Gross

Color

Comment

(0,25]

BLUE

A flop

(25,100]

CYAN

A typical movie

(l00,200]

ORANGE

A hit

> 200

RED

A blockbuster hit

The values that divide one category from another are called cutoff values or sometimes break points. For this example, the cutoff values are 25, 100, and 200. The cutoff values, together with the minimum and maximum value of US_Gross, form endpoints of intervals. All values within an interval are assigned the same color.

A naive implementation would loop over all observations, and use the SetMarkerColor method to assign a color for that observation based on the cutoff values for US_Gross. This is inefficient. As pointed out in the section "Writing Efficient SAS/IML Programs" on page 79, you should avoid loops over all observations. It is almost always better to loop over categories and to use the LOC function to identify the observations that belong to each category. The following program loops over the four color categories:

/* color each observation based on value of a continuous variable */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.GetVarData("US_Gross", v);                         /* 1 */
EndPts = {0       25      100      200} || max(v);      /* 2 */
Color  = BLUE || CYAN || ORANGE || RED;                 /* 3 */
do i = 1 to ncol(Color);                                /* 4 */
   idx = loc(v>EndPts[i] & v<=EndPts[i+1]);             /* 5 */
   if ncol(idx)>0 then
      dobj.SetMarkerColor(idx, Color[i]);
end;

The following list describes the main steps of the program:

  1. The data in the US_Gross variable are copied into the vector v.

  2. The cutoff values define four intervals. The first interval is (0, 25], the second is (25, 100], and so on. So that the program does not need to treat the last category ("blockbusters") differently from the previous categories, it is useful to specify a large number (max(v)) to use as the right endpoint of the last interval.

  3. The corresponding colors are defined as in Table 10.2.

  4. The loop is over the categories defined by the four intervals.

  5. This is the key step of the program. The LOC function finds all observations in the i th interval. The ith interval consists of values greater than EndPts[i] and less than or equal to EndPts[i+1]. The color of these observations is set to Color[i] by calling the SetMarker-Color method.

You should carefully study the behavior of the statements inside the loop during the last iteration (when i=4 in the example). The LOC function finds elements of v that exceed EndPts[4] (200) and are less than or equal to EndPts[5], which corresponds to max(v). If the EndPts vector had only four elements, then the program would halt with an index-out-of-bounds error. By using max(v) as the fifth element of EndPts, the program avoids this error and also avoids having to handle the case v > 200 differently than the other cases.

It is left as an exercise for the reader to view the colored observations by creating a scatter plot of any two variables in the Sasuser.Movies data set.

Color-Blending: Few Values for each Color

The previous section assigned a single color to a range of values for US_Gross. In mathematical terms, the function that maps values to colors is piecewise constant. It is also possible to construct a piecewise linear function that maps values to colors. Defining this mapping requires linear interpolation between colors. Specifically, given two colors, what are the colors between them?

Linear interpolation of real numbers is straightforward: given numbers x and y, the numbers between them are parameterized by (1 − t)x + ty for t Encoding Colors for Movie Revenue Ranges [0,1]. Linear interpolation of colors works similarly. If two colors, x = (xr, xg, xb) and y = (yr, yg, yb), are represented as triples of integers, then for each t Encoding Colors for Movie Revenue Ranges [0,1], the triple c = (1 − t)x + ty is between x and y. In general, c is not a triple of integers, so in practice you must specify a color close to c, typically by using the INT, FLOOR, or ROUND functions in Base SAS software.

Color interpolation makes it easy to color observation markers by values of a continuous variable. Let v be the vector that contains the values. Assume first that you have a color ramp with two colors, a and b (represented as triples). The idea is to assign the color a to any observations with the value min(v) and assign the color b to any observations with the value max(v). For intermediate values, interpolate colors between a and b. How can you do this? First, normalize the values of v by applying a linear transformation: for each element of v, let ti, = (vi, − min(v))/(max(v) − min(v)). Notice that the values of t are in the interval [0,1]. Consequently, you can assign the observation with value vi, the color closest to the triple (1 − ti, )a + ti, b.

The following program implements these ideas for a two-color ramp that varies between light brown and dark brown. The program uses linear interpolation to color-code observations according to values of a continuous variable.

/* linearly interpolate color of each observation */
dobj.GetVarData("US_Gross", v);
a = IntToRGB(CREAM);                                   /* 1 */
b = IntToRGB(BROWN);
t = (v-min(v)) / (max(v)-min(v));                      /* 2 */
colors = (1-t)*a + t*b;                                /* 3 */
dobj.SetMarkerColor(1:nrow(v), colors);                /* 4 */

The previous statements were not written to handle missing values in v, but can be easily adapted. This is left to the reader. It is also trivial to change the program to use a color ramp that is formed by any other pair of colors. The steps of the program are as follows:

  1. Represent the two colors as RGB triples, a and b.

  2. Normalize the values of v. The new vector t has values in the interval [0,1].

  3. Linearly interpolate between the colors. The colors vector does not contain integer values, but you can apply the INT, FLOOR, or ROUND functions to the colors vector. If you do not explicitly truncate the colors, the SetMarkerColor method truncates the values (which is equivalent to applying the INT function).

  4. Call the SetMarkerColor method to set the colors of the observations.

It is not initially apparent that Step 3 involves matrix computations, but it does. The vector t is an n × 1 vector, and the vectors a and b are a 1 × 3 row vectors. Therefore, the matrix product t*b is an n × 3 matrix, and similarly for the product (1-t) *a.

The scatter plot shown in Figure 10.4 displays US_Gross on the vertical axis. Even though the image in this book is monochromatic, the gradient of colors from a light color to a dark color is apparent. In practice, it is common to use color to depict a variable that is not present in the graph.

A Color Ramp

Figure 10.4. A Color Ramp

The ideas of this section can be extended to color ramps with more than two colors. Suppose you have a vector v and want to color observations by using a color ramp with k colors c1, c2,..., ck. Divide the range of v into k − 1 equal intervals defined by the evenly-spaced values L1, L2,..., Lk, where L1 = min(v) and Lk = max(v). To the ith interval [L,i, Li,+1], associate the two-color ramp with colors ci, and ci,+1. For an observation that has a value of v in the interval [Li, Li+1], color it according to the algorithm for two-color color ramps. In this way, you reduce the problem to one already solved: to handle a color ramp with k colors, simply handle k − 1 color ramps with two colors.

SAS/IML Studio distributes several modules that can help you interpolate colors and color-code observations. The modules are briefly described in Table 10.3.

Table 10.3. IMLPlus Modules for Manipulating Colors

IMLPlus Module

Description

BlendColors

Interpolates colors from a color ramp

ColorCodeObs

Uses linear interpolation to color observations according to values of a continuous variable

ColorCodeObsByGroups

Colors observations according to levels of one or more categorical variables

IntToRGB

Converts colors from a hexadecimal representation to ordered triples of RGB values

RGBToInt

Converts colors from ordered triples of RGB values to a hexadecimal representation

10.3 Changing the Display Order of Categories

An ordinal variable is a categorical variable for which the various categories can be ordered. For example, a variable that contains the days of the week has a natural ordering: Sunday, Monday,..., Saturday.

By default, all graphs display categorical variables in alphanumeric order. This order might not be the most appropriate to display. For example, if you create a bar chart for days of the week or months of the year, the bar chart should present categories in a chronological order rather than in alphabetical order. Similarly, for a variable with values "Low," "Medium," and "High," the values of the variable suggest an order that is not alphabetical.

Sometimes it is appropriate to order categories according to the values of some statistic applied to the categories. For example, you might want to order a bar chart according to the frequency counts, or you might want to order a box plot according to the mean (or median) value of the categories.

This section describes how to change the order in which a graph displays the levels of a categorical variable. For information about using the SAS/IML Studio GUI to set the order of a variable, see the section "Ordering Categories of a Nominal Variable" (Chapter 11, SAS/IML Studio User's Guide).

10.3.1 Setting the Display Order of a Categorical Variable

Suppose you want to create a bar chart showing how many movies in the Movies data set were released for each month, across all years. You can determine the month that each movie was released by applying the MONNAMEw. format to the ReleaseDate variable.

The following program creates a new character variable (Month) in the data object that contains the month in which each movie was released. The program also creates a bar chart (shown in Figure 10.5) of the Month variable.

/* create a bar chart of variable; bars displayed alphabetically */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.GetVarData("ReleaseDate", r);
month = putn(r,"monname3.");          /* Jan, Feb, ..., Dec         */
dobj.AddVar("Month", month);          /* add new character variable */

declare BarChart bar;
bar = BarChart.Create(dobj, "Month");
Graph of Categories in Alphabetical Order

Figure 10.5. Graph of Categories in Alphabetical Order

The categories in Figure 10.5 are plotted in alphabetical order. This makes it difficult to compare consecutive months or to observe seasonal trends in the data. The months have a natural chronological ordering. You can set the ordering of a variable by using the SetVarValueOrder method in the DataObject class. This is shown in the following statements:

/* change order of bars; display chronologically */
order = {"Jan" "Feb" "Mar" "Apr" "May" "Jun"
         "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"};
/* alternatively: order = putn(mdy(1:12,1,1960), "monname3."); */
dobj.SetVarValueOrder("Month", order);

All graphs that display the variable will update to reflect the new ordering, as shown in Figure 10.6.

Graph of Categories in Chronological Order

Figure 10.6. Graph of Categories in Chronological Order

Figure 10.6 shows the relative changes from month to month. Notice that the distributors release relatively few movies in January and February; perhaps the large number of December movies are still playing in the theaters? It is also curious that there are more releases in the months of March and September than in the summer months.

10.3.2 Using a Statistic to Set the Display Order of a Categorical Variable

Sometimes you might want to emphasize differences between groups by ordering categories according to some statistic calculated for data in each category. For example, you might want to order a bar chart according to the frequency counts, or you might want to order a box plot according to the mean (or median) value of the categories.

The following program demonstrates this idea by creating a box plot of the US_Gross variable versus the month that the movie was released. The Month variable is ordered according to the mean of the US_Gross variable for movies that are released during that month (for any year).

The previous section describes how you can create a new character variable in the data object (named Month) that contains the name of the month in which the movie was released. The following program continues the example. The program uses the technique of Section 3.3.5 to compute the mean gross revenue for each month:

/* compute a statistic for each category */
GroupVar = "Month";          /* variable that contains categories   */
dobj.GetVarData(GroupVar, group);
dobj.GetVarData("US_Gross", y);
u = unique(group);           /* find the categories                 */
numGroups = ncol(u);         /* number of categories                */
stat = j(1, numGroups);      /* allocate a vector for results       */
do i = 1 to numGroups;       /* for each group...                   */
   idx = loc(group=u[i]);    /* find the observations in that group */
   m = y[idx];               /* extract the values                  */
   stat[i] = m[:];           /* compute statistic for group         */
end;
print stat[colname=u];

The categories are stored in the u vector in alphanumeric order. The mean of the ith group is stored in the stat vector, as shown in Figure 10.7.

Statistics for Each Category in Alphabetical Order

Figure 10.7. Statistics for Each Category in Alphabetical Order

The means are known for each group, so the following statements order the months according to their means:

/* sort categories according to the statistic for each category */
r = rank(stat);
print r[colname=u];
sorted_u = u;                /* copy data in u                       */
sorted_u[r] = u;             /* permute categories into sorted order */
print sorted_u;

dobj.SetVarValueOrder(GroupVar, sorted_u);
declare BoxPlot box;
box = BoxPlot.Create(dobj, GroupVar, "US_Gross");

The program shows how the RANK function is used to sort the u vector according to the values of the stat vector. The output from the program is shown in Figure 10.8. The smallest mean is stored in stat[12] which corresponds to movies released in September. Consequently, r[12] contains the value 1, which indicates that September needs to be the first entry when the vector is sorted. The second entry should be October, and so on.

Ranks of Categories and the Sorted Categories

Figure 10.8. Ranks of Categories and the Sorted Categories

The actual sorting is accomplished by using the r vector to permute the entries of u. First, the vector sorted_u is created as a copy of u. This ensures that sorted_u is the same size and type (character or numeric) as u. Then the permutation occurs. This is shown schematically in Figure 10.9. The first entry in r is 4, so sorted_u[4] is assigned the contents of u[1] which is "Apr". The second entry in r is 6, so sorted_u[6] is assigned the contents of u[2] which is "Aug". This process continues for each row. Finally, the last (twelfth) entry in r is 1, so sorted_u[1] is assigned the contents of u[12] which is "Sep". In this way, the rows of sorted_u are sorted according to the values of stat.

Permuting Rows to Sort Categories

Figure 10.9. Permuting Rows to Sort Categories

The last statements of the program reorder the categories of the Month variable by using the SetVarValueOrder method in the DataObject class. All graphs that include the Month variable display the months in the order given by the sorted_u vector, as shown in the box plot in Figure 10.10.

Box Plots Ordered by the Mean of Each Category

Figure 10.10. Box Plots Ordered by the Mean of Each Category

The program in this section is written so as to work in many situations. For example, if you assign "MPAARating" to GroupVar then the program creates a box plot of the MPAA ratings sorted according to the mean US gross revenues for each ratings category. (If GroupVar contains the name of a numeric variable, the program still runs correctly provided that you comment out the PRINT statements since the COLNAME= option expects a character variable.)

10.4 Selecting Observations

Selecting observations in IMLPlus graphics is a major reason to use SAS/IML Studio in data analysis. You can discover relationships among variables in your data by selecting observations in one graph and seeing those same observations highlighted in other graphs.

You can select observations interactively by clicking on a bar in a bar chart or by using a selection rectangle to select several observations in a scatter plot. However, sometimes it is useful to select observations by running a program. Selecting observations with a program can be useful if the criterion that specifies the selected observations is complicated. It is also useful for selecting observations during a presentation or as part of an analysis that you run frequently, or for highlighting observations that have missing values in a certain variable.

The following program selects observations in a data object created from the Movies data set for which the World_Gross variable has a missing value:

/* select observations that contain a missing value for a variable */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");
dobj.GetVarData("World_Gross", wGross);
/* find observations with missing values */
idx = loc(wGross=.);
if ncol(idx)>0 then
   dobj.SelectObs(idx);

The SelectObs method selects observations. The observation numbers are specified by the first argument to the method. The SelectObs method has an optional second argument that specifies whether to deselect all observations prior to selecting new observations. By default, new selections do not deselect any existing selections, so you can call the SelectObs method several times to build up the union of observations that satisfy any of several criteria. For example, the following statements (which continue the preceding program) find all movies for which the world gross revenues are more than ten times the US revenue. These movies (which did far better internationally than in the US) are added to the set of selected observations:

/* select observations that satisfy a formula */
dobj.GetVarData("US_Gross", usGross);
jdx = loc(wGross>10*usGross);
if ncol(jdx)>0 then
   dobj.SelectObs(jdx);                 /* add to previous selection */

The results of the previous statements are shown in Figure 10.11. The size of the selected observations is increased relative to the unselected observations so that they are more visible. You can increase the size difference between selected and unselected observations by clicking ALT+UP ARROW in the plot. As of SAS/IML Studio 3.3, there is no method to increase this size difference.

Observations That Satisfy Either of Two Criteria

Figure 10.11. Observations That Satisfy Either of Two Criteria

If you want to replace existing selected observations, you can call the SelectObs method with the second argument equal to false. If you want to clear existing selected observations, you can call the DeselectAllObs method in the DataObject class. Similarly, the DeselectObs method allows you to deselect specified observations. You can use the DeselectObs method to specify the intersection of criteria. You can also use the XSECT function to form the intersection of criteria.

The following table lists the frequently used DataObject methods that select or deselect observations:

Table 10.4. Methods in the DataObject Class That Select or Deselect Observations

Method

Description

DeselectAllObs

Deselects all observations

DeselectObs

Deselects specified observations

SelectObs

Selects specified observations

SelectObsWhere

Selects observations that satisfy a criterion

In addition to viewing selected observations in IMLPlus graphs, there is a second reason you might want to select observations in a data object: many DataObject methods operate on the selected observations unless a vector of observation numbers is explicitly specified. The following table lists DataObject methods that, by default, operate on selected observations:

Table 10.5. DataObject Class Methods That Can Operate on Selected Observations

Method

Description

GetSelectedObsNumbers

Gets the observation numbers for the selected observations

GetVarSelectedData

Gets the data for the selected observations and for a list of specified variables

IncludeInAnalysis

Specifies whether an observation should be included in statistical analyses

IncludeInPlots

Specifies whether an observation should be included in IMLPlus graphs

SetMarkerColor

Sets the color of the marker that represents an observation

SetMarkerShape

Sets the shape of the marker that represents an observation

10.5 Getting and Setting Attributes of Data

Previous sections describe methods such as SetRoleVar, SetMarkerColor, and SelectObs, which all affect attributes of data. This section gives further examples of using DataObject methods to change attributes for observations and properties for variables.

10.5.1 Properties of Variables

SAS data sets contain some properties of variables: the length of a variable (more often used for character variables than for numeric) and the variable's label, format, and informat. Even the name of the variable is a property and can be changed by using the RENAME statement in a DATA step. The DataObject class has methods that get or set all of these properties.

In addition, the DataObject class maintains additional properties. For example, a variable in a data object can be nominal. All character variables are nominal; numerical variables are assumed to represent continuous quantities unless explicitly set to be nominal. A variable can also be selected or unselected. Selected variables are used by analyses chosen from the SAS/IML Studio GUI to automatically fill in fields in some dialog boxes. The analyses built into SAS/IML Studio are accessible from the Analysis menu; they are described in the SAS/IML Studio User's Guide.

The following program calls several DataObject methods to set variable properties. It then displays a data table.

/* call DataObject methods related to variable properties */
obsNum = t(1:5);                              /* 5 × 1 vector       */
x = obsNum / 100;
declare DataObject dobj;
dobj = DataObject.Create("Properties", {"ObsNum","x"}, obsNum||x);
/* call some "Set" methods for variable properties */
/* standard SAS variable properties */
dobj.SetVarFormat("x", "Percent6.1");         /* set format         */
dobj.SetVarLabel("x", "A few values");        /* set label          */

/*IMLPlus properties */
dobj.SetRoleVar(ROLE_WEIGHT, x");             /* set role (=WEIGHT) */
dobj.SelectVar("x");                          /* select variable    */

/* displays data table in front of any other SAS/IML Studio windows */
DataTable.Create(dobj).ActivateWindow();

The data table created by the program is shown in Figure 10.12. You can see from the data table that the x variable is selected (notice the highlighting), is assigned the "weight" role (notice the 'W' in the column header), and is assigned the PERCENT6.1 format (notice the values).

The Result of Calling DataObject Methods

Figure 10.12. The Result of Calling DataObject Methods

You can also retrieve variable properties. The following example continues the previous program statements:

/* call some "Get" methods for variable properties */
dobj.GetVarNames(varNames);                /* get name of all vars  */
dobj.GetSelectedVarNames(varName);         /* get name of sel. vars */
/* standard SAS variable properties */
format = dobj.GetVarFormat(varName);       /* get format            */
label = dobj.GetVarLabel(varName);         /* get label             */
informat = dobj.GetVarInformat(varName);   /* get informat          */
if informat="" then                        /* return empty string   */
   informat = "None";                      /*    if no informat     */
/* IMLPlus properties */
weightName = dobj.GetRoleVar(ROLE_WEIGHT); /* get name of WEIGHT var*/
isNominal = dobj.IsNominal(varName);       /* is variable nominal?  */

print varName format informat label;       /* SAS properties        */
print weightName isNominal;                /* IMLPlus properties    */

The output from the program is shown in Figure 10.13. The program is not very interesting; its purpose is to demonstrate how to set and get frequently used properties of variables. Notice that the DataObject class contains methods that get and set standard SAS properties (variable name, format, and informat) in addition to IMLPlus properties (role, selected state). For information on how to set and examine variable properties by using the SAS/IML Studio GUI, see Chapter 4, "Interacting with the Data Table" (SAS/IML Studio User's Guide).

Variable Properties

Figure 10.13. Variable Properties

10.5.2 Attributes of Observations

Although SAS data sets do not contain attributes for each observation, the DataObject class does. The DataObject class has methods to set and retrieve attributes for observations. For example, you can set the marker color and marker shape for an observation, and specify whether the observation is included in plots and included in analyses. You can also select or deselect an observation.

The following program continues the previous program by calling DataObject methods to set attributes of observation for the data object:

/* call DataObject methods related to observation attributes */
/* call some "Set" methods for observation attributes */
dobj.DeselectAllVar();                    /* remove selected var    */
dobj.SetMarkerColor(1:2, RED);            /* set marker color       */
dobj.SetMarkerShape(2:3, MARKER_STAR);    /* set marker shape       */
dobj.IncludeInAnalysis(3:4, false);       /* set analysis indicator */
dobj.IncludeInPlots(4:5, false);          /* set plot indicator     */
dobj.SelectObs({1,3,5});                  /* select observations    */

The data table created by the program is shown in Figure 10.14. Notice that the selected observations are highlighted, and that the row headers have icons that show the shape for each observation, in addition to whether the observation is included in plots or in analyses. The icons also show the color for each observation, althought the colors are not apparent in the figure.

The Result of Calling DataObject Methods

Figure 10.14. The Result of Calling DataObject Methods

In the same way, you can call methods that retrieve attributes for observations. This is shown in the following statements:

/* call some "Get" methods for observation attributes */
dobj.GetMarkerFillColor(colorIdx);     /* get all colors            */
dobj.GetMarkerShape(shapeIdx);         /* get all shapes            */
dobj.GetObsNumbersInAnalysis(analIdx); /* get only obs numbers with */
dobj.GetObsNumbersInPlots(plotsIdx);   /* a certain indicator       */
dobj.GetSelectedObsNumbers(selIdx);    /* get selected obs numbers  */

/* compute some combinations */
selColor = colorIdx[selIdx];           /* color of selected obs     */
shapePlots = shapeIdx[plotsIdx];       /* shapes in plots           */
selAnal = xsect(analIdx, selIdx);      /* selected obs in analyses  */

/* printing convenience:
   create a format that associates marker names with marker values  */
submit;
proc format;
   value shape
   0='Square' 1='Plus'     2='Circle'      3='Diamond'
   4='X'      5='Triangle' 6='InvTriangle' 7='Star';
run;
endsubmit;

print selColor[format=hex6.] shapePlots[format=shape.], selAnal;

The output from the program is shown in Figure 10.15

Observation Attributes

Figure 10.15. Observation Attributes

There are several statements in the program that require further explanation. Notice that there are two kinds of "Get" methods for observation attributes. Some (such as GetMarkerShape) create a vector that always has n rows, where n is the number of observations in the data set. The elements of the vector are colors, shapes, or some other attribute. In contrast, other methods (such as Get-SelectedObsNumbers) create a vector that contains k rows where k is the number of observations that satisfy a criterion. The elements of the vector are observation numbers (the indices of observations) for which a given criterion is true. Consequently, you use colorIdx[selIdx] in order to compute the colors of selected observations, whereas you use the XSECT function to compute the indices that are both selected and included in analyses.

Notice how the SUBMIT block is used to call the FORMAT procedure in order to print names for the values returned by the GetMarkerShape method. The FORMAT procedure creates a new format which associates the values of IMLPlus keywords with explanatory text strings. For example, the keyword MARKER_STAR has the numeric value 7 as you can determine by using the PRINT statement. The newly created format (named SHAPE.) is used to print the values of the shapePlots vector.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset