Chapter 7. Creating Statistical Graphs

Contents

  • 7.1 Overview of Creating Statistical Graphs 144

  • 7.2 The Source of Data for a Graph 144

  • 7.3 Bar Charts 145

    • 7.3.1 Creating a Bar Chart from a Vector 145

    • 7.3.2 Creating a Bar Chart from a Data Object 146

    • 7.3.3 Modifying the Appearance of a Graph 146

    • 7.3.4 Frequently Used Bar Chart Methods 147

  • 7.4 Histograms 149

    • 7.4.1 Creating a Histogram from a Vector 149

    • 7.4.2 Creating a Histogram from a Data Object 150

    • 7.4.3 Frequently Used Histogram Methods 150

  • 7.5 Scatter Plots 153

    • 7.5.1 Creating a Scatter Plot from Vectors 154

    • 7.5.2 Creating a Scatter Plot from a Data Object 154

  • 7.6 Line Plots 155

    • 7.6.1 Creating a Line Plot for a Single Variable 155

    • 7.6.2 Creating a Line Plot for Several Variables 156

    • 7.6.3 Creating a Line Plot with a Classification Variable 158

    • 7.6.4 Frequently Used Line Plot Methods 161

  • 7.7 Box Plots 161

    • 7.7.1 Creating a Box Plot 162

    • 7.7.2 Creating a Grouped Box Plot 164

    • 7.7.3 Frequently Used Box Plot Methods 165

  • 7.8 Summary of Graph Types 167

  • 7.9 Displaying the Data Used to Create a Graph 168

  • 7.10 Changing the Format of a Graph Axis 169

  • 7.11 Summary of Creating Graphs 172

  • 7.12 References 172

7.1 Overview of Creating Statistical Graphs

This chapter introduces some of the statistical graphs available in IMLPlus. It describes how to create these graphs from SAS/IML vectors and from a data object. The chapter also introduces the concept of calling methods in order to modify simple attributes of graphs.

The chapter uses some object-oriented terminology that is described in Chapter 6, "Understanding IMLPlus Classes." However, you do not need to master object-oriented programming in order to follow the examples in this chapter. By imitating the syntax in the examples, you can create and use simple graphs in your own IMLPlus programs. The following list summarizes important terms and concepts:

  • A class is a "template" that defines a graph, including what data are required to create it and what functions can be used to modify the graph. A bar chart is defined by the BarChart class, a histogram is defined by the Histogram class, and so on.

  • Graphs are created and modified by functions called methods. Each class has its own methods, although some methods are common to several classes.

  • An object is a programming variable that refers to a class. In IMLPlus, you can use the declare keyword to specify that a variable in your program is an object of some class. For example, the following statement declares that bar is an object of the BarChart class:

    declare BarChart bar;
  • Each graph is attached to an in-memory copy of the data. This in-memory copy is known as a "data object" since it is an object of the DataObject class. Graphs that are attached to the same data object are dynamically linked to each other. The DataObject class is discussed in Chapter 8, "Managing Data in IMLPlus."

The examples in this chapter do not run in PROC IML, so be sure you are using SAS/IML Studio.

7.2 The Source of Data for a Graph

In general, there are two ways to create a graph in IMLPlus: from SAS/IML vectors and from an object of the DataObject class. The DataObject class is introduced in Chapter 6 and is described more fully in Chapter 8.

If you already have the data in SAS/IML vectors and you just want a quick and convenient way to visualize the data, you can create graphs directly from the vectors. However, if you want to create several graphs that are dynamically linked to each other, you should create the graphs from a common data object.

7.3 Bar Charts

The section "Analyzing Observations by Categories" on page 68 shows how to read a categorical variable from a SAS data set and how to use the LOC and UNIQUE functions to count and display a table that shows the frequency of each category in the variable. A bar chart is a graphical representation of the same information. For example, Figure 7.1 displays the number of movies in the Movies data set for each rating category. This graphically displays the tabular data shown in Figure 3.13.

Bar Chart of Movie Ratings

Figure 7.1. Bar Chart of Movie Ratings

7.3.1 Creating a Bar Chart from a Vector

You can create Figure 7.1 from a vector that contains the Motion Picture Association of America (MPAA) rating for each movie. The following program shows one way to create the bar chart:

/* create a bar chart from a vector of data */
use Sasuser.Movies;
read all var {"MPAARating"};                               /* 1 */
close Sasuser.Movies;

declare BarChart bar;                                      /* 2 */
bar = BarChart.Create("Ratings", MPAARating);              /* 3 */

The program contains three main steps:

  1. Create the vector MPAARating from the data in the MPAARating variable.

  2. Use the declare keyword to specify that bar is an object of the BarChart class.

  3. Create the BarChart object from the data in the MPAARating vector. The first argument to this method is a string that names the associated data object; the second argument is the vector that contains the data.

The Create method displays the graph, as shown in Figure 7.1. The bar chart shows that most movies made in the US between 2005-2007 were rated PG-13 or R. Only a small number of movies were rated G. There were also a smaller number of "not rated" (NR) movies that were never submitted to the MPAA for rating.

7.3.2 Creating a Bar Chart from a Data Object

If the data are in a data object, you can create a similar bar chart.

Create a data object from the Movies data by using the following statements:

/* create data object from SAS data set */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");

The following statements create a bar chart:

/* create a bar chart from a data object */
declare BarChart bar2;
bar2 = BarChart.Create(dobj, "MPAARating");

The program creates the BarChart object, bar2, from the data in the MPAARating variable. The first argument to this method is a data object; the second argument is the name of the variable that contains the data.

The bar chart created in this way is linked to the data object and to all other graph and data tables that share the same data object. It looks like Figure 7.1, except that the horizontal axis is labeled by the name of the MPAARating variable.

7.3.3 Modifying the Appearance of a Graph

You can call BarChart class methods to modify the appearance of the bar chart. Each method causes the graph to update.

For example, suppose you want to modify the appearance of the graph in Figure 7.1. This graph corresponds to the bar object. You might want to change the label for the horizontal axis (which currently reads "X") and add labels to the bar chart that indicate the number of observations in each MPAA category. You need to call methods in the BarChart class. The bar object is a variable in the IMLPlus program, and therefore you can call methods on the bar object that change the appearance of the bar chart. You can call methods accessible to the BarChart class by using the ObjectName.MethodName() syntax, as shown in the following statements:

bar.SetAxisLabel(XAXIS, "MPAA Rating"); /* change the axis label    */
bar.ShowBarLabels();                    /* show counts for each bar */

The first statement sets the label for the horizontal axis to be "MPAA Rating" instead. The second statement causes the bar chart to display the counts for each category. Both statements are optional, but result in a more interpretable graph, as shown in Figure 7.2. The bar chart updates itself after each method call.

Modified Bar Chart of Movie Ratings

Figure 7.2. Modified Bar Chart of Movie Ratings

7.3.4 Frequently Used Bar Chart Methods

Each graph type has methods that control the appearance of the graph. Table 7.1 summarizes frequently used methods in the BarChart class.

Table 7.1. Frequently Used Methods in the BarChart Class

Method

Description

BarChart.Create

Creates a bar chart

SetOtherThreshold

Specifies a cutoff percentage for determining which categories are combined into a generic category called "Others"

ShowBarLabels

Shows or hides labels of the count or percentage for each category

ShowPercentage

Specifies whether the graph's vertical axis displays counts or percentages

The complete set of BarChart methods are documented in the SAS/IML Studio online Help. To view the online Help, select HelpFrequently Used Methods in the BarChart ClassHelp Topics from the SAS/IML Studio main menu, and then select the chapter titled "IMLPlus Class Reference."

The result of the ShowBarLabels method is shown in Figure 7.2. The other methods are used in the following statements to create Figure 7.3.

bar.ShowPercentage(true);      /* display percents                  */
bar.SetOtherThreshold(5.0);    /* merge bars that have small counts */
Result of Calling Bar Chart Methods

Figure 7.3. Result of Calling Bar Chart Methods

The ShowPercentage method specifies the units of the vertical axis: frequency or percentage. The SetOtherThreshold method is useful when there are many categories that contain a relatively small number of observations. The argument to the method is a percentage in the range [0,100]. For example, 5.0 means 5% of observations. Any categories that contain fewer than 5% of the total observations are combined into an "Others" category, as shown in Figure 7.3. This enables you to more easily explore and analyze data in the relatively large categories.

The BarChart class can also access methods provided by any of its base classes. For example, the SetAxisLabel method is provided by the Plot class. Base classes are described in Chapter 6, "Understanding IMLPlus Classes."

7.4 Histograms

A histogram is a graphical representation of the distribution of a variable. A histogram bins the data and counts how many observations are contained in each bin. For example, Figure 7.4 divides the budgets of movies in the Movies data set into bins of 25 million dollars and displays the number of movies in each bin.

Histogram of Movie Budgets

Figure 7.4. Histogram of Movie Budgets

7.4.1 Creating a Histogram from a Vector

You can create a histogram from a vector that contains univariate data. The following program shows one way to create a histogram of the Budget variable in the Movies data set, as shown in Figure 7.4:

/* create a histogram from a vector of data */
use Sasuser.Movies;
read all var {"Budget"};                        /* 1 */
close Sasuser.Movies;
declare Histogram hist;                         /* 2 */
hist = Histogram.Create("Budget", Budget);      /* 3 */
hist.SetAxisLabel(XAXIS, "Budget (million $)"); /* change axis label */

The program contains three main steps:

  1. Create the vector Budget from the data in the Budget variable.

  2. Use the declare keyword to specify that hist is an object of the Histogram class.

  3. Create the Histogram object from the data in the Budget vector. The first argument to this method is a string that names the associated data object; the second argument is the vector that contains the data.

The resulting histogram is shown in Figure 7.4. The histogram shows that about two-thirds of movies in the years 2005-2007 had budgets less than 50 million dollars. The distribution of budgets has a long tail; several movies had budgets more than 150 million dollars, and one movie had a budget in excess of 250 million dollars.

7.4.2 Creating a Histogram from a Data Object

If the data are in a data object, you can create a similar histogram.

Create a data object from the Movies data by using the following statements:

/* create data object from SAS data set */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");

The following statements create a histogram:

/* create a histogram from a data object */
declare Histogram hist2;
hist2 = Histogram.Create(dobj, "Budget");

The program creates the Histogram object, hist2, from the data in the Budget variable. The first argument to this method is a data object; the second argument is the name of the variable that contains the data.

The histogram created in this way is linked to the data object and to all other graph and data tables that share the same data object. It looks like Figure 7.4, except that the horizontal axis is labeled by the name of the Budget variable.

7.4.3 Frequently Used Histogram Methods

Each graph type has methods that control the appearance of the graph. Table 7.2 summarizes frequently used methods in the Histogram class.

Table 7.2. Frequently Used Methods in the Histogram Class

Method

Description

Histogram.Create

Creates a histogram

ReBin

Specifies the offset and width for the histogram bins

ShowBarLabels

Shows or hides labels of the count, percentage, or density for each category

ShowDensity

Specifies whether the graph's vertical axis displays counts or density

ShowPercentage

Specifies whether the graph's vertical axis displays counts or percentages

To view the complete documentation for graph methods, select HelpFrequently Used Methods in the Histogram ClassHelp Topics from the SAS/IML Studio main menu, and then select the chapter titled "IMLPlus Class Reference."

The ShowBarLabels and ShowPercentage methods in the Histogram class behave identically to their counterparts in the BarChart class and are discussed in Section 7.3.4. The ShowDensity method scales the vertical axis so that the sum of the histogram bars is unity. This scale is useful when you want to overlay a parametric or kernel density estimate on the histogram, as described in the section "Case Study: Plotting a Density Estimate" on page 214.

The SAS/IML Studio for SAS/STAT Users documentation has a chapter on "Adjusting Axes and Ticks" that describes how to use the ReBin method. The ReBin method is useful when you want to specify a new bin width for the histogram. For example, you might want bin widths for ages of adults to be an integral multiple of five. Alternatively, you might want to choose a bin width that is optimal for exhibiting features of the data density.

The following statements modify the histogram shown in Figure 7.4, which corresponds to the hist object. The program compares the default histogram bin widths (computed according to an algorithm by Terrell and Scott (1985)) with robust bin widths suggested by Freedman and Diaconis, as presented in Scott (1992, p. 75). The robust bin widths are computed as 2.603 IQR n—1/3, where IQR is the interquartile range of the data and n is the number of nonmissing observations in the data.

The following statements continue the program in Section 7.4.1:

/* calculate new histogram bins */
/* get current anchor and bin width: Terrell and Scott (1985) */
x0 = hist.GetAxisTickAnchor(XAXIS);             /* 4 */
h0 = hist.GetAxisTickUnit(XAXIS);

/* Freedman-Diaconis robust rule (1981) */
nNonmissing = sum(Budget^=.);                   /* 5 */
q = quartile(Budget);                           /* 6 */
IQR = q[4] - q[2];
h = 2.603 * IQR * nNonmissing##(-1/3);          /* 7 */
print x0 h0 h;

h = round(h, 1);                                /* round to integer */
hist.ReBin(x0, h);                              /* 8 */
Tick Anchor, Default Bin Width, and New Bin Width

Figure 7.5. Tick Anchor, Default Bin Width, and New Bin Width

The previous statements implement five new steps:

  1. Get the current histogram anchor (x0) and bin width (h0). The GetAxisTickAnchor and GetAxisTickUnit methods are provided by the Plot class, which is a base class of the Histogram class.

  2. Compute the number of nonmissing observations (nNonMissing). Although the Budget variable does not contain missing values, it is a good idea to write programs that handle missing values, as discussed in the section "Handling Missing Values" on page 65.

  3. Call the QUARTILE module (which is part of the IMLMLIB module library) to compute the five-number summary of the data. The vector q contains the minimum, first quartile, second quartile (median), third quartile, and maximum values of the data. The interquartile range, IQR, is computed as q[4]-q[2].

  4. Compute the Freedman-Diaconis bin width. The bin width is approximately 15.9 and is stored in the matrix h.

  5. Call the ReBin method with the existing anchor and the new bin width. For this example, the bin width is rounded to the nearest integer so that the axis tick labels look nicer.

The revised histogram is shown in Figure 7.6. The histogram looks similar to Figure 7.4, but the new Freedman-Diaconis bandwidth is substantially smaller than the default bin width, as shown in Figure 7.5. With the smaller bin width, the distribution appears to be bimodal, with one peak near 20 million dollars and another smaller peak near 150 million dollars.

Rebinned Histogram of Movie Budgets

Figure 7.6. Rebinned Histogram of Movie Budgets

7.5 Scatter Plots

A scatter plot displays the values of two variables plotted against each other. In many cases it enables you to explore bivariate relationships in your data. For example, Figure 7.7 plots the value of the Budget variable versus the value of the US_Gross variable for movies in the Movies data set.

Scatter Plot of Gross Revenue versus Movie Budgets

Figure 7.7. Scatter Plot of Gross Revenue versus Movie Budgets

7.5.1 Creating a Scatter Plot from Vectors

You can create a scatter plot from data in two vectors in much the same way as you create a histogram. The data in a scatter plot is usually continuous numeric data. However, the IMLPlus scatter plot also supports categorical data (including character data) for either axis.

The following program shows one way to create a scatter plot of the Budget variable versus the US_Gross variable:

/* create a scatter plot from vectors of data */
use Sasuser.Movies;
read all var {"Budget" "US_Gross"};
close Sasuser.Movies;

declare ScatterPlot p;
p = ScatterPlot.Create("Scatter", Budget, US_Gross);
p.SetAxisLabel(XAXIS, "Budget (million $)");
p.SetAxisLabel(YAXIS, "US Gross (million $)");

The program contains the same steps as the program in Section 7.4, except that this program creates an object of the ScatterPlot class. The resulting scatter plot is shown in Figure 7.7.

The scatter plot shows that these two variables are correlated: movies produced with relatively modest budgets typically bring in modest revenues, whereas big-budget movies often bring in larger revenues. However, the graph also shows movies that generated revenue that is disproportionate to their budgets. Some movies with small budgets generated a relatively large gross revenue. Other movies were disappointments: their revenue was only a fraction of the cost of producing the movie.

7.5.2 Creating a Scatter Plot from a Data Object

If the data are in a data object, you can create a similar scatter plot.

Create a data object from the Movies data by using the following statements:

/* create data object from SAS data set */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");

The following statements create a scatter plot:

/* create a scatter plot from a data object */
declare ScatterPlot p2;
p2 = ScatterPlot.Create(dobj, "Budget", "US_Gross");

The program creates the ScatterPlot object, p2, from the data in the Budget and US_Gross variables. The first argument to this method is a data object; the second and third arguments are the name of the variables that contains the data.

The scatter plot created in this way is linked to the data object and to all other graph and data tables that share the same data object.

7.6 Line Plots

A line plot displays the values of one or more variables plotted against another variable, which is often a time variable. A line plot is useful for investigating how quantities change over time. For example, Figure 7.8 plots the value of the Budget variable versus the value of the ReleaseDate variable for movies in the Movies data set.

Line Plot of Movie Budgets versus Date of Release

Figure 7.8. Line Plot of Movie Budgets versus Date of Release

You can create a line plot from data in two (or more) vectors in the same way as you create a scatter plot. Often, each variable in a line plot is continuous. However, the IMLPlus line plot also supports categorical data (including character data) for either axis.

7.6.1 Creating a Line Plot for a Single Variable

The simplest line plot shows a single variable plotted against time.

7.6.1.1 Creating a Line Plot from Vectors

The following program shows one way to create a line plot of the Budget variable versus the ReleaseDate variable:

/* create a line plot from vectors of data */
use Sasuser.Movies;
read all var {"ReleaseDate" "US_Gross"};
close Sasuser.Movies;

declare LinePlot line;
line = LinePlot.Create("Line", ReleaseDate, US_Gross);
line.SetAxisLabel(XAXIS, "Release Date");
line.SetAxisLabel(YAXIS, "Budget (million $)");

The program contains the same steps as the program in Section 7.5. (Not shown are statements that place a DATE7. format on the horizontal axis; these statements are described in Section 7.10.) The resulting line plot is shown in Figure 7.8.

The line plot shows that movies with large budgets are often released in the May-July and November-December time periods.

7.6.1.2 Creating a Line Plot from a Data Object

If the data are in a data object, you can create a similar line plot.

Create a data object from the Movies data by using the following statements:

/* create data object from SAS data set */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");

The following statements create a line plot:

/* create a line plot from a data object */
declare LinePlot line2;
line2 = LinePlot.Create(dobj, "ReleaseDate", "US_Gross");

The program creates the LinePlot object, line2, from the data in the ReleaseDate and US_Gross variables. The first argument to this method is a data object; the second and third arguments are the names of the variables that contain the data.

The line plot created in this way is linked to the data object and to all other graph and data tables that share the same data object.

7.6.2 Creating a Line Plot for Several Variables

Although Figure 7.8 shows a plot of a single Y variable versus time, you can use a line plot to display several Y variables simultaneously.

7.6.2.1 Creating a Line Plot from Vectors

The following statements create a graph that displays three functions:

/* create a line plot of several Y variables */
x = t(do(-3.3, 3.3, 0.1));    /* evenly spaced points (t=transpose) */
normal = pdf("normal", x);    /* evaluate normal density at x       */
t4 = pdf("t", x, 4);          /* evaluate t distrib with 4 d.o.f    */
t12 = pdf("t", x, 12);        /* evaluate t distrib with 12 d.o.f   */

declare LinePlot lp;
lp = LinePlot.Create("Line", x, normal || t12 || t4);
lp.ShowObs(false);            /* do not show observation markers    */
lp.SetLineWidth(2);           /* set all line widths to 2           */
lp.SetLineStyle(2, DASHED);   /* set style of second line           */
lp.SetLineStyle(3, DOTTED);   /* set style of third line            */

The graph is shown in Figure 7.9. The program uses the PDF function to evaluate three different probability density functions on a set of evenly spaced values: the standardized normal distribution, a Student's t distribution with four degrees of freedom, and a t distribution with 12 degrees of freedom. The line plot enables you to compare these three distributions. The markers are not displayed so that the three curves are easier to see. The line plot has several methods that enable you to control the color, style, and width of each line. In this program, the SetLineWidth method sets the width of all lines, and the SetLineStyle method sets the line styles for the second and third lines.

Line Plot of Probability Density Functions

Figure 7.9. Line Plot of Probability Density Functions

7.6.2.2 Creating a Line Plot from a Data Object

The data for the previous line plot do not exist in a SAS data set. Nevertheless, you can create a data object from the data in the vectors and then create a line plot from the data object.

Assume that the x, normal, t4, and t12 vectors contain the data as defined in the previous section. You can create a data object for these data by using the following statements:

/* create data object from vectors */
varNames = {"x" "pdf_normal" "pdf_t12" "pdf_t4"};
declare DataObject dobjVec;
dobjVec = DataObject.Create("Data", varNames, x || normal || t12 || t4 );

Notice that the data object is created from a matrix by using the Create method. The first argument is an identifier for the data object. This string is used when the data object is written to a data set. It is also used in the title bar of a graph to identify the data object that underlies the graph. The second argument is a list of variable names. The third argument is a matrix of data; each column corresponds to a variable.

The following statements create a line plot:

/* create a line plot of several variables from a data object */
declare LinePlot lp2;
lp2 = LinePlot.Create(dobjVec, "x", {"pdf_normal" "pdf_t12" "pdf_t4"});

The program creates the LinePlot object, lp2, from variables in the data object. Each PDF variable is plotted agains the x variable. The first argument to the Create method is a data object; the second and third arguments are the names of the variables that contain the data. You can continue the program by calling the ShowObs, SetLineWidth, and SetLineStyle methods that are used in the previous section.

The line plot created in this way is linked to the data object and to all other graph and data tables that share the same data object.

7.6.3 Creating a Line Plot with a Classification Variable

The LinePlot class also enables you to create a line plot by specifying a single Y variable, one or more classification variables, and a time variable. The joint levels of the classification variables determine which observations are joined in the line plot. This is useful for comparing two or more groups. Line plots of this type are created by using the CreateWithGroup method of the LinePlot class.

7.6.3.1 Creating a Line Plot from Vectors

Recall that the Sex, Violence, and Profanity variables in the Movies data set represent a measurement of the amount of sexual content, violence, and profane language (respectively) for each movie, as judged by the raters at the kids-in-mind.com Web site. It is interesting to investigate how these quantities relate to the MPAA rating of a movie.

The following statements create a line plot that enables you to compare the values of a quantity as a function of time for movies rated G, PG, PG-13, and R. The quantity is the total amount of sexual content, violence, and profane language for each movie.

/* create a line plot by specifying a classification (group) variable   */
use Sasuser.Movies where (MPAARating^="NR");
read all var {"ReleaseDate" "MPAARating" "Sex" "Violence" "Profanity"};
close Sasuser.Movies;

Total = Sex + Violence + Profanity;     /* score for "adult situations" */

declare LinePlot lpg;
lpg = LinePlot.CreateWithGroup("Line",
                               ReleaseDate, /* x coordinate             */
                               Total,       /* y quantity to plot       */
                               MPAARating   /* classification variable  */
                               );
lpg.SetAxisLabel(XAXIS, "Release Date");
lpg.SetAxisLabel(YAXIS, "Sex + Violence + Profanity");
lpg.SetLineWidth(2);
lpg.SetLineStyle(2, DASHED);
lpg.SetLineStyle(3, DOTTED);

The graph is shown in Figure 7.10. The Total vector contains the sum of the Sex, Violence, and Profanity vectors. This is a measure of the level of "adult situations" in each movie. As shown in Figure 7.10, the R-rated movies have the highest average adult situation score, with a mean near 20. Movies rated PG-13 have a mean score near 13, whereas PG-rated movies have a mean score near 7.5. The G-rated movies have a mean score of about 5. The four time series appear to be stationary in time. That is, there does not seem to be any trends in the time period spanned by the data.

The graph also indicates that the MPAA rating board and the raters at the kids-in-mind.com Web site appear to be using consistent standards in evaluating movies. Consequently, the Total vector might serve as a useful discriminant function for predicting a movie's MPAA rating based on the measures of sexual content, violence, and profanity as determined by the Web site. For example, you could use the Web site ratings to predict MPAA ratings for the three movies in the data that were not rated by the MPAA.

Lastly, the graph shows that there was a two-month period (March-April, 2007) during which no PG- or G-rated movies were released.

Line Plot of Adult Content for Each MPAA Rating

Figure 7.10. Line Plot of Adult Content for Each MPAA Rating

7.6.3.2 Creating a Line Plot from a Data Object

If the data are in a data object, you can create a similar line plot. There are two ways to create a data object that contains the Total variable, which is the sum of other variables. The simplest approach (used in this section) is to use the DATA step to create a new data set that contains the Total variable. (The alternative approach, used in the section "Variable Transformations" on page 181, is to use the GetVarData and AddVar methods in the DataObject class.)

Recall from Chapter 4 that you can use the SUBMIT statement to execute SAS statements. The following statements call the DATA step and create a data object from the resulting data set:

/* use the DATA step to transform a variable */
submit;
data NewMovies;
   set Sasuser.Movies;
   Total = Sex + Violence + Profanity; /* score for "adult situations" */
run;
endsubmit;

declare DataObject dobjNew;
dobjNew = DataObject.CreateFromServerDataSet("Work.NewMovies");

The following statements create a line plot:

/* create a line plot with a group variable from a data object */
declare LinePlot lpg2;
lpg2 = LinePlot.CreateWithGroup(dobjNew,
                             "ReleaseDate", /* x coordinate            */
                             "Total",       /* y quantity to plot      */
                             "MPAARating"   /* classification variable */
                             );

The program creates the LinePlot object, lpg2, from variables in the data object. The first argument to the CreateWithGroup method is the data object. The second and third arguments are the names of the variables that contain the data. The third argument names the variables whose levels define the groups used in the graph.

The line plot created in this way is linked to the data object and to all other graph and data tables that share the same data object.

7.6.4 Frequently Used Line Plot Methods

Table 7.3 summarizes frequently used methods in the LinePlot class.

Table 7.3. Frequently Used Methods in the LinePlot Class

Method

Description

LinePlot.Create

Creates a line plot with one or more Y variables

LinePlot.CreateWithGroup

Creates a line plot in which each line is determined by categories of a grouping variable

AddVar

Adds a new Y variable to an existing line plot

ConnectPoints

Specifies whether to connect observations for each line

SetLineAttributes

Specifies the color, style, and width of a line

SetLineColor

Specifies the color of a line

SetLineStyle

Specifies the style of a line

SetLineWidth

Specifies the width of a line

SetLineMarkerShape

Specifies the marker shape for a line

ShowPoints

Specifies whether to show the markers for a line

To view the complete documentation for graph methods, select HelpFrequently Used Methods in the LinePlot ClassHelp Topics from the SAS/IML Studio main menu, and then select the chapter titled "IMLPlus Class Reference."

7.7 Box Plots

A box plot is a schematic representation of the distribution of a variable. The left graph in Figure 7.11 shows a box plot for the US_Gross variable restricted to the G-rated movies in the Movies data set. (The left graph also shows the default appearance of a box plot in SAS/IML Studio.) The right graph shows a box plot for the same data, but labels important features of the box plot.

A box plot enables you to find the five-number summary of the data: the minimum value, the 25th percentile, the 50th percentile, the 75th percentile, and the maximum value. (The 25th, 50th, and 75th percentiles are also called the first quartile (Q1), the median, and the third quartile (Q3), respectively.) For example, Figure 7.11 indicates the five-number summary for the US gross revenues for G-rated movies. The minimum value is about 8 million dollars. The quartiles of the data are approximately 25, 55, and 85 million dollars. The maximum value is about 245 million dollars.

A box plot has a wide central box that contains 50% of the data. The upper and lower edges of the box are positioned at the 25th and 75th percentiles of the data. A line segment inside the box indicates the median value of the data. The height of the box indicates the interquartile range (IQR), which is a robust estimate of the scale of the data.

Above and below the main box are two thinner boxes that are called whiskers. The lengths of the whiskers are determined by the IQR. Specifically, the upper (respectively, lower) whisker extends to an observation whose distance from the 75th (respectively, 25th) percentile does not exceed 1.5 IQR. If there are observations whose distance to the main box exceeds 1.5 IQR, these observations are plotted individually and are called univariate outliers.

Main Features of a Box Plot

Figure 7.11. Main Features of a Box Plot

7.7.1 Creating a Box Plot

You can create a box plot that schematically shows the distribution of a single variable. Figure 7.12 shows a box plot for the US_Gross variable for all movies in the Movies data set.

7.7.1.1 Creating a Box Plot from a Vector

The following program shows one way to create the box plot shown in Figure 7.12:

/* create a box plot from a vector of data */
use Sasuser.Movies;
read all var {"US_Gross"};                          /* 1 */
close Sasuser.Movies;

declare BoxPlot box;                                /* 2 */
box = BoxPlot.Create("Box", US_Gross);              /* 3 */
box.SetAxisLabel(YAXIS, "US Gross (million $)");

The program contains three main steps:

  1. Create the vector US_Gross from the data in the US_Gross variable.

  2. Use the declare keyword to specify that box is an object of the BoxPlot class.

  3. Create the BoxPlot object from the data in the US_Gross vector. The first argument to this method is a string that names the associated data object; the second argument is the vector that contains the data.

A Box Plot of Gross Revenue

Figure 7.12. A Box Plot of Gross Revenue

7.7.1.2 Creating a Box Plot from a Data Object

If the data are in a data object, you can create a similar box plot.

Create a data object from the Movies data by using the following statements:

/* create data object from SAS data set */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");

The following statements create a box plot of the US gross revenues of all movies:

/* create a box plot from a data object */
declare BoxPlot box2;
box2 = BoxPlot.Create(dobj, "US_Gross");

The program creates the BoxPlot object, box2, from the data in the US_Gross variable. The first argument to this method is a data object; the second argument is the name of the variable that contains the data.

The box plot created in this way is linked to the data object and to all other graph and data tables that share the same data object.

7.7.2 Creating a Grouped Box Plot

If there is a grouping variable in the data, you can create a box plot for the data in each group.

7.7.2.1 Creating a Grouped Box Plot from Vectors

The following statements create a box plot for the US_Gross variable, grouped by each MPAA category in the Movies data set:

/* create a box plot from vectors of data */
use Sasuser.Movies;
read all var {"MPAARating" "US_Gross"};
close Sasuser.Movies;

declare BoxPlot bg;
bg = BoxPlot.Create("Box Plot", MPAARating, US_Gross);
bg.SetAxisLabel(XAXIS, "MPAA Rating");
bg.SetAxisLabel(YAXIS, "US Gross (million $)");

The box plot is shown in Figure 7.13. It shows features of the distribution of the US_Gross variable for each ratings category. You can see that the movies rated G, PG, and PG-13 have similar median US gross revenues. Furthermore, the first and third quartiles of the data are similar for those movies. However, the PG and PG-13 movies seem to have more extreme values than the G-rated movies. The R-rated movies generate comparatively less revenue, as seen in the box plot by the smaller median and Q3 value. There are only a few NR (not rated) movies, but these movies did not generate large revenues when compared with the other rating categories.

A Box Plot of Gross Revenue versus MPAA Rating

Figure 7.13. A Box Plot of Gross Revenue versus MPAA Rating

7.7.2.2 Creating a Grouped Box Plot from a Data Object

If the data are in a data object, you can create a similar box plot.

Create a data object from the Movies data by using the following statements:

/* create data object from SAS data set */
declare DataObject dobj;
dobj = DataObject.CreateFromServerDataSet("Sasuser.Movies");

The following statements create a box plot:

/* create a box plot with a group variable from a data object */
declare BoxPlot bg2;
bg2 = BoxPlot.Create(dobj, "MPAARating", "US_Gross");

The program creates the BoxPlot object, bg2, from the data in the MPAARating and US_Gross variables. The first argument to this method is a data object; the second and third arguments are the names of variables that contain the data.

The box plot created in this way is linked to the data object and to all other graph and data tables that share the same data object.

7.7.3 Frequently Used Box Plot Methods

The BoxPlot class has methods that control the appearance of box plots. Table 7.4 summarizes the frequently used methods in the BoxPlot class.

Table 7.4. Frequently Used Methods in the BoxPlot Class

Method

Description

BoxPlot.Create

Creates a bar chart

SetWhiskerLength

Specifies the length of the box plot whiskers as a multiple of the interquartile range

ShowMeanMarker

Specifies whether to overlay the mean and standard deviation on the box plot

ShowNotches

Specifies whether to display a notched box plot. This variation of the box plot indicates that medians of two box plots are significantly different at approximately the 0.05 significance level if the corresponding notches do not overlap.

The complete set of BoxPlot methods are documented in the SAS/IML Studio online Help.

You can use the SetWhiskerLength method to control which observations are plotted as outliers. The default multiplier is 1.5. If you set the whisker length to zero, then all observations in the first and fourth quartiles are plotted. In contrast, if you set the whisker length to 3, then only extreme outliers are explicitly plotted.

A box plot shows the median and IQR for the data, but sometimes it is useful to compare these statistics to their nonrobust counterparts, the mean and the standard deviation. You can use the ShowMeanMarker method to overlay a line segment that represents the mean of the data, and an ellipse (or diamond) that indicates the standard deviation.

The box plot displays the sample median. You can use a notched box plot to compare the median values of two groups: the medians are different (at approximately a 0.05 significance level) if their notched regions do not overlap.

The following statements modify Figure 7.13 to illustrate the methods in this section. The result is shown in Figure 7.14.

bg.SetWhiskerLength(3)
bg.ShowMeanMarker();
bg.ShowNotches();
Modified Box Plot of Gross Revenue versus MPAA Rating

Figure 7.14. Modified Box Plot of Gross Revenue versus MPAA Rating

7.8 Summary of Graph Types

SAS/IML Studio provides other statistical graphs, but this book focuses on the simple bar chart, histogram, scatter plot, line plot, and box plot. For completeness, Table 7.5 lists the graph types available in SAS/IML Studio.

Each graph is described in the SAS/IML Studio User's Guide. The methods available for each graph are described in the online Help. To view the online Help, select Help7.8 Summary of Graph TypesHelp Topics from the SAS/IML Studio main menu, and then select the chapter titled "IMLPlus Class Reference."

Table 7.5. IMLPlus Graphs

Graph

Comments

BarChart

Described in the section "Bar Charts" on page 145.

BoxPlot

Described in the section "Box Plots" on page 161.

ContourPlot

Useful when you want to visualize a fitted surface or a regression model with two explanatory variables. This graph uses a simple contouring algorithm that is not suitable for noisy or highly correlated data.

Histogram

Described in the section "Histograms" on page 149.

LinePlot

Described in the section "Line Plots" on page 155.

MosaicPlot

Useful in understanding the relationships between two or three categorical variables. If you analyze categorical data, learn how to interpret this graph.

PolygonPlot

Useful in creating interactive maps.

RotatingPlot

A three-dimensional scatter plot. It also has the capability to plot a fitted surface or a regression model with two explanatory variables.

ScatterPlot

Described in the section "Scatter Plots" on page 153.

7.9 Displaying the Data Used to Create a Graph

Each section of this chapter contains an example of how to create a graph from vectors. What is not initially apparent is that creating a graph from vectors also creates an object of the DataObject class that contains the contents of the vector.

In SAS/IML Studio you can display a tabular view of the data that underlies a graph by pressing F9 when the graph is the active window. You can click in a graph window to make it the active window. The title bar of the active window usually has a different color than the title bars of inactive windows.

For example, Figure 7.2 shows a bar chart that is created from a vector. If you press F9 in the bar chart, SAS/IML Studio displays the data table in Figure 7.15. Notice that the data are the contents of the vector that is used to create the graph. Other variables that are in the Movies data set are not present.

Data Table for the Bar Chart of Movie Ratings

Figure 7.15. Data Table for the Bar Chart of Movie Ratings

In contrast, Section 7.3.2 shows how to create a data object directly from the Movies data set and then create the bar chart from the data object. If you press F9 in the bar chart that is created in this way, then you see a tabular view of the entire Movies data set. Furthermore, the second approach preserves variable properties such as formats and labels.

7.10 Changing the Format of a Graph Axis

Suppose that a variable in a SAS data set has a SAS format (for example, the DATE7. format). How can you get the scatter plot to display the formatted values shown in Figure 7.17? There are two ways. The preferred way is to create the scatter plot from an object of the DataObject class; the scatter plot automatically uses the format that is associated with a variable in a data object.

However, if you use the READ statement to create a SAS/IML vector from a variable with a SAS format, the vector contains the raw data, not the formatted values. Consequently, a graph that you create from that vector does not display any formatted values. For example, the following program creates Figure 7.16:

/* create a scatter plot from vectors of data */
use Sasuser.Movies;
read all var {"ReleaseDate" "Budget"};
close Sasuser.Movies;

declare ScatterPlot p;
p = ScatterPlot.Create("Scatter", ReleaseDate, Budget);
p.SetAxisLabel(XAXIS, "Release Date");
p.SetAxisLabel(YAXIS, "Budget (million $)");
Scatter Plot of Movie Budgets versus Date of Release

Figure 7.16. Scatter Plot of Movie Budgets versus Date of Release

Notice that each tick mark on the horizontal axis is a numerical value. These are representative of the data in the ReleaseDate vector. When a DATEw. format is applied to these data, they are displayed as dates. Figure 7.16 would be more understandable if the ticks on the horizontal axis displayed dates. The following statements use the DATE7. format to print the date values that are displayed on the horizontal axis of the scatter plot:

x = do(16500, 17500, 250);      /* row vector of sequential values */
print x, x[format=DATE7.];
Result of Applying the DATE7. Format

Figure 7.17. Result of Applying the DATE7. Format

Recall from Section 7.9 that every graph has a data object that is associated with it. There is a method that enables you to get the object of the DataObject class that is associated with a graph or data table. The method is named GetDataObject, and it is implemented in the DataView class. (Recall from Section 6.9 that all graph classes and the DataTable class are derived from the DataView class.) You can then use the SetVarFormat method of the DataObject class to set a format for the horizontal variable, which is named X. You can set a format for the Y variable in the same way, as shown in the following statements:

/* set formats for variables in a data object */
declare DataObject dobj;
dobj = p.GetDataObject();
dobj.SetVarFormat("X", "DATE7.");
dobj.SetVarFormat("Y", "DOLLAR4.");

Because graphs and data tables automatically respond to changes made to the associated data object, the scatter plot updates its axes as shown in Figure 7.18. The scatter plot now more clearly shows the release dates for movies released during 2005-2007.

Scatter Plot with Formatted Tick Labels

Figure 7.18. Scatter Plot with Formatted Tick Labels

You can use the same technique to change other attributes of the graph. For example, you can change a marker's shape or color by calling a method in the associated data object. These techniques are described in Chapter 10, "Marker Shapes, Colors, and Other Attributes of Data."

7.11 Summary of Creating Graphs

You can create graphs in two ways: from SAS/IML vectors and from an object of the DataObject class. It is the DataObject class that enables you to create dynamically linked graphs. Graphs that are created from the same data object are automatically linked to each other.

Creating a graph from SAS/IML vectors is quick and often convenient, but it has the following drawbacks:

  1. The graphs that are created from vectors initially display "X" for the label of the horizontal axis. You can use the SetAxisLabel method to explicitly set a label, but this is somewhat inconvenient.

  2. If a variable contains a format, that format is lost when the variable is read into a SAS/IML vector. For example, the graphs in this chapter that use the ReleaseDate variable must explicitly set a DATEw. format if the dates are to be displayed correctly.

  3. Any variable not specified when creating the graph is lost. For example, you cannot label observations by the names of movies in order to better identify interesting observations.

  4. Two graphs created from the same data are completely independent. When you create graphs from vectors, you cannot select PG-rated movies in a bar chart and see those same movies highlighted in a second graph, as shown in Figure 6.5.

Graphs created from a data object do not suffer from these drawbacks. IMLPlus graphics created from a common data object are dynamically linked to each other and automatically display variable names and formats. Most of the graphs in this book are created from a data object.

7.12 References

[bibch07_01] D. W. Scott, (1992), Multivariate Density Estimation: Theory, Practice, and Visualization, New York: John Wiley & Sons.

[bibch07_02] G. R. Terrell, and D. W. Scott, (1985), "Oversmoothed Nonparametric Density Estimates," Journal of the American Statistical Association, 80, 209-214.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset