Chapter 4: Data Discovery with Multivariate Data

Introduction

Use Tables to Explore Multivariate Data

PivotTables

Tabulate in JMP

Use Graphs to Explore Multivariate Data

Graph Builder

Scatterplot

Explore a Larger Data Set

Trellis Chart

Bubble Plot

Explore a Real-World Data Set

Use Graph Builder to Examine Results of Analyses

Generate a Trellis Chart and Examine Results

Use Dynamic Linking to Explore Comparisons in a Small Data Subset

Return to Graph Builder to Sort and Visualize a Larger Data Set

Introduction

Most data sets in the real world are multivariate. That is, they contain more than two variables. Generally speaking, a data set can be viewed conceptually as shown in Figure 4.1, where the columns represent variables and the rows are objects. The rows, or objects, are the perspective from which the observations or measurements were taken (for example, by event, case, transaction, person, or company). The columns, or variables, are the characteristics by which the objects were measured. Multivariate data analysis is when more than two variables are analyzed simultaneously.

Figure 4.1: A Conceptual View of a Data Set

image

Figure 4.2 provides a framework of statistical and visual methods for analyzing multivariate data. The initial multivariate analytical steps should be the data discovery of relationships through tables and graphs. Some of this data discovery will include univariate descriptive statistics or distribution analysis and perhaps some bivariate analyses (for example, scatterplots and contingency tables, as discussed in Chapter 2). In this chapter, you will explore some of the multivariate tables and graphs to consider.

Figure 4.2: A Framework for Multivariate Analysis

image

As shown in Figure 4.2, there are numerous multivariate statistical and data mining techniques available to analyze multivariate data. Several of the more popular multivariate techniques and some of the data mining techniques, will be discussed in this text:

   principal components analysis

   cluster analysis

   multiple regression

   ANOVA

   logistic regression

    generalized regression

   decision trees

   K-nearest neighbors

   neural networks

   bootstrap forests

   boosted trees

Generally speaking, these techniques are categorized as interdependence and dependence techniques (Figure 4.2). The interdependence techniques examine relationships either between the variables (columns) or the observations (rows) without considering causality. With dependence techniques, one or more variables are identified as the dependent variables, and it is assumed that X variables cause Y values. The objective of these dependence techniques is to examine and measure the relationship between other (independent) variables and the dependent variable or variables. Two dependence techniques sometimes covered in an introductory statistics course, multiple regression and ANOVA, will be reviewed in Chapter 5.

One last word of caution and advice before you start your multivariate journey. (This is just a reiteration of the sixth fundamental concept discussed in Chapter 2.) All the many discovery tools and the numerous statistical techniques cannot guarantee useful insights until you try them. Successful results really depend on the data set. On the other hand, there are times when it is inappropriate to use a tool or technique. Throughout the following chapters, you will learn when and when not to use a tool or technique.

Use Tables to Explore Multivariate Data

Row and column tables, contingency tables, crosstabs, Excel PivotTables, and JMP Tabulate are basic OLAP tools that are used to produce tables to examine or summarize relationships between two or more variables. To illustrate, see the survey data from 20 students in the file Countif.xls and in the worksheet rawdata from Chapter 2, as shown in Table 2.2.

PivotTables

A PivotTable is one of the most powerful tools in Excel. It enables you to summarize and explore data. To generate a PivotTable in Excel, begin as follows:

1.   Highlight cells A1:G23.

2.   Click Insert on the top menu. A new set of icons appears.

3.   All the way to the left, click PivotTable. When the dialog box appears, click OK. A new worksheet will appear similar to Figure 4.3.

Figure 4.3 PivotTable Worksheet

image

Continue with the following steps:

1.   In the upper part of the PivotTable Field List box (on the right side of the worksheet, Figure 4.3), drag Major all the way to the left, in Column A, where it says Drop Row Fields Here, and release the mouse button.

2.   Similarly, drag Gender to where it says Drop Column Fields Here. Notice back in the upper part of the PivotTable Field List box (Figure 4.3), under Choose fields to add to report, that Major and Gender have check marks next to them. In the lower part of the box, under Column labels, see Gender. Under Row labels, see Major.

3.   Drag Salary to the left where it says Drop Value Fields Here. Salary in the upper part of the PivotTable Field List box is now checked.

4.   Below, under ∑ Values, is the Sum of Salary. Click the drop-down menu for the Summary of Salary.

5.   Click the Value Field Setting, and a new dialog box similar to Figure 4.4 will appear.

6.   In the dialog box, you can change the displayed results to various summary statistics. Click Average (notice that the Custom Name changes to Average of Salary). Click OK.

Figure 4.4: Value Field Setting Dialog Box

image

As the JMP diagram in the Fit Y by X dialog box directs you (see Figure 2.12), the rows and columns are categorical data in a contingency table. This is, in general, the same for these OLAP tools. The data in the PivotTable can be further sliced by dragging a categorical variable to the Drop Report Filter Fields Here area.

The resulting PivotTable should look similar to Figure 4.5, which shows the average salary by Gender and Major.

Figure 4.5: Resulting Excel PivotTable

image

Tabulate in JMP

To generate a similar table in JMP, open Countif.jmp. Select AnalyzeTabulate. The Tabulate dialog box appears (Figure 4.6).

Figure 4.6: The Tabulate Dialog Box

image

Then complete the following steps:

1.   Drag Salary to the Resulting Cells area. The table now shows the sum of all the salaries (1078375).

2.   In the Control Panel, drag Mean to the right and on top of the word Sum. The average salary is 49017.0454.

3.   Drag Gender to Salary, where it now shows Salary (Female Mean=45695.555556 and Male Mean=51316.538462).

4.   Finally, drag Major to the white square to the left of the Mean for Females (45695.55).

5.   Click the red triangle and click Show Chart. The table and chart should now look like Figure 4.7.

If you want to copy and paste the JMP table and chart into Microsoft Word or Microsoft PowerPoint, then do as follows:

1.   Move your mouse to the top horizontal lines near the top of the Tabulate dialog box (see Figure 4.7). The result is the activation of the Menu and Toolbars(see Figure 4.7). A row of icons will appear as shown in Figure 4.7.

2.   Click the Selection icon as shown in Figure 4.7.

3.   Click on the table, just outside its border, right-click the mouse, and select Copy.

4.   Go to a Word or PowerPoint file and paste them in. The table and chart will appear like Figure 4.8.

Figure 4.7: Selection Button to Copy the Output

image

Figure 4.8: Resulting Copy of the Tabulate Table and Chart (into a Microsoft Word Document)

image

To copy and paste JMP results from other JMP tools or techniques, you follow this same process.

Use Graphs to Explore Multivariate Data

Graphical presentation tools provide powerful insights into multivariate data. JMP provides a large selection of visualization tools beyond the simple XY scatterplots, bar charts, and pie charts. One such tool that is extremely useful in exploring multivariate data is the Graph Builder.

Graph Builder

Initially, look at and use the Graph Builder with the Countif.jmp data set.

Select GraphGraph Builder. The Graph Builder dialog box will appear, as in Figure 4.9. To the left of the dialog box, there is a window that has a list of your variables, and to the right is a graphical sandbox in which you can design your graph. As noted in Figure 4.9, there are several areas called drop zones where you can drop the variables. You can bring over to these drop zones either continuous or categorical (ordinal and nominal) data.

Figure 4.9: Graph Builder Dialog Box

image

Complete the following steps:

1.   Click Salary and drag it around to the different drop zones. The graph immediately reacts as you move over the drop zones.

2.   Release Salary in the Y drop zone area.

3.   Click Years, and again drag it around to the different drop zones and observe how the graph reacts.

Notice the following:

   When you move Years to Group X, you get several vertical graphs. This is further notice that years are grouped—or put into intervals (Figure 4.10).

   When you move Years to Group Y, you will get several horizontal graphs (Figure 4.11).

   When you move Years to Wrap and Overlay, Years is grouped into Year categories (Figures 4.12 and 4.13).

Figure 4.10 Graph of Salary and Year When Year Is in Drop Zone Group X

image

Figure 4.11 Graph of Salary and Year When Year Is in Drop Zone Group Y

image

Figure 4.12 Graph of Salary and Year When Year Is in Drop Zone Wrap

image

Figure 4.13 Graph of Salary and Year When Year Is in Drop Zone Overlap

image

Next, complete the following steps to produce a trellis chart of Major, Gender by Usefulness:

1.   Release Years in the X drop zone.

2.   Click Major and drag it around to the drop zones.

3.   Move Major over the Y and X drop zones, and observe how you can add another variable to those drop zones.

4.   Release Major in the Group X drop zone.

5.   Click Gender and drag it to the Group Y drop zone.

You now see four graphs as shown in Figure 4.14. You can quickly see that the Business students have higher salaries and are older (more years since they graduated).

Figure 4.14: Salary versus Year, by Major and Gender

image

Scatterplot

Another useful visualization tool for providing insights to multivariate data is the scatterplot matrix. Data that consists of k variables would require k(k−1) two-dimensional XY scatterplots. Thus, every combination of two variables is in an XY scatterplot. The upper triangle of graphs is the same as the lower triangle of graphs, except that the coordinates are reversed. Thus, usually only one side of the triangle is shown. An advantage of the scatterplot matrix is that changes in one dimension can be observed by scanning across all the graphs in a particular row or column.

For example, produce a scatterplot matrix of the Countif.jmp data as follows:

1.   Select GraphScatterplot Matrix, and the scatterplot matrix dialog box will appear.

2.   Hold down the Ctrl key and click every variable, one by one, except for Gender Code. All the variables are now highlighted except for Gender Code.

3.   Click Y, Columns, and all the variables are listed.

4.   Notice toward the lower left of the dialog box that there is Matrix Format. There is a drop-down menu with Lower Triangular selected. If you click the drop-down menu, you can see the different options available. Leave it as Lower Triangular. Click OK.

A scatterplot matrix similar to Figure 4.15 will appear.

Figure 4.15: Scatterplot Matrix of Countif.jmp

image

For every scatterplot in the leftmost column of the scatterplot matrix (which is labeled Major), the x-axis is Major. Move the mouse on top of any point and left-click. That observation is now highlighted in that scatterplot but also in the other scatterplots. So you can observe how that observation performs in the other dimensions. The row number also appears.

Similarly, you can hold down the left mouse button and highlight an area of points. The corresponding selected points in the other scatterplots are also highlighted, as well as those observations in the data table.

Click the red triangle and Fit Line. In the scatterplots where both variables are continuous (in this case, Salary by Years, Salary by GPA, and GPA by Years), a line is fitted with a confidence interval as shown in Figure 4.15. The smaller the width of the interval, the stronger the relationship. That is, Salary and Years have a strong relationship.

Generally speaking, the scatterplots in a scatterplot matrix are more informative with continuous variables than with categorical variables. Nevertheless, insight into the effect of a categorical variable on the other variables can be visualized with the scatterplot matrix by using the Group option. Here is an example:

1.   Select GraphScatterplot Matrix.

2.   Click Recall. All your variables are now listed under Y, Columns.

3.   Select MajorGroup.

4.   Click OK. The scatterplot matrix appears.

5.   Click the red triangle and select Density EllipsesShaded Ellipses.

As shown in Figure 4.16, when you examine the ellipses, it appears that Arts and Sciences (A&S) students have lower salaries than Business students, and, in the data set, there were more A&S students that have recently graduated than Business students.

Figure 4.16: Scatterplot with Shaded Ellipses

image

Explore a Larger Data Set

Now explore HomePriceIndexCPI1.jmp, another data set that contains time as a variable. This file includes the quarterly home price index and consumer price index for each of the states in the United States and for the District of Columbia from May 1975 until September 2009. This table has been sorted by state and date.

Trellis Chart

Initially, you could examine all the variables in a scatterplot matrix. If you do, the scatterplots are not very helpful—there are too many dates and states. Instead, use the Graph Builder:

1.   In JMP, open HomePriceIndexCPI1.jmp.

2.   Select Graph ▶ Graph Builder.

3.   Click Date and drag it to the X drop zone.

4.   Click Home Price Index, and drag it to the Y drop zone. The graph is not too informative yet.

5.   Now, click State and drag it to the Wrap drop zone.

You now have 51 small charts, by State, of Home Price Index versus Date, as shown in Figure 4.17. This set of charts is called a trellis chart. A chart is created for each level of a categorical variable. (Similarly, the charts in Figure 4.17 are trellis charts.) Trellis charts can be extremely effective in discovering relationships with multivariate data. With this data, you can see the trend of each state’s home price index. Some states increase significantly, such as CA, HI, and DC (considerably highlighted), and some very slowly, such as GA, MO, and MS (slightly highlighted).

Figure 4.17: Trellis Chart of the Home Price Indices, by State

image

You can add CPI to the charts by dragging CPI to the Y drop zone (just to the right of the Y-axis label Home Price Index), as shown in Figure 4.18. But perhaps an easier way is to start over: Select Home Price Index, hold down the Shift key, click CPI, and then click Y.

Figure 4.18: Trellis Chart of Home Price Index and CPI by State

image

Bubble Plot

Another informative visualization chart is the Bubble plot, which can be very effective with data over time. To generate a Bubble plot with the Home Price Index data, complete the following steps:

1.   Select GraphBubble Plot.

2.   Drag Home Price Index to Y.

3.   Drag State to X.

4.   Drag Date to Time.

5.   Drag Home Price Index to Coloring, as shown in Figure 4.19. Click OK.

6.   Click the go arrow key icon (blue triangle), and observe how the home price indexes for the states increase over time until 2008, and then they decrease.

Figure 4.19 Bubble Plot Dialog Box

image

Figure 4.20 through Figure 23 capture the bubble plot at four dates. You can increase the speed by moving the slider to the right. Further, you can save this as a Flash file, so that you can send it in an email or include it as part of a presentation, by clicking the red triangle and selecting Save for Adobe Flash platform (.SWF).

Figure 4.20: Bubble Plot of Home Price Index by State on 3/1975

image

Figure 4.21: Bubble Plot of Home Price Index by State on 3/2005

image

Figure 4.22: Bubble Plot of Home Price Index by State on 3/2008

image

Figure 4.23: Bubble Plot of Home Price Index by State on 9/2009

image

Explore a Real-World Data Set

Now you will examine another data set, profit by product.jmp. This file contains sales data, including Revenue, Cost of Sales, and two calculated columns—Gross Profit and GP% (percentage of gross profit). (Notice the plus sign to the right of the Gross Profit and GP% variables, respectively. Double-click the plus sign for either variable. The Formula dialog box appears with the corresponding formulas.)

The data is organized by these items:

   time (quarter)

   distribution channel

   product line

   customer ID

Use Graph Builder to Examine Results of Analyses

If you were examining this data set “in the real world,” the first step would be to generate and examine the results of some univariate and bivariate analyses with descriptive statistics, scatterplots, and tables. Next, since the data set is multivariate, graphics are likely to provide significant insights. So use the JMP Graph Builder:

1.   Select Graph ▶ Graph Builder.

2.   Drag Quarter to the X drop zone.

3.   Drag Revenue to the Y drop zone.

4.   Instead of viewing the box plot, look at a line of the average revenue by quarter by right-clicking any open space in the graph and then selecting Box Plot ▶ Change to Contour. The density Revenue values are displayed by Quarter, as in Figure 4.24.

5.   Now right-click inside any of the shaded revenue density areas. Then select Contour ▶ Change to ▶ Line. The line graph will resemble what is shown in Figure 4.25.

Figure 4.24: Revenue by Quarter Using Contours

image

Figure 4.25: Revenue by Quarter Using Line Graphs

image

Generate a Trellis Chart and Examine Results

Next complete the following steps:

1.   Add Cost of Sales and Gross Profit to the Y axis, by dragging Cost of Sales to the Y drop zone. (Be careful to not remove Revenue—release the mouse button just to the right of the Y axis.)

2.   Similarly, drag Gross Profit just to the right of Y.

3.   Drag Product Line to the Wrap drop zone.

4.   Click Done.

These steps should produce a trellis chart similar to Figure 4.26.

Figure 4.26: Trellis Chart of the Average Revenue, Cost of Sales, and Gross Product over Time (Quarter) by Product Line

image

Examining Figure 4.26, you can see that, in terms of average Gross Profit, the Credit Products product line did not do too well in Quarter 3. You can focus or drill down on the Revenue and Cost of Sales of the Credit Products by using the Data Filter feature of JMP. The Data Filter, when used in concert with the Graph Builder, enhances your exploration capability.

1.   On the main menu in the data sheet window, select RowsData Filter.

2.   Click Product Line, and click Add.

3.   Check the Include option. Then, under the list of Product Lines, click Credit Products. The graph now includes only Credit Products and looks like Figure 4.27. In the rows panel, note that 5,144 rows are Selected and 19,402 are Excluded.

4.   Because you are looking only at Credit Products, drag Product Line from the top graph back over and release it on top of the other variables (at the left side in the Select Columns box). You can now focus on all your Credit Products customers by dragging Customer ID to the Wrap drop zone. The graph should look like Figure 4.28.

Figure 4.27: Graph of the Average Revenue, Cost of Sales, and Gross Product, by Quarter for the Credit Products Product Line

image

Figure 4.28: Graph of the Average Revenue, Cost of Sales, and Gross Product, by Quarter for the Credit Products Product Line, by Customer ID

image

However, if you have too many customers, this trellis chart might be much too busy. So, with the filter in play, build another graph:

1.   Click Start Over in the Graph Builder dialog box.

2.   Drag Customer ID to the Y drop zone.

3.   Drag Revenue to the X drop zone.

4.   Right-click on any point, and then select Box PlotChange toBar.

5.   Drag GP% to the X drop zone (toward the right, near the x-value of 400).

6.   On the right side of the graph with GP%, right-click anywhere on the box plot and then select Box PlotChange toBar. The graph should look similar to Figure 4.29.

Figure 4.29: Bar Chart of the Average Revenue and %GP of Credit Products Customers

image

Use Dynamic Linking to Explore Comparisons in a Small Data Subset

You can quickly see for your Credit Products product line that the average sales (revenue) of customers IAS and CAM are relatively high, but their average percentage gross profit is rather low. You might want to further examine these customers separately. An effective feature of JMP is called dynamic linking: As you interact with the graph, you also interact with the data table. So you can easily create a subset of the data with just these two customers:

1.   Click on one of the bars for IAS. Then hold down the Shift key and click one of the bars for CAM. The bars for these two companies should be highlighted.

2.   In the Graph Builder window, click the View Associated Data icon in the extreme lower right. A data table will be displayed with 64 rows selected.

3.   Select TablesSubset.

4.   In the Subset window, change the Output table name from its default to CAM IAS profit by product. Click OK. A new data table is created and displayed with only 64 rows for customers IAS and CAM only.

5.   Select FileSave. You can further analyze these two customers at a later time. Now return to Graph Builder.

Return to Graph Builder to Sort and Visualize a Larger Data Set

If you have difficulty getting the Graph Builder dialog box back (Figure 4.29), navigate to the JMP Home Window and click profit by product—Graph Builder.

Those two customers were rather easy to identify. But what if you had a large number of customers? You can sort the customers by average percent gross profit by again dragging GP% to the Y drop zone. With the cursor somewhere on the y-axis, right-click, and then click Order ByGP%, descending. Now, you can easily see the customers who are not doing well in terms of average percent gross profit as shown in Figure 4.30.

As annotated in Figure 4.29, if you click the View Associated Data icon in the lower-right corner, then a table will list the associated data with the graph.

Figure 4.30: Bar Chart of the Average Revenue and %GP of Credit Products Customers in Ascending Order

image

You can further explore the Credit Products product line and develop a better understanding of the data by adding another element to your filter. For example, you might postulate that there is a correlation between average size of sales and gross profit. You can visualize this relationship by following these next steps:

1.   In the Data Filter dialog box, click the AND icon.

2.   In the list of Add Filter Columns, click Revenue.

3.   Click Add. A scroll bar will appear as shown in Figure 4.31.

4.   Double-click $0.5, change the value to 1350, and press Enter. As a result, 169 rows are matched. The graph should look similar to Figure 4.32.

Figure 4.31: Two Data Filters with Scroll bar

image

Figure 4.32: Bar Chart Using Further Elements of the Data Filter

image

You can see that you are losing money with four customers. If you want to examine them further, you could subset these four customers to another JMP table.

With these examples, you have seen that by using several of the JMP visualization tools, and the additional layer of the data filter, you can quickly explore and develop a better understanding of a large amount of multivariate data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset