chapter seven

approachable analytics

In Chapters 5 and 6, we introduced you to many of the objects available in the Designer application of SAS Visual Analytics. These objects let you create visualizations to display your data so that you can communicate a message to a viewer. This works well for reporting and dashboards, but sometimes your data needs additional investigation to see what is going on underneath the covers. SAS Visual Analytics has a very similar application to the Designer called the Explorer that gives you access to additional objects and data analysis features. At first glance, you might not be able to see the difference between the Explorer and Designer, since they are laid out similarly.

Compared to the Designer, which focuses more on displaying your data, the Explorer section enables you to learn more about your data. Organizations are always searching for that piece of information to give them a competitive advantage, but you can only find that if you are getting deep into analyzing your data. In the Explorer application, you are given all of the designer objects as well as more advanced ones that let you see your data in different views. Some of the objects also have analysis features that run calculations behind the scenes to give you even more information about your data.

In this chapter, we walk through the Explorer and show how it is set up differently than the Designer. Then we go over some of the advanced objects that you can use in the Explorer as well as the additional data analysis features that are built into the objects.

Note: The Explorer application is for SAS Visual Analytics 7.3 and all other previous versions. In the 8.1 version, the Explorer and Designer are merged into the same section along with all of the objects. That version is covered in Chapter 13.

About the Explorer

In the Explorer application, you create explorations, which are metadata objects that contain visualizations and other data settings. Explorations can be saved and shared with other users. Visualizations are where the objects are used within an exploration. Each exploration can have multiple visualizations but each visualization can only have one object.

At first glance, you might not be able to see the difference between the Explorer and Designer since they are laid out similarly.

Figure 7.1 Explorer layout

image

The Explorer is set up with a panel on the left side for your data. This is where you can add data and see all of the data items once they have been added. The drop-down list at the top of the panel gives you the options to manipulate data such as adding a new hierarchy, calculated item, data source filter, and so on.

All of the objects can be found as icons in the toolbar near the top of the window. Clicking on any of them places that object into the visualization. You can also select an object by clicking on the drop-down list that sits to the left of the visualization name. The objects can be controlled with the right panel. This contains tabs where you can add data items, set up filtering, and change other properties.

Figure 7.2 Creating visualizations

image

Visualizations are added to the canvas area by clicking on the icon to the left of the objects. This icon also has a drop-down menu that contains all of the objects so that you can do everything with one step. When you add a new visualization, the canvas splits and gives you a window for each one as shown above with Visualization 2. These can also be minimized by click on the minimize icon in the top right corner of each window. The visualizations are then placed in the bar at the bottom of the canvas shown at the bottom of Figure 7.2 with Visualization 3.

Automatic chart feature

A feature that is unique to the Explorer is the first object called Automatic Chart. This object is different from others in that it changes based on the data items that you provide it. Depending on the mixture of categories, measures, geography items, and date fields, SAS chooses the chart that works the best for that combination.

SAS has a built in algorithm to determine which chart you get.
One measure Histogram
One category Bar chart
One aggregate measure Crosstab
One datetime category and any number of other categories or measures Line chart
One geography and up to two measures Geo map
One geography and three or more measures Bar chart
One document collection Word cloud
Two measures Scatter plot or heat map
Three or more measures Scatter plot matrix or correlation matrix
One or more categories and any number of measures and geographies Bar chart

Figure 7.3 Using the automatic chart feature

image

In the example above, you can see that we are using the Automatic Chart by the top left corner of the visualization and by the information contained in the Roles tab. We took the VA_SAMPLE_SMALLINSIGHT data set and added Order Total and Vendor Rating as measures to the chart. The chart generated a Heat Map from that information and that is what is shown above. We could click the Use Heat Map to switch over to a Heat Map permanently or we can continue to change the data items.

Figure 7.4 Removing roles in an automatic chart

image

In this example, we removed Vendor Rating and added Product Line, you can see that the Autochart has now generated a bar chart for us.

The Autochart feature in the Explorer is a good way to start learning more about a data set, especially if it is one that you are just looking at for the first time. By adding and removing data items, you get a better sense of what your data contains as well as different views of it through all of the charts.

Box plots

In getting familiar with statistics, mean and median as well as maximum and minimum are some of the first concepts taught. This is because these statistical values explain a lot about your data and you can get a better sense of the values that you are working with from them. By using any measure in your data, the Box Plot object in the Explorer automatically distributes the data for you and displays a visualization of those statistics in a box-and-whisker format. For this example we are going to use the CARS data set in the SASHELP library.

Interpreting the results

The Box Plot object requires just a single measure for the visualization but a category and additional measures can be incorporated. In the following figure, we added the MPG (Highway) data item as a measure in the Box plot.

Figure 7.5 Box plot example

image

The dark blue shaded area in the middle is our box and the lines below and above it are our whiskers. The whiskers represent the minimum and maximum values of the range of data where the boxes represent the middle 50% of data points. The means that Q1 is the start of the 25% range and Q3 is end of the 75% range. The range from end of Q1 to end of Q3 is considered the interquartile range. The median is middle bar between the two where half of the data points are below it and half are above it. If you position the pointer over the box plot in the middle, you get the context box that gives you all of the exact values for the statistics.

The example above is the default setting for the Box Plot, but there are some changes that you can make in the Properties tab to better visualize the distribution of your data.

Figure 7.6 Box plot example with outliers

image

In the above figure, we checked the Show averages box and changed the Outliers drop-down list from Hide Outliers to Show Outliers in the Properties tab. The average shows up as a diamond in the box plot and the outliers show outside of the whiskers as either a light blue shaded box or black dots. There are always dots unless there are too many points to plot, which then they are shown as the light shaded box.

The Box Plot identifies an outlier as any data point that is farther than 1.5 times the size of the interquartile range away from it. So if our interquartile range is 24-29, that would be a distance of 5. After multiplying that by the 1.5 and adding it to our initial range, we end up with a range of 16.5-36.5. Anything outside of that range is now considered an outlier in the object.

Should you exclude outliers?

Before you can determine what to do with an outlier, you need to understand the cause. If the outlier is a measurement error, then it makes sense to exclude it. For example, if the MPG was listed as 288, it is likely that someone made a data entry error. The outlier can be excluded. There might be other cases where your model or underlying assumptions are incorrect. Study the outliers to ensure you understand their causes.

Depending on which one you choose, each of the Outliers options changes the way your data looks and can change the underlying statistics. With the default Hide Outliers, the outliers are not shown in the Visualization and the data points that would be outliers are incorporated into the whiskers. Show Outliers displays the outliers in the visualization and updates the whiskers to not include the outlier. Outliers are still considered for all of the other statistics. The Ignore Outliers option takes all of the outliers out of the visualization and the statistics are updated without them. That option is shown here below, you can see how the values in our box-and-whisker plot have changed.

Figure 7.7 Box plot ignoring outliers

image

Adding more data items

The Box Plot can go beyond a single measure and handle a category and/or multiple measures. From our previous example, we added the Type data field as a Category in the Roles tab. This displays a separate box-and-whisker plot for each category.

Figure 7.8 Box plots with a category

image

The distribution across categories allows for better analysis now that you can see how each one is broken out. The Hybrid box-and-whisker plot only has three data points, which is why it does not have any whiskers. Also notice how all of the box-and-whisker plots vary in size and shape. For the Truck, the minimum value is very close to the Q1 and Median values. This shows how that category has a lot more data points on the lower end compared to higher end of the range where the top whisker is.

Now we’re going to add MPG (City) to the Measures in the Roles tab.

Figure 7.9 Box plot with a category and multiple measures

image

By adding another measure, our box-and-whisker plots now doubled since there is one of the new measures for each category. Its interesting how Truck and Hybrid categories do not have much of a change from Highway to City, but the Sedan, Sports, and Wagon have a difference. Also, the Sedan is the only category that has multiple outliers for both the City and Highway measures.

When to use Box Plots

As you can see from the above example, Box Plots are a good way to see into the measures and look at the distribution of values. Here’s a few situations to consider using a box plot:

•   Exploration – Box plots can be a good place to start looking at your data since you can see how each of the measures are distributed and how they look with different categories.

•   Outliers – Depending on the data set, you might want to find out if you have outliers that you would like to exclude from your analysis. The Box Plot provides a way to easily calculate and visualize them.

Histograms

As with Box Plots, Histograms are another way for you to see a frequency distribution of your data. Only one measure can be used with them, but histograms can be an effective way to see how your data is spread out as counts or percent of frequency. For this example, we are going to use the SASHELP.CARS data set that we used in the Box Plot example and switch over to a histogram.

Changing objects in a visualization

Within a Visualization, there are multiple ways that you can change the object to get a different look at your data.

Figure 7.10 Where to change an object

image

Now you’ll notice with this example that the Histogram is not available. That’s because at the end of the Box Plot section we had data items Type, MPG (Highway), and MPG (City) selected in the Roles Tab. When changing objects, the visualization uses the data items that are being used in the Roles tab. Some objects have restrictions on number and type of date items that they can handle. In the histogram’s case, it only uses a single measure so that is grayed out until there is only one measure in the Roles tab.

After taking out the category Type and measure MPG (City), we then select the histogram and the visualization is updated.

Figure 7.11 Change visualization to histogram

image

Histogram options

In the Properties tab there are a few options that you can apply to the Histogram. Similar to a bar chart, you can change the bar direction to horizontal. Also, you can change how the frequency is displayed by selecting Count or Percent. The default is Count, but by changing it to Percent your one axis updates to show how much of a percent each bin has out of all the points of data. Finally, you can also choose your own Bin count. The number of values in your measure determines the default bin count that is automatically applied.

Figure 7.12 Histogram example

image

As shown above, when you use your own bin count, the object takes the maximum and minimum values and then evenly spread out the bins for that range.

When exploring data, histograms can be a good way to get an understanding of how measures are distributed. The box plot shows you plenty of statistics about a measure, but it can’t show the variance of how packed your data can be in one area like the histogram can.

Using a correlation matrix

Finding pieces of information within your data that can lead to a competitive advantage or help improve an organization is what data analysis is all about. Finding relationships between data items is one way that can lead you to those bits of insight. Using the Correlation Matrix is one way to measure relationships between measures in the Explorer. The object takes in different measures, compares them against each other, and then determines whether and how strong a relationship is between them. For this example we are going to use the VA_SAMPLE_K12_STUDENT data set that is shipped with SAS Visual Analytics.

Calculating a correlation

Correlations in SAS Visual Analytics are calculated using Pearson’s product-moment correlation coefficient calculation. This calculation takes in two measures and all of their values and generates a coefficient value that describes their relationship.

The range of the coefficient value can be anywhere from -1 to 1. Anything from -1 to 0 indicates a negative relationship, which means that as one of the measures increases the other decreases and vice versa. A correlation of 0 shows no relationship at all. Positive numbers from 0 to 1 indicate a positive relationship, which means that as one measure increases or decreases the other follows. SAS identifies these ranges of ratings for correlations as being Weak, Moderate, or Strong.

Figure 7.13 How SAS categorizes correlation values

image

Understanding the matrix

In this example, we look at a combination of comprehension levels, scores, attendance days, and discipline days for students in a school district. You can get a matrix of correlations of all the measures against one another by using the Within one set of measures option in the Roles tab.

Figure 7.14 Correlation matrix example

image

Here’s what we learn from the above figure:

•   There’s a strong relationship between math and reading scores.

•   Attendance days are not related to test scores or comprehension levels.

•   Discipline days are not related to test scores or comprehension levels.

The strength of the correlation determines the color of the box. If you position the pointer over any of the boxes, then you get to see the actual numbers as well as how SAS has classified the correlation.

What’s interesting in this example is how attendance days and discipline days are not related to scores at all. One might think that the less that students are in school or disciplined for school would mean less learning, but that might not be the case.

Now if you do not want the full matrix and instead just want to compare a single measure against others or two different sets against each other, then you can use the Between two sets of measures option in the roles tab under Show correlations. This breaks the chart out into an X and Y axis that you can control which measures you want to see against each other. Below, we wanted to see which measure had the highest correlation value with Math score, so we put it on its own in the Y axis.

Figure 7.15 Correlation example between two sets of measures

image

Interpreting a correlation value

With a lot of data and a strong correlation between the values of two measures, you might assume that the correlation between the values indicates a relationship between the concepts that the values are measuring. Sometimes that is not always the case. The phrase correlation does not equal causation is common in the field of statistics and means that just because two measures have values that are related—which is measured by correlation—it does not mean that the concepts behind the measures have a direct relationship. There are many different forms of an apparent relationship between data items. In Steven Few’s book Now You See It, he breaks down correlations to meaning one of four possibilities:

•   One measure causes the others behavior

•   Neither causes the other’s behavior, both are caused by other variables

•   Neither causes the other’s behavior, another variable connects them

•   Correlation is erroneous due to insufficient or bad data

In our previous example, the math and reading score had a strong positive correlation with each other. This indicates that as one goes up, the other should follow. Because we also saw minimal relationships between the math and reading scores and the attendance days and discipline days, we might conclude that some other factor, such as the intelligence of each student, is causing the high correlation between the math and reading scores.

Shark Attacks and Ice Cream Sales

If you looked at the amount of ice cream that is sold compared to the number of shark attacks by month or season, those two measures would have a strong correlation value. Both occur more often in the summer months compared to the winter. Obviously, they do not have anything related to each other except that they change as the season’s turn. This would be an example of neither measure causing the other’s behavior because there is another variable that connects them—which is seasonality.

Forecasting

Forecasting is not an object by itself and is rather an extension of the Line Chart object. It is included here as its own section since it is one way to do predictive analysis in the application and has a lot of options for you to forecast data.

Using data that contains a time frame, you can use the Forecasting option in the Explorer Line Chart object that predicts how the data trends to some upcoming time frame. In this section, you learn how the forecasting is done and how to use the Scenario Analysis option that comes with it. For our example, we are going to use the VA_SAMPLE_SMALLINSIGHT data set.

Working with the forecasting option

In the Roles tab of the line chart is where you can find the option for forecasting. The option is grayed out until a date item is added in the Category section. Once that is populated, you can select the Forecasting option. When selected, a vertical line appears in the line chart dividing the ending date of the user’s data and the beginning of the forecasting results.

Figure 7.16 Forecasting example

image

As long as you have a date field and a measure, anything can be forecasted. Popular examples include sales, weather, and company performance. For this example, we are going to stick with the Small Insight company data. Below is an example turning the forecasting option on for the Order Total measure and Transaction Date - Month (duplicate Transaction Date with MMYYYY format)

The data ends in October 2013, so the forecast starts in November 2013 which where the gray vertical line is placed. The dark blue line in the forecast shows the most likely trajectory of the stock price and the blue shaded area is the confidence interval. By looking at the legend at the bottom, you can see that we are working with a 95% confidence interval. This means that the model projects a 95% chance that the future Order Total falls somewhere in the blue shaded area.

Figure 7.17 Forecasting options

image

For this example, the forecast is only going out to the next six quarters. This is called the forecast duration and can be changed with the confidence interval by going to the Properties tab. At the bottom of the tab, there is an option to change those values, shown in Figure 7.17 above.

As you increase the forecast duration the confidence band typically expands since the further into the future you go, the more uncertainty there is. It’s important to note here that models like these work better with as much data as you can give them. If you only have a few points, then the model is going to have a hard time coming up with accurate results.

Building better forecasts

A good indicator of future performance is past performance. When creating a forecast the more historical data you can provide, the more likely your forecast is to produce a better result.

How is the data modeled?

The forecast runs your data through six different models and selects the one with the best fit. Here is a list of the different exponential smoothing models available:

•   Damped-trend

•   Linear

•   Seasonal

•   Simple

•   Winters method (additive)

•   Winters method (multiplicative)

As the data is modeled, the Root-Mean-Square-Error (RMSE) is calculated for each model behind the scenes. The RMSE is a measure of how close the predicted values are to the real data. The lower the RMSE, the more accurate the model is. SAS Visual Analytics then selects the model with the lowest RMSE to use in the forecast.

Figure 7.18 Forecast analysis tab

image

After selecting the forecasting option, you can see which model was used as well as a table of the results by clicking on the at the bottom of the line chart. Shown in Figure 7.18, the Winters Method (Additive) algorithm was selected for the forecast used in the first example.

Look for underlying factors

In order to improve our analysis, we don’t just want to look at one historical measure and base the forecast on those values. There could also be other data points that might have an influence on the modeling of that measure, and if they are brought in then our model can become even stronger since it has multiple variables incorporated.

The models that the application runs to build our forecast are able to include other measures into the analysis. By going to the Underlying factors section in the Roles tab and clicking the drop-down list, you can add one or more measures from your data set into the analysis. As with the original forecasting, SAS runs the data through the models, adding autoregressive integrated moving average models (ARIMA) to go with the original six, to determine the best fit. If the added measure does not have an influence on the model, then it is grayed out. When the new measure does influence the model, the chart is updated with the results. In the below figure, we added in the Sales Rep Rating as a possible underlying factor.

Figure 7.19 Forecasting with underlying factors

image

Note: The data goes all the way back to December 1997 but the slider has been moved to just show the past three years.

Continuing our forecast example, adding Sales Rep Rating as a possible underlying factor, the forecasting has been updated with the results. The top chart is similar to our original forecast of Small Insight’s monthly sales except now the forecasted section has improved. In our first run, the projection for 6 months out had a 95% confident predicted monthly sales range of $1.19-$5.16 (millions). When using Sales Rep Rating as a factor, that confidence band is now narrowed to $2.01-4.82 (millions), which is a notable reduction in the range.

Using the scenario analysis

Once you have found an underlying factor that influences the forecast, the Scenario Analysis button at the bottom of the Roles tab becomes available to use. After clicking on it, a window shows the forecasted data field and the underlying factor. There are two options for users to change, Goal Seeking and Scenario Analysis.

With Scenario Analysis, you can go in and manipulate the underlying factors and see how the forecast would change based on those new values. In our example, we are trying to get our average Sales Rep Rating around 60% and want to investigate how that could affect our monthly sales. We set this expectation by clicking on the Sales Rep Rating button on the left side of the screen and selecting “Set Series Values”. A pop-up window appears and this is where the values are set with a fixed number, a numeric increment, or a percentage increase.

Figure 7.20 Forecasting with scenario analysis

image

After you click OK, the forecasted numbers for the Sales Rep Rating are updated with the 60% set increase. There is a gray line in the underlying factor’s forecast section that indicates the original data points. Since the underlying factor has been altered, only Scenario Analysis is available to use and is the only option available in the right menu. When Apply is selected, the forecast is then updated with the new results.

In Figure 7.20, the data points and the confidence band have now started to trend higher. The gap is not that far off from the original with the first forecasted quarter, but if you position the pointer over the month interval, the original baseline data is shown along with the updated forecast points. The forecasted monthly sales for April 2016 have risen from a baseline of $2.26 million to $2.65 million. You could take away from this model that improving the Sales Rep Rating over time might have a significant influence on the monthly sales.

Figure 7.21 Forecasting with goal seeking

image

Goal seeking works in a similar way except that you are changing the forecasted values and then seeing how the underlying factors would have to change to get those results. Since the underlying factors can have just a small influence on the forecast and they also do not have a confidence range, you only get an accurate result with something that is highly correlated. So for this example, we’re going to keep the Order Totals by month, but instead use the Order Marketing Costs as an underlying factor.

For the analysis in this example, we increased the forecasted Order Total by 10% by clicking on the data item’s button on the left side. Once you do that, an Apply button appears on the right side underneath Goal Seeking. When that is clicked the graph updates with the results. You can see that the two line graphs are very similar and we know that since the more you market your products the more likely they are to get sold. At the six-month mark, the Order Marketing Cost has gone from $142,856 to $157,750 for an increase of 10.42%. This goal seeking analysis shows that if we ignore other factors, the marketing costs would have to rise by 10.42% in order to hit the 10% increase in Order Total.

Word clouds

All of the objects that we looked at so far have been focused on analyzing measures with some being able to add categories for an added layer of investigation. Switching things up, word clouds are objects that explore categorical data items. Each word cloud object takes a category field that contains text and can analyze it for words of importance, frequency of words, or add a measure to place a value on certain words. For this example, we are going to not use a data set already in SAS and instead use the option to import data from twitter.

Loading social media data

At the Add Data Source window in Explorer, you have option to load from any of the tables already loaded into LASR or you can import data from your local machine, a server, Hadoop (if installed), or SAS can go out and grab public data on the Internet. In the Other section of the Import Data window, there are options to get data from Facebook, Google Analytics, and Twitter.

Figure 7.22 How to load social media data

image

When you click the Twitter link, it first asks you to log in to an account. Once you do that, you are able to click the link again and get to the Import Twitter Data screen. For this example, we signed in as the Twitter account @zencos which you can see in the bottom left corner of the window. In this window, you can also enter in what you would like to search for, the number of tweets to pull, and the output table name for LASR.

Figure 7.23 Import twitter data window

image

In our example, this is data set was created two days after Apple released the iPhone7. We want to gauge how people react and feel about the product. With that being said, we enter #iPhone7, which is the most popular hashtag used for the product on twitter. 18,000 is entered for the number of tweets since that is the maximum allowed and we want to view as much data as possible for our analysis. We also check the Do not import retweets box so that we are only grabbing original tweets and not duplicate ones. After clicking OK, the data pull begins and takes a little more than a few seconds since it is scraping all publicly available tweets that contain the hashtag and twitter handle that we are searching for.

Setting up the word cloud

The Word Cloud object has two different ways to visualize the category data item that is being analyzed. One is to use category values, which takes the category data item as a whole string and use frequency or another measure to determine how to display those values. The other is to use text analytics, which takes the category data item and examines each word in the string. Each word is then displayed based on frequency or importance within the text.

Using category values

Under Show word cloud in the Roles Tab there is the options where we can select category values. With the category values, we want to look at location to see where people are tweeting from, so we add authorlocation to the Words role in the Roles Tab.

Figure 7.24 Word cloud example

image

It turns out that there were quite a few people from the United States and its major cities that were tweeting about the product. Apple is headquartered in the United States and many of its customers live in the United States so that would make sense that a lot of people from these areas would be tweeting about the new phone. It was also released in 24 other countries throughout the weekend, which is why you see other countries and cities in the world scattered across the word cloud.

Note: Twitter gives users the option put in their own locations, which is why the data is not standardized.

Also, notice at the warning sign in the bottom left corner. The word cloud limits the number of words allowed so that it can properly display the disparity in size for each of the words and so that they are also readable.

The Roles tab also lets you change what determines the size and color of the word, which is similar to the Treemap object that was covered in Chapter 5. We are going to leave the Frequency in the size but put the measure retweetcount into the Color field. This aggregates that measure for each of the location words displayed in the object. You can change the color gradient by going to the Properties tab.

Figure 7.25 Word cloud with a measure

image

San Francisco, CA had an overwhelming amount of retweets among their tweets as well. Because Apple is headquartered near San Francisco and because many other companies in the industry are also located near San Francisco, there were probably a lot of employees, journalists and consumers tweeting about the iPhone 7.

Was the product release a positive or negative experience for everyone? Let’s jump into the text analytics option to find out.

Using text analytics

On the Roles tab you can select Using text analytics to switch to how the word cloud handles the data. This option needs a Document collection instead of a category so that it can analyze individual terms inside the category data field. You can create a Document Collection by right clicking on the category data item and selecting Document Collection. For this example, we do that with the body data item. This is the actual text from the tweet that is in each row of data. The body data item is then available in the Document collection drop-down list on the Roles tab.

Figure 7.26 Using text analytics

image

After selecting a document collection, the word cloud is then updated with the results. Each term is given a Topic term weight that analyzes the frequency of the term as well as the importance of it. The value given then determines how large the term appears in the word cloud. You can see that there are a lot of common words but others that are important to what we are looking for. There were many hardware updates to the new iPhone that included a better camera, improved battery, new home button, and removal of the headphone jack. These are all popular terms from the tweets that were analyzed.

Now we want to look more into if these people that were tweeting about the new features were happy with the change. In the Properties tab, there is a Text Analytics Settings section where you can choose to check Analyze document sentiment. By clicking on this option it brings up a window where you can also check an Identify term roles box, which groups like nouns together, and specify how many topics you want created. After clicking OK, the word cloud is updated with the sentiment analysis. Similar terms are also grouped into topics that can be found in the top left corner or as a tab in the bottom analysis section. So we want to find a grouping that is similar to our new features. By clicking on that drop-down list, you can find terms that are commonly found together.

Figure 7.27 Text analytics with sentiment analysis

image

The sentiment analysis looks at everything in the document collection field and determines the nature of the text. Each item gets categorized as positive (green), neutral (yellow), or negative (red). The results can be found in the upper right corner of the word cloud as well as in the Topics tab. With mostly neutral and a few more negative than positive tweets, there is definitely a mixed reaction regarding the decision to remove the headphone jack. However, when switching to topics that included iPhone7 or home button, the reactions were about three to one in favor of the positive ones. It appears that people are publicly enjoying the new product, but they are still on the fence about having to deal without a place to plug in their headphones.

Word clouds with unstructured data

In a 2014 SAS Global Forum paper “SAS Admins need a Dashboard, Too”, they use a word cloud example that looks at SAS procedures and DATA steps executed on a server. In the word cloud, you were able to see which ones were taking up the most memory and how much CPU time they were taking up on the server. This data was pulled from APM artifact tables, which are similar to logs. This is a useful visualization so that administrators can see what executions could be slowing down a server just by picking up the most common words in an unstructured log field. Thinking about this makes you wonder what sources of unstructured data you might have around that could be easily analyzed with a word cloud.

Scatter plot

A scatter plot is a graph that plots individual points for each row of data based on where they land according to the X-axis and Y-axis variables. Similar to the Correlation Matrix, the Scatter Plot object is used to show relationships between two measures. However, this object also adds in a visual aspect to it in that you can see all of your data points and where they fall between the two measures. Using each measure as an axis, the scatter plot displays each point in the chart which then gives you a two dimensional view of how the data is distributed across both of the measures. This can give you an idea of how the measures are related to one another. For this example, we use the bodyfat and cars data sets that are located in sashelp library.

Data analysis

The setup for the Scatter Plot is straightforward, you just need to add your two measures into the visualization or into the Roles tab.

Figure 7.28 Scatter plot example

image

In this example, we added WEIGHT and BODYFAT as our measures. For each row of data, a point is plotted on the graph based on the values of the two measures. If you position the pointer over any of the data points, their values show in the details box.

The graph gives us a good indication on how the data varies between the two measures. Most of it is concentrated between 150-200 for WEIGHT and 10-30 for BODYFAT. Other than that, the data fluctuates a lot where even someone at a weight of 150 can be anywhere between 5-30 bodyfat. That being said, we are interested in whether there is a possible relationship between the two measures. So if BODYFAT goes up or down, can we expect a change with WEIGHT as well? For that type of analysis, we can add lines of best fit to the scatter plot to help us out.

Lines of best fit are a way to model the relationship between variables. This is done in the scatter plot object by going to the Properties tab and clicking on the Fit Line drop-down list. The fit line is formed between the two measures by taking in all of the data points and calculating the line that best represents the relationship for the data by maximizing the R-square value. R-squared can range from zero to one, the closer the value is to one signals a more significant possible relationship between the two measures. In this example, we have selected a Linear line of best fit.

Figure 7.29 Scatter plot with a fit line

image

After the selection is made, all possible Linear lines are calculated and the one with the highest R-square value according to the data is displayed. You can see that it goes straight through the data points and has an upward trajectory. In the Analysis tab at the bottom is where you can find the R-square value (0.38) and also the correlation value. Just like the calculation in the correlation matrix, linear fit lines display the correlation calculation as well. You can see here that with a correlation value of 0.61, SAS tells us that there is a possible strong relationship between these two measures.

In the Fit Line drop-down list, there are also other options that include Best Fit, Cubic, Quadratic, and PSpline. Quadratic and Cubic can be used if your data is varied or might have multiple points of change where a trend takes the data in a new direction. Quadratic lines have one curve where Cubic lines have two, similar to an S shape. The PSpline line fits the line in pieces, which can have multiple curves and breaks across the data. The Best Fit option calculates the R-square value of the Linear, Cubic, and Quadratic lines and then displays the one with the highest R-square value. In the following figure, we switch our selection to the Best Fit option.

Figure 7.30 Scatter plot with best fit option

image

For our two measures, the Quadratic line of best fit had the highest R-square value and fit our data the best. You can tell that it is a Quadratic line by the single curve. There is a property in the Analysis tab that tells you which line was selected. The R-square value in the Analysis tab is 0.40 compared to 0.38 that we had with the Linear line.

Interpreting lines of best fit

In Stephen Few’s book Now you see it, he goes over four patterns to look for when analyzing relationships in data.

•   Direction - Does the slope go up or down? If the slope is going up, then you have a positive relationship, which means as one measure increases, so does the other. When the slope is going down then you have a negative relationship and as one measure increases, the other decreases.

•   Strength - How condensed are the data points to the line? If most of the data points follow the line, then the relationship is going to be stronger. However, if they are scattered all over or they are all compressed into one small area, then a relationship might not be as obvious.

•   Shape - Is it a straight line or curved? A straight line signifies a simple relationship, when one measure goes one way, the other measure goes follows suite. A curve means that there could be a changing point. This means that as your data is following the line, there comes a point where the relationship changes. These points of curvature can be very important to understand more about your data.

•   Outliers - Are there any outliers? Outliers can be good to find examples of what doesn’t follow the relationship.

For our bodyfat to weight example, our best fit line is a positive curved line with a clustered area of values and a few possible outliers. The positive relationship was not that apparent at first, but with the addition of the line it’s easier to visualize how the data does trend in an upward manner. This indicates a possible correlation that as bodyfat or weight increases, so does the other. Now the curved shape is a bit vague. We have an outlier on the far right side that is not close to any of the data points. With that point’s bodyfat being low compared to the upward trajectory of the line where the points are condensed, it could have thrown off our model. We finish this analysis in the Heat map section, which is another way to visualize data in a manner like the scatter plot.

Adding categories

Stepping away from the data analysis portion of the scatter plot, there is also a way to add categories to your chart, which can make for great visualizations. In the Roles tab under Color, you can add a category into the chart that splits the data markers into different shapes and colors. In the figure below, we are looking at the CARS data set with measures of HorsePower and Weight. The Type category was added in as a color.

Figure 7.31 Scatter plot with categories

image

In the legend at the bottom, all of the types of cars are separated into different colors as well as shapes. When you look into the graph, it becomes very easy to see the variance between all of the types. The sports cars have relatively low weight and their horsepower differs throughout the graph, the trajectory is only slightly upward. The SUVs and trucks are the opposite having a steep upward trajectory with weight to horsepower but never reaching over 350 horsepower.

Adding a category can give you good insight into how the individual data points are spread out and how the values in the category compare. When you add a category, this does take away the option to add a fit line.

Heat map

The Heat Map is a unique object that provides an alternative to a scatter plot that is more capable of visualizing relationships when using categorical data. For the data types that it uses, there is an X-axis and a Y-axis that can contain both measures and categories. For the measures, they are binned in to cells that cover a certain range. With categories, each unique value has its own cell. There is also an option for color that is defaulted to Frequency, but any measure can be used here. This role changes the color of the cell depending on where the value of the measure falls according to the range of colors. For the example with two measures, we continue our analysis with the bodyfat data set from the Scatter Plot. Then we use the Insight Toy Company data set to look at how a measure changes the object and what you can look at.

Data analysis

In the previous section on Scatter Plots, we looked at the BODYFAT data set and how the data was spread out according to the weight and bodyfat measures. In the figure below, we have added them into the axes of the heat map except this time we have removed an outlier from the data set.

Figure 7.32 Heat map example

image

This is the exact same graph that we saw with the scatter plot except now we can tell how concentrated the data is by the color of the bins. The Bin count in the Properties tab lets you change the sizes of the bins and you can add more, making the ranges smaller, or subtract some, making the ranges larger, by changing the number in the box. The color gradient determines the color range of the measure that you have in the color role. For this chart we have just left it as the defaulted white to blue. You can see in the chart how the middle sections that are much more concentrated with data points become darker shades of blue.

Figure 7.33 Heat map with fit line

image

In the Properties tab there is also an option for the Fit line as there was with the scatter plot. This is only available when you have two measures as the axes. As we did with the scatter plot example, we select the Best Fit option and it is shown in the figure below.

Now back when we were analyzing our fit line from the scatter plot, we got a Quadratic line as our best fit but were unsure about the outlier having an effect on our model. In this instance, we have removed that from the data that we are working with in the filters tab and get the same result. This Quadratic line begins to flatten when we get to the higher range in bodyfat and weight. When that happens, it starts to indicate a non-linear relationship compared to the other parts of the graph. This means that in those higher ranges, with an increase in weight, we can’t expect the same or possibly any increase in bodyfat.

Using a category

Even though using two measures works similarly between the heat map and scatter plot, adding in a category is much different. In a scatter plot, you added a category as the third data item (which was color), and the object split out the marks for the data points according to that category. With a heat map, the category goes into an axis. This enables you to view all of the values of the category according to where they fall in the range of the measure on the other axis. Then the color of each cell, which represents the collective value for the measure of all data points that fall into that cell, gives you another visual to analyze. In the below figure, we look at Facility Country and Product Quality data fields in the Insight Toy Company data set with Frequency in the color role.

Figure 7.34 Heat map with a category

image

There’s a lot to interpret in this visualization. This data set is based on orders. What sticks out right away is how concentrated those orders are in the United States facilities. This could just be from most customer orders being within the United States and having more facilities there than any other country. More interesting though, is how the United States and Mexico are the only countries with orders that have a Product Quality lower than 70%. That’s not just a few orders either; this includes orders in every bin from 70% down to 62%. There are 30 other countries and all of them are in fairly similar ranges. It’s probably worth investigating further into the facilities producing those low product qualities.

Similar to the bubble plot and treemap from the Data Visualization chapter, the heat map analyzes three different data items, which gives a unique look at the data. This can be overwhelming to analyze, but there is information to be discovered this way. If you were to only look at the product quality averages for all of the countries in Figure 7.34, the United States and Mexico would trend lower but you would not see how drastic it is compared to the other countries like it is shown in the heat map.

Other tips when using the Explorer

The Explorer and Designer applications look similar, but do have differences in functionality. Here are a few additional features to be aware of with the Explorer.

Include and exclude

Sometimes there could be a cluster of data or category value that you want to look more into. By using the include and exclude option, you can easily apply a filter to your data in the object and get to what you want to look at. Using this method can be much quicker than finding the exact requirements for filters if you are just exploring your data. Here’s how you can access these options:

1.   Select or highlight the data that you want to keep or exclude from the analysis. Then right-click the area.

image

2.   At the bottom of the options are the Include and Exclude choices. Select the one that you want to use.

image

After your selection is made, the visualization is updated. In the filters tab on the right panel, you notice that the selection filter has been updated with what was selected. You can always delete this filter to go back to the original visualization.

3.   These filters even stay with you as you change objects. In this example, we right-clicked the object and selected a scatter plot.

image

Moving visualizations to the Designer

As you have seen, the Explorer offers a few more objects and data analysis features than the Designer does. That’s good since the Explorer is meant more for diving into your data, but sometimes you might want to use those visualizations in a report so that you can share it with other users. However, if you try to open an exploration in the Designer you won’t be able to find it.

When opening in the Designer, the application is looking for report objects. Explorations and reports are saved as different objects, which is why you do not see any of your explorations in the folders. You can get around this problem by exporting a visualization as a report object. Then you can Import it into a report. Here are the steps:

1.   Open the Exploration that has the Visualization that you want to move into the Designer. Then, select File ▶ Export Export as a Report.

image

2.   Select the visualizations that you want to export as a report. You can select more than one. Click OK and then save the Report.

image

3.   Go to the Designer and open a report. Then click the Import tab in the left panel. Choose Select a report to import in the drop-down list and then Import a report. Find your visualizations that you saved as a report.

image

4.   Open the report object and you can see your exploration with all your visualizations and their objects listed in a folder structure in the Import tab. Drag the objects into the canvas.

image

Proceed with caution!

In the preceding figure, we brought in a visualization where the data analysis feature was not available in the Designer. There is a warning sign in the bottom right corner of the object that lets us know we cannot edit the data. Since these features are not available in the Designer, anything involving the data cannot be changed, so there are no options for the Roles, Display Rules, Filters, and so on. You can change how object looks with some options in the Properties and Styles tabs. So before moving a visualization object over it’s a good idea to make sure your data is finalized.

References

The following sources were used in the chapter or provide more information about the topics discussed:

Aanderud, Tricia, and Michelle Homes. 2014. “SAS Admins Need a Dashboard, Too.” Proceedings of the SAS Global Forum 2014 Conference. Paper 1247-2014. Cary, NC: SAS Institute Inc.

Few, Stephen. 2009. Now you see it: Simple visualization techniques for quantitative analysis. Oakland, CA: Analytics Press.

Gonick, Larry, and Woollcott. Smith. 1993. The Cartoon Guide to Statistics. New York, NY: HarperCollins Publishers Inc.

Huff, Darrell. 1954. How to Lie with Statistics. New York, NY: W.W. Norton & Company, Inc.

Milton, Michael. 2009. Head First Data Analysis. Sebastopol, CA: O’Reilly Media Inc.

SAS Institute Inc. 2015. SAS Visual Analytics 7.3: User’s Guide. Cary, NC: SAS Institute Inc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset