Chapter 7
IN THIS CHAPTER
Working with things great and small
Understanding symmetry, peaks, and plateaus
Getting descriptive
Serving up statistics on a tray
Measures of central tendency and variability are excellent ways of summarizing a set of scores. They aren’t the only ways, though. Central tendency and variability make up a subset of descriptive statistics. Some descriptive statistics are intuitive — like count, maximum, and minimum. Some are not — like skewness and kurtosis.
In this chapter, I discuss descriptive statistics, and I show you Excel’s capabilities for calculating them and visualizing them.
The most fundamental descriptive statistic I can imagine is the number of scores in a set of scores. Excel offers five ways to determine that number. Yes, five ways. Count them.
Given an array of cells, COUNT
gives you the amount of those cells that contain numerical data. Figure 7-1 shows that I’ve entered a group of scores, selected a cell to hold COUNT
’s result, and opened the Function Arguments dialog box for COUNT
.
Here are the steps:
Enter your data into the worksheet and select a cell for the result.
I entered data into columns C, D, and E to show off COUNT’s multi-argument capability. I selected cell C14 to hold the count.
In the Function Arguments dialog box, enter the appropriate values for the arguments.
In the Number1 box, I entered one of the data columns for this example, like C1:C12.
I clicked in the Number2 box and entered another data column. I entered D1:D6.
I clicked in the Number3 box and entered the last column, which in this example is E1:E2.
COUNTA
works like COUNT
, except that its tally includes cells that contain text and logical values in its tally.
COUNTBLANK
counts the number of blank cells in an array. In Figure 7-2, I use the numbers from the preceding example, but I extend the array to include cells D7 through D12 and E3 through E12. The array in the Range box is C1:E12. The Argument Functions dialog box for COUNTBLANK
shows the number of blank cells (16, for this example).
COUNTIF
shows the number of cells whose value meets a specified criterion. Figure 7-3 reuses the data once again, showing the Arguments Function dialog box for COUNTIF
. Although the range is C1:E12, COUNTIF
doesn’t include blank cells.
The criterion I used, >= 89, tells COUNTIF
to count only the cells whose values are greater than or equal to 89. For this example, that count is 2.
COUNTIFS
can use multiple criteria to determine the count. If the criteria come from two arrays, they must have the same number of cells. This is because COUNTIFS
counts pairs of cells. It includes a pair of cells in the count if one of the cells meets a criterion and the other meets a criterion. Take a look at Figure 7-4.
In this example, COUNTIFS
operates in C1:C6 and D1:D6. The criterion for the cells in column C is >=40. The criterion for the cells in column D is >50. This means that COUNTIFS
counts cell-pairs whose C cell holds a value greater than or equal to 40 and whose D cell holds a value greater than 50. Only two cell-pairs meet these conditions, as the dialog box shows.
You can use a cell range more than once in COUNTIFS
. For example,
=COUNTIFS(C1:C12,">30",C1:C12,"<60")
gives the number of cells in which the value is between 30 and 60 (not including 30 or 60).
Two more descriptive statistics that probably require no introduction are the maximum and the minimum. These, of course, are the largest value and the smallest value in a group of scores.
Excel has worksheet functions that determine a group’s largest and smallest values. I show you what MAX
is all about. The others work in a similar fashion.
Figure 7-5 reuses the scores from the preceding examples.
I selected a cell to hold their maximum value and opened the Function Arguments dialog box for MAX
. Here are the steps you can follow:
Type your data into the worksheet and select a cell to hold the result.
I entered data into columns C, D, and E to show off MAX
’s multi-argument capability. For this example, I selected cell C14.
MAX
.In the Function Arguments dialog box, enter the appropriate values for the arguments.
In the Number1 box, I entered one of the data columns: C1:C12.
Clicking the Number2 box creates and opens the Number3 box. In the Number2 box, I entered another array: D1:D6.
I clicked in the Number3 box and entered the last array: E1:E2.
MAX
ignores any text or logical values it encounters along the way. MAXA
takes text and logical values into account when it finds the maximum. If MAXA
encounters the logical value TRUE
, it converts that value to 1
. MAXA
converts FALSE
, or any text other than “TRUE
”, to 0
.
MIN
and MINA
work the same way as MAX
and MAXA
, except that they find the minimum rather than the maximum. Take care when you use MINA
, because the conversions of logical values and text to 0 and 1 influence the result. With the numbers in the preceding example, the minimum is 22. If you enter FALSE
or other text into a cell in any of the arrays, MINA
gives 0 as the minimum. If you enter TRUE
, MINA
gives 1 as the minimum.
In this section, I discuss some little-used statistics that are related to the mean and the variance. For most people, the mean and the variance are enough to describe a set of data. These other statistics, skewness and kurtosis, go just a bit further. You might use them someday if you have a huge set of data and you want to provide some in-depth description.
Think of the mean as locating a group of scores by showing you where their center is. This is the starting point for the other statistics. With respect to the mean:
Figure 7-6 shows three histograms. The first is symmetric; the other two are not. The symmetry and the asymmetry are reflected in the skewness statistic.
For the symmetric histogram, the skewness is 0. For the second histogram — the one that tails off to the right — the value of the skewness statistic is positive. It’s also said to be skewed to the right. For the third histogram (which tails off to the left), the value of the skewness statistic is negative. It’s also said to be skewed to the left.
In the formula, is the mean of the scores, N is the number of scores, and s is the standard deviation. This formula is for the skewness of a sample. The formula for the skewness in a population uses N rather than N-1.
I include this formula for completeness. If you’re ever concerned with skewness of a sample, you probably won’t use this formula anyway because Excel’s SKEW
function does the work for you.
To use SKEW
:
Type your numbers into a worksheet and select a cell for the result.
For this example, I’ve entered scores into the first ten rows of columns B, C, D, and E. (See Figure 7-7.) I selected cell H2 for the result.
SKEW
.In the Function Arguments dialog box, type the appropriate values for the arguments.
In the Number1 box, enter the array of cells that holds the data. For this example, the array is B1:E10. With the data array entered, the Function Arguments dialog box shows the skewness, which for this example is negative.
The Function Arguments dialog box for SKEW.P
(the skewness of a population) looks the same. As I mention earlier, population skewness incorporates N rather than N-1.
Figure 7-8 shows two histograms. The first has a peak at its center; the second is flat. The first is said to be leptokurtic. its kurtosis is positive. The second is platykurtic; its kurtosis is negative.
Here’s how. The formula for kurtosis is
where is the mean of the scores, N is the number of scores, and s is the standard deviation.
Uh, why 3? The 3 comes into the picture because that’s the kurtosis of something special called the standard normal distribution. (I discuss normal distributions at length in Chapter 8.) Technically, statisticians refer to this formula as kurtosis excess — meaning that it shows the kurtosis in a set of scores that’s in excess of the standard normal distribution’s kurtosis. If you’re about to ask the question “Why is the kurtosis of the standard normal distribution equal to 3?” don’t ask.
This is another formula you’ll probably never use because Excel’s KURT
function takes care of business. Figure 7-9 shows the scores from the preceding example, a selected cell, and the Function Arguments dialog box for KURT
.
To use KURT
:
Enter your numbers into a worksheet and select a cell for the result.
For this example, I enter scores into the first ten rows of columns B, C, D, and E. I select cell H2 for the result.
KURT
.In the Function Arguments dialog box, enter the appropriate values for the arguments.
In the Number1 box, I enter the array of cells that holds the data. Here, the array is B1:E10. With the data array entered, the Function Arguments dialog box shows the kurtosis, which for this example is negative.
Although the calculations for skewness and kurtosis are all well and good, it’s helpful to see how the scores are distributed. To do this, you create a frequency distribution, a table that divides the possible scores into intervals and shows the number (the frequency) of scores that fall into each interval.
Excel gives you two ways to create a frequency distribution. One is a worksheet, and the other is a data analysis tool.
I show you the FREQUENCY
worksheet function in Chapter 2 when I introduce array functions. Here, I give you another look. In the upcoming example, I reuse the data from the skewness and kurtosis discussions so that you can see what the distribution of those scores looks like.
Figure 7-10 shows the data once again, along with a selected array, labeled Frequency. I’ve also added the label Intervals to a column, and in that column I put the interval boundaries. Each number in that column is the upper bound of an interval. The figure also shows the Function Arguments dialog box for FREQUENCY
.
This is an array function, so the steps are a bit different from the functions I show you earlier in this chapter:
Enter the scores into an array of cells.
The array, as in the preceding examples, is B1:E10.
Enter the intervals into an array.
I entered 5, 10, 15, 20, 25, and 30 into G2:G7.
Select an array for the resulting frequencies.
I put Frequency as the label at the top of column H, so I selected H2 through H7 to hold the resulting frequencies.
FREQUENCY
.In the Function Arguments dialog box, enter the appropriate values for the arguments.
In the Data_array box, I entered the cells that hold the scores. In this example, that’s B1:E10.
FREQUENCY
refers to intervals as bins, and holds the intervals in the Bins_array box. For this example, G2:G7 goes into the Bins_array box.
After I identified both arrays, the Function Arguments dialog box shows the frequencies inside a pair of curly brackets. Look closely at Figure 7-10 and you see that Excel adds a frequency of zero to the end of the set of frequencies in the third line of the dialog box.
Press Ctrl+Shift+Enter to close the Function Arguments dialog box.
Use this keystroke combination because FREQUENCY
is an array function.
When you close the Function Arguments dialog box, the frequencies go into the appropriate cells, as Figure 7-11 shows.
=FREQUENCY(Data,Interval)
which might be easier to understand than
=FREQUENCY(B1:E10,G2:G7)
(Don’t remember how to assign a name to a range of cells? Take a look at Chapter 2.)
Here’s another way to create a frequency distribution — with the Histogram data analysis tool. To show you that the two methods are equivalent, I use the data from the FREQUENCY
example. Figure 7-12 shows the data along with the Histogram dialog box.
The steps are listed here:
In the Histogram dialog box, enter the appropriate values.
The data are in cells B1 through E10, so B1:E10 goes into the Input Range box. The easiest way to enter this array is to click in B1, press and hold the Shift key, and then click in E10. Excel puts the absolute reference format ($B$1:$E$10) into the Input Range box.
In the Bin Range box, I enter the array that holds the intervals. In this example, that’s G2 through G7. I click in G2, press and hold the Shift key, and then click in G7. The absolute reference format ($G$2:$G$7) appears in the Bin Range box.
Figure 7-13 shows Histogram’s output. The table matches up with what FREQUENCY
produces. Notice that Histogram adds More to the Bin column. The size of the histogram is somewhat smaller when it first appears. I used the mouse to stretch the histogram and give it the appearance you see in the figure. The histogram shows that the distribution tails off to the left (consistent with the negative skewness statistic) and seems to not have a distinctive peak (consistent with the negative kurtosis statistic). Notice also the chart toolset (the three icons) that appears to the right of the histogram. The tools enable you to modify the histogram in a variety of ways. (See Chapter 3.)
By the way, the other check box options in the Histogram dialog box are Pareto Chart and Cumulative Percentage. The Pareto Chart option sorts the intervals in order, from highest frequency to lowest, before creating the graph. The Cumulative Percentage option shows the percentage of scores in an interval combined with the percentages in all the preceding intervals. Selecting this check box also puts a cumulative percentage line in the histogram. (You’d have to select both the Pareto Chart option and the Cumulative Percentage option to duplicate the effect of the Pareto chart I describe in Chapter 3.)
If you’re dealing with individual descriptive statistics, the worksheet functions I discuss get the job done nicely. If you want an overall report that presents nearly all the descriptive statistical information in one place, use the Data Analysis tool I describe in the next section.
In Chapter 2, I show you the Descriptive Statistics tool to introduce Excel’s data analysis tools. Here’s a slightly more complex example. Figure 7-14 shows three columns of scores and the Descriptive Statistics dialog box. I’ve labeled the columns First, Second, and Third so that you can see how this tool incorporates labels.
Here are the steps for using this tool:
In the Descriptive Statistics dialog box, enter the appropriate values.
In the Input Range box, I enter the data. The easiest way to do this is to move the cursor to the upper-left cell (C1), press the Shift key, and click in the lower-right cell (E9). That puts $C$1:$E$9 into Input Range.
Click OK to close the dialog box.
The new tabbed sheet (ply) opens, displaying statistics that summarize the data.
As Figure 7-15 shows, the statistics summarize each column separately. When this page first opens, the columns that show the statistic names are too narrow, so the figure shows what the page looks like after I widen the columns.
The Descriptive Statistics tool gives values for these statistics: Mean, standard error, median, mode, standard deviation, sample variance, kurtosis, skewness, range, minimum, maximum, sum, and count. Except for standard error and range, I discuss all of them.
Range is just the difference between the maximum and the minimum. Standard error is more involved, and I defer the explanation until Chapter 9. For now, I’ll just say that standard error is the standard deviation divided by the square root of the sample size and leave it at that.
By the way, one of the check boxes left deselected in the example’s Step 6 provides something called the Confidence Limit of the Mean, which I also defer until Chapter 9. The remaining two check boxes, Kth Largest and Kth Smallest, work like the functions LARGE
and SMALL
.
Quick Analysis was a wonderful addition to Excel 2013, but it still hasn’t made its way to the Mac. You select a range of data and an icon appears in the lower-right corner of the selection. Clicking the icon (or pressing Ctrl+Q) opens numerous possibilities for visualizing and summarizing the selected data. Mousing over these possibilities gives you a preview of what they look like. Selecting one puts it into your worksheet.
The worksheet in Figure 7-16 shows the percentages by age group that used the indicated media in 2006 (Source: U.S. Statistical Abstract). I selected the data, which caused the Quick Analysis icon to appear. Clicking the icon opened the panel with the options.
Figure 7-17 shows what happens when I mouse over FORMATTING | Data Bars.
Want to see what a column chart looks like? Mouse over CHARTS | Clustered Column. (See Figure 7-18.)
How about inserting the means? That’s TOTALS | Average, as in Figure 7-19.
I could go on all day with this, but I’ll just show you a couple more. If you’d like to add some professional-looking table effects to the selection, try TABLES | Table. (See Figure 7-20.)
I couldn’t finish off this topic without taking a look at the sparklines in Quick Analysis. When I mouse over SPARKLINES | Column, the result is Figure 7-21. The columns give a concise look at important age-related trends: Contrast the Internet sparkline with the sparklines for Newspapers and TV.
Suppose that you’re working with a cell range full of data. You might like to quickly know the status of the average and perhaps some other descriptive statistics about the data without going to the trouble of using several statistical functions. You can customize the status bar at the bottom of the worksheet to track these values for you and display them whenever you select the cell range. To do this, right-click the status bar to open the Customize Status Bar menu. (See Figure 7-22.) In the area third from the bottom, selecting all the items displays the values I mention in the preceding section (along with the count of items in the range — numerical and non-numerical).
Figure 7-23 shows these values displayed on the status bar for the cells I selected.