Chapter 21

Ten Statistical and Graphical Tips and Traps

IN THIS CHAPTER

Determining significance

Being wary of graphs

Being cautious with regression

Using concepts carefully

The world of statistics is full of pitfalls, but it’s also full of opportunities. Whether you’re a user of statistics or someone who has to interpret them, it’s possible to fall into the pitfalls. It’s also possible to walk around them. Here are ten tips and traps from the areas of hypothesis testing, regression, correlation, and graphs.

Significant Doesn’t Always Mean Important

As I say earlier in the book, significance is, in many ways, a poorly chosen term. When a statistical test yields a significant result, and the decision is to reject H0, that doesn’t guarantee that the study behind the data is an important one. Statistics can only help decision making about numbers and inferences about the processes that produced them. They can’t make those processes important or earth shattering. Importance is something you have to judge for yourself — and no statistical test can do that for you.

Trying to Not Reject a Null Hypothesis Has a Number of Implications

Let me tell you a story: Some years ago, an industrial firm was trying to show that it was finally in compliance with environmental clean-up laws. The company took numerous measurements of the pollution in the body of water surrounding its factory, compared the measurements with a null hypothesis-generated set of expectations, and found that it couldn’t reject H0 with α = .05. The measurements didn’t differ significantly (there’s that word again) from “clean” water.

This, the company claimed, was evidence that it had cleaned up its act. Closer inspection revealed that the data approached significance, but the pollution wasn’t quite of a high enough magnitude to reject H0. Does this mean the company is not polluting?

Not at all. In striving to “prove” a null hypothesis, the company had stacked the deck in favor of itself. It set a high barrier to get over, didn’t clear it, and then patted itself on the back.

Every so often, it’s appropriate to try and not reject H0. When you set out on that path, be sure to set a high value of α (about .20–.30), so that small divergences from H0 cause rejection of H0. (I discuss this topic in Chapter 10, and I mention it in other parts of the book. I think it’s important enough to mention again here.)

Regression Isn’t Always Linear

When trying to fit a regression model to a scatterplot, the temptation is to immediately use a line. This is the best-understood regression model, and when you get the hang of it, slopes and intercepts aren’t all that daunting.

But linear regression isn’t the only kind of regression. It’s possible to fit a curve through a scatterplot. I won’t kid you: The statistical concepts behind curvilinear regression are more difficult to understand than the concepts behind linear regression.

It’s worth taking the time to master those concepts, however. Sometimes, a curve is a much better fit than a line. (This is partly a plug for Chapter 22, where I take you through curvilinear regression — and some of the concepts behind it.)

Extrapolating Beyond a Sample Scatterplot Is a Bad Idea

Whether you’re working with linear regression or curvilinear regression, keep in mind that it’s inappropriate to generalize beyond the boundaries of the scatterplot.

Suppose you’ve established a solid predictive relationship between a test of mathematics aptitude and performance in mathematics courses, and your scatterplot covers only a narrow range of mathematics aptitude. You have no way of knowing whether the relationship holds up beyond that range. Predictions outside that range aren’t valid.

Your best bet is to expand the scatterplot by testing more people. You might find that the original relationship tells only part of the story.

Examine the Variability Around a Regression Line

Careful analysis of residuals (the differences between observed and predicted values) can tell you a lot about how well the line fits the data. A foundational assumption is that variability around a regression line is the same up and down the line. If it isn’t, the model might not be as predictive as you think. If the variability is systematic (greater variability at one end than at the other), curvilinear regression might be more appropriate than linear. The standard error of estimate won’t always be the indicator.

A Sample Can Be Too Large

Believe it or not, this sometimes happens with correlation coefficients. A very large sample can make a small correlation coefficient statistically significant. For example, with 100 degrees of freedom and α = .05, a correlation coefficient of .195 is cause for rejecting the null hypothesis that the population correlation coefficient is equal to zero.

But what does that correlation coefficient really mean? The coefficient of determination —r2 — is just .038, meaning that the SSRegression is less than 4 percent of the SSTotal. (See Chapter 15.) That’s a very small association.

Bottom line: When looking at a correlation coefficient, be aware of the sample size. If it’s large enough, it can make a trivial association turn out statistically significant. (Hmmm … significance — there it is again!)

Consumers: Know Your Axes

When you look at a graph, make sure you know what’s on each axis. Make sure you understand the units of measure. Do you understand the independent variable? Do you understand the dependent variable? Can you describe each one in your own words? If the answer to any of these questions is "No," you don’t understand the graph you’re looking at.

When looking at a graph in a TV ad, be very wary if it disappears too quickly, before you can see what’s on the axes. The advertiser may be trying to create a lingering false impression about a bogus relationship inside the graph. The graphed relationship might be as valid as that other staple of TV advertising — scientific proof via animated cartoon: Tiny animated scrub brushes cleaning cartoon teeth might not necessarily guarantee whiter teeth for you if you buy the product. (I know that’s off-topic, but I had to get it in.)

Graphing a Categorical Variable as Though It’s a Quantitative Variable Is Just Wrong

So you’re just about ready to compete in the Rock-Paper-Scissors World Series. In preparation for this international tournament, you’ve tallied all your matches from the past ten years, listing the percentage of times you won when you played each role.

To summarize all the outcomes, you’re about to use Excel’s graphics capabilities to create a graph. One thing’s sure: Whatever your preference rock-paper-scissors-wise, the graph absolutely, positively had better not look like Figure 21-1.

image

FIGURE 21-1: Absolutely the wrong way to graph categorical data.

So many people create these kinds of graphs — people who should know better. The line in the graph implies continuity from one point to another. With these data, of course, that’s impossible. What’s between rock and paper? Why are they equal units apart? Why are the three categories in that order? (Can you tell this is my pet peeve?)

Simply put, a line graph is not the proper graph when at least one of your variables is a set of categories. Instead, create a column graph. A pie chart works here, too, because the data are percentages and you have just a few slices. (See Chapter 3 for Yogi Berra’s pie-slice guidelines.)

When I wrote the first edition of this book, I whimsically came up with the idea of a Rock-Paper-Scissors World Series for this example. Between then and now, I found out … there really is one! (The World RPS Society puts it on.)

Whenever Appropriate, Include Variability in Your Graph

When the points in your graph represent means, make sure that the graph includes the standard error of each mean. This gives the viewer an idea of the variability in the data — which is an important aspect of the data. Here’s another plug: In Chapter 22, I show you how to do that in Excel.

Means by themselves don’t always tell you the whole story. Take every opportunity to examine variances and standard deviations. You may find some hidden nuggets. Systematic variation — high values of variance associated with large means, for example — might be a clue about a relationship you didn’t see before.

Be Careful When Relating Statistics Textbook Concepts to Excel

If you’re serious about doing statistical work, you’ll probably have occasion to look into a statistics text or two. Bear in mind that the symbols in some areas of statistics aren’t standard: For example, some texts use M rather than images to represent the sample mean, and some represent a deviation from the mean with just x.

Connecting textbook concepts to Excel’s statistical functions can be a challenge because of the texts and because of Excel. Messages in dialog boxes and in Help files might contain symbols other than the ones you read about, or they might use the same symbols but in a different way. This discrepancy might lead you to make an incorrect entry into a parameter in a dialog box, resulting in an error that’s hard to trace.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset