Chapter 2: Statistics Review

Introduction

Fundamental Concepts 1 and 2

FC1: Always Take a Random and Representative Sample

FC2: Remember That Statistics Is Not an Exact Science

Fundamental Concept 3: Understand a Z-Score

Fundamental Concept 4

FC4: Understand the Central Limit Theorem

Learn from an Example

Fundamental Concept 5

Understand One-Sample Hypothesis Testing

Consider p-Values

Fundamental Concept 6:

Understand That Few Approaches/Techniques Are Correct—Many Are Wrong

Three Possible Outcomes When You Choose a Technique

Introduction

Regardless of the academic field of study—business, psychology, or sociology—the first applied statistics course introduces the following statistical foundation topics:

   descriptive statistics

   probability

   probability distributions (discrete and continuous)

   sampling distribution of the mean

   confidence intervals

   one-sample hypothesis testing and perhaps two-sample hypothesis testing

   simple linear regression

   multiple linear regression

   ANOVA

Not considering the mechanics or processes of performing these statistical techniques, what fundamental concepts should you remember? We believe there are six fundamental concepts:

   FC1: Always take a random and representative sample.

   FC2: Statistics is not an exact science.

   FC3: Understand a z-score.

   FC4: Understand the central limit theorem (not every distribution has to be bell-shaped).

   FC5: Understand one-sample hypothesis testing and p-values.

   FC6: Few approaches are correct and many wrong.

Let’s examine each concept further.

Fundamental Concepts 1 and 2

The first two fundamental concepts explain why we take a random and representative sample and that the sample statistics are estimates that vary from sample to sample.

FC1: Always Take a Random and Representative Sample

What is a random and representative sample (called a 2R sample)? Here, representative means representative of the population of interest. A good example is state election polling. You do not want to sample everyone in the state. First, an individual must be old enough and registered to vote. You cannot vote if you are not registered. Next, not everyone who is registered votes,so, does a given registered voter plan to vote? You are not interested in individuals who do not plan to vote. You don’t care about their voting preferences because they will not affect the election. Thus, the population of interest is those individuals who are registered to vote and plan to vote.

From this representative population of registered voters who plan to vote, you want to choose a random sample. Random, means that each individual has an equal chance of being selected. So you could suppose that there is a huge container with balls that represent each individual who is identified as registered and planning to vote. From this container, you choose so many balls (without replacing the ball). In such a case, each individual has an equal chance of being drawn.

You want the sample to be a 2R sample, but why? For two related reasons. First, if the sample is a 2R sample, then the sample distribution of observations will follow a pattern resembling that of the population. Suppose that the population distribution of interest is the weights of sumo wrestlers and horse jockeys (sort of a ridiculous distribution of interest, but that should help you remember why it is important). What does the shape of the population distribution of weights of sumo wrestlers and jockeys look like? Probably somewhat like the distribution in Figure 2.1. That is, it’s bimodal, or two-humped.

Figure 2.1: Population Distribution of the Weights of Sumo Wrestlers and Jockeys

image

If you take a 2R sample, the distribution of sampled weights will look somewhat like the population distribution in Figure 2.2, where the solid line is the population distribution and the dashed line is the sample distribution.

Figure 2.2: Population and a Sample Distribution of the Weights of Sumo Wrestlers and Jockeys

image

Why not exactly the same? Because it is a sample, not the entire population. It can differ, but just slightly. If the sample was of the entire population, then it would look exactly the same. Again, so what? Why is this so important?

The population parameters (such as the population mean, µ, the population variance, σ2, or the population standard deviation, σ) are the true values of the population. These are the values that you are interested in knowing. In most situations, you would not know these values exactly only if you were to sample the entire population (or census) of interest. In most real-world situations, this would be a prohibitively large number (costing too much and taking too much time).

Because the sample is a 2R sample, the sample distribution of observations is very similar to the population distribution of observations. Therefore, the sample statistics, calculated from the sample, are good estimates of their corresponding population parameters. That is, statistically they will be relatively close to their population parameters because you took a 2R sample. For these reasons, you take a 2R sample.

FC2: Remember That Statistics Is Not an Exact Science

The sample statistics (such as the sample mean, sample variance, and sample standard deviation) are estimates of their corresponding population parameters. It is highly unlikely that they will equal their corresponding population parameter. It is more likely that they will be slightly below or slightly above the actual population parameter, as shown in Figure 2.2.

Further, if another 2R sample is taken, most likely the sample statistics from the second sample will be different from the first sample. They will be slightly less or more than the actual population parameter.

For example, suppose that a company’s union is on the verge of striking. You take a 2R sample of 2,000 union workers. Assume that this sample size is statistically large. Out of the 2,000, 1,040 of them say that they are going to strike. First, 1,040 out of 2,000 is 52%, which is greater than 50%. Can you therefore conclude that they will go on strike? Given that 52% is an estimate of the percentage of the total number of union workers who are willing to strike, you know that another 2R sample will provide another percentage. But another sample could produce a percentage perhaps higher and perhaps lower and perhaps even less than 50%. By using statistical techniques, you can test the likelihood of the population parameter being greater than 50%. (You can construct a confidence interval, and if the lower confidence level is greater than 50%, you can be highly confident that the true population proportion is greater than 50%. Or you can conduct a hypothesis test to measure the likelihood that the proportion is greater than 50%.)

Bottom line: When you take a 2R sample, your sample statistics will be good (statistically relatively close, that is, not too far away) estimates of their corresponding population parameters. And you must realize that these sample statistics are estimates, in that, if other 2R samples are taken, they will produce different estimates.

Fundamental Concept 3: Understand a Z-Score

Suppose that you are sitting in on a marketing meeting. The marketing manager is presenting the past performance of one product over the past several years. Some of the statistical information that the manager provides is the average monthly sales and standard deviation. (More than likely, the manager would not present the standard deviation, but, a quick conservative estimate of the standard deviation is the (Max − Min)/4; the manager most likely would give the minimum and maximum values.)

Suppose that the average monthly sales are $500 million, and the standard deviation is $10 million. The marketing manager starts to present a new advertising campaign which he or she claims would increase sales to $570 million per month. And suppose that the new advertising looks promising. What is the likelihood of this happening? Calculate the z-score as follows:

Z = xμσ = 570  50010 = 7

The z-score (and the t-score) is not just a number. The z-score is how many standard deviations away that a value, like the 570, is from the mean of 500. The z-score can provide you some guidance, regardless of the shape of the distribution. A z-score greater than (absolute value) 3 is considered an outlier and highly unlikely. In the example, if the new marketing campaign is as effective as suggested, the likelihood of increasing monthly sales by 7 standard deviations is extremely low.

On the other hand, what if you calculated the standard deviation and it was $50 million? The z-score is now 1.4 standard deviations. As you might expect, this can occur. Depending on how much you like the new advertising campaign, you would believe it could occur. So the number $570 million can be far away, or it could be close to the mean of $500 million. It depends on the spread of the data, which is measured by the standard deviation.

In general, the z-score is like a traffic light. If it is greater than the absolute value of 3 (denoted |3|), the light is red; this is an extreme value. If the z-score is between |1.65| and |3|, the light is yellow; this value is borderline. If the z-score is less than |1.65|, the light is green, and the value is just considered random variation. (The cutpoints of 3 and 1.65 might vary slightly depending on the situation.)

Fundamental Concept 4

This concept is where most students become lost in their first statistics class. They complete their statistics course thinking every distribution is normal or bell-shaped, but that is not true. However, if the FC1 assumption is not violated and the central limit theorem holds, then something called the sampling distribution of the sample means will be bell-shaped. And this sampling distribution is used for inferential statistics; that is, it is applied in constructing confidence intervals and performing hypothesis tests.

FC4: Understand the Central Limit Theorem

If you take a 2R sample, the histogram of the sample distribution of observations will be close to the histogram of the population distribution of observations (FC1). You also know that the sample mean from sample to sample will vary (FC2).

Suppose that you actually know the value of the population mean and you took every combination of sample size n (and let n be any number greater than 30), and you calculated the sample mean for each sample. Given all these sample means, you then produce a frequency distribution and corresponding histogram of sample means. You call this distribution the sampling distribution of sample means. A good number of sample means will be slightly less and more, and fewer will be further away (above and below), with equal chance of being greater than or less than the population mean. If you try to visualize this, the distribution of all these sample means would be bell-shaped, as in Figure 2.3. This should make intuitive sense.

Figure 2.3: Population Distribution and Sample Distribution of Observations and Sampling Distribution of the Means for the Weights of Sumo Wrestlers and Jockeys

image

Nevertheless, there is one major problem. To get this distribution of sample means, you said that every combination of sample size n needs to be collected and analyzed. That, in most cases, is an enormous number of samples and would be prohibitive. Also, in the real world, you only take one 2R sample.

This is where the central limit theorem (CLT) comes to our rescue. The CLT will hold regardless of the shape of the population distribution of observations—whether it is normal, bimodal (like the sumo wrestlers and jockeys), or whatever shape, as long as a 2R sample is taken and the sample size is greater than 30. Then, the sampling distribution of sample means will be approximately normal, with a mean of ˉx and a standard deviation of (s / √n) (which is called the standard error).

What does this mean in terms of performing statistical inferences of the population? You do not have to take an enormous number of samples. You need to take only one 2R sample with a sample size greater than 30. In most situations, this will not be a problem. (If it is an issue, you should use nonparametric statistical techniques.) If you have a 2R sample greater than 30, you can approximate the sampling distribution of sample means by using the sample’s ˉx and standard error, s / ˉx. If you collect a 2R sample greater than 30, the CLT holds. As a result, you can use inferential statistics. That is, you can construct confidence intervals and perform hypothesis tests. The fact that you can approximate the sample distribution of the sample means by taking only one 2R sample greater than 30 is rather remarkable and is why the CLT theorem is known as the “cornerstone of statistics.”

Learn from an Example

The implications of the CLT can be further illustrated with an empirical example. The example that you will use is the population of the weights of sumo wrestlers and jockeys.

Open the Excel file called SumowrestlersJockeysnew.xls and go to the first worksheet called “data.” In column A, you see that the generated population of 5,000 sumo wrestlers’ and jockeys’ weights with 30% of them being sumo wrestlers.

First, you need the Excel Data Analysis add-in. (If you have loaded it already, you can jump to the next paragraph). To upload the Data Analysis add-in1:

1.   Click File from the list of options at the top of window. A box of options will appear.

2.   On the left side toward the bottom, click Options. A dialog box will appear with a list of options on the left.

3.   Click Add-Ins. The right side of this dialog box will now list Add-Ins. Toward the bottom of the dialog box there will appear the following:
image

4.   Click Go. A new dialog box will appear listing the Add-Ins available with a check box on the left. Click the check boxes for Analysis ToolPak and Analysis ToolPakVBA. Then click OK.

Now, you can generate the population distribution of weights:

1.   Click Data on the list of options at the top of the window. Then click Data Analysis. A new dialog box will appear with an alphabetically ordered list of Analysis tools.

2.   Click Histogram and OK.

3.   In the Histogram dialog box, for the Input Range, enter $A$2:$A$5001; for the Bin Range, enter $H$2:$H$37; for the Output range, enter $K$1. Then click the options Cumulative Percentage and Chart Output and click OK, as in Figure 2.4.

Figure 2.4: Excel Data Analysis Tool Histogram Dialog Box

image

A frequency distribution and histogram similar to Figure 2.5 will be generated.

Figure 2.5: Results of the Histogram Data Analysis Tool

image

Given the population distribution of sumo wrestlers and jockeys, you will generate a random sample of 30 and a corresponding dynamic frequency distribution and histogram (you will understand the term dynamic shortly):

1.   Select the 1 random sample worksheet. In columns C and D, you will find percentages that are based on the cumulative percentages in column M of the worksheet data. Also, in column E, you will find the average (or midpoint) of that particular range.

2.   In cell K2, enter =rand(). Copy and paste K2 into cells K3 to K31.

3.   In cell L2, enter =VLOOKUP(K2,$C$2:$E$37,3). Copy and paste L2 into cells L3 to L31. (In this case, the VLOOKUP function finds the row in $C$2:$D$37 that matches K2 and returns the value found in the third column (column E) in that row.)

4.   You have now generated a random sample of 30. If you press F9, the random sample will change.

5.   To produce the corresponding frequency distribution (and be careful!), highlight the cells P2 to P37. In cell P2, enter the following: =frequency(L2:L31,O2:O37). Before pressing Enter, simultaneously hold down and press Ctrl, Shift, and Enter. The frequency function finds the frequency for each bin, O2:O37, and for the cells L2:L31. Also, when you simultaneously hold down the keys, an array is created. Again, as you press the F9 key, the random sample and corresponding frequency distribution changes. (Hence, it is called a dynamic frequency distribution.)

a.   To produce the corresponding dynamic histogram, highlight the cells P2 to P37. Click Insert from the top list of options. Click the Chart type Column icon. An icon menu of column graphs is displayed. Click under the left icon that is under the 2-D Columns. A histogram of your frequency distribution is produced, similar to Figure 2.6.

b.   To add the axis labels, under the group of Chart Tools at the top of the screen (remember to click on the graph), click Layout. A menu of options appears below. Select Axis TitlesPrimary Horizontal Axis TitleTitle Below Axis. Type Weights and press Enter. For the vertical axis, select Axis TitlesPrimary Vertical Axis TitleVertical title. Type Frequency and press Enter.

c.   If you press F9, the random sample changes, the frequency distribution changes, and the histogram changes. As you can see, the histogram is definitely not bell-shaped and does look somewhat like the population distribution in Figure 2.5.

Now, go to the sampling distribution worksheet. Much in the way you generated a random sample in the random sample worksheet, 50 random samples were generated, each of size 30, in columns L to BI. Below each random sample, the average of that sample is calculated in row 33. Further in column BL is the dynamic frequency distribution, and there is a corresponding histogram of the 50 sample means. If you press F9, the 50 random samples, averages, frequency distribution, and histogram change. The histogram of the sampling distribution of sample means (which is based on only 50 samples—not on every combination) is not bimodal, but is, for the most part, bell-shaped.

Figure 2.6: Histogram of a Random Sample of 30 Sumo Wrestler and Jockeys Weights

image

Fundamental Concept 5

One of the inferential statistical techniques that you can apply, thanks to the CLT, is one-sample hypothesis testing of the mean.

Understand One-Sample Hypothesis Testing

Generally speaking, hypothesis testing consists of two hypotheses, the null hypothesis, called H0, and the opposite to H0—the alternative hypothesis, called H1 or Ha. The null hypothesis for one-sample hypothesis testing of the mean tests whether the population mean is equal to, less than or equal to, or greater than or equal to a particular constant, µ = k, µ ≤ k, or µ ≥ k. An excellent analogy for hypothesis testing is the judicial system. The null hypothesis, H0, is that you are innocent, and the alternative hypothesis, H1, is that you are guilty.

Once the hypotheses are identified, the statistical test statistic is calculated. For simplicity’s sake, in our discussion here assume only the z test will be discussed, although most of what is presented is pertinent to other statistical tests—such as t, F, χ2. This calculated statistical test statistic is called Zcalc. This Zcalc is compared to what here will be called the critical z, Zcritical. The Zcritical value is based on what is called a level of significance, called α, which is usually equal to 0.10, 0.05, or 0.01. The level of significance can be viewed as the probability of making an error (or mistake), given that the H0 is correct. Relating this to the judicial system, this is the probability of wrongly determining someone is guilty when in reality they are innocent. So you want to keep the level of significance rather small. Remember that statistics is not an exact science. You are dealing with estimates of the actual values. (The only way that you can be completely certain is if you use the entire population.) So, you want to keep the likelihood of making an error relatively small.

There are two possible statistical decisions and conclusions that are based on comparing the two z-values, Zcalc and Zcritical. If |Zcalc| > |Zcritical|, you reject H0. When you reject H0, there is enough statistical evidence to support H1. That is, in terms of the judicial system, there is overwhelming evidence to conclude that the individual is guilty. On the other hand, you do fail to reject H0 when |Zcalc| ≤ |Zcritical|, and you conclude that there is not enough statistical evidence to support H1. The judicial system would then say that the person is innocent, but, in reality, this is not necessarily true. You just did not have enough evidence to say that the person is guilty.

As discussed under FC3, “Understand a Z-Score,” the |Zcalc| is not simply a number. It represents the number of standard deviations away from the mean that a value is. In this case, it is the number of standard deviations away from the hypothesized value used in H0. So, you reject H0 when you have a relatively large |Zcalc|; that is, |Zcalc| > |Zcritical|. In this situation, you reject H0 when the value is a relatively large number of standard deviations away from the hypothesized value. Conversely, when you have a relatively small |Zcalc| (that is, |Zcalc| ≤ |Zcritical|), you fail to reject H0. That is, the |Zcalc| value is relatively near the hypothesized value and could be simply due to random variation.

Consider p-Values

Instead of comparing the two z-values, Zcalc and Zcritical, another more generalizable approach that can also be used with other hypothesis tests (such as t, F, χ2) is a concept known as the p-value. The p-value is the probability of rejecting H0. Thus, in terms of the one-sample hypothesis test using the Z, the p-value is the probability that is associated with Zcalc. So, as shown in Table 2.1, a relatively large |Zcalc| results in rejecting H0 and has a relatively small p-value. Alternatively, a relatively small |Zcalc| results in not rejecting H0 and has a relatively large p-value. The p-values and |Zcalc| have an inverse relationship: Relatively large |Zcalc| values are associated with relatively small p-values, and, vice versa, relatively small |Zcalc| values have relatively large p-values.

Table 2.1: Decisions and Conclusions to Hypothesis Tests in Relationship to the p-Value

Critical Value p-value Statistical Decision Conclusion
ZCalc> ZCritical p-value < α Reject H0 There is enough evidence to say that H1 is true.
ZCalcZCritical p-value ≥ α Do Not Reject H0 There is not enough evidence to say that H1 is true.

General interpretation of a p-value is as follows:

   Less than 1%: There is overwhelming evidence that supports the alternative hypothesis.

   Between 1% and 5%. There is strong evidence that supports the alternative hypothesis.

   Between 5% and 10%. There is weak evidence that supports the alternative hypothesis.

   Greater than 10%: There is little to no evidence that supports the alternative hypothesis.

An excellent real-world example of p-values is the criterion that the U.S. Food and Drug Administration (FDA) uses to approve new drugs. A new drug is said to be effective if it has a p-value less than .05 (and FDA does not change the threshold of .05). So, a new drug is approved only if there is strong evidence that it is effective.

Fundamental Concept 6:

In your first statistics course, many and perhaps an overwhelming number of approaches and techniques were presented. When do you use them? Do you remember why you use them? Some approaches/techniques should not even be considered with some data. Two major questions should be asked when considering the use of a statistical approach or technique:

   Is it statistically appropriate?

   What will it possibly tell you?

Understand That Few Approaches/Techniques Are Correct—Many Are Wrong

An important factor to consider in deciding which technique to use is whether one or more of the variables is categorical or continuous. Categorical data can be nominal data such as gender, or it might be ordinal such as the Likert scale. Continuous data can have decimals (or no decimals, in which the datum is an integer), and you can measure the distance between values. But with categorical data, you cannot measure distance. Simply in terms of graphing, you would use bar and pie charts for categorical data but not for continuous data. On the other hand, graphing a continuous variable requires a histogram or box plot. When summarizing data, descriptive statistics are insightful for continuous variables. A frequency distribution is much more useful for categorical data.

Illustration 1

To illustrate, use the data in Table 2.2 and found in the file Countif.xls in worksheet rawdata. The data consists of survey data from 20 students, asking them how useful their statistics class was (column C), where 1 represents extremely not useful and 5 represents extremely useful, along with some individual descriptors of major (Business or Arts and Sciences (A&S)), gender, current salary, GPA, and years since graduating. Major and gender (and correspondingly gender code) are examples of nominal data. The Likert scale of usefulness is an example of ordinal data. Salary, GPA, and years are examples of continuous data.

Table 2.2 Data and Descriptive Statistics in Countif.xls file and Worksheet Statistics

image

Some descriptive statistics, derived from some Excel functions, are found in rows 25 to 29 in the stats worksheet. These descriptive statistics are valuable in understanding the continuous data—An example would be the fact that since the average is somewhat less than the median the salary data could be considered to be slightly left-skewed and with a minimum of $31,235 and a maximum of $65,437. Descriptive statistics for the categorical data are not very helpful. For example, for the usefulness variable, an average of 3.35 was calculated, slightly above the middle value of 3. A frequency distribution would give much more insight.

Next examine this data in JMP. First, however, you must read the data from Excel.

Ways JMP Can Access Data in Excel

There are three ways that you can open an Excel file. One way is similar to opening any file in JMP; another way is directly from inside Excel (when JMP has been added to Excel as an Add-in). Lastly, the third way is accomplished by copying and pasting the data from Excel:

1.   To open the file in JMP, first open JMP. From the top menu, click FileOpen. Locate the Countif.xls Excel file on your computer and click on it in the selection window. The Excel Import Wizard box will appear as shown in Figure 2.7. In the upper right corner, click the worksheet called rawdata, as shown in Figure 2.7. Click Import. The data table should then appear.

2.   If you want to open JMP from within Excel (and you are in Worksheet rawdata), on the top Excel menu click JMP. (Note: The first time you use this approach, select Preferences. Check the box for Use the first row s as column names. Click OK. Subsequent use of this approach does not require you to click Preferences.) Highlight cells A1:G23. Click Data Table. JMP should open and the data table will appear.

3.   In Excel, copy the data including column names. In JMP, with a new data table, click File ▶ New. Click Edit in the new data table and select Paste with Column Names.

Figure 2.7: Excel Import Wizard Dialog Box

image

Figure 2.8: Modeling Types of Gender

image

Now that you have the data in the worksheet rawdata from the Excel file Countif.xls in JMP, let’s examine it.

In JMP, as illustrated in Figure 2.8, move your cursor to the Columns panel on top of the red bar chart symbol to the left of the variable Gender. The cursor should change and look like a hand.

Right-click. You will get three rows of options: continuous (which is grayed out), ordinal, and nominal. Next to Nominal will be a dark colored marker, which indicates the JMP software’s best guess of what type of data the column Gender is: Nominal.

If you move your cursor over the blue triangle, beside Usefulness, you will see the dark colored marker next to Continuous. But actually the data is ordinal. So click Ordinal. JMP now considers that column as ordinal (note that the blue triangle changed to a green bar chart).

Following the same process, change the column Gender code to nominal (the blue triangle now changes to a red bar chart). The data table should look like Figure 2.9. To save the file as a JMP file, first, in the Table panel, right-click Notes and select Delete. At the top menu, click File ▶ Save As, enter the filename Countif, and click OK.

Figure 2.9: The Data Table for Countif.jmp after Modeling Type Changes

image

At the top menu in JMP, select AnalyzeDistribution. The Distribution dialog box will appear. In this new dialog box, click Major, hold down the Shift key, click Years, and release. All the variables should be highlighted, as in Figure 2.10.

Figure 2.10: The JMP Distribution Dialog Box

image

Click Y, Columns, and all the variables should be transferred to the box to the right. Click OK, and a new window will appear. Examine Figure 2.11 and your Distribution window in JMP. All the categorical variables (Major, Gender, Usefulness, and Gender code), whether they are nominal or ordinal, have frequency numbers and a histogram, and no descriptive statistics. But the continuous variables have descriptive statistics and a histogram.

As shown in Figure 2.11, click the area/bar of the Major histogram for Business. You can immediately see the distribution of Business students within each variable; they are highlighted in each of the histograms.

Most of the time in JMP, if you are looking for some more information to display or statistical options, they can be usually found by clicking the red triangle. For example, notice in Figure 2.11, just to the left of each variable’s name, there is a red triangle. Click any one of these red triangles, and you will see a list of options. For example, click Histogram Options and deselect Vertical. Here’s another example: Click the red triangle next to Summary Statistics (note that summary statistics are listed for continuous variables only), and click Customize Summary Statistics. Click the check box, or check boxes, on the summary statistics that you want displayed, such as Median, Minimum or Maximum; and then click OK.

Figure 2.11: Distribution Output for Countif.jmp Data

image

Illustration 2

What if you want to further examine the relationship between Business and these other variables or the relationship between any two of these variables (in essence, perform some bivariate analysis). You can click any of the bars in the histograms to see the corresponding data in the other histograms. You could possibly look at every combination, but what is the right approach? JMP provides excellent direction. The bivariate diagram in the lower left of the new window, as in Figure 2.12, provides guidance on which technique is appropriate—for example, as follows:

1.   Select AnalyzeFit Y by X.

2.   Drag Salary to the white box to the right of Y, Response (or click Salary and then click Y, Response).

3.   Similarly, click Years, hold down the left mouse button, and drag it to the white box to the right of X, Factor. The Fit Y by X dialog box should look like Figure 2.12. According to the lower left diagram in Figure 2.12, bivariate analysis will be performed.

4.   Click OK.

Figure 2.12: Fit Y by X Dialog Box

image

In the new Fit Y by X output, click the red triangle to the left of Bivariate Fit of Salary by Years, and click Fit Line. The output will look like Figure 2.13. The positive coefficient of 7743.7163 demonstrates a strong positive relationship. A positive value implies that, as Years increases, Salary also increases, or the slope is positive. In contrast, a negative relationship has a negative slope. So, as the X variable increases, the Y variable decreases. The RSquare value or the coefficient of determination is 0.847457, which also shows a strong relationship.

Figure 2.13: Bivariate Analysis of Salary by Years

image

RSquare values can range from 0 (no relationship) to 1 (exact/perfect relationship). You take the square root of the RSquare and multiply it as follows:

   by 1 if it has a positive slope (as it is for this illustration), or

   by −1 if it has a negative slope.

This calculation results in what is called the correlation of the variables Salary and Years. Correlation values near −1 or 1 show strong linear associations. (A negative correlation implies a negative linear relationship, and a positive correlation implies a positive linear relationship.) Correlation values near 0 imply no linear relationship. In this example, Salary and Years have a very strong correlation of .920574 = 1* √(0.847457).

On the other hand, what if you drag Major and Gender to the Y, Response and X, Factor, respectively in the Fit Y by X dialog box (Figure 2.12) and click OK? The bivariate analysis diagram on the lower left in Figure 2.12 would suggest a Contingency analysis. The contingency analysis output is shown in Figure 2.14.

The Mosaic Plot graphs the percentages from the contingency table. As shown in Figure 2.14, the Mosaic plot visually shows what appears to be a significant difference in Gender by Major. However, looking at the χ2 test of independence results, the p-value, or Prob>ChiSq, is 0.1933. The χ2 test assesses whether the row variable is significantly related to the column variable. That is, in this case, is Gender related to Major and vice versa? With a p-value of 0.1993, you would fail to reject H0 and conclude that there is not a significant relationship between Major and Gender.

In general, using the χ2 test of independence when one or more of the expected values are less than 5 is not advisable. In this case, if you click the red triangle next to Contingency Table and click Expected, you will see in the last row of each cell the expected value, as seen in Figure 2.14. (You can observe that for both A&S and Business in the Female row that the expected value is 4.5. So, in this circumstance, the χ2 test of independence and its results should be ignored.)

Figure 2.14: Contingency Analysis of Major by Gender

image

As illustrated, JMP, in the bivariate analysis diagram of the Fit Y by X dialog box, helps the analyst select the proper statistical method to use. The Y variable is usually considered to be a dependent variable. For example, if the X variable is continuous and the Y is categorical (nominal or ordinal), then in the lower left of the diagram in Figure 2.12 logistic regression will be used. This will be discussed in Chapter 6. In another scenario, with the X variable as categorical and the Y variable as continuous, JMP will suggest One-way ANOVA, which will be discussed in Chapter 5. If two (or more variables) have no dependency, that is, they are interdependent, as you will learn in this book, there are other techniques to use.

Three Possible Outcomes When You Choose a Technique

Depending on the type of data, some techniques are appropriate and some are not. As you can see, one of the major factors is the type of data being considered—essentially, continuous or categorical. Although JMP is a great help, just because an approach/technique appears appropriate, before running it, you need to step back and ask yourself what the results could provide. Part of that answer requires understanding and having knowledge of the actual problem situation being solved or examined. For example, you could be considering the bivariate analysis of GPA and Years. But, logically they are not related, and if a relationship is demonstrated it would most likely be a spurious one. What would it mean?

So you might decide that you have an appropriate approach/technique, and it could provide some meaningful insight. However, you cannot guarantee that you will get the results that you expect or anticipate. You are not sure how it will work out. Yes, the approach/technique is appropriate. But depending on the theoretical and actual relationship that underlies the data, it might or might not be helpful.

When using a certain technique, three possible outcomes could occur:

   The technique is not appropriate to use with the data and should not be used.

   The technique is appropriate to use with the data. However, the results are not meaningful.

   The technique is appropriate to use with the data and, the results are meaningful.

This process of exploration is all part of developing and telling the statistical story behind the data.

1 At this time, Macs do not have the Data Analysis ToolPaks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset