Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12. Statistics and Data Analysis

Watching in a trance The crew is certain Nothing left to chance ... Starting to collect Requested data "What will it affect When all is done?" Thinks Major Tom
—Peter Schilling, "Major Tom (Coming Home)"

12.0 Introduction

Ask statisticians what software they use, and chances are (no pun intended), they will mention SAS, SPSS, or maybe even R. Those systems are quite good, but most are highly specialized for statistical work. With the release of version 7, Wolfram has substantially beefed up the statistical capabilities of Mathematica. Given everything else Mathematica can do, it is now a compelling alternative for statistics and data analysis. An entire Mathematica statistical cookbook could be written; therefore, this chapter is necessarily incomplete. I have selected these recipes for this chapter to provide jumping-off points for further exploration. You should consult the Mathematica documentation for more depth, and nonexperts should consider Sarah Boslaugh and Paul Andrew Watters’ Statistics in a Nutshell (O’Reilly) for a broad overview of the relevant concepts.

Even readers without much interest in statistics are encouraged to skim these recipes because there are demonstrations here that have application outside statistics proper. Most users of Mathematica are comfortable with basic statistical metrics, such as mean and variance, but perhaps you are rusty on quantiles. All are covered in 12.1 Computing Common Statistical Metrics of Numerical and Symbolic Data. Every programmer needs to generate random numbers from time to time, and it is useful to know how to use different distributions beside the standard uniform distribution (12.2 Generating Pseudorandom Numbers with a Given Distribution). Students and teachers of probability will appreciate Mathematica’s ability to manipulate and plot a variety of common (and not so common) distributions (12.3 Working with Probability Distributions) as well as the ability to illustrate statistical theorems and puzzles (12.4 Demonstrating the Central Limit Theorem and 12.16 Creating Stochastic Simulations). Advanced statisticians and researchers will get a lot of use out of Mathematica’s data analysis features, covered in 12.5 Computing Covariance and Correlation of Vectors and Matrices through 12.13 Grouping Data into Clusters. Finally, 12.14 Creating Common Statistical Plots demonstrates plots that are specific to statistical analysis.

This chapter often synthesizes data using random generation. In these cases, I seed the random number generator with a specific seed so the results are repeatable. There is no magic behind the seeds specified other than they provided a reasonable result. When I use specific data in these recipes, it is plausible but entirely fabricated and should not be construed as coming from an actual experiment.

12.1 Computing Common Statistical Metrics of Numerical and Symbolic Data

Problem

You want to perform common statistical analysis of data sets. These metrics represent the entry-level statistical functions that all users of Mathematica should have under their belts.

Solution

It should come as little surprise that Mathematica is equipped with the standard statistical functions. Here I use the byte count of Mathematica files on my folder as a source of data.

In[1]:=  data = N[ FileByteCount /@
             FileNames[FileNameJoin[{NotebookDirectory[],"*.nb"}]]];
         (*Compute the mean.*)
         Mean[data]
Out[2]=  2.45023 × 10⁶

The statistical functions you will use most in Mathematica (Mean, Median, Max, Min, Variance, and StandardDeviation) have obvious names and obvious uses. Here I get a bit fancy by computing a table in one step by using Through with the list of functions.

Not quite as pedestrian, quantiles are a common concept in statistics that generalizes the concept of median to other subdivisions.

In[4]:=  (*Find the lower quantile.*)
         Quantile[data, 1/4]
Out[4]=  14412.

In[5]:=  (*Find the 1/2, 1/3, 1/4, ... 1/10.*)
         Quantile[data, #] & /@ Table[1/n, {n, 2, 10}]
Out[5]=  {114698., 26623., 14412., 7712., 6102., 5456., 4775., 3865., 3514.}

In[6]:=  Quantile[data, 1/2]
Out[6]=  114698.

When used with default parameters Quantile always returns some element in the actual list. Thus, Quantile[data, 1/2] may not be the same as Median.

In[7]:=  Quantile[data, 1/2] == Median[data]
Out[7]=  True

With the following parameters, Quantile and Median are identical. See Quantile documentation for the meaning of these parameters.

In[8]:=  Quantile[data, 1/2, {{1/2,0}, {0,1}}] == Median[data]
Out[8]=  True

Discussion

The basic functions covered in the solution are no doubt familiar and hardly warrant further elaboration except to note their generality.

All of the statistics functions in Mathematica work with SparseArray, which is very convenient when you have a very large but sparse data set.

Further, given Mathematica’s symbolic nature, you should not be too surprised that it can do more than other common data analysis applications, such as MS Excel.

What does this result mean? It is the formula for computing the variance of a set of data with 3 a’s, 1 b, 2 c’s and 2 d’s. You can use this formula using ReplaceAll.

This is exactly the result you would get if you took the direct route.

This may seem completely academic; for many of you, it will be so. Yet consider that symbolic form allows you to perform further symbolic manipulations that account for properties you may know about the symbolic data. For example, imagine the items were all angles in radians in a given relationship and you wanted to know the formula for the variance of their sine. Such examples are contrived only until you need to do a similar transformation.

These symbolic capabilities also imply you can use these functions with common distributions rather than on individual values.

Another common statistical metric is the mode. This function is called Commonest in Mathematica and can be used to find the commonest or the n commonest. Related to this is a new function in version 7, Tally, that gives the individual counts.

In[16]:=  list = First[ RealDigits[Pi, 10, 50]];

In[17]:=  {Commonest[list], Commonest[list, 3]}
Out[17]=  {{3}, {3, 1, 9}}

In[18]:=  Tally[list]
Out[18]=  {{3, 9}, {1, 5}, {4, 4}, {5, 5}, {9, 8}, {2, 5}, {6, 4}, {8, 5}, {7, 4}, {0, 1}}

12.2 Generating Pseudorandom Numbers with a Given Distribution

Problem

You want to generate random numbers that have nonuniform distributions. Many recipes in this book use RandomReal and RandomInteger, but these functions give uniform distributions unless you specify otherwise.

Solution

Both RandomReal and RandomInteger can take a distribution as their first argument. RandomReal uses continuous distributions, including NormalDistribution, HalfNormalDistribution, LogNormalDistribution, InverseGaussianDistribution, GammaDistribution, ChiSquareDistribution, and others. RandomInteger uses discrete distributions, such as BernoulliDistribution, GeometricDistribution, HypergeometricDistribution, PoissonDistribution, and others.

In[19]:=  RandomReal[NormalDistribution[], 10]
Out[19]=  {-0.96524, 1.19926, 0.989088, 0.156427, -0.336326,
           -1.66671, 0.149802, -0.464219, -0.998164, 0.948215}

In[20]:=  RandomInteger[PoissonDistribution[5], 10]
Out[20]=  {5, 2, 6, 5, 6, 4, 3, 4, 4, 5}

Discussion

You can visualize distributions using BinCounts and BarChart.

Another way to visualize the various continuous distributions is to generate a random raster using each distribution. How would you rewrite this to remove the redundancy? (Hint: functional programming!)

12.3 Working with Probability Distributions

Problem

You want to compute the probability density function (PDF) and cumulative density function (CDF) of various distributions. You may also want to determine the characteristic function of the associated distribution.

Solution

Use PDF to compute the probability density function and CDF to compute the cumulative density function. I illustrate the use of these functions using the standardized normal distribution (mean 0 and variance 1).

Discussion

The CDF is obtained from the PDF by integrating the PDF from –∞ to x, which you can illustrate in Mathematica very easily. The implementation given here is designed to execute the integration only once and then store it as a new function for subsequent evaluation, so it is almost as fast as the built-in CDF. There is no compelling reason to use this over the built-in CDF implementation. It is here strictly as an illustration of the relationship. If you use Mathematica to teach statistics, it is a good idea to peek under the covers of black box functions like CDF whenever possible.

Clearly, you can also obtain the closed-form formula for the CDF of any particular distribution.

Find the value at –∞.

So the closed-form value for the CDF of the normal distribution is

In[29]:=  cumNormDist[x_] := Erf[x/Sqrt[2]]/2 + 0.5

The classic application of a PDF is in computing the probability that a particular value falls within some range. For example, consider the probability of a value falling between 0 and 0.25 for various distributions.

In[30]:=  Integrate[PDF[#, x], {x, 0, 0.25}] & /@
           {UniformDistribution[{0, 1}], NormalDistribution[0, 1],
            HalfNormalDistribution[1], ChiSquareDistribution[2]}
Out[30]=  {0.25, 0.0987063, 0.158106, 0.117503}

Based on the definition of the CDF, it is easy to see that it computes the probability that a value will be less than or equal to a specific value. Subtracting the CDF from 1 will give the probability of a value being greater than a specified limit.

In[31]:=  (*Probability that a normally distributed random variable will
           be less than or equal to 0.5*)CDF[NormalDistribution[0, 1], 0.5]
Out[31]=  0.691462

In[32]:=  (*Probability that a normally distributed random variable will
           be greater than 0.8*)1 - CDF[NormalDistribution[0,1], 0.8]
Out[32]=  0.211855

In[33]:=  (*Probability that a normally distributed random
            variable will be less than -1 or greater than 1*)
          CDF[NormalDistribution[0,1], -1.] +
           (1 - CDF[NormalDistribution[0,1], 1.])
Out[33]=  0.317311

When you plot a PDF, you can use ColorFunction to highlight regions of interest, but make sure you also set Filling Axis and ColorFunctionScaling False. Here I plot the regions of interest whose total area (and hence probability) is approximately 0.317311.

Use CharacteristicFunction[dist,var] to extract the characteristic function of a distribution in terms of a variable var. Here are the functions for five common distributions.

12.4 Demonstrating the Central Limit Theorem

Problem

You want to illustrate the central limit theorem (CLT) to yourself or your students.

Solution

The CLT states that the mean of sufficiently large samples from any distribution will approximate a normal distribution. You can illustrate this by averaging suitably large random samples from a nonnormal distribution, such as the uniform distribution.

Discussion

The CLT is often stated in a very technical way. In Statistics in a Nutshell, Boslaugh and Watters explain that the CLT "states that the sampling distribution of the sample mean approximates the normal distribution, regardless of the distribution of the population from which samples are drawn, if the sample size is sufficiently large" (137). Other references define it in an equally technical way. The solution shows that the concept is not difficult, although the result is certainly not obvious. The solution demonstrates 200 samples of uniformly generated lists of random numbers, each of length 30, being averaged and then the counts of each integer-valued range being organized into bins and plotted. The shape looks roughly normal, which is the prediction of the CLT. BinCounts, Mean, and RandomReal are relatively easy to understand (see prior recipes), so this makes the idea behind the CLT rather concrete.

To further emphasize that this is not a property of the uniform distribution, you can substitute other distributions. These use finer grained bins due to the tighter range of numbers generated, but the result is similar. As an exercise, wrap a Manipulate around the code in the "Solution" section above and adjust both the sample size and the number of samples. This will illustrate that the validity of the CLT is predicated on a sufficiently large number of samples of sufficiently large size.

12.5 Computing Covariance and Correlation of Vectors and Matrices

Problem

You want to measure the relationship between data sets to see if they vary about the mean in a similar way (covariance) or if there is a linear relationship (correlation).

Solution

Discussion

Covariance and Correlation both operate on matrices. If you pass a single matrix, it will return a covariance (or correlation) matrix resulting from computing the covariance between each column. To demonstrate this clearly, I’ll engineer a matrix with an obvious relationship between the first and second column and a weak correlation of these in a third column. The output matrix will always be symmetrical. The correlation matrix will always have ones on the diagonal, since these entries represent correlations of columns with themselves. You can also pass two matrices, in which case you get the covariance (or correlation) with respective columns.

   In[40]:=  SeedRandom[2];
             (data = Transpose[{{0, 1, 2, 3, 4, 5, 6, 7, 8, 9},
                   {0, 10, 20, 30, 40, 50, 60, 70, 80, 90},
                   RandomReal[{-1, 1}, 10]}]) // TableForm
Out[41]//TableForm=
             0 0  0.44448
             1 10 -0.781103
             2 20 -0.0585946
             3 30 0.0711637
             4 40 0.166355
             5 50 -0.412115
             6 60 -0.669691
             7 70 0.202516
             8 80 0.508435
             9 90 0.542246

   In[42]:=  Covariance[data] // TableForm
Out[42]//TableForm=
             9.16667  91.6667 0.467288
             91.6667  916.667 4.67288
             0.467288 4.67288 0.228412

   In[43]:=  Correlation[data] // TableForm
Out[43]//TableForm=
             1.       1.       0.322938
             1.       1.       0.322938
             0.322938 0.322938 1.

   In[44]:=  Correlation[data, data^2] // TableForm
Out[44]//TableForm=
             0.962691 0.962691 0.00604923
             0.962691 0.962691 0.00604923
             0.442467 0.442467 -0.522003

12.6 Measuring the Shape of Data

Problem

You want to summarize the shape of your data using some common statistical measures.

Solution

Use Skewness to measure the asymmetry of a distribution. A symmetrical distribution like the NormalDistribution will have skewness of zero. A positive skewness indicates the right tail is longer, while a negative skewness indicates the left tail is longer.

Use QuartileSkewness to measure if the median is closer to the upper or lower quartile. QuartileSkewness is a more robust measure of skewness in the presence of extreme values.

In[47]:=  data = {0.1, 0.3, 0.7, 1, 0.6, 99, 0.8, 2, 2.1, 0.95, 1.7, 0.69};
          {QuartileSkewness[data], Skewness[data]}
Out[48]=  {0.618257, 3.01242}

Use Kurtosis to measure the sharpness of the peak of a distribution. A high kurtosis distribution has a sharper peak and longer, fatter tails, whereas a low kurtosis distribution has a more rounded peak and shorter, thinner tails.

Discussion

CentralMoment is a fundamental measure that underlies statistical measures of shape. It is computed as

The second central moment of a data set is called the population variance (which is not as commonly used as sample variance as computed by the Variance function).

In[50]:=  data = {0.1, 0.3, 0.7, 1, 0.6, 99, 0.8, 2, 2.1, 0.95, 1.7, 0.69};

In[51]:=  Table[CentralMoment[data, i], {i, 1, 3}]
Out[51]=  {1.77636 × 10^–15, 734.086, 59915.}

Skewness is equivalent to CentralMoment[list,3]/CentralMoment[list,2]^(3/2); Kurtosis is CentralMoment[list,4]/CentralMoment[list,2]^2.

12.7 Finding and Adjusting for Outliers

Problem

You have a large data set and you want to identify outliers and possibly adjust the statistics to compensate.

Solution

A simple way to identify outliers is to use Sort and inspect the beginning and end of the list. You can also look at a certain number of elements near the minimum and maximum using Nearest.

In[52]:=  data = Join[{0.0001, 0.0005}, RandomReal[{10, 30}, 500], {1000, 1007}];
           {min, max} = {Min[data], Max[data]};
          {Nearest[data, min, 5], Nearest[data, max, 5]}
Out[54]=  {{0.0001, 0.0005, 10.0021, 10.1101, 10.1403},
           {1007, 1000, 29.9915, 29.9773, 29.975}}

You can also compute the trimmed mean, which is the mean after dropping a fraction of the smallest and largest elements.

In[55]:=  {Mean[data], TrimmedMean[data, 0.2]}
Out[55]=  {24.0623, 20.173}

Discussion

Here I take advantage of a feature of Tally that allows you to provide custom equivalence function. The idea here is to treat values within a specified distance of each other as equal. In this case, I use distance 5. This shows that there are 3 clusters of values in the data and some outliers with low frequency of occurrence.

   In[56]:=  Tally[data, (Abs[#1 - #2] < 5) &] //TableForm
Out[56]//TableForm=
             0.0001  2
             25.5715 235
             10.4722 135
             17.0082 130
             1000    1
             1007    1

12.8 Fitting Data Using a Linear Model

Problem

You have a data set and would like to find a linear model of the data. A linear model is commonly called a "linear regression." A linear model has various statistics that define its accuracy, and you typically want to obtain these as well.

Solution

In[57]:= data = Table[{x, x + RandomReal[{-2, 3}]}, {x, 1, 20}];

Use Fit in versions prior to Mathematica 7.

Use LinearModelFit in version 7 and above to build a linear model that you can then use to plot or extract statistics.

Discussion

LinearModelFit is a vast improvement over Fit since it is not just a way to synthesize a function. Once you have constructed a linear model, you can query its various properties, of which there are quite a few. To find out what is available, simply ask the model. Ask for a specific property by name.

In[62]:=  lm["Properties"]
Out[62]=  {AdjustedRSquared, AIC, ANOVATable, ANOVATableDegreesOfFreedom,
           ANOVATableEntries, ANOVATableFStatistics, ANOVATableMeanSquares,
           ANOVATablePValues, ANOVATableSumsOfSquares, BetaDifferences, BestFit,
           BestFitParameters, BIC, CatcherMatrix, CoefficientOfVariation,
           CookDistances, CorrelationMatrix, CovarianceMatrix, CovarianceRatios,
           Data, DesignMatrix, DurbinWatsonD, EigenstructureTable,
           EigenstructureTableEigenvalues, EigenstructureTableEntries,
           EigenstructureTableIndexes, EigenstructureTablePartitions,
           EstimatedVariance, FitDifferences, FitResiduals, Function,
           FVarianceRatios, HatDiagonal, MeanPredictionBands,
           MeanPredictionConfidenceIntervals, MeanPredictionConfidenceIntervalTable,
           MeanPredictionConfidenceIntervalTableEntries, MeanPredictionErrors,
           ParameterConfidenceIntervals, ParameterConfidenceIntervalTable,
           ParameterConfidenceIntervalTableEntries, ParameterConfidenceRegion,
           ParameterErrors, ParameterPValues, ParameterTable, ParameterTableEntries,
           ParameterTStatistics, PartialSumOfSquares, PredictedResponse, Properties,
           Response, RSquared, SequentialSumOfSquares, SingleDeletionVariances,
           SinglePredictionBands, SinglePredictionConfidenceIntervals,
           SinglePredictionConfidenceIntervalTable,
           SinglePredictionConfidenceIntervalTableEntries, SinglePredictionErrors,
           StandardizedResiduals, StudentizedResiduals, VarianceInflationFactors}

In[63]:=  lm["RSquared"]
Out[63]=  0.944788

In[64]:=  lm["MeanPredictionErrors"]
Out[64]=  {0.627101, 0.579603, 0.533846, 0.490318, 0.449667, 0.412744,
           0.380636, 0.354652, 0.336216, 0.326608, 0.326608, 0.336216, 0.354652,
           0.380636, 0.412744, 0.449667, 0.490318, 0.533846, 0.579603, 0.627101}

In[65]:=  lm["BestFit"]
Out[65]=  0.981879 + 0.990357 ×

You can also get the best Fit function by using Normal.

In[66]:=  Normal[lm]
Out[66]=  0.981879 + 0.990357 ×

12.9 Fitting Data Using a Nonlinear Model

Problem

You want to fit data to a function for which you have knowledge of the mathematical model. Specifically, you know the model is nonlinear and, hence, neither Fit nor LinearModelFit is appropriate.

Solution

Use FindFit in versions prior to Mathematica 7.

Use NonLinearModel fit in Mathematica 7 as a more complete solution.

Discussion

As with LinearModelFit, NonlinearModelFit encapsulates a wealth of information.

In[73]:=  nlm["Properties"]
Out[73]=  {AdjustedRSquared, AIC, ANOVATable, ANOVATableDegreesOfFreedom,
           ANOVATableEntries, ANOVATableMeanSquares, ANOVATableSumsOfSquares,
           BestFit, BestFitParameters, BIC, CorrelationMatrix, CovarianceMatrix,
           CurvatureConfidenceRegion, Data, EstimatedVariance, FitCurvatureTable,
           FitCurvatureTableEntries, FitResiduals, Function, HatDiagonal,
           MaxIntrinsicCurvature, MaxParameterEffectsCurvature, MeanPredictionBands,
           MeanPredictionConfidenceIntervals, MeanPredictionConfidenceIntervalTable,
           MeanPredictionConfidenceIntervalTableEntries,
           MeanPredictionErrors, ParameterBias, ParameterConfidenceIntervals,
           ParameterConfidenceIntervalTable, ParameterConfidenceIntervalTableEntries,
           ParameterConfidenceRegion, ParameterErrors, ParameterPValues,
           ParameterTable, ParameterTableEntries, ParameterTStatistics,
           PredictedResponse, Properties, Response, RSquared, SingleDeletionVariances,
           SinglePredictionBands, SinglePredictionConfidenceIntervals,
           SinglePredictionConfidenceIntervalTable,
           SinglePredictionConfidenceIntervalTableEntries,
           SinglePredictionErrors, StandardizedResiduals, StudentizedResiduals}

For example, you can extract and plot confidence bands for various confidence levels.

Or you can extract a variety of statistics.

12.10 Creating Interpolation Functions from Data

Problem

You have a set of data points and want to construct a function you can use to predict values at other points.

Solution

Normally, you would interpolate data that was obtained in the wild without any a priori notion of the underlying function. However, as a simple illustration, I’ll sample data from a known function.

In[82]:=  xvalues = Sort[RandomReal[{-4 Pi, 4 Pi}, 18]];
          data = Table[{x, Sin[x]}, {x, xvalues}];
          fData = Interpolation[data]
Out[84]=  InterpolatingFunction[{{-11.3374, 12.5436}}, <>]

Discussion

Interpolation returns an InterpolationFunctionObject, which can be used just like a normal function. The default order for Interpolation is 3 but this can be varied using the option InterpolationOrder.

12.11 Testing for Statistically Significant Difference Between Groups Using ANOVA

Problem

You have experimental data suggesting a linear relationship between an independent and dependent variables; however, you are unsure if the relationship is causal. You run an experiment using an experimental group and a control group. You want to know if the results of the experiment are statistically significant.

Solution

Analysis of variance (ANOVA) is a popular statistical technique that is very important in the analysis of experimental results. Mathematica provides this functionality in a package aptly named ANOVA`. To illustrate the use of this package, I borrow a toy example from Boslaugh and Watters’ Statistics in a Nutshell. Imagine you collected the data in table coffeeIQ suggesting a relationship between coffee consumption in cups and IQ as measured by some standardized IQ test.

The question that remains is whether there is a causal relationship between caffeine and IQ, since one could equally suppose smart people just like to drink coffee. To investigate further, you design an experiment with two randomly selected groups: everyone in the first group receives a caffeine pill, and those in the second group receive a placebo. The pills are administered in a double-blind method, at the same time, under the exact same conditions, and each group is administered an IQ test at a specific time after the pills were taken. From these experiments you obtain the following data, where the first entry is 1 for those who received the caffeine and 0 for those who received the placebo. The second entry is the measured IQ.

In[94]:=  experiments = {{1, 110}, {1, 100}, {1, 120}, {1, 125}, {1, 120}, {1, 120},
             {1, 115}, {1, 98}, {1, 95}, {1, 91}, {0, 100}, {0, 95}, {0, 100},
             {0, 122}, {0, 115}, {0, 88}, {0, 97}, {0, 87}, {0, 92}, {0, 76}};

Using ANOVA you see

Here the important results are the FRatio (higher is better) and PValue (smaller is better). The PValue is the probability of obtaining the result at least as extreme as the one that was actually observed, given that the null hypothesis is true. Typically one will reject the null hypothesis when the PValue is less than 0.05.

Discussion

You may wonder why the output of ANOVA is formatted as it is. Here Mathematica is emulating a popular statistics package called Minitab. You can drill down to the raw values easily enough.

In[97]:=  (ANOVA /. ANOVA[experiments ])[[1]]
Out[97]=  {{1, 744.2, 744.2, 4.47415, 0.0486171}, {18, 2994., 166.333}, {19, 3738.2}}

The solution shows a one-way ANOVA. It is frequently the case that there are multiple independent variables. In this case, you must describe the model and variables more precisely. For example, suppose you were measuring height and age of men as a predictor of income. For the purpose of this experiment, we will designate men under 5’10" as "short," assigning them height classification 1 and "tall" men classification 2. Similarly, we will define "young" men as under 40 with age classification 1 and "mature" men with age classification 2.

Here I use All in the model input to indicate I want to analyze all products of the main effects. You can also specify the products individually. For example, if you want to analyze the significance of height and height and age together, you can specify the model parameter as {height, age height}.

There are a few standard post hoc tests you can run to determine which group’s means were significantly different given SignificanceLevel (default is 0.05). I will not delve into the statistics behind these tests. You should refer to one of the resources in the See Also section. The output is fairly self-explanatory. Here we see that using the Bonferroni and Tukey tests, variation in income due to height was statistically significant between groups 1 and 2, but age did not show up as significant for either test.

Returning to the data from the Solution section, we can see how the tests can pass at one significance level but fail at a tighter tolerance. Note also how I use the output as a replacement rule to extract only the test results.

In the examples given here, I have also used the option CellMeans False, which suppresses the display of the means.

12.12 Hypothesis Testing with Categorical Data

Problem

You want to determine if there are statistically significant relationships within categorical data.

Solution

The chi-square test is a standard computation on categorical data. Categorical data is that for which the response is a choice among a set of discrete categories rather than a measurement on a continuous scale. Common examples are sex {male, female}, party {Democrat, Republican}, or sometimes data that could be placed on a scale but for simplicity is lumped into discrete groups, for example, blood pressure {low, normal, prehypertensive, hypertensive}. Experiments using categorical data often result in tables; hence, the data is called row-column (RC) data.

Here is a simplest possible example (borrowed from Statistics in a Nutshell) showing a two-by-two table relating smoking to lung cancer.

	`Lung Cancer Diagnosis`	`No Lung Cancer Diagnosis`
`Currently smoke`	`60`	`300`
`Do not currently smoke`	`10`	`390`

The chi-square test is a test for independence. If the RC data is independent, there is no demonstrated relationship between smoking and cancer (the null hypothesis); otherwise, there is evidence for the alternate hypothesis. The chi-square statistic starts with the computation of expected, values for each cell. The formula is

This is easily computed for the entire table using Outer.

In[104]:=  data := {{60, 300}, {10, 390}};

Out[105]=  expectedValues[rc_List] := Module[{rowTotals, colTotals, grandTotal},
             colTotals := Total[rc];
             rowTotals := Total[Transpose[rc]];
             grandTotal := Total[rowTotals];
             Outer[Times, rowTotals, colTotals] / grandTotal]

The chi-square value is computed by taking the differences between expected and observed, squaring the result, dividing it by expected, and summing all the ratios.

  In[106]:=  chiSquare[data_List] := Module[{ev}, ev = expectedValues[data];
                   Total[((data - ev) ^ 2) / ev, 2]]

  In[107]:=  expectedValues[data] // N // TableForm
Out[107]//TableForm=
             33.1579 326.842
             36.8421 363.158

  In[108]:=  chiSquare[data] // N
  Out[108]=  45.4741

To interpret this result, you need to compute PValue. The smaller the p-value, the more confident you can be in rejecting the null hypothesis.

Discussion

The second argument to ChiSquarePValue specifies the degrees of freedom of the distribution. In the solution example, we use 1 without explanation. The rule for computing the degrees of freedom for RC data is (numRows -1)(numCols -1).

In[111]:=  degreesOfFreedom[rc_List] := Times  @@ (Dimensions[rc] - 1)

In[112]:=  degreesOfFreedom[data]
Out[112]=  1

In the literature you will often find tables of critical values for various distributions relative to a significance level called alpha (α). For example, a common value for alpha is 0.05, which represents 95% confidence or (1 - α)* 100%. The critical value for a specified degree of freedom is the lower (or upper) bound for chiSquare in the solution that would give you the required confidence. Computing the critical value is the problem of finding a limit that gives the specified alpha as the area under the PDF for the distribution. We can compute these values efficiently using FindRoot and NIntegrate.

In[113]:=  chiSqUpperP[criticalValue_, df_] := With[{infinity = 1000}, NIntegrate[
              PDF[ChiSquareDistribution[df], x], {x, criticalValue, infinity}]]
           chiSqLowerP[criticalValue_, df_] := NIntegrate[
             PDF[ChiSquareDistribution[df], x], {x, 0, criticalValue}]
           criticalValueUpper[alpha_, df_] :=
            FindRoot[chiSqUpperP[c, df] == alpha, {c, 0.1}]
           criticalValueLower[alpha_, df_] :=
            FindRoot[chiSqLowerP[c, df] == alpha, {c, 0.1}]

The critical value for the experiment in the solution is

Our result was 45.47, so the result was well over the critical value. A result below the lower critical value is also acceptable, but clearly that does not apply to this experiment.

Given these functions, you can create your own tables of critical values like those in the NIST/SEMATECH e-Handbook of Statistical Methods website ( http://bit.ly/AbGvb ).

12.13 Grouping Data into Clusters

Problem

You want to group data in separate lists based on a metric like Euclidean distance or Hamming distance. This problem arises in a wide variety of contexts, including market research, demographics, informatics, risk analysis, and so forth.

Solution

Use FindClusters with the default Euclidean distance function for numbers and vectors.

In[121]:=  FindClusters[{1, 100, 2, 101, 3, 102, 1000, 1010, 4, 1020, 7}]
Out[121]=  {{1, 2, 3, 4, 7}, {100, 101, 102}, {1000, 1010, 1020}}

When you use FindClusters with strings, this distance function is "edit distance" or the number of character changes to get from one string to another.

In[122]:=  FindClusters[DictionaryLookup[_~~ "ead" ~~ _]]
Out[122]=  {{beads, heads, leads, reads}, {beady, heady, Meade, Reade, ready}}

You can insist on a specific number of clusters.

In[123]:=  FindClusters[{1, 100, 2, 101, 3, 102, 1000, 1010, 4, 1020, 7}, 4]
Out[123]=  {{1, 2, 3, 4}, {100, 101, 102}, {1000, 1010, 1020}, {7}}

Discussion

If you need to cluster data by a key or criterion that is not part of the data, transform the data into the form {key1→data1, key2→data2, ...}. When FindClusters sees this format, it will cluster that data using the keys. For example, say you retrieve some data from a database with names and ages and you want to cluster names by age.

If you don’t want to lose the ages, you can use the following variation:

There is also a variation that is more convenient when the keys and values are in separate lists.

You can also handle the situation via a custom distance function, which is a more general solution since the function can use other metrics besides Euclidean distance.

Mathematica provides a variety of built-in distance functions that cater to different conceptions of closeness as well as different data types. For numbers, vectors, and higher-order tensors, you can use EuclideanDistance, SquaredEuclideanDistance, ManhattanDistance, ChessboardDistance, CanberraDistance, CosineDistance, CorrelationDistance, or BrayCurtisDistance. For example, CosineDistance (also known as angular distance) is often used with highly dimensional data. Here we generate a data set of 800 vectors of length 50. By design, the vectors are clumped into four groups by magnitude, so it should be of little surprise that FindClusters using default EuclideanDistance discovers four clusters.

In[131]:=  data =
             Join[RandomReal[{-10, -5}, {200, 50}], RandomReal[{-5, 0}, {200, 50}],
              RandomReal[{0, 1}, {200, 50}], RandomReal[{5, 10}, {200, 50}]];

In[132]:=  Length[FindClusters[data]]
Out[132]=  4

However, using CosineDistance, which is insensitive to vector length, only two clusters are found.

In[133]:=  Length[FindClusters[data, DistanceFunction -> CosineDistance]]
In[133]:=  2

For Boolean vectors, you can use MatchingDissimilarity, JaccardDissimilarity, RussellRaoDissimilarity, SokalSneathDissimilarity, RogersTanimotoDissimilarity, DiceDissimilarity, and YuleDissimilarity. Consider a problem that turns the game of 20 Questions on its head. I devised 20 questions in a somewhat haphazard fashion and then selected a bunch of nouns as they came into my head (Table 12-1). The idea here is to associate a Boolean vector with each noun based on how one might answer the questions in relation to the noun. Some of the questions are very subjective, and some don’t really apply to all nouns, but to stay in the domain of Boolean, I forced myself to choose either true or false.

Table 12-1. Twenty Questions

Number	Question
`1`	`Is it living?`
`2`	`Is it bigger than a bread box?`
`3`	`Is it soft?`
`4`	`Is it visible?`
`5`	`Is it man - made?`
`6`	`Is it flammable?`
`7`	`Is it famous?`
`8`	`Does it run on electricity?`
`9`	`Does it have hair or fur?`
`10`	`Does it process information?`
`11`	`Does it usually cost more than $1000?`
`12`	`Is it mostly one color?`
`13`	`Can you sell it legally?`
`14`	`Does it conduct electricity?`
`15`	`Can you bend it without breaking and it retains its new shape?`
`16`	`Can an average human lift it?`
`17`	`Can it been seen with the unaided eye?`
`18`	`Can you transfer it over the Internet?`
`19`	`Is it scary?`
`20`	`Does its English name come before Lizard in the dictionary?`

The nouns I applied these questions to are

In[135]:= words = {"cat", "PC", "Java", "bird", "airplane", "Obama", "Mathematica",
             "Hillary Clinton", "weather", "time", "wind", "tunnel",
             "carpenter", "house", "red", "beer", "LSD", "Nintendo Wii",
             "John Lennon", "Paul McCartney", "Howard Stern", "mother", "Linux",
             "candle", "paper", "rock", "scissors", "steak", "broccoli"};

I’ll only show part of the data set (you can find it in the file 20Q.nb in the downloads from the book’s website).

Assuming the full data set is stored in the variable data, we can see how FindClusters partitions the data using the various Boolean distance functions.

By transforming Boolean value to 0 and 1, you can see how EuclideanDistance and ManhattanDistance tend to create a larger number of clusters.

For strings, you can choose from EditDistance, DamerauLevenshteinDistance, and HammingDistance.

HammingDistance requires equal length strings, otherwise it will report an error. I added a preprocessing function that pads each string at the end with blanks to make each as long as the longest string in the list.

For advanced applications of FindCluster, you can tweak fine-grained aspects of the clustering algorithm via the Method option. Consult the FindClusters tutorial for detailed specifications of Method that provide for custom significance tests and linkage tests.

12.14 Creating Common Statistical Plots

Problem

You want to visualize experimental data in a manner that effectively summarizes all the standard statistical measures.

Solution

The BoxWhiskerPlot is an excellent way to visually convey the essential statistics of one or more data sets.

Discussion

A box plot shows the minimum, maximum, median (black line), and middle quantile (box). There are options to change orientation (BoxOrientation), spacing (BoxExtraSpacing), styles (BoxLineStyle, BoxMedianStyle, BoxFillingStyle), and display of outliers (BoxOutliers, BoxOutlierMarkers). You can also show other quantiles using BoxQuantile.

Other common statistical chart types include StemLeafPlot, ParetoPlot, QuantilePlot, and PairwiseScatterPlot.

A Pareto plot combines a bar chart of percentages of categories with a plot of cumulative percentages. It is often used in quality control applications for which the data might be defects for various products.

Quantile plots are used to visualize whether two data sets come from the same population. The closeness of fit to a straight line indicates the degree to which the data comes from the same population.

PairwiseScatterPlot plots each column of a matrix against each of the other columns. The diagonals will always be straight lines. The following plot of 2006, 2007, and 2008 Dow Jones Industrial Average (DJIA) data shows how 2006 and 2008 had nearly inverse trends, whereas 2007 deviated in the middle of the year from the 2008 data.

12.15 Quasi-Random Number Generation

Problem

You need to generate random numbers, but you want to avoid the inevitable clustering that occurs using pseudorandom generators. This type of generator is sometimes called quasirandom.

Solution

Notice the clumping in this randomly generated list plot of 500 points.

The van der Corput sequence takes the digits of an integer in a given base b, and then reflects them about the decimal point. This maps the numbers from 1 to n into a set of numbers [0,1] in an even distribution, provided n is one less than a power of the base.

In[155]:=  corput[n_, b_] :=
             IntegerDigits[n, b].(b ^ Range[-Floor[Log[b, n] + 1], -1]);
           SetAttributes[corput, Listable]

The Halton sequence shows that a good way to distribute the values in n dimensions is to use the first n primes as the bases used with van der Corput.

As you can see, this gives far less clumpy distribution of points than RandomReal gives.

Discussion

These quasirandom numbers are often used in simulations and Monte Carlo methods. One problem with these sequences is that they always give you the same set of numbers. One possibility is to perturb each number by a small random epsilon (e). This more or less preserves the even distribution provided the random perturbation is small.

12.16 Creating Stochastic Simulations

Problem

You want to create a simulation as a means of developing a better understanding of the long-term behavior of a system governed by randomness.

Solution

One of the most well-known types of stochastic processes is the random walk. A random walk can occur in one-, two-, three-, or even higher dimensional space, but it is easiest to visualize in one or two dimensions. Here I show a random walk on a 2D lattice. A particle (or drunkard, if you prefer) starts at the origin {0,0} and can take a step east {0,1}, west {0,-1}, north {1,0} or south {-1,0}.

In[161]:=  latticeWalk2D[n_] := Module[{start = {0,0},
              east = {1, 0}, west = {-1, 0}, north = {0, 1}, south  = {0, -1}},
             NestList[# + RandomChoice[{east, west, north, south}] &, start, n]]

The walk is generated by specifying a number of steps and can be visualized using ListLinePlot or using arrows for each step, as I show in Out[163] below. Here I use SeedRandom only to make sure I always get the same walk no matter how many times this notebook is evaluated before going to press!

In[162]:=  SeedRandom[1004] ; walk  = latticeWalk2D[50]
Out[162]=  {{0, 0}, {-1, 0}, {0, 0}, {0, 1}, {0, 0}, {0, 1}, {-1, 1}, {-1, 2},
            {0, 2}, {-1, 2}, {-1, 1}, {-2, 1}, {-1, 1}, {-1, 2}, {0, 2}, {0, 1},
            {0, 2}, {-1, 2}, {-1, 1}, {-1, 0}, {-1, 1}, {-2, 1}, {-3, 1}, {-4, 1},
            {-4, 2}, {-5, 2}, {-5, 3}, {-6, 3}, {-6, 2}, {-5, 2}, {-6, 2}, {-6, 1},
            {-5, 1}, {-5, 0}, {-5, -1}, {-5, -2}, {-6, -2}, {-7, -2}, {-8, -2},
            {-9, -2}, {-10, -2}, {-10, -3}, {-10, -4}, {-11, -4}, {-10, -4},
            {-11, -4}, {-10, -4}, {-11, -4}, {-12, -4}, {-11, -4}, {-10, -4}}

Discussion

Some simulations contain constraints on what can happen at each step. For example, if you wanted a walk for which a back-step is disallowed, you could remember the previous step and remove its inverse from the population on the generation.

In[164]:=  latticeWalk2DNoBackStep[n_] :=
            Module[{start = {0, 0}, east = {1, 0}, west = {-1, 0},
              north = {0, 1}, south = {0, -1}, steps, last},
             steps = {east, west, north, south};
             (*Initialize last to a step not in the
              population so not to remove anything the first time.*)
             last = {1, 1};
             (*At each step the inverse (-last)
              is removed from possible steps using Complement.*)
             NestList[# + (last = RandomChoice[Complement[steps, {-last}]]) &,
              start, n]]

In[165]:=  SeedRandom[778]; walk = latticeWalk2DNoBackStep[25]
Out[165]=  {{0, 0}, {1, 0}, {2, 0}, {2, -1}, {3, -1}, {4, -1}, {5, -1}, {5, -2},
            {6, -2}, {6, -1}, {6, 0}, {5, 0}, {4, 0}, {4, 1}, {5, 1}, {5, 2}, {6, 2},
            {7, 2}, {7, 1}, {7, 0}, {8, 0}, {9, 0}, {9, 1}, {8, 1}, {8, 2}, {8, 3}}

Given a simulation, you will usually want to understand its behavior over many runs. One obvious metric is the distance from the origin. You might postulate, for example, that the average distance from the origin for latticeWalk2D will be less than latticeWalk2DNoBackStep. By running the simulation 500 times for each case and computing the mean, median, and other statistics, you can be more confident this intuition is correct. You can also see that the advantage seems to be only about two steps.

Simulation is also powerful as a tool for persuading someone of a truth that seems to defy intuition. A famous example is the Monty Hall problem. This problem is named after a U.S. game show called "Let’s Make a Deal," which was popular in the 1960s and ’70s and hosted by Monty Hall. A well-known statement of the problem was published in Parade magazine ("Ask Marilyn," Sept. 1990, 16):

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

For many people, the intuitive answer is that there is no advantage in switching because there is a 50/50 chance you have the car either way you go. There even seems to be a bias for not switching, based on the platitude "go with your first instincts." However, if you analyze the problem correctly (see the following analysis) there is a 2/3 probability of getting the car if you switch. But the analysis is subtle and apparently fails to convince even some very intelligent people, so perhaps a simulation is helpful. An advantage of creating the simulation is that it makes it clear just what you mean by this problem. Specifically, we are talking about the best decision over many trials for a problem where the initial choice of door is random, the placement of the car is random, and Monty always shows the door that contains a goat. In this simulation, we call sticking with your first choice strategyl and switching to the remaining door strategy2. The simulation is purposefully without any cute functional programming tricks so it is clear that at each step we are accurately following the rules of the game.

In[168]:=  (*randomPick is similar to RandomChoice except we want
            the position of the choice rather than the choice itself.*)
           randomPick[choices_List] :=
            Module[{}, RandomInteger[{1, Length[choices]}]]
           (*simulateStrat1VSStrat2 computes the winnings over a number
            of trials for strategy1 (stick) and strategy2 (switch).*)
           simulateStrat1VSStrat2[trials_Integer] :=
            Module[{GOAT = 0, CAR = 1, doors, firstPick, secondPick,
              winnings1 = 0, winnings2 = 0, doorsTemp, makePrizes},
             (*There are 3 possible initial game configurations. These
              can be generated using Permutations.*)
             SeedRandom[];
             Do[
              (*Randomly pick one of the
               three possible initial game configurations.*)
              doors = RandomSample[{GOAT, GOAT, CAR}];
              (*Contestant picks a door at random. Recall this
               is the position of the prize, not the prize itself.*)
              firstPick = randomPick[doors];
              (*Winnings of contestant who keeps first pick always*)
              winnings1 += doors[[firstPick]];
              (*Delete first pick from choices.*)
              doorsTemp = Drop[doors, {firstPick}];
              (*Delete goat from remaining; this is where Monty shows the goat.
                Here I use position to find a goat and, since there could be two,
                 I arbitrarily remove the first.*)
              doorsTemp = Drop[doorsTemp, Position[doorsTemp, GOAT][[1]]];
              (*Contestant following second
               strategy always switches to remaining prize.*)
              secondPick = doorsTemp[[1]];
              (*Winnings of contestant who switches*)
              winnings2 += secondPick,
              {trials}];
             {winnings1, winnings2}]

You can now simulate any number of games and compare the accumulated winnings over that many games. Here I show the results where the number of games varies from 10 to 100,000 in increments of powers of 10. Clearly strategy2, always switching, is the way to play the Monty Hall game.

   In[170]:=  Table[simulateStrat1VSStrat2[10^i], {i, 1, 5}] // TableForm
Out[170]//TableForm=
             4      6
             34     66
             304    696
             3345   6655
             33321 66679

The Monty Hall game analysis that leads to the correct conclusion is as follows: Consider the probability of not picking the car at the start of the game. Since there are 2 goats and 1 car, the probability of not picking a car is 2/3. Now consider what happens when Monty shows you a goat. In effect he tells you that IF you did not pick a car initially, THEN there is definitely a car behind the remaining door. We agreed that the probability of not having picked the car initially was 2/3, so now the probability of the car being behind the remaining door must be 2/3. The simulation we’re given shows this to be the case.

Table of Contents for 12. Statistics and Data Analysis

Create new playlist

Sign In

Sign Up

Chapter 12. Statistics and Data Analysis

12.0 Introduction

12.1 Computing Common Statistical Metrics of Numerical and Symbolic Data

Problem

Solution

Discussion

See Also

12.2 Generating Pseudorandom Numbers with a Given Distribution

Problem

Solution

Discussion

See Also

12.3 Working with Probability Distributions

Problem

Solution

Discussion

See Also

12.4 Demonstrating the Central Limit Theorem

Problem

Solution

Discussion

See Also

12.5 Computing Covariance and Correlation of Vectors and Matrices

Problem

Solution

Discussion

12.6 Measuring the Shape of Data

Problem

Solution

Discussion

12.7 Finding and Adjusting for Outliers

Problem

Solution

Discussion

12.8 Fitting Data Using a Linear Model

Problem

Solution

Discussion

See Also

12.9 Fitting Data Using a Nonlinear Model

Problem

Solution

Discussion

See Also

12.10 Creating Interpolation Functions from Data

Problem

Solution

Discussion

12.11 Testing for Statistically Significant Difference Between Groups Using ANOVA

Problem

Solution

Discussion

See Also

12.12 Hypothesis Testing with Categorical Data

Problem

Solution

Discussion

See Also

12.13 Grouping Data into Clusters

Problem

Solution

Discussion

See Also

12.14 Creating Common Statistical Plots

Problem

Solution

Discussion

See Also

12.15 Quasi-Random Number Generation

Problem

Solution

Discussion

See Also

12.16 Creating Stochastic Simulations

Problem

Solution

Table of Contents for
12. Statistics and Data Analysis