Time for action – analyzing random values

We will generate random values that mimic a normal distribution and analyze the generated data with statistical functions from the scipy.stats package.

  1. Generate random values from a normal distribution using the scipy.stats package:
    generated = stats.norm.rvs(size=900)
  2. Fit the generated values to a normal distribution. This basically gives the mean and standard deviation of the dataset:
    print("Mean", "Std", stats.norm.fit(generated))

    The mean and standard deviation appear as follows:

    Mean Std (0.0071293257063200707, 0.95537708218972528)
    
  3. Skewness tells us how skewed (asymmetric) a probability distribution is (see http://en.wikipedia.org/wiki/Skewness). Perform a skewness test. This test returns two values. The second value is the p-value—the probability that the skewness of the dataset does not correspond to a normal distribution.

    Note

    Generally speaking, the p-value is the probability of an outcome different than what was expected given the null hypothesis—in this case, the probability of getting a skewness different from that of a normal distribution (which is 0 because of symmetry).

    P-values range from 0 to 1:

    print("Skewtest", "pvalue", stats.skewtest(generated))

    The result of the skewness test appears as follows:

    Skewtest pvalue (-0.62120640688766893, 0.5344638245033837)
    

    So, there is a 53 percent chance we are not dealing with a normal distribution. It is instructive to see what happens if we generate more points, because if we generate more points, we should have a more normal distribution. For 900,000 points, we get a p-value of 0.16. For 20 generated values, the p-value is 0.50.

  4. Kurtosis tells us how curved a probability distribution is. Perform a kurtosis test. This test is set up similarly to the skewness test, but, of course, applies to kurtosis:
    print("Kurtosistest", "pvalue", stats.kurtosistest(generated))

    The result of the kurtosis test appears as follows:

    Kurtosistest pvalue (1.3065381019536981, 0.19136963054975586)
    

    The p-value for 900,000 values is 0.028. For 20 generated values, the p-values is 0.88.

  5. A normality test tells us how likely it is that a dataset complies the normal distribution. Perform a normality test. This test also returns two values, of which the second is a p-value:
    print("Normaltest", "pvalue", stats.normaltest(generated))

    The result of the normality test appears as follows:

    Normaltest pvalue (2.09293921181506, 0.35117535059841687)
    

    The p-value for 900,000 generated values is 0.035. For 20 generated values, the p-value is 0.79.

  6. We can find the value at a certain percentile easily with SciPy:
    print("95 percentile", stats.scoreatpercentile(generated, 95))

    The value at the 95th percentile appears as follows:

    95 percentile 1.54048860252
    
  7. Do the opposite of the previous step to find the percentile at 1:
    print("Percentile at 1", stats.percentileofscore(generated, 1))

    The percentile at 1 appears as follows:

    Percentile at 1 85.5555555556
    
  8. Plot the generated values in a histogram with matplotlib (more information about matplotlib can be found in the previous Chapter 9, Plotting with matplotlib):
    plt.hist(generated)

    The histogram of the generated random values is as follows:

    Time for action – analyzing random values

What just happened?

We created a dataset from a normal distribution and analyzed it with the scipy.stats module (see statistics.py):

from __future__ import print_function
from scipy import stats
import matplotlib.pyplot as plt

generated = stats.norm.rvs(size=900)
print("Mean", "Std", stats.norm.fit(generated))
print("Skewtest", "pvalue", stats.skewtest(generated))
print("Kurtosistest", "pvalue", stats.kurtosistest(generated))
print("Normaltest", "pvalue", stats.normaltest(generated))
print("95 percentile", stats.scoreatpercentile(generated, 95))
print("Percentile at 1", stats.percentileofscore(generated, 1))
plt.title('Histogram of 900 random normally distributed values')
plt.hist(generated)
plt.grid()
plt.show()

Have a go hero – improving the data generation

Judging from the histogram in the previous Time for action section, there is room for improvement when it comes to generating the data. Try using NumPy or different parameters of the scipy.stats.norm.rvs() function.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset