Time for action – performing simple statistics

We can use some kind of threshold to weed out outliers, but there is a better way. It is called the median, and it basically picks the middle value of a sorted set of values (see https://www.khanacademy.org/math/probability/descriptive-statistics/central_tendency/e/mean_median_and_mode). One half of the data is below the median and the other half is above it. For example, if we have the values of 1, 2, 3, 4, and 5, then the median will be 3, since it is in the middle.

These are the steps to calculate the median:

  1. Create a new Python script and call it simplestats.py. You already know how to load the data from a CSV file into an array. So, copy that line of code and make sure that it only gets the close price. The code should appear like this:
    c=np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True)
  2. The function that will do the magic for us is called median(). We will call it and print the result immediately. Add the following line of code:
    print("median =", np.median(c))

    The program prints the following output:

    median = 352.055
    
  3. Since it is our first time using the median() function, we would like to check whether this is correct. Obviously, we can do it by just going through the file and finding the correct value, but that is no fun. Instead, we will just mimic the median algorithm by sorting the close price array and printing the middle value of the sorted array. The msort() function does the first part for us. Call the function, store the sorted array, and then print it:
    sorted_close = np.msort(c)
    print("sorted =", sorted_close)

    This prints the following output:

    Time for action – performing simple statistics

    Yup, it works! Let's now get the middle value of the sorted array:

    N = len(c)
    print "middle =", sorted[(N - 1)/2]

    The preceding snippet gives us the following output:

    middle = 351.99
    
  4. Hey, that's a different value than the one the median() function gave us. How come? Upon further investigation, we find that the median() function return value doesn't even appear in our file. That's even stranger! Before filing bugs with the NumPy team, let's have a look at the documentation:
    $ python
     >>> import numpy as np
    >>> help(np.median)
    

    This mystery is easy to solve. It turns out that our naive algorithm only works for arrays with odd lengths. For even-length arrays, the median is calculated from the average of the two array values in the middle. Therefore, type the following code:

    print("average middle =", (sorted[N /2] + sorted[(N - 1) / 2]) / 2)

    This prints the following output:

    average middle = 352.055
    
  5. Another statistical measure that we are concerned with is variance. Variance tells us how much a variable varies (see https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/e/variance). In our case, it also tells us how risky an investment is, since a stock price that varies too wildly is bound to get us into trouble.

    Calculate the variance of the close price (with NumPy, this is just a one-liner):

    print("variance =", np.var(c))

    This gives us the following output:

    variance = 50.1265178889
    
  6. Not that we don't trust NumPy or anything, but let's double-check using the definition of variance, as found in the documentation. Mind you, this definition might be different than the one in your statistics book, but that is quite common in the field of statistics.

    Note

    The population variance is defined as the mean of the square of deviations from the mean, divided by the number of elements in the array:

    Time for action – performing simple statistics

    Some books tell us to divide by the number of elements in the array minus one (this is called a sample variance):

    print("variance from definition =", np.mean((c - c.mean())**2))

    The output is as follows:

    variance from definition = 50.1265178889
    

What just happened?

Maybe you noticed something new. We suddenly called the mean() function on the c array. Yes, this is legal, because the ndarray class has a mean() method. This is for your convenience. For now, just keep in mind that this is possible. The code for this example can be found in simplestats.py:

from __future__ import print_function
import numpy as np

c=np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True)
print("median =", np.median(c))
sorted = np.msort(c)
print("sorted =", sorted)

N = len(c)
print("middle =", sorted[(N - 1)/2])
print("average middle =", (sorted[N /2] + sorted[(N - 1) / 2]) / 2)

print("variance =", np.var(c))
print("variance from definition =", np.mean((c - c.mean())**2))
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset