We can use some kind of threshold to weed out outliers, but there is a better way. It is called the median, and it basically picks the middle value of a sorted set of values (see https://www.khanacademy.org/math/probability/descriptive-statistics/central_tendency/e/mean_median_and_mode). One half of the data is below the median and the other half is above it. For example, if we have the values of 1, 2, 3, 4, and 5, then the median will be 3, since it is in the middle.
These are the steps to calculate the median:
simplestats.py
. You already know how to load the data from a CSV file into an array. So, copy that line of code and make sure that it only gets the close price. The code should appear like this:c=np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True)
median()
. We will call it and print the result immediately. Add the following line of code:print("median =", np.median(c))
The program prints the following output:
median = 352.055
median()
function, we would like to check whether this is correct. Obviously, we can do it by just going through the file and finding the correct value, but that is no fun. Instead, we will just mimic the median algorithm by sorting the close price array and printing the middle value of the sorted array. The msort()
function does the first part for us. Call the function, store the sorted array, and then print it:sorted_close = np.msort(c) print("sorted =", sorted_close)
This prints the following output:
Yup, it works! Let's now get the middle value of the sorted array:
N = len(c) print "middle =", sorted[(N - 1)/2]
The preceding snippet gives us the following output:
middle = 351.99
median()
function gave us. How come? Upon further investigation, we find that the median()
function return value doesn't even appear in our file. That's even stranger! Before filing bugs with the NumPy team, let's have a look at the documentation:$ python >>> import numpy as np >>> help(np.median)
This mystery is easy to solve. It turns out that our naive algorithm only works for arrays with odd lengths. For even-length arrays, the median
is calculated from the average of the two array values in the middle. Therefore, type the following code:
print("average middle =", (sorted[N /2] + sorted[(N - 1) / 2]) / 2)
This prints the following output:
average middle = 352.055
Variance
tells us how much a variable varies (see https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/e/variance). In our case, it also tells us how risky an investment is, since a stock price that varies too wildly is bound to get us into trouble.Calculate the variance of the close price (with NumPy, this is just a one-liner):
print("variance =", np.var(c))
This gives us the following output:
variance = 50.1265178889
Some books tell us to divide by the number of elements in the array minus one (this is called a sample variance):
print("variance from definition =", np.mean((c - c.mean())**2))
The output is as follows:
variance from definition = 50.1265178889
Maybe you noticed something new. We suddenly called the mean()
function on the c
array. Yes, this is legal, because the ndarray
class has a mean()
method. This is for your convenience. For now, just keep in mind that this is possible. The code for this example can be found in simplestats.py
:
from __future__ import print_function
import numpy as np
c=np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True)
print("median =", np.median(c))
sorted = np.msort(c)
print("sorted =", sorted)
N = len(c)
print("middle =", sorted[(N - 1)/2])
print("average middle =", (sorted[N /2] + sorted[(N - 1) / 2]) / 2)
print("variance =", np.var(c))
print("variance from definition =", np.mean((c - c.mean())**2))