Time for action – summarizing data

The data we will summarize will be for a whole business week, running from Monday to Friday. During the period covered by the data, there was one holiday on February 21, President's Day. This happened to be a Monday and the US stock exchanges were closed on this day. As a consequence, there is no entry for this day, in the sample. The first day in the sample is a Friday, which is inconvenient. Use the following instructions to summarize data:

  1. To simplify, just have a look at the first three weeks in the sample— later, you can have a go at improving this:
    close = close[:16]
    dates = dates[:16]

    We will be building on the code from the previous Time for action section.

  2. Commencing, we will find the first Monday in our sample data. Recall that Mondays have the code 0 in Python. This is what we will put in the condition of the where() function. Then, we will need to extract the first element that has index 0. The result will be a multidimensional array. Flatten this with the ravel() function:
    # get first Monday
    first_monday = np.ravel(np.where(dates == 0))[0]
    print("The first Monday index is", first_monday)

    This will print the following output:

    The first Monday index is 1
    
  3. The next logical step is to find the Friday before last Friday in the sample. The logic is similar to the one for finding the first Monday, and the code for Friday is 4. Additionally, we are looking for the second to last element with index 2:
    # get last Friday
    last_friday = np.ravel(np.where(dates == 4))[-2]
    print("The last Friday index is", last_friday)

    This will give us the following output:

    The last Friday index is 15
    
  4. Next, create an array with the indices of all the days in the three weeks:
    weeks_indices = np.arange(first_monday, last_friday + 1)
    print("Weeks indices initial", weeks_indices)
  5. Split the array in pieces of size 5 with the split() function:
    weeks_indices = np.split(weeks_indices, 3)
    print("Weeks indices after split", weeks_indices)

    This splits the array as follows:

    Weeks indices after split [array([1, 2, 3, 4, 5]), array([ 6,  7,  8,  9, 10]), array([11, 12, 13, 14, 15])]
    
  6. In NumPy, array dimensions are called axes. Now, we will get fancy with the apply_along_axis() function. This function calls another function, which we will provide, to operate on each of the elements of an array. Currently, we have an array with three elements. Each array item corresponds to one week in our sample and contains indices of the corresponding items. Call the apply_along_axis() function by supplying the name of our function, called summarize(), which we will define shortly. Furthermore, specify the axis or dimension number (such as 1), the array to operate on, and a variable number of arguments for the summarize() function, if any:
    weeksummary = np.apply_along_axis(summarize, 1, weeks_indices, open, high, low, close)
    print("Week summary", weeksummary)
  7. For each week, the summarize() function returns a tuple that holds the open, high, low, and close price for the week, similar to end-of-day data:
    def summarize(a, o, h, l, c):
        monday_open = o[a[0]]
        week_high = np.max( np.take(h, a) )
        week_low = np.min( np.take(l, a) )
        friday_close = c[a[-1]]
        
        return("APPL", monday_open, week_high, week_low, friday_close)

    Notice that we used the take() function to get the actual values from indices. Calculating the high and low values for the week was easily done with the max() and min() functions. The open for the week is the open for the first day in the week—Monday. Likewise, the close is the close for the last day of the week—Friday:

    Week summary [['APPL' '335.8' '346.7' '334.3' '346.5']
     ['APPL' '347.89' '360.0' '347.64' '356.85']
     ['APPL' '356.79' '364.9' '349.52' '350.56']]
    
  8. Store the data in a file with the NumPy savetxt() function:
    np.savetxt("weeksummary.csv", weeksummary, delimiter=",", fmt="%s")

    As you can see, have specified a filename, the array we want to store, a delimiter (in this case a comma), and the format we want to store floating point numbers in.

    The format string starts with a percent sign. Second is an optional flag. The—flag means left justify, 0 means left pad with zeros, and + means precede with + or -. Third is an optional width. The width indicates the minimum number of characters. Fourth, a dot is followed by a number linked to precision. Finally, there comes a character specifier; in our example, the character specifier is a string. The character codes are described as follows:

    Character code

    Description

    c

    character

    d or i

    signed decimal integer

    e or E

    scientific notation with e or E.

    f

    decimal floating point

    g,G

    use the shorter of e,E or f

    o

    signed octal

    s

    string of characters

    u

    unsigned decimal integer

    x,X

    unsigned hexadecimal integer

    View the generated file in your favorite editor or type at the command line:

    $ cat weeksummary.csv
    APPL,335.8,346.7,334.3,346.5
    APPL,347.89,360.0,347.64,356.85
    APPL,356.79,364.9,349.52,350.56
    

What just happened?

We did something that is not even possible in some programming languages. We defined a function and passed it as an argument to the apply_along_axis() function.

Note

The programming paradigm described here is called functional programming. You can read more about functional programming in Python at https://docs.python.org/2/howto/functional.html.

Arguments for the summarize() function were neatly passed by apply_along_axis() (see weeksummary.py):

from __future__ import print_function
import numpy as np
from datetime import datetime

# Monday 0
# Tuesday 1
# Wednesday 2
# Thursday 3
# Friday 4
# Saturday 5
# Sunday 6
def datestr2num(s):
   return datetime.strptime(s, "%d-%m-%Y").date().weekday()

dates, open, high, low, close=np.loadtxt('data.csv', delimiter=',', usecols=(1, 3, 4, 5, 6), converters={1: datestr2num}, unpack=True)
close = close[:16]
dates = dates[:16]

# get first Monday
first_monday = np.ravel(np.where(dates == 0))[0]
print("The first Monday index is", first_monday)

# get last Friday
last_friday = np.ravel(np.where(dates == 4))[-1]
print("The last Friday index is", last_friday)

weeks_indices = np.arange(first_monday, last_friday + 1)
print("Weeks indices initial", weeks_indices)

weeks_indices = np.split(weeks_indices, 3)
print("Weeks indices after split", weeks_indices)

def summarize(a, o, h, l, c):
    monday_open = o[a[0]]
    week_high = np.max( np.take(h, a) )
    week_low = np.min( np.take(l, a) )
    friday_close = c[a[-1]]

    return("APPL", monday_open, week_high, week_low, friday_close)

weeksummary = np.apply_along_axis(summarize, 1, weeks_indices, open, high, low, close)
print("Week summary", weeksummary)

np.savetxt("weeksummary.csv", weeksummary, delimiter=",", fmt="%s")

Have a go hero – improving the code

Change the code to deal with a holiday. Time the code to see how big the speedup due to apply_along_axis() is.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset