Time for action – summarizing data

The data we will summarize will be for a whole business week from Monday to Friday. During the period covered by the data, there was one holiday on February 21st, President's Day. This happened to be a Monday and the US stock exchanges were closed on this day. As a consequence, there is no entry for this day, in the sample. The first day in the sample is a Friday, which is inconvenient. Use the following instructions to summarize data:

  1. To simplify, we will just have a look at the first three weeks in the sample—you can later have a go at improving this.
    close = close[:16]
    dates = dates[:16]

    We will build on the code from the Time for action – dealing with dates tutorial.

  2. Commencing, we will find the first Monday in our sample data. Recall that Mondays have the code 0 in Python. This is what we will put in the condition of a where function. Then, we will need to extract the first element that has index 0. The result would be a multidimensional array. Flatten that with the ravel function.
    # get first Monday
    first_monday = np.ravel(np.where(dates == 0))[0]
    print "The first Monday index is", first_monday

    This will print the following output:

    The first Monday index is 1
  3. The next logical step is to find the Friday before last Friday in the sample. The logic is similar to the one for finding the first Monday, and the code for Friday is 4. Additionally, we are looking for the second-to-last element with index 2.
    # get last Friday
    last_friday = np.ravel(np.where(dates == 4))[-2]
    print "The last Friday index is", last_friday

    This will give us the following output:

    The last Friday index is 15

    Next, create an array with the indices of all the days in the three weeks:

    weeks_indices = np.arange(first_monday, last_friday + 1)
    print "Weeks indices initial", weeks_indices
  4. Split the array in pieces of size 5 with the split function.
    weeks_indices = np.split(weeks_indices, 5)
    print "Weeks indices after split", weeks_indices

    It splits the array, as follows:

    Weeks indices after split [array([1, 2, 3, 4, 5]), array([ 6,  7,  8,  9, 10]), array([11, 12, 13, 14, 15])]
  5. In NumPy, dimensions are called axes. Now, we will get fancy with the apply_along_axis function. This function calls another function, which we will provide, to operate on each of the elements of an array. Currently, we have an array with three elements. Each array item corresponds to one week in our sample and contains indices of the corresponding items. Call the apply_along_axis function by supplying the name of our function, called summarize, that we will define shortly. Further specify the axis or dimension number (such as 1), the array to operate on, and a variable number of arguments for the summarize function, if any.
    weeksummary = np.apply_along_axis(summarize, 1, weeks_indices, open, high, low, close)
    print "Week summary", weeksummary
  6. Write the summarize function. The summarize function returns, for each week, a tuple that holds the open, high, low, and close prices for the week, similarly to end-of-day data.
    def summarize(a, o, h, l, c):
        monday_open = o[a[0]]
        week_high = np.max( np.take(h, a) )
        week_low = np.min( np.take(l, a) )
        friday_close = c[a[-1]]
        
        return("APPL", monday_open, week_high, week_low, friday_close)

    Notice that we used the take function to get the actual values from indices. Calculating the high and low values of the week was easily done with the max and min functions. open for the week is the open for the first day in the week—Monday. Likewise, close is the close for the last day of the week—Friday.

    Week summary [['APPL' '335.8' '346.7' '334.3' '346.5']
     ['APPL' '347.89' '360.0' '347.64' '356.85']
     ['APPL' '356.79' '364.9' '349.52' '350.56']]
  7. Store the data in a file with the NumPy savetxt function.
    np.savetxt("weeksummary.csv", weeksummary, delimiter=",", fmt="%s")

    As you can see, we specify a filename, the array we want to store, a delimiter (in this case a comma), and the format we want to store floating point numbers in.

    The format string starts with a percent sign. Second is an optional flag. The - flag means left justify, 0 means left pad with zeroes, + means precede with + or -. Third is an optional width. The width indicates the minimum number of characters. Fourth, a dot is followed by a number linked to precision. Finally, there comes a character specifier; in our example, the character specifier is a string.

    Character code

    Description

    c

    character

    d or i

    signed decimal integer

    e or E

    scientific notation with e or E

    f

    decimal floating point

    g or G

    use the shorter of e, E, or f

    o

    signed octal

    s

    string of characters

    u

    unsigned decimal integer

    x or X

    unsigned hexadecimal integer

    View the generated file in your favorite editor or type in the following commands in the command line:

    cat weeksummary.csv
    APPL,335.8,346.7,334.3,346.5
    APPL,347.89,360.0,347.64,356.85
    APPL,356.79,364.9,349.52,350.56
    

What just happened?

We did something that is not even possible in some programming languages. We defined a function and passed it as an argument to the apply_along_axis function. Arguments for the summarize function were neatly passed by apply_along_axis (see weeksummary.py).

import numpy as np
from datetime import datetime

# Monday 0
# Tuesday 1
# Wednesday 2
# Thursday 3
# Friday 4
# Saturday 5
# Sunday 6
def datestr2num(s):
    return datetime.strptime(s, "%d-%m-%Y").date().weekday()

dates, open, high, low, close=np.loadtxt('data.csv', delimiter=',', usecols=(1, 3, 4, 5, 6), converters={1: datestr2num}, unpack=True)
close = close[:16]
dates = dates[:16]

# get first Monday
first_monday = np.ravel(np.where(dates == 0))[0]
print "The first Monday index is", first_monday

# get last Friday
last_friday = np.ravel(np.where(dates == 4))[-1]
print "The last Friday index is", last_friday

weeks_indices = np.arange(first_monday, last_friday + 1)
print "Weeks indices initial", weeks_indices

weeks_indices = np.split(weeks_indices, 3)
print "Weeks indices after split", weeks_indices

def summarize(a, o, h, l, c):
    monday_open = o[a[0]]
    week_high = np.max( np.take(h, a) )
    week_low = np.min( np.take(l, a) )
    friday_close = c[a[-1]]

    return("APPL", monday_open, week_high, week_low, friday_close)

weeksummary = np.apply_along_axis(summarize, 1, weeks_indices, open, high, low, close)
print "Week summary", weeksummary

np.savetxt("weeksummary.csv", weeksummary, delimiter=",", fmt="%s")

Have a go hero – improving the code

Change the code to deal with a holiday. Time the code to see how big the speedup due to apply_along_axis is.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset