In this section, we show you how to handle time series data. We will start by showing how to create time series data using the data read in from a csv
file.
Here, we demonstrate the various ways to read in time series data:
In [7]: ibmData=pd.read_csv('ibm-common-stock-closing-prices-1959_1960.csv') ibmData.head() Out[7]: TradeDate closingPrice 0 1959-06-29 445 1 1959-06-30 448 2 1959-07-01 450 3 1959-07-02 447 4 1959-07-06 451 5 rows 2 columns
The source of this information can be found at http://datamarket.com.
We would like the TradeDate
column to be a series of datetime
values so that we can index it and create a time series. Let us first check the type of values in the TradeDate
series:
In [16]: type(ibmData['TradeDate']) Out[16]: pandas.core.series.Series In [12]: type(ibmData['TradeDate'][0]) Out[12]: str
Next, we convert it to a Timestamp
type:
In [17]: ibmData['TradeDate']=pd.to_datetime(ibmData['TradeDate']) type(ibmData['TradeDate'][0]) Out[17]: pandas.tslib.Timestamp
We can now use the TradeDate
column as an index:
In [113]: #Convert DataFrame to TimeSeries #Resampling creates NaN rows for weekend dates, hence use dropna ibmTS=ibmData.set_index('TradeDate').resample('D')['closingPrice'].dropna() ibmTS Out[113]: TradeDate 1959-06-29 445 1959-06-30 448 1959-07-01 450 1959-07-02 447 1959-07-06 451 ... Name: closingPrice, Length: 255
A DateOffset
object represents a change or offset in time. The key features of a DateOffset
object are as follows:
datetime
object to obtain a shifted dateWe illustrate how we use a DateOffset object as follows:
In [371]: xmasDay=pd.datetime(2014,12,25) xmasDay Out[371]: datetime.datetime(2014, 12, 25, 0, 0) In [373]: boxingDay=xmasDay+pd.DateOffset(days=1) boxingDay Out[373]: Timestamp('2014-12-26 00:00:00', tz=None) In [390}: today=pd.datetime.now() today Out[390]: datetime.datetime(2014, 5, 31, 13, 7, 36, 440060)
Note that datetime.datetime
is different from pd.Timestamp
. The former is a Python class and is inefficient, while the latter is based on the numpy.datetime64
datatype. The pd.DateOffset
object works with pd.Timestamp
and adding it to a datetime.datetime
function casts that object into a pd.Timestamp
object.
The following illustrates the command for one week from today:
In [392]: today+pd.DateOffset(weeks=1) Out[392]: Timestamp('2014-06-07 13:07:36.440060', tz=None)
The following illustrates the command for five years from today:
In [394]: today+2*pd.DateOffset(years=2, months=6) Out[394]: Timestamp('2019-05-30 13:07:36.440060', tz=None)
Here is an example of using the rollforward
functionality. QuarterBegin
is a DateOffset
object that is used to increment a given datetime
object to the start of the next calendar quarter:
In [18]: lastDay=pd.datetime(2013,12,31) In [24]: from pandas.tseries.offsets import QuarterBegin dtoffset=QuarterBegin() lastDay+dtoffset Out[24]: Timestamp('2014-03-01 00:00:00', tz=None) In [25]: dtoffset.rollforward(lastDay) Out[25]: Timestamp('2014-03-01 00:00:00', tz=None)
Thus, we can see that the next quarter after December 31, 2013 starts on March 1, 2014. Timedeltas
are similar to DateOffsets
but work with datetime.datetime
objects. The use of these has been explained by the following command:
In [40]: weekDelta=datetime.timedelta(weeks=1) weekDelta Out[40]: datetime.timedelta(7) In [39]: today=pd.datetime.now() today Out[39]: datetime.datetime (2014, 6, 2, 3, 56, 0, 600309) In [41]: today+weekDelta Out[41]: datetime.datetime (2014, 6, 9, 3, 56,0, 600309)
In this section, we explore various methods for Time Series objects such as shifting, frequency conversion, and resampling.
Sometimes, we may wish to shift the values in a Time Series backward or forward in time. One possible scenario is when a dataset contains the list of start dates for last year's new employees in a firm, and the company's human resource program wishes to shift these dates forward by one year so that the employees' benefits can be activated. We can do this by using the shift()
function as follows:
In [117]: ibmTS.shift(3) Out[117]: TradeDate 1959-06-29 NaN 1959-06-30 NaN 1959-07-01 NaN 1959-07-02 445 1959-07-06 448 1959-07-07 450 1959-07-08 447 ...
This shifts all the calendar days. However, if we wish to shift only business days, we must use the following command:
In [119]: ibmTS.shift(3, freq=pd.datetools.bday) Out[119]: TradeDate 1959-07-02 445 1959-07-03 448 1959-07-06 450 1959-07-07 447 1959-07-09 451
In the preceding snippet, we have specified the freq
argument to shift; this tells the function to shift only the business days. The shift
function has a freq
argument whose value can be a DateOffset
class, timedelta
-like object, or an offset alias. Thus, using ibmTS.shift(3, freq='B')
would also produce the same result.
We can use the asfreq
function to change frequencies, as explained:
In [131]: # Frequency conversion using asfreq ibmTS.asfreq('BM') Out[131]: 1959-06-30 448 1959-07-31 428 1959-08-31 425 1959-09-30 411 1959-10-30 411 1959-11-30 428 1959-12-31 439 1960-01-29 418 1960-02-29 419 1960-03-31 445 1960-04-29 453 1960-05-31 504 1960-06-30 522 Freq: BM, Name: closingPrice, dtype: float64
In this case, we just obtain the values corresponding to the last day of the month from the ibmTS
time series. Here, bm
stands for business month end frequency. For a list of all possible frequency aliases, go to http://bit.ly/1cMI3iA.
If we specify a frequency that is smaller than the granularity of the data, the gaps will be filled in with NaN
values:
In [132]: ibmTS.asfreq('H') Out[132]: 1959-06-29 00:00:00 445 1959-06-29 01:00:00 NaN 1959-06-29 02:00:00 NaN 1959-06-29 03:00:00 NaN ... 1960-06-29 23:00:00 NaN 1960-06-30 00:00:00 522 Freq: H, Name: closingPrice, Length: 8809
We can also apply the asfreq
method to the Period
and PeriodIndex
objects similar to how we do for the datetime
and Timestamp
objects. Period
and PeriodIndex
are introduced later and are used to represent time intervals.
The asfreq
method accepts a method argument that allows you to forward fill (ffill
) or back fill the gaps, similar to fillna
:
In [140]: ibmTS.asfreq('H', method='ffill') Out[140]: 1959-06-29 00:00:00 445 1959-06-29 01:00:00 445 1959-06-29 02:00:00 445 1959-06-29 03:00:00 445 ... 1960-06-29 23:00:00 522 1960-06-30 00:00:00 522 Freq: H, Name: closingPrice, Length: 8809
The TimeSeries.resample
function enables us to summarize/aggregate more granular data based on a sampling interval and a sampling function.
Downsampling is a term that originates from digital signal processing and refers to the process of reducing the sampling rate of a signal. In the case of data, we use it to reduce the amount of data that we wish to process.
The opposite process is upsampling, which is used to increase the amount of data to be processed and requires interpolation to obtain the intermediate data points. For more information on downsampling and upsampling, refer to Practical Applications of Upsampling and Downsampling at http://bit.ly/1JC95HD and Downsampling Time Series for Visual Representation at http://bit.ly/1zrExVP.
Here, we examine some tick data for use in resampling. Before we examine the data, we need to prepare it. In doing so, we will learn some useful techniques for time series data, which are as follows:
Here is an example that uses tick data for stock prices of Google for Tuesday, May 27, 2014:
In [150]: googTickData=pd.read_csv('./GOOG_tickdata_20140527.csv') In [151]: googTickData.head() Out[151]: Timestamp close high low open volume 0 1401197402 555.008 556.41 554.35 556.38 81100 1 1401197460 556.250 556.30 555.25 555.25 18500 2 1401197526 556.730 556.75 556.05 556.39 9900 3 1401197582 557.480 557.67 556.73 556.73 14700 4 1401197642 558.155 558.66 557.48 557.59 15700 5 rows 6 columns
The source for the preceding data can be found at http://bit.ly/1MKBwlB.
As you can see from the preceding section, we have a Timestamp column along with the columns for the close, high, low, and opening prices and the volume of trades of the Google stock.
So, why does the Timestamp column seem a bit strange? Well, tick data Timestamps are generally expressed in epoch time (for more information, refer to http://en.wikipedia.org/wiki/Unix_epoch) as a more compact means of storage. We'll need to convert this into a more human-readable time, and we can do this as follows:
In [201]: googTickData['tstamp']=pd.to_datetime(googTickData['Timestamp'],unit='s',utc=True) In [209]: googTickData.head() Out[209]: Timestamp close high low open volume tstamp 0 14011974020 555.008 556.41 554.35 556.38 81100 2014-05-27 13:30:02 1 1401197460 556.250 556.30 555.25 555.25 18500 2014-05-27 13:31:00 2 1401197526 556.730 556.75 556.05 556.39 9900 2014-05-27 13:32:06 3 1401197582 557.480 557.67 556.73 556.73 14700 2014-05-27 13:33:02 4 1401197642 558.155 558.66 557.48 557.59 15700 2014-05-27 13:34:02 5 rows 7 columns
We would now like to make the tstamp
column, as the index and eliminate the epoch Timestamp column:
In [210]: googTickTS=googTickData.set_index('tstamp') googTickTS=googTickTS.drop('Timestamp',axis=1) googTickTS.head() Out[210]: tstamp close high low open volume 2014-05-27 13:30:02 555.008 556.41 554.35 556.38 811000 2014-05-27 13:31:00 556.250 556.30 555.25 555.25 18500 2014-05-27 13:32:06 556.730 556.75 556.05 556.39 9900 2014-05-27 13:33:02 557.480 557.67 556.73 556.73 14700 2014-05-27 13:34:02 558.155 558.66 557.48 557.59 15700 5 rows 5 columns
Note that the tstamp index column has the times in UTC, and we can convert it to US/Eastern time using two operators—tz_localize
and tz_convert
:
In [211]: googTickTS.index=googTickTS.index.tz_localize('UTC').tz_convert('US/Eastern') In [212]: googTickTS.head() Out[212]: tstamp close high low open volume 2014-05-27 09:30:02-04:00 555.008 556.41 554.35 556.38 81100 2014-05-27 09:31:00-04:00 556.250 556.30 555.25 555.25 18500 2014-05-27 09:32:06-04:00 556.730 556.75 556.05 556.39 9900 2014-05-27 09:33:02-04:00 557.480 557.67 556.73 556.73 14700 2014-05-27 09:34:02-04:00 558.155 558.66 557.48 557.59 15700 5 rows 5 columns In [213]: googTickTS.tail() Out[213]: tstamp close high low open volume 2014-05-27 15:56:00-04:00 565.4300 565.48 565.30 565.385 14300 2014-05-27 15:57:00-04:00 565.3050 565.46 565.20 565.400 14700 2014-05-27 15:58:00-04:00 565.1101 565.31 565.10 565.310 23200 2014-05-27 15:59:00-04:00 565.9400 566.00 565.08 565.230 55600 2014-05-27 16:00:00-04:00 565.9500 565.95 565.95 565.950 126000 5 rows 5 columns In [214]: len(googTickTS) Out[214]: 390
From the preceding output, we can see ticks for every minute of the trading day—from 9:30 a.m., when the stock market opens, until 4:00 p.m., when it closes. This results in 390 rows in the dataset since there are 390 minutes between 9:30 a.m. and 4:00 p.m.
Suppose we want to obtain a snapshot every 5 minutes instead of every minute? We can achieve this by using downsampling as follows:
In [216]: googTickTS.resample('5Min').head(6) Out[216]: close high low open volume tstamp 2014-05-27 09:30:00-04:00 556.72460 557.15800 555.97200 556.46800 27980 2014-05-27 09:35:00-04:00 556.93648 557.64800 556.85100 557.34200 24620 2014-05-27 09:40:00-04:00 556.48600 556.79994 556.27700 556.60678 8620 2014-05-27 09:45:00-04:00 557.05300 557.27600 556.73800 556.96600 9720 2014-05-27 09:50:00-04:00 556.66200 556.93596 556.46400 556.80326 14560 2014-05-27 09:55:00-04:00 555.96580 556.35400 555.85800 556.23600 12400 6 rows 5 columns
The default function used for resampling is the mean. However, we can also specify other functions, such as the minimum, and we can do this via the how
parameter to resample:
In [245]: googTickTS.resample('10Min', how=np.min).head(4) Out[245]: close high low open volume tstamp 2014-05-27 09:30:00-04:00 555.008 556.3000 554.35 555.25 9900 2014-05-27 09:40:00-04:00 556.190 556.5600 556.13 556.35 3500 2014-05-27 09:50:00-04:00 554.770 555.5500 554.77 555.55 3400 2014-05-27 10:00:00-04:00 554.580 554.9847 554.45 554.58 1800
Various function names can be passed to the how
parameter, such as sum
, ohlc
, max
, min
, std
, mean
, median
, first
, and last
.
The ohlc
function that returns open-high-low-close values on time series data that is; the first, maximum, minimum, and last values. To specify whether the left or right interval is closed, we can pass the closed
parameter as follows:
In [254]: pd.set_option('display.precision',5) googTickTS.resample('5Min', closed='right').tail(3) Out[254]: close high low open volume tstamp 2014-05-27 15:45:00-04:00 564.3167 564.3733 564.1075 564.1700 12816.6667 2014-05-27 15:50:00-04:00 565.1128 565.1725 565.0090 565.0650 13325.0000 2014-05-27 15:55:00-04:00 565.5158 565.6033 565.3083 565.4158 40933.3333 3 rows 5 columns
Thus, in the preceding command, we can see that the last row shows the tick at 15:55 instead of 16:00.
For upsampling, we need to specify a fill method to determine how the gaps should be filled via the fill_method
parameter:
In [263]: googTickTS[:3].resample('30s', fill_method='ffill') Out[263]: close high low open volume tstamp 2014-05-27 09:30:00-04:00 555.008 556.41 554.35 556.38 81100 2014-05-27 09:30:30-04:00 555.008 556.41 554.35 556.38 81100 2014-05-27 09:31:00-04:00 556.250 556.30 555.25 555.25 18500 2014-05-27 09:31:30-04:00 556.250 556.30 555.25 555.25 18500 2014-05-27 09:32:00-04:00 556.730 556.75 556.05 556.39 9900 5 rows 5 columns In [264]: googTickTS[:3].resample('30s', fill_method='bfill') Out[264]: close high low open volume tstamp 2014-05-27 09:30:00-04:00 555.008 556.41 554.35 556.38 81100 2014-05-27 09:30:30-04:00 556.250 556.30 555.25 555.25 18500 2014-05-27 09:31:00-04:00 556.250 556.30 555.25 555.25 18500 2014-05-27 09:31:30-04:00 556.730 556.75 556.05 556.39 9900 2014-05-27 09:32:00-04:00 556.730 556.75 556.05 556.39 9900 5 rows 5 columns
Unfortunately, the fill_method
parameter currently supports only two methods—forward fill and back fill. An interpolation method would be valuable.
To specify offsets, a number of aliases are available; some of the most commonly used ones are as follows:
These aliases can also be combined. In the following case, we resample every 7 minutes and 30 seconds:
In [267]: googTickTS.resample('7T30S').head(5) Out[267]: close high low open volume tstamp 2014-05-27 09:30:00-04:00 556.8266 557.4362 556.3144 556.8800 28075.0 2014-05-27 09:37:30-04:00 556.5889 556.9342 556.4264 556.7206 11642.9 2014-05-27 09:45:00-04:00 556.9921 557.2185 556.7171 556.9871 9800.0 2014-05-27 09:52:30-04:00 556.1824 556.5375 556.0350 556.3896 14350.0 2014-05-27 10:00:00-04:00 555.2111 555.4368 554.8288 554.9675 12512.5 5 rows x 5 columns
Suffixes can be applied to the frequency aliases to specify when in a frequency period to start. These are known as anchoring offsets:
These offsets can be used as arguments to the date_range
and bdate_range
functions as well as constructors for index types such as PeriodIndex
and DatetimeIndex
. A comprehensive discussion on this can be found in the pandas documentation at http://pandas.pydata.org/pandas-docs/stable/timeseries.html#.
When dealing with time series, there are two main concepts that you have to consider: points in time and ranges, or time spans. In pandas, the former is represented by the Timestamp datatype, which is equivalent to Python's datatime.datetime
(datetime
) datatype and is interchangeable with it. The latter (time span) is represented by the Period datatype, which is specific to pandas.
Each of these datatypes has index datatypes associated with them: DatetimeIndex
for Timestamp
/Datetime
and PeriodIndex
for Period
. These index datatypes are basically subtypes of numpy.ndarray
that contain the corresponding Timestamp and Period datatypes and can be used as indexes for Series and DataFrame objects.
The Period
datatype is used to represent a range or span of time. Here are a few examples:
# Period representing May 2014 In [287]: pd.Period('2014', freq='A-MAY') Out[287]: Period('2014', 'A-MAY') # Period representing specific day – June 11, 2014 In [292]: pd.Period('06/11/2014') Out[292]: Period('2014-06-11', 'D') # Period representing 11AM, Nov 11, 1918 In [298]: pd.Period('11/11/1918 11:00',freq='H') Out[298]: Period('1918-11-11 11:00', 'H')
We can add integers to Periods
which advances the period by the requisite number of unit of the frequency:
In [299]: pd.Period('06/30/2014')+4 Out[299]: Period('2014-07-04', 'D') In [303]: pd.Period('11/11/1918 11:00',freq='H') - 48 Out[303]: Period('1918-11-09 11:00', 'H')
We can also calculate the difference between two Periods
and return the number of units of frequency between them:
In [304]: pd.Period('2014-04', freq='M')-pd.Period('2013-02', freq='M') Out[304]: 14
A PeriodIndex
object, which is an index
type for a Period
object, can be created in two ways:
Period
objects using the period_range
function an analogue of date_range
:In [305]: perRng=pd.period_range('02/01/2014','02/06/2014',freq='D') perRng Out[305]: <class 'pandas.tseries.period.PeriodIndex'> freq: D [2014-02-01, ..., 2014-02-06] length: 6 In [306]: type(perRng[:2]) Out[306]: pandas.tseries.period.PeriodIndex In [307]: perRng[:2] Out[307]: <class 'pandas.tseries.period.PeriodIndex'> freq: D [2014-02-01, 2014-02-02]
As we can confirm from the preceding command, when you pull the covers, a PeriodIndex
function is really an ndarray
of Period
objects underneath.
Period
constructor:In [312]: JulyPeriod=pd.PeriodIndex(['07/01/2014','07/31/2014'], freq='D') JulyPeriod Out[312]: <class 'pandas.tseries.period.PeriodIndex'> freq: D [2014-07-01, 2014-07-31]
The difference between the two approaches, as can be seen from the preceding output, is that period_range
fills in the resulting ndarray, but the Period
constructor does not and you have to specify all the values that should be in the index.
We can convert the Period
and PeriodIndex
datatypes to Datetime
/Timestamp
and DatetimeIndex
datatypes via the to_period
and to_timestamp
functions, as follows:
In [339]: worldCupFinal=pd.to_datetime('07/13/2014', errors='raise') worldCupFinal Out[339]: Timestamp('2014-07-13 00:00:00') In [340]: worldCupFinal.to_period('D') Out[340]: Period('2014-07-13', 'D') In [342]: worldCupKickoff=pd.Period('06/12/2014','D') worldCupKickoff Out[342]: Period('2014-06-12', 'D') In [345]: worldCupKickoff.to_timestamp() Out[345]: Timestamp('2014-06-12 00:00:00', tz=None) In [346]: worldCupDays=pd.date_range('06/12/2014',periods=32, freq='D') worldCupDays Out[346]: <class 'pandas.tseries.index.DatetimeIndex'> [2014-06-12, ..., 2014-07-13] Length: 32, Freq: D, Timezone: None In [347]: worldCupDays.to_period() Out[347]: <class 'pandas.tseries.period.PeriodIndex'> freq: D [2014-06-12, ..., 2014-07-13] length: 32