Upsampling time series data

In upsampling, the frequency of the time series is increased. As a result, we have more sample points than data points. One of the main questions is how to account for the entries in the series where we have no measurement.

Let's start with hourly data for a single day:

>>> rng = pd.date_range('4/29/2015 8:00', periods=10, freq='H')
>>> ts = pd.Series(np.random.randint(0, 100, len(rng)), index=rng)
>>> ts.head()
2015-04-29 08:00:00    30
2015-04-29 09:00:00    27
2015-04-29 10:00:00    54
2015-04-29 11:00:00     9
2015-04-29 12:00:00    48
Freq: H, dtype: int64

If we upsample to data points taken every 15 minutes, our time series will be extended with NaN values:

>>> ts.resample('15min')
>>> ts.head()
2015-04-29 08:00:00    30
2015-04-29 08:15:00   NaN
2015-04-29 08:30:00   NaN
2015-04-29 08:45:00   NaN
2015-04-29 09:00:00    27

There are various ways to deal with missing values, which can be controlled by the fill_method keyword argument to resample. Values can be filled either forward or backward:

>>> ts.resample('15min', fill_method='ffill').head()
2015-04-29 08:00:00    30
2015-04-29 08:15:00    30
2015-04-29 08:30:00    30
2015-04-29 08:45:00    30
2015-04-29 09:00:00    27
Freq: 15T, dtype: int64
>>> ts.resample('15min', fill_method='bfill').head()
2015-04-29 08:00:00    30
2015-04-29 08:15:00    27
2015-04-29 08:30:00    27
2015-04-29 08:45:00    27
2015-04-29 09:00:00    27

With the limit parameter, it is possible to control the number of missing values to be filled:

>>> ts.resample('15min', fill_method='ffill', limit=2).head()
2015-04-29 08:00:00    30
2015-04-29 08:15:00    30
2015-04-29 08:30:00    30
2015-04-29 08:45:00   NaN
2015-04-29 09:00:00    27
Freq: 15T, dtype: float64

If you want to adjust the labels during resampling, you can use the loffset keyword argument:

>>> ts.resample('15min', fill_method='ffill', limit=2, loffset='5min').head()
2015-04-29 08:05:00    30
2015-04-29 08:20:00    30
2015-04-29 08:35:00    30
2015-04-29 08:50:00   NaN
2015-04-29 09:05:00    27
Freq: 15T, dtype: float64

There is another way to fill in missing values. We could employ an algorithm to construct new data points that would somehow fit the existing points, for some definition of somehow. This process is called interpolation.

We can ask pandas to interpolate a time series for us:

>>> tsx = ts.resample('15min')
>>> tsx.interpolate().head()
2015-04-29 08:00:00    30.00
2015-04-29 08:15:00    29.25
2015-04-29 08:30:00    28.50
2015-04-29 08:45:00    27.75
2015-04-29 09:00:00    27.00
Freq: 15T, dtype: float64

We saw the default interpolate method – a linear interpolation – in action. pandas assumes a linear relationship between two existing points.

pandas supports over a dozen interpolation functions, some of which require the scipy library to be installed. We will not cover interpolation methods in this chapter, but we encourage you to explore the various methods yourself. The right interpolation method will depend on the requirements of your application.

Upsampling time series data

While, by default, pandas objects are time zone unaware, many real-world applications will make use of time zones. As with working with time in general, time zones are no trivial matter: do you know which countries have daylight saving time and do you know when the time zone is switched in those countries? Thankfully, pandas builds on the time zone capabilities of two popular and proven utility libraries for time and date handling: pytz and dateutil:

>>> t = pd.Timestamp('2000-01-01')
>>> t.tz is None
True

To supply time zone information, you can use the tz keyword argument:

>>> t = pd.Timestamp('2000-01-01', tz='Europe/Berlin')
>>> t.tz
<DstTzInfo 'Europe/Berlin' CET+1:00:00 STD>

This works for ranges as well:

>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D', tz='Europe/London')
>>> rng
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04','2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08','2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D', tz='Europe/London')

Time zone objects can also be constructed beforehand:

>>> import pytz
>>> tz = pytz.timezone('Europe/London')
>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D', tz=tz)
>>> rng
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04','2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08','2000-01-09', '2000-01-10'], dtype='datetime64[ns]', freq='D', tz='Europe/London')

Sometimes, you will already have a time zone unaware time series object that you would like to make time zone aware. The tz_localize function helps to switch between time zone aware and time zone unaware objects:

>>> rng = pd.date_range('1/1/2000 00:00', periods=10, freq='D')
>>> ts = pd.Series(np.random.randn(len(rng)), rng)
>>> ts.index.tz is None
True
>>> ts_utc = ts.tz_localize('UTC')
>>> ts_utc.index.tz
<UTC>

To move a time zone aware object to other time zones, you can use the tz_convert method:

>>> ts_utc.tz_convert('Europe/Berlin').index.tz
<DstTzInfo 'Europe/Berlin' LMT+0:53:00 STD>

Finally, to detach any time zone information from an object, it is possible to pass None to either tz_convert or tz_localize:

>>> ts_utc.tz_convert(None).index.tz is None
True
>>> ts_utc.tz_localize(None).index.tz is None
True
Upsampling time series data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset