The autocorrelation of a time series can inform us about repeating patterns or serial correlation. The latter refers to the correlation between the signal at a given time and at a later time. The analysis of the autocorrelation can thereby inform us about the timescale of the fluctuations. Here, we use this tool to analyze the evolution of baby names in the US, based on data provided by the United States Social Security Administration.
>>> import os import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline
babies
subdirectory. There is one CSV file per year. Each file contains all baby names given that year with the respective frequencies.>>> import io import requests import zipfile >>> url = ('https://github.com/ipython-books/' 'cookbook-2nd-data/blob/master/' 'babies.zip?raw=true') r = io.BytesIO(requests.get(url).content) zipfile.ZipFile(r).extractall('babies') >>> %ls babies yob1902.txt yob1903.txt yob1904.txt ... yob2014.txt yob2015.txt yob2016.txt
DataFrame
per year:>>> files = [file for file in os.listdir('babies')if file.startswith('yob')] >>> years = np.array(sorted([int(file[3:7])for file in files])) >>> data = {year:pd.read_csv('babies/yob%d.txt' % year,index_col=0, header=None,names=['First name','Gender','Number'])for year in years} >>> data[2016].head()
>>> def get_value(name, gender, year):"""Return the number of babies born a given year,with a given gender and a given name."""dy = data[year]try:return dy[dy['Gender'] == gender]['Number'][name]except KeyError:return 0 >>> def get_evolution(name, gender):"""Return the evolution of a baby name overthe years.""" return np.array([get_value(name, gender, year)for year in years])
correlate()
function.>>> def autocorr(x): result = np.correlate(x, x, mode='full') return result[result.size // 2:]
>>> def autocorr_name(name, gender, color, axes=None): x = get_evolution(name, gender) z = autocorr(x) # Evolution of the name. axes[0].plot(years, x, '-o' + color, label=name) axes[0].set_title("Baby names") axes[0].legend() # Autocorrelation. axes[1].plot(z / float(z.max()), '-' + color, label=name) axes[1].legend() axes[1].set_title("Autocorrelation")
>>> fig, axes = plt.subplots(1, 2, figsize=(12, 4)) autocorr_name('Olivia', 'F', 'k', axes=axes) autocorr_name('Maria', 'F', 'y', axes=axes)
The autocorrelation of Olivia is decaying much faster than Maria's. This is mainly because of the steep increase of the name Olivia at the end of the twentieth century. By contrast, the name Maria is varying more slowly globally, and its autocorrelation is decaying slower.
A time series is a sequence indexed by time. Important applications include stock markets, product sales, weather forecasting, biological signals, and many others. Time series analysis is an important part of statistical data analysis, signal processing, and machine learning.
There are various definitions of the autocorrelation. Here, we define the autocorrelation of a time series as:
In the previous plot, we normalized the autocorrelation by its maximum so as to compare the autocorrelation of two signals. The autocorrelation quantifies the average similarity between the signal and a shifted version of the same signal, as a function of the delay between the two. In other words, the autocorrelation can give us information about repeating patterns as well as the timescale of the signal's fluctuations. The faster the autocorrelation decays to zero, the faster the signal varies.
statsmodels
, documented at http://statsmodels.sourceforge.net/stable/tsa.html