Data analysis

The following is the definition of data analysis given by Wikipedia:

"Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making."

In the Python scientific computing community, the definition tends to lean away from the software business applications and more towards a statistical analysis. In this section, while we will see a little math, our purpose is not to engage in a rigorous exploration of mathematical analysis, but rather provide a plausible context for using various tools in the matplotlib ecosystem that assist in high-level plotting and various related activities.

Pandas, SciPy, and Seaborn

We've just learned more about Seaborn, and we'll be working more with it in this section. We've used Pandas a bit, but haven't formally introduced it, nor SciPy.

The Pandas project describes generic Python as a great tool for data managing and preparation, but not much strong in the areas of data analysis and modeling. This is the area that Pandas was envisioned to focus upon, filling a much-needed gap in the suite of available libraries, and allowing one to carry out the entire data analysis workflows in Python without having to switch to tools like R or SPSS.

The scipy library is one of the core packages, that make up the SciPy stack, providing many user-friendly and efficient numerical calculation routines such as those for numerical integration, clustering, signal processing, and statistics, among others.

We have imported pandas as pd and will import stats from scipy. The seaborn module is already imported in our chapter notebook as sns. As you see these various module aliases, note that we are taking an advantage of these high-level libraries: to great effect, as you will soon see.

Examining and shaping a dataset

For demonstration purposes in this chapter, we requested the mean monthly temperatures (in Fahrenheit) and mean monthly precipitation (in inches) for the century ranging from 1894 to 2013, in the small farming town of Saint Francis, Kansas from the United States Historical Climatology Network (USHCN). Our goal is to select a dataset amenable to statistical analysis and explore the various ways in which the data (raw and analyzed) may be presented to reveal patterns, which may be more easily uncovered using the tools of high-level plotting in matplotlib.

Let us do some more imports and then use Pandas to read this CSV data and instantiate a DataFrame object with it:

In [27]: import calendar
         from scipy import stats
         sns.set(style="darkgrid")
In [28]: data_file = "../data/KS147093_0563_data_only.csv"
         data = pd.read_csv(data_file)

As part of creating the DataFrame object, the headers of the CSV file are converted to column names, by which we can refer to the data later:

In [29]: data.columns
Out[29]: Index(['State ID', 'Year', 'Month',
                'Precipitation (in)',
                'Mean Temperature (F)'],
                dtype='object')

As a quick sanity check on our data loading, we can view the first few lines of the set with this command:

In [30]: data.head()
Out[30]:

The following table is the result of the preceding command:

 

State ID

Year

Month

Precipitation (in)

Mean temperature (F)

0

'147093'

1894

1

0.43

25.4

1

'147093'

1894

2

0.69

22.5

2

'147093'

1894

3

0.45

42.1

3

'147093'

1894

4

0.62

53.7

4

'147093'

1894

5

0.64

62.9

The months in our dataset are numbers; that's exactly what we want for some calculations; for others (and for display purposes) we will sometimes need these as names. In fact, we will need month names, month numbers, and a lookup dictionary with both. Let us do that now:

month_nums = list(range(1, 13))
month_lookup = {x: calendar.month_name[x] for x in month_nums}
month_names = [x[1] for x in sorted(month_lookup.items())]

We can use the lookup to edit our data in-place with the following:

data["Month"] = data["Month"].map(month_lookup)

If you run data.head(), you will see that the month numbers have been replaced with the month names that we defined in the lookup dictionary.

Since we have changed our dataset now, let us reload the CSV for the cases when want to use the raw data as is with its month numbers:

data_raw = pd.read_csv(data_file)

For the purposes of keeping the example code clear, we'll define some more variables as well, and then confirm that we have data that makes sense.

In [32]: years = data["Year"].values
         temps_degrees = data["Mean Temperature (F)"].values
         precips_inches = data["Precipitation (in)"].values
In [33]: years_min = data.get("Year").min()
         years_min
Out[33]: 1894
In [34]: years_max = data.get("Year").max()
         years_max
Out[34]: 2013
In [35]: temp_max = data.get("Mean Temperature (F)").max()
         temp_max
Out[35]: 81.799999999999997
In [36]: temp_min = data.get("Mean Temperature (F)").min()
         temp_min
Out[36]: 13.199999999999999
In [37]: precip_max = data.get("Precipitation (in)").max()
         precip_max
Out[37]: 11.31
In [38]: precip_min = data.get("Precipitation (in)").min()
         precip_min
Out[38]: 0.0

Next, we are going to create a Pandas pivot table. This spreadsheet-like feature of Pandas allows one to create new views of old data, creating new indices, limiting columns, and so on. The DataFrame object we obtained when reading the CSV data provided us with an automatic, incremented index and all the data from our file, in columns. We're going to need a view of the data, where the rows are months and the columns are years. Our first pivot table will give us a DataFrame object with just that setup:

In [39]: temps = data_raw.pivot(
             "Month", "Year", "Mean Temperature (F)")
         temps.index = [calendar.month_name[x] for x in temps.index]

Typing temps by itself in the notebook will render an elided table of values, showing you what the shape of this new DataFrame is.

Let us do the same thing with the precipitation data in our dataset:

In [40]: precips = data_raw.pivot(
             "Month", "Year", "Precipitation (in)")
         precips.index = [
             calendar.month_name[x] for x in precips.index]

We've just taken the necessary steps of preparing our data for analysis and plotting, which we're going to be doing for the rest of this chapter.

Analysis of temperature

We will be utilizing the temperature portion of our dataset (for the period of 1894-2013) in this section, to demonstrate functionality in Pandas, SciPy, and Seaborn, as it relates to the use of these for the purpose of high-level plotting and associated data analysis.

Throughout the rest of the chapter, do keep in mind that this analysis is done to provide examples of usage of libraries. It is not meant to provide deep insights into the nature of climatology or to draw conclusions about the state of our environment. To that point, this dataset is for a single small farming town in the American High Plains. There's not enough data in this set to do much science, but there's plenty to explore.

Since we will be discussing temperature, we should create a palette of colors that intuitively translates to the range of temperatures we will be examining. After some experimentation, we have settled on the following:

In [41]: temps_colors = ["#FCF8D4", "#FAEAB9", "#FAD873",
                         "#FFA500", "#FF8C00", "#B22222"]
         sns.palplot(temps_colors)

The following set of colors forming a palette is the result of the preceding code:

Analysis of temperature

Next, let us convert this list of colors to a color map that can be used by matplotlib:

In [42]: temps_cmap = mpl.colors.LinearSegmentedColormap.from_list(
             "temp colors", temps_colors)

That being said and done, our first plot won't actually use color. We first need to build some intuition about our raw data. Let us see what it looks like. Keep in mind that our temperature data points represent the mean temperature for every month, from 1894 through the end of 2013. Given that these are discrete data points, a scatter plot is a good choice for a first view of the data, as it will quickly reveal any obvious patterns such as clustering. The scatter plot is created as follows:

In [43]: sns.set(style="ticks")

         (figure, axes) = plt.subplots(figsize=(18,6))
         scatter = axes.scatter(
             years, temps_degrees, s=100, color="0.5",
             alpha=0.5)
         axes.set_xlim([years_min, years_max])
         axes.set_ylim([temp_min - 5, temp_max + 5])
         axes.set_title(
             ("Mean Monthly Temperatures from 1894-2013
"
              "Saint Francis, KS, USA"),
             fontsize=20)
         axes.set_xlabel("Years", fontsize=16)
         _ = axes.set_ylabel(
             "Temperature (F)", fontsize=16)

The following scatter plot is the result of the preceding code:

Analysis of temperature

We've seen code like this before, so there are no surprises. The one bit that may be new is the Seaborn style that we set at the beginning. Some of the plots we'll be generating look better with ticks, with white backgrounds, with darker backgrounds, and so on. As such, you will notice that we occasionally make calls to Seaborn's styling functions.

Something else you might have noticed was that we assigned the last call to the don't care variable. Sometimes we want IPython to print its results and sometimes we don't. We've decided to enable output by default, so if we're not interested in seeing the output of a function call (in this case it would have been a Python object's representation), we simply assign it to a variable.

There are a few things that we might notice upon first seeing this data rendered as a scatter plot:

  • There appears to be a banding at the minimum and maximum temperatures
  • The banding in the minimum temperatures looks like it might be a bit wider
  • It appears that the mean temperatures are trending slightly upward
  • The lower temperatures seem to trend upward more than the higher temperatures

The first point, we can address immediately with a logical inference: we are examining data that is cyclic in nature (due primarily to the axial tilt of the planet, and thus the corresponding temperatures). Cyclic processes can be described with trigonometric functions such as the sine or cosine of a scalar value. If we were to sample points from a continuous trigonometric and scatter plot them on a Cartesian coordinate system, we'd see a familiar banding pattern: as the y values reach the maximum level, the density of points appears greater, due to the fact that the same vertical space is being used to plot the increase to and then the decrease from the maximum.

For our dataset, we can expect that mean temperatures increase during the summer months, typically hold there for a month or two, and then decrease towards the minimum, where the same pattern will apply for the winter months.

As for the other three points of observation, we will need to do some analysis to discern whether those observations are valid or not. Where should we start? How about the following:

  1. Let us get the minimum and maximum values for every year.
  2. Then, find the line that describes those values across the century.
  3. Examine the slopes of the minimum and maximum lines.
  4. Then, compare the slopes with each other.

Our first bullet is actually a matter of performing a linear regression on the maximum and minimum values. SciPy has just the thing for us: scipy.stats.linregress. We'll use the results from that function to create a Pandas data Series, with which we can run calculations. We'll define a quick little function to make that a bit easier, and then use it:

In [44]: def get_fit(series, m, b):
             x = series.index
             y = m * x + b
             return pd.Series(y, x)

         temps_max_x = temps.max().index
         temps_max_y = temps.max().values
         temps_min_x = temps.min().index
         temps_min_y = temps.min().values

         (temps_max_slope,
          temps_max_intercept,
          _, _, _) = stats.linregress(temps_max_x, temps_max_y)
         temps_max_fit = get_fit(
             temps.max(), temps_max_slope, temps_max_intercept)

         (temps_min_slope,
          temps_min_intercept,
          _, _, _) = stats.linregress(temps_min_x, temps_min_y)
         temps_min_fit = get_fit(
             temps.min(), temps_min_slope, temps_min_intercept)

The linregress function returns the slope of the regression line, its intercept, the correlation coefficient, and the p-value. For our purposes, we're just interested in this slope and intercept; so we ignore the other values. Let us look at the results:

In [45]: (temps_max_slope, temps_min_slope)
Out[45]: (0.015674352385582326, 0.04552191124383638)

So what does this mean? Let us do a quick refresher: the slope m is defined as the change in y values over the change in x values:

Analysis of temperature

In our case, the y values are the minimum and maximum mean monthly temperatures in degrees Fahrenheit; the x values are the years these measurements were taken.

The slope for the minimum mean monthly temperatures over the last 120 years is about three times greater than that of the maximum mean monthly temperatures:

In [46]: temps_min_slope/temps_max_slope
Out[46]: 2.9042291588205336

Let us go back to our scatter plot and superimpose our linear fits for the maximum and minimum annual means:

In [47]: (figure, axes) = plt.subplots(figsize=(18,6))
         scatter = axes.scatter(
             years, temps_degrees, s=100, color="0.5", alpha=0.5)
         temps_max_fit.plot(
             ax=axes, lw=5, color=temps_colors[5], alpha=0.7)
         temps_min_fit.plot(
             ax=axes, lw=5, color=temps_colors[3], alpha=0.7)
         axes.set_xlim([years_min, years_max])
         axes.set_ylim([temp_min - 5, temp_max + 5])
         axes.set_title(("Mean Monthly Temperatures from 1894-2013
"
                         "Saint Francis, KS, USA
"
                         "(with max and min fit)"), fontsize=20)
         axes.set_xlabel("Years", fontsize=16)
         _ = axes.set_ylabel("Temperature (F)",
         fontsize=16)

The following scatter plot is the result of the preceding code:

Analysis of temperature

It still looks like there is a greater rise in the minimum mean temperatures than the maximums. We can get a better visual by superimposing the two lines. Let us remove the vertical distance and compare:

In [48]: diff_1894 = temps_max_fit.iloc[0] - temps_min_fit.iloc[0]
         diff_2013 = temps_max_fit.iloc[-1] - temps_min_fit.iloc[-1]
         (diff_1894, diff_2013)
Out[48]: (53.125096418732781, 49.573236914600542)

Note that, we have used the iloc attribute on our Pandas Series objects. The iloc attribute allows one to extract elements in a Series based on integer indices.

With this, we have the difference between high and low in 1894 and then the same difference in 2013, the latter being a smaller difference by a few degrees. We can overlay our two linear regression lines by shifting one of them downwards, so that they converge on the same point (this is done solely for comparison reasons):

In [49]: vert_shift = temps_max_fit - diff_2013

         (figure, axes) = plt.subplots(figsize=(18,6))
         vert_shift.plot(
             ax=axes, lw=5, color=temps_colors[5], alpha=0.7)
         temps_min_fit.plot(
             ax=axes, lw=5, color=temps_colors[3], alpha=0.7)
         axes.set_xlim([years_min, years_max])
         axes.set_ylim([vert_shift.min() - 5, vert_shift.max() + 1])
         axes.set_title(("Mean Monthly Temperature Difference "
                         "from 1894-2013
Saint Francis, KS, USA
"
                         "(vertical offset adjusted to "
                         "converge at 2013)"),
                        fontsize=20)
         axes.set_xlabel("Years", fontsize=16)
         _ = axes.set_ylabel(
             "Temperature
Difference (F)", fontsize=16)

The following plot is the result of the preceding code:

Analysis of temperature

Now, we can really see the difference and can confirm that the rise in minimum mean temperatures is greater than the rise in maximum means.

Let us take a big jump from scatter plots and liner regressions to the heatmap functionality that Seaborn provides. Despite the name, heat maps don't have any intrinsic relationship with temperatures. The idea behind heat maps is to present a dataset in a matrix where each value in the matrix is encoded as a color, thus allowing one to easily see patterns of values across an entire dataset. Creating a heat map in matplotlib directly can be a rather complicated affair, although Seaborn makes this very easy for us, as shown in the following manner:

In [50]: sns.set(style="darkgrid")
In [51]: (figure, axes) = plt.subplots(figsize=(17,9))
         axes.set_title(("Heat Map
Mean Monthly Temperatures, "
                         "1894-2013
Saint Francis, KS, USA"),
                        fontsize=20)
         sns.heatmap(
             temps, cmap=temps_cmap,
             cbar_kws={"label": "Temperature (F)"})
         figure.tight_layout()

The following heat map is the result of the preceding code:

Analysis of temperature

Given that this is a town in the Northern hemisphere near to the 40th parallel, we don't see any surprises:

  • Highest temperatures are in the summer
  • Lowest temperatures are in the winter

There is some interesting summer banding in the 1930s, which indicates several years of hotter-than-normal summers. We see something similar for a few years, starting around 1998 and 1999. There also seems to be a wide band of cold Decembers from 1907 through to about 1932.

Next, we're going to look at Seaborn's clustermap functionality. Cluster maps of this sort are very useful in sorting out data that may have hidden (or not) hierarchical structure. We don't expect that with this dataset, so this is more of a demonstration of the plot than anything. However, it might have a few insights for us. We shall see.

Due to the fact that this is a composite plot, we'll need to access subplot axes, as provided by the clustermap class:

In [53]:clustermap = sns.clustermap(
        temps, figsize=(19, 12),
        cbar_kws={"label": "Temperature
(F)"},
        cmap=temps_cmap)
        _ = clustermap.ax_col_dendrogram.set_title(
            ("Cluster Map
Mean Monthly Temperatures, 1894-2013
"
            "Saint Francis, KS, USA"),
            fontsize=20)

The following heat map is the result of the preceding code:

Analysis of temperature

You've probably noticed that everything got rearranged; here's what happened: while keeping the temperatures for each year together, the x (years) and y (months) values have been sorted/grouped to be close to those with which it shares the most similarity. Here's what we can discern from the graph with regard to our current dataset:

  • The century's temperature patterns each year can be viewed in two groups: higher and lower temperatures
  • January and December share similar low-temperature patterns, with the next closest being February
  • The next grouping of similar temperature patterns are November and March, sibling to the January/December/February grouping
  • The last grouping of the low-temperature months is the April/October pairing

A similar analysis (with no surprises) can be done for the high-temperature months.

Looking across the x axis, we can view patterns/groupings by year. With careful tracing (ideally with a larger rendering of the cluster map), one could identify similar temperature patterns in various years. Though this doesn't reveal anything intrinsically, it could assist in additional analysis (for example, pointing towards the historical records to examine the possibility, trends may be discovered).

In the preceding cluster map, we passed a value for the color map to use, the one we defined at the beginning of this section. If we leave that out, Seaborn will do something quite nice: it will normalize our data and then select a color map that highlights values above and below the mean:

In [53]: clustermap = sns.clustermap(
    temps, z_score=1, figsize=(19, 12),
    cbar_kws={"label": "Normalized
Temperature (F)"})
_ = clustermap.ax_col_dendrogram.set_title(
        ("Normalized Cluster Map
Mean Monthly Temperatures, "
         "1894-2013
Saint Francis, KS, USA"),
        fontsize=20)

The following normalized cluster map is the result:

Analysis of temperature

Note that we get the same grouping as in the previous heat map; the internal values at each coordinate of the map (and the associated color) are all that have changed. This view offers great insight for statistical data: not only do we see the large and obvious grouping between the above and below the mean, but the colors give obvious insights as to how far any given point is from the overall mean.

With the following plot, we're going to return two previous plots:

  • The temperature heat map
  • The scatter plot for our temperature data

Seaborn has an option for heat maps to display a histogram above them. We will see this usage when we examine the precipitation. However, for the temperatures, counts for a year isn't quite as meaningful as the actual values for each month of that year. As such, we will replace the standard histogram with our scatter plot:

In [54]: figure = plt.figure(figsize=(18,13))
         grid_spec = plt.GridSpec(2, 2,
                                  width_ratios=[50, 1],
                                  height_ratios=[1, 3],
                                  wspace=0.05, hspace=0.05)
         scatter_axes = figure.add_subplot(grid_spec[0])
         cluster_axes = figure.add_subplot(grid_spec[2])
         colorbar_axes = figure.add_subplot(grid_spec[3])

         scatter_axes.scatter(years,
                              temps_degrees,
                              s=40,
                              c="0.3",
                              alpha=0.5)
         scatter_axes.set(xticks=[], ylabel="Yearly. Temp. (F)")
         scatter_axes.set_xlim([years_min, years_max])
         scatter_axes.set_title(
             ("Heat Map with Scatter Plot
Mean Monthly "
             "Temperatures, 1894-2013
Saint Francis, KS, USA"),
             fontsize=20)
         sns.heatmap(temps,
                     cmap=temps_cmap,
                     ax=cluster_axes,
                     cbar_ax=colorbar_axes,
                     cbar_kws={"orientation": "vertical"})
         _ = colorbar_axes.set(xlabel="Temperature
(F)")

The following scatter plot and heat map are the result:

Analysis of temperature

Next, we're going to take a closer look at the average monthly temperatures by month using a histogram matrix. To do this, we'll need a new pivot. Our first one created a pivot with the Month data being the index; now, we want to index by Year. We'll do the same trick of keeping the data in the correct month order by converting the month numbers to names after we create the pivot table, however, in the case of the histogram matrix plot, that won't actually help us: to keep the sorting correct, we'll need to pre-pend the zero-filled month number:

In [55]: temps2 = data_raw.pivot(
             "Year", "Month", "Mean Temperature (F)")
         temps2.columns = [
           str(x).zfill(2) + " - " + calendar.month_name[x]
           for x in temps2.columns]
         monthly_means = temps2.mean()

We'll use the histogram provided by Pandas for this. Unfortunately, Pandas does not return the figure and axes that it creates with its hist wrapper. Instead, it returns a NumPy array of subplots. As such, we're left with fewer options than we might like for further tweaking of the plot. Our use of plt.text is a quick hack (of trial and error) that lets us label the overall figure (instead of the enclosing axes, as we'd prefer).

In [56]: axes = temps2.hist(figsize=(16,12))
         plt.text(-20, -10, "Temperatures (F)", fontsize=16)
         plt.text(-74, 77, "Counts", rotation="vertical", fontsize=16)
         _ = plt.suptitle(
             ("Temperatue Counts by Month, 1894-2013
"
              "Saint Francis, KS, USA"),
             fontsize=20)

The following plots are the result of the preceding code:

Analysis of temperature

This provides a nice view on the number of occurrences for temperature ranges in each month over the course of the century. For the most part, these have roughly normal distributions, as we would expect.

Now what we'd like to do is:

  • Look at the mean temperature for all months over the century
  • Show the constituent data that generated that mean
  • Trace the maximum, mean, and minimum temperatures

Let us tackle that last one, first. The minimum, maximum, and means are discrete values in our case, one for each month. What we'd like to do is see what a smooth curve through those points might look like (as a visual aid more than anything). At first, one might think of using NumPy's or Pandas' histogram and distribution plotting capabilities. That would be perfect if we were just binning data. That is not what we are doing, though: we are not generating counts for data that falls in a given range. We are looking at temperatures in a given range. The next thought might be to use the 2D histogram capabilities of NumPy, and while that does work, it's a rather different type of a plot than what we want.

Instead of trying to fit our needs into the tools, what data do we have and what work could we do on that with the tools at hand? We already have our maximums, means, and minimums. We have temperatures per month over the course of the century. We just need to connect our discrete points with a smooth, continuous line.

SciPy provides just the thing: spline interpolation. This will give us a smooth curve for our discrete values:

In [57]: from scipy.interpolate import UnivariateSpline

         smooth_mean = UnivariateSpline(
             month_nums, list(monthly_means), s=0.5)
         means_xs = np.linspace(0, 13, 2000)
         means_ys = smooth_mean(means_xs)

         smooth_maxs = UnivariateSpline(
             month_nums, list(temps2.max()), s=0)
         maxs_xs = np.linspace(0, 13, 2000)
         maxs_ys = smooth_maxs(maxs_xs)

         smooth_mins = UnivariateSpline(
             month_nums, list(temps2.min()), s=0)
         mins_xs = np.linspace(0, 13, 2000)
         mins_ys = smooth_mins(mins_xs)

We'll use the raw data from the beginning of this section, since we'll be doing interpolation on our x values (month numbers):

In [58]: temps3 = data_raw[["Month", "Mean Temperature (F)"]]

Now we can plot our means for all months, a scatter plot (with data points as lines, in this case) for each month superimposed over each mean, and finally our maximum / mean / minimum interpolations:

In [59]: (figure, axes) = plt.subplots(figsize=(18,10))
         axes.bar(
             month_nums, monthly_means, width=0.96, align="center",
             alpha=0.6)
         axes.scatter(
             temps3["Month"], temps3["Mean Temperature (F)"],
             s=2000, marker="_", alpha=0.6)
         axes.plot(means_xs, means_ys, "b", linewidth=6, alpha=0.6)
         axes.plot(maxs_xs, maxs_ys, "r", linewidth=6, alpha=0.2)
         axes.plot(mins_xs, mins_ys, "y", linewidth=6, alpha=0.5)
         axes.axis(
             (0.5, 12.5,
             temps_degrees.min() - 5, temps_degrees.max() + 5))
         axes.set_title(
             ("Mean Monthly Temperatures from 1894-2013
"
              "Saint Francis, KS, USA",
             fontsize=20)
         axes.set_xticks(month_nums)
         axes.set_xticklabels(month_names)
         _ = axes.set_ylabel("Temperature (F)", fontsize=16)

The following plot is the result of the preceding code:

Analysis of temperature

You may have noticed that these plot components have echoes of the box plot in them. They do, in fact, share some basic qualities in common with the box plot. The box plot was invented by the famous statistical mathematician John Tukey.

Note

The inventor of many important concepts, John Tukey is often forgotten as the person who coined the term bit. A term which now permeates multiple industries and has made its way into the vocabularies of non-specialists, too.

Box plots concisely and visually convey the following bits (couldn't resist) of information:

  • Upper part of the box: approximate distribution, 75th percentile
  • Line across box: median
  • Lower part of the box: approximate distribution, 25th percentile
  • Height of the box: fourth spread
  • Upper line out of box: greatest non-outlying value
  • Lower line out of box: smallest non-outlying value
  • Dots above and below: outliers

Sometimes, you will see box plots of different width; the width indicates the relative size of the datasets.

The box plot allows one to view data without making any assumptions about it; the basic statistics are there to view, in plain sight.

The following plot will overlay a box plot on our bar chart of medians (and line scatter plot of values):

In [64]: (figure, axes) = plt.subplots(figsize=(18,10))
         axes.bar(
             month_nums, monthly_means, width=0.96, align="center",
             alpha=0.6)
         axes.scatter(
             temps3["Month"], temps3["Mean Temperature (F)"],
             s=2000, marker="_", alpha=0.6)
         sns.boxplot(temps2, ax=axes)
         axes.axis(
             (0.5, 12.5,
              temps_degrees.min() - 5, temps_degrees.max() + 5))
         axes.set_title(
             ("Mean Monthly Temperatures, 1894-2013
"
              "Saint Francis, KS, USA"),
             fontsize=20)
         axes.set_xticks(month_nums)
         axes.set_xticklabels(month_names)
         _ = axes.set_ylabel("Temperature (F)", fontsize=16)

The following plot is the result of the preceding code:

Analysis of temperature

Now, we can easily identify the spread, the outliers, the area that contains half of the distribution, and so on. Though pretty, the color of the box plots merely represents relative closeness of values to that of its neighbors and as such holds no significant sources for insight.

A variation on the box plot that focuses on the probability distribution rather than quartiles is the violin plot, an example of which we saw earlier in the introduction to Seaborn. We will configure a violin plot to show our data points as lines (the stick option), thus combining our use of the line-scatter plot above with the box plot:

In [65]: sns.set(style="whitegrid")
In [66]: (figure, axes) = plt.subplots(figsize=(18, 10))
         sns.violinplot(temps2, bw=0.2, lw=1, inner="stick")
         axes.set_title(
             ("Violin Plots
Mean Monthly Temperatures, 1894-2013
"
              "Saint Francis, KS, USA"),
             fontsize=20)
         axes.set_xticks(month_nums)
         axes.set_xticklabels(month_names)
         _ = axes.set_ylabel("Temperature (F)", fontsize=16)

The following violin plot is the result of the preceding code:

Analysis of temperature

In the violin plot, the outliers are part of the probability distribution, though they are just as easy to identify as they are in the box plot due to the thinning of the distribution at these points.

For our final plot of this section, we will dip back into mathematics and finish up with a feature from the Pandas library with a plot of Andrews' curves. Andrews' curves can be useful when attempting to uncover a hidden structure in datasets of higher dimensions. As such, it may be a bit forced in our case; we're essentially looking at just two dimensions: the temperature and the time of year. That being said, it is such a useful tool that it's worth covering, if only on a toy example.

Andrews' curves are groups of lines where each line represents a point in the input dataset, and the line is a transformation of that point. In fact, the line is a plot of a finite Fourier series, and is defined as follows:

Analysis of temperature

This function is then plotted for the interval from -π < t < π. Thus, each data point may be viewed as a line between and π. The following formula can be thought of as the projection of the data point onto the vector:

Analysis of temperature

Let us see it in action:

In [67]: months_cmap = sns.cubehelix_palette(
             8, start=-0.5, rot=0.75, as_cmap=True)

         (figure, axes) = plt.subplots(figsize=(18, 10))
         temps4 = data_raw[["Mean Temperature (F)", "Month"]]
         axes.set_xticks([-np.pi, -np.pi/2, 0, np.pi/2, np.pi])
         axes.set_xticklabels(
             [r"$-{pi}$", r"$-frac{pi}{2}$",
              r"$0$", r"$frac{pi}{2}$", r"${pi}$"])
         axes.set_title(
             ("Andrews Curves for
Mean Monthly Temperatures,
              "1894-2013
Saint Francis, KS, USA"),
             fontsize=20)
         axes.set_xlabel(
             (r"Data points mapped to lines in the range "
               "$[-{pi},{pi}]$"),
             fontsize=16)
         axes.set_ylabel(r"$f_{x}(t)$", fontsize=16)
         pd.tools.plotting.andrews_curves(
             temps4, class_column="Month", ax=axes,
             colormap=months_cmap)
         axes.axis(
             [-np.pi, np.pi] + [x * 1.025 for x in axes.axis()[2:]])
         _ = axes.legend(labels=month_names, loc=(0, 0.67))

The following plot is the result of the preceding code:

Analysis of temperature

If we examine the rendered curves, we see the same patterns that we identified in the cluster map plots:

  • The temperatures of January and December are similar (thus the light and dark banding, staying close together)
  • Likewise for the temperatures during the summer months
  • The alternate banding of colors represent the same relationship that we saw in the cluster map, where months of similar temperature patterns (though at different times of year) were paired together

Notice that the curves preserve the distance between the high and low temperatures. This is another property of Andrews' curves. Others include the following:

  • The mean is preserved
  • Linear relationships are preserved
  • The variance is preserved

Note

Things to keep in mind when using Andrews' curves in your projects:

  • The order of the variables matters; changing that order will result in different curves
  • The lower frequencies show up better; as such, put the variables you feel to be more important first

For example, if we did have a dataset with more variables that contributed to the temperature, such as atmospheric pressure or wind speed, we might have defined our Pandas DataFrame with the columns in this order:

temps4 = data_raw[
    ["Mean Temperature (F)", "Wind Speed (kn)",
     "Pressure (Pa)", "Month"]]

Analysis of precipitation

In the Analysis of precipitation section of the IPython Notebook for this chapter, all the graphs that are explored in the temperature section are also created for the precipitation data. We will leave that review as an exercise for the interested reader. However, it is worth noting a few of the differences between the two aspects of the datasets, so we will highlight those here.

Our setup for the precipitation colors are as follows:

In [68]: sns.set(style="darkgrid")
In [69]: precips_colors = ["#f2d98f", "#f8ed39", "#a7cf38",
                           "#7fc242", "#4680c2", "#3a53a3",
                           "#6e4a98"]
         sns.palplot(precips_colors)

The following graph is obtained as the result of the preceding code:

Analysis of precipitation

The first precipitation graph will be the one we had mentioned before: a combination of the precipitation amount heat map and a histogram of the total counts for the corresponding year:

In [72]: figure = plt.figure(figsize=(18, 13))
         grid_spec = plt.GridSpec(2, 2,
                                  width_ratios=[50, 1],
                                  height_ratios=[1, 3],
                                  wspace=0.05, hspace=0.05)
         hist_axes = figure.add_subplot(grid_spec[0])
         cluster_axes = figure.add_subplot(grid_spec[2])
         colorbar_axes = figure.add_subplot(grid_spec[3])

         precips_sum = precips.sum(axis=0)
         years_unique = data["Year"].unique()
         hist_axes.bar(years_unique, precips_sum, 1,
                       ec="w", lw=2, color="0.5", alpha=0.5)
         hist_axes.set(
             xticks=[], ylabel="Total Yearly
Precip. (in)")
         hist_axes.set_xlim([years_min, years_max])
         hist_axes.set_title(
             ("Heat Map with Histogram
Mean Monthly Precipitation,"
              "1894-2013
Saint Francis, KS, USA"),
             fontsize=20)

         sns.heatmap(precips,
                     cmap=precips_cmap,
                     ax=cluster_axes,
                     cbar_ax=colorbar_axes,
                     cbar_kws={"orientation": "vertical"})
         _ = colorbar_axes.set(xlabel="Precipitation
(in)")

The following plot is the result of the preceding code:

Analysis of precipitation

This plot very nicely allows us to scan the heat map and then trace upwards to the histogram for a quick summary for any year we find interesting. In point of fact, we notice the purple month of May in 1923 right away. The histogram confirms for us that this was the rainiest year of the century for Saint Francis, KS. A quick search on the Internet for kansas rain 1923 yields a USGS page discussing major floods along the Arkansas River where they mention "flood stages on the Ninnescah [river] were the highest known."

In contrast to the temperature data, we can see that the precipitation is highly irregular. This is confirmed when rendering the histogram for the months of the century: few or no normal distributions. The cluster map does bear more examination, however, the clustering of the years could reveal stretches of drought.

The other plot we will include in this section, is the precipitation box plot, as there are some pretty significant outliers:

In [84]: (figure, axes) = plt.subplots(figsize=(18,10))
         axes.bar(
             month_nums, monthly_means, width=0.99, align="center",
             alpha=0.6)
         axes.scatter(
             precips3["Month"], precips3["Precipitation (in)"],
             s=2000, marker="_", alpha=0.6)
         sns.boxplot(precips2, ax=axes)
         axes.axis(
             (0.5, 12.5,
              precips_inches.min(), precips_inches.max() + 0.25))
         axes.set_title(
             ("Mean Monthly Precipitation from 1894-2013
"
              "Saint Francis, KS, USA"),
             fontsize=20)
         axes.set_xticks(month_nums)
         axes.set_xticklabels(month_names)
         _ = axes.set_ylabel("Precipitation (in)", fontsize=16)

The following plot is the result of the preceding code:

Analysis of precipitation

The greatest amount of rain we saw was in May of 1923, and there is its outlying data point in the plot. We see another one almost as high in August. Referencing our precipitation heat map, we easily locate the other purple month corresponding to the heaviest rains, and sure enough: it's in August (1933).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset