Visualizing the results

Finally, the juicy bits. In this section, we're going to visualize some results. From a data science perspective, I'm not very interested in going deep into analysis, especially because the data is completely random, but still, this code will get you started with graphs and other features.

Something I learned in my life, and this may come as a surprise to you, is that—looks also count, so it's very important that when you present your results, you do your best to make them pretty.

First, we tell pandas to render graphs in the cell output frame, which is convenient. We do it with the following:

#24
%matplotlib inline

Then, we proceed with some styling:

#25
import matplotlib.pyplot as plt
plt.style.use(['classic', 'ggplot'])
import pylab
pylab.rcParams.update({'font.family' : 'serif'})

Its purpose is to make the graphs we will look at in this section a little bit prettier. You can also instruct the Notebook to do this when you start it from the console by passing a parameter, but I wanted to show you this way too since it can be annoying to have to restart the Notebook just because you want to plot something. In this way, you can do it on the fly and then keep working.

We also use pylab to set the font.family to serif. This might not be necessary on your system. Try to comment it out and execute the Notebook, and see whether anything changes.

Now that DataFrame is complete, let's run df.describe() (#26) again. The results should look something like this:

This kind of quick result is perfect for satisfying those managers who have 20 seconds to dedicate to you and just want rough numbers.

Once again, please keep in mind that our campaigns have different currencies, so these numbers are actually meaningless. The point here is to demonstrate the DataFrame capabilities, not to get to a correct or detailed analysis of real data.

Alternatively, a graph is usually much better than a table with numbers because it's much easier to read it and it gives you immediate feedback. So, let's graph out the four pieces of information we have on each campaign—'Budget', 'Spent', 'Clicks', and 'Impressions':

#27
df[['Budget', 'Spent', 'Clicks', 'Impressions']].hist(
bins=16, figsize=(16, 6));

We extrapolate those four columns (this will give us another DataFrame made with only those columns) and call the histogram hist() method on it. We give some measurements on the bins and figure sizes, but basically, everything is done automatically.

One important thing: since this instruction is the only one in this cell (which also means, it's the last one), the Notebook will print its result before drawing the graph. To suppress this behavior and have only the graph drawn with no printing, just add a semicolon at the end (you thought I was reminiscing about Java, didn't you?). Here are the graphs:

They are beautiful, aren't they? Did you notice the serif font? How about the meaning of those figures? If you go back and take a look at the way we generate the data, you will see that all these graphs make perfect sense:

  • Budget is simply a random integer in an interval, therefore we were expecting a uniform distribution, and there we have it; it's practically a constant line.
  • Spent is a uniform distribution as well, but the high end of its interval is the budget, which is moving. This means we should expect something such as a quadratic hyperbole that decreases to the right. And there it is as well.
  • Clicks was generated with a triangular distribution with a mean roughly 20% of the interval size, and you can see that the peak is right there, at about 20% to the left.
  • Impressions was a Gaussian distribution, which is the one that assumes the famous bell shape. The mean was exactly in the middle and we had a standard deviation of 2. You can see that the graph matches those parameters.

Good! Let's plot out the measures we calculated:

#28
df[['CTR', 'CPC', 'CPI']].hist(
bins=20, figsize=(16, 6))

Here is the plot representation:

We can see that the CPC is highly skewed to the left, meaning that most of the CPC values are very low. The CPI has a similar shape, but is less extreme.

Now, all this is nice, but if you wanted to analyze only a particular segment of the data, how would you do it? We can apply a mask to DataFrame so that we get another one with only the rows that satisfy the mask condition. It's like applying a global, row-wise if clause:

#29
mask = (df.Spent > 0.75 * df.Budget)
df[mask][['Budget', 'Spent', 'Clicks', 'Impressions']].hist(
bins=15, figsize=(16, 6), color='g');

In this case, I prepared mask to filter out all the rows for which the amount spent is less than or equal to 75% of the budget. In other words, we'll include only those campaigns for which we have spent at least three-quarters of the budget. Notice that in mask, I am showing you an alternative way of asking for a DataFrame column, by using direct property access (object.property_name), instead of dictionary-like access (object['property_name']). If property_name is a valid Python name, you can use both ways interchangeably (JavaScript works like this as well).

mask is applied in the same way that we access a dictionary with a key. When you apply mask to DataFrame, you get back another one and we select only the relevant columns on this and call hist() again. This time, just for fun, we want the results to be green:

Note that the shapes of the graphs haven't changed much, apart from the Spent graph,  which is quite different. The reason for this is that we've asked only for the rows where the amount spent is at least 75% of the budget. This means that we're including only the rows where the amount spent is close to the budget. The budget numbers come from a uniform distribution. Therefore, it is quite obvious that the Spent graph is now assuming that kind of shape. If you make the boundary even tighter and ask for 85% or more, you'll see the Spent graph become more and more like the Budget one.

Let's now ask for something different. How about the measure of 'Spent', 'Clicks', and 'Impressions' grouped by day of the week:

#30
df_weekday = df.groupby(['Day of Week']).sum()
df_weekday[['Impressions', 'Spent', 'Clicks']].plot(
figsize=(16, 6), subplots=True);

The first line creates a new DataFrame, df_weekday, by asking for a grouping by 'Day of Week' on df. The function used to aggregate the data is an addition.

The second line gets a slice of df_weekday using a list of column names, something we're accustomed to by now. On the result, we call plot(), which is a bit different to hist(). The subplots=True option makes plot draw three independent graphs:

Interestingly enough, we can see that most of the action happens on Sundays and Wednesdays. If this were meaningful data, this would potentially be important information to give to our clients, which is why I'm showing you this example.

Note that the days are sorted alphabetically, which scrambles them up a bit. Can you think of a quick solution that would fix the issue? I'll leave it to you as an exercise to come up with something.

Let's finish this presentation section with a couple more things. First, a simple aggregation. We want to aggregate on 'Target Gender' and 'Target Age', and show 'Impressions' and 'Spent'. For both, we want to see 'mean' and the standard deviation ('std'):

#31
agg_config = {
'Impressions': ['mean', 'std'],
'Spent': ['mean', 'std'],
}
df.groupby(['Target Gender', 'Target Age']).agg(agg_config)

It's very easy to do. We will prepare a dictionary that we'll use as a configuration. Then, we perform a grouping on the 'Target Gender' and 'Target Age' columns, and we pass our configuration dictionary to the agg() method. The result is truncated and rearranged a little bit to make it fit, and shown here:

                            Impressions                    Spent
                                   mean       std           mean
Target Gender Target Age                                        
B             20-25       499999.741573  1.904111  218917.000000
              20-30       499999.618421  2.039393  237180.644737
              20-35       499999.358025  2.039048  256378.641975
...                                 ...       ...            ...
M             20-25       499999.355263  2.108421  277232.276316
              20-30       499999.635294  2.075062  252140.117647
              20-35       499999.835821  1.871614  308598.149254  

This is the textual representation, of course, but you can also have the HTML one. 

Let's do one more thing before we wrap this chapter up. I want to show you something called a pivot table. It's kind of a buzzword in the data environment, so an example such as this one, albeit very simple, is a must:

#32
pivot = df.pivot_table(
values=['Impressions', 'Clicks', 'Spent'],
index=['Target Age'],
columns=['Target Gender'],
aggfunc=np.sum
)
pivot

We create a pivot table that shows us the correlation between 'Target Age' and 'Impressions', 'Clicks', and 'Spent'. These last three will be subdivided according to 'Target Gender'. The aggregation function (aggfunc) used to calculate the results is the numpy.sum function (numpy.mean would be the default, had I not specified anything).

After creating the pivot table, we simply print it with the last line in the cell, and here's a crop of the result:

It's pretty clear and provides very useful information when the data is meaningful.

That's it! I'll leave you to discover more about the wonderful world of IPython, Jupyter, and data science. I strongly encourage you to get comfortable with the Notebook environment. It's much better than a console, it's extremely practical and fun to use, and you can even create slides and documents with it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset