First, we have another round of imports:
#1
import json
import calendar
import numpy as np
from pandas import DataFrame
import arrow
import pandas as pd
The json and calendar libraries come from the standard library. numpy is the NumPy library, the fundamental package for scientific computing with Python. NumPy stands for Numeric Python, and it is one of the most widely-used libraries in the data science environment. I'll say a few words about it later on in this chapter. pandas is the very core upon which the whole project is based. Pandas stands for Python Data Analysis Library. Among many other things, it provides DataFrame, a matrix-like data structure with advanced processing capabilities. It's customary to import DataFrame separately and then to import pandas as pd.
arrow is a nice third-party library that speeds up dealing with dates dramatically. Technically, we could do it with the standard library, but I see no reason not to expand the range of the example and show you something different.
After the imports, we load the data as follows:
#2
with open('data.json') as stream:
data = json.loads(stream.read())
And finally, it's time to create DataFrame:
#3
df = DataFrame(data)
df.head()
We can inspect the first five rows using the head method of DataFrame. You should see something like this:
Jupyter renders the output of the df.head() call as HTML automatically. In order to have a text-based output, simply wrap df.head() in a print call.
The DataFrame structure is very powerful. It allows us to manipulate a lot of its contents. You can filter by rows, columns, aggregate on data, and many other operations. You can operate with rows or columns without suffering the time penalty you would have to pay if you were working on data with pure Python. This happens because, under the covers, pandas is harnessing the power of the NumPy library, which itself draws its incredible speed from the low-level implementation of its core.
Using DataFrame allows us to couple the power of NumPy with spreadsheet-like capabilities so that we'll be able to work on our data in a fashion that is similar to what an analyst could do. Only, we do it with code.
But let's go back to our project. Let's see two ways to quickly get a bird's eye view of the data:
#4
df.count()
count yields a count of all the non-empty cells in each column. This is good to help you understand how sparse your data can be. In our case, we have no missing values, so the output is:
cmp_bgt 5037 cmp_clicks 5037 cmp_impr 5037 cmp_name 5037 cmp_spent 5037 user 5037 dtype: int64
Nice! We have 5,037 rows, and the data type is integers (dtype: int64 means long integers because they take 64 bits each). Given that we have 1,000 users and the amount of campaigns per user is a random number between 2 and 8, we're exactly in line with what I was expecting:
#5
df.describe()
The describe method is a nice, quick way to introspect a bit further:
cmp_bgt cmp_clicks cmp_impr cmp_spent
count 5037.000000 5037.000000 5037.000000 5037.000000
mean 496930.317054 40920.962676 499999.498312 246963.542783
std 287126.683484 21758.505210 2.033342 217822.037701
min 1057.000000 341.000000 499993.000000 114.000000
25% 247663.000000 23340.000000 499998.000000 64853.000000
50% 491650.000000 37919.000000 500000.000000 183716.000000
75% 745093.000000 56253.000000 500001.000000 379478.000000
max 999577.000000 99654.000000 500008.000000 975799.000000
As you can see, it gives us several measures, such as count, mean, std (standard deviation), min, and max, and shows how data is distributed in the various quadrants. Thanks to this method, we already have a rough idea of how our data is structured.
Let's see which are the three campaigns with the highest and lowest budgets:
#6
df.sort_index(by=['cmp_bgt'], ascending=False).head(3)
This gives the following output:
cmp_bgt cmp_clicks cmp_impr cmp_name 3321 999577 8232 499997 GRZ_20180810_20190107_40-55_M_EUR 2361 999534 53223 499999 GRZ_20180516_20191030_25-30_B_EUR 2220 999096 13347 499999 KTR_20180620_20190809_40-50_F_USD
And a call to tail shows us the ones with the lowest budgets:
#7
df.sort_values(by=['cmp_bgt'], ascending=False).tail(3)