Creating the DataFrame

First, we have another round of imports:

#1
import json
import calendar
import numpy as np
from pandas import DataFrame
import arrow
import pandas as pd

The json and calendar libraries come from the standard library. numpy is the NumPy library, the fundamental package for scientific computing with Python. NumPy stands for Numeric Python, and it is one of the most widely-used libraries in the data science environment. I'll say a few words about it later on in this chapter. pandas is the very core upon which the whole project is based. Pandas stands for Python Data Analysis Library. Among many other things, it provides DataFrame, a matrix-like data structure with advanced processing capabilities. It's customary to import DataFrame separately and then to import pandas as pd.

arrow is a nice third-party library that speeds up dealing with dates dramatically. Technically, we could do it with the standard library, but I see no reason not to expand the range of the example and show you something different.

After the imports, we load the data as follows:

#2
with open('data.json') as stream:
data = json.loads(stream.read())

And finally, it's time to create DataFrame:

#3
df = DataFrame(data)
df.head()

We can inspect the first five rows using the head method of DataFrame. You should see something like this:

Jupyter renders the output of the df.head() call as HTML automatically. In order to have a text-based output, simply wrap df.head() in a print call.

The DataFrame structure is very powerful. It allows us to manipulate a lot of its contents. You can filter by rows, columns, aggregate on data, and many other operations. You can operate with rows or columns without suffering the time penalty you would have to pay if you were working on data with pure Python. This happens because, under the covers, pandas is harnessing the power of the NumPy library, which itself draws its incredible speed from the low-level implementation of its core.

Using DataFrame allows us to couple the power of NumPy with spreadsheet-like capabilities so that we'll be able to work on our data in a fashion that is similar to what an analyst could do. Only, we do it with code.

But let's go back to our project. Let's see two ways to quickly get a bird's eye view of the data:

#4
df.count()

count yields a count of all the non-empty cells in each column. This is good to help you understand how sparse your data can be. In our case, we have no missing values, so the output is:

cmp_bgt       5037
cmp_clicks    5037
cmp_impr      5037
cmp_name      5037
cmp_spent     5037
user          5037
dtype: int64

Nice! We have 5,037 rows, and the data type is integers (dtype: int64 means long integers because they take 64 bits each). Given that we have 1,000 users and the amount of campaigns per user is a random number between 2 and 8, we're exactly in line with what I was expecting:

#5
df.describe()

The describe method is a nice, quick way to introspect a bit further:

           cmp_bgt   cmp_clicks      cmp_impr     cmp_spent
count 5037.000000 5037.000000 5037.000000 5037.000000
mean 496930.317054 40920.962676 499999.498312 246963.542783
std 287126.683484 21758.505210 2.033342 217822.037701
min 1057.000000 341.000000 499993.000000 114.000000
25% 247663.000000 23340.000000 499998.000000 64853.000000
50% 491650.000000 37919.000000 500000.000000 183716.000000
75% 745093.000000 56253.000000 500001.000000 379478.000000
max 999577.000000 99654.000000 500008.000000 975799.000000

As you can see, it gives us several measures, such as count, mean, std (standard deviation), min, and max, and shows how data is distributed in the various quadrants. Thanks to this method, we already have a rough idea of how our data is structured.

Let's see which are the three campaigns with the highest and lowest budgets:

#6
df.sort_index(by=['cmp_bgt'], ascending=False).head(3)

This gives the following output:

      cmp_bgt  cmp_clicks  cmp_impr                           cmp_name
3321   999577        8232    499997  GRZ_20180810_20190107_40-55_M_EUR   
2361   999534       53223    499999  GRZ_20180516_20191030_25-30_B_EUR   
2220   999096       13347    499999  KTR_20180620_20190809_40-50_F_USD

And a call to tail shows us the ones with the lowest budgets:

#7
df.sort_values(by=['cmp_bgt'], ascending=False).tail(3)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset