Unpacking the user data

We now do the exact same thing for each piece of user JSON data. We call apply on the user series, running the unpack_user_json function, which takes a JSON user object and transforms it into a list of its fields, which we can then inject into a brand new DataFrameuser_df. After that, we'll join user_df back with df, like we did with campaign_df:

#11
def unpack_user_json(user):
# very optimistic as well, expects user objects
# to have all attributes
user = json.loads(user.strip())
return [
user['username'],
user['email'],
user['name'],
user['gender'],
user['age'],
user['address'],
]

user_data = df['user'].apply(unpack_user_json)
user_cols = [
'username', 'email', 'name', 'gender', 'age', 'address']
user_df = DataFrame(
user_data.tolist(), columns=user_cols, index=df.index)

Very similar to the previous operation, isn't it? We should also note here that, when creating user_df, we need to instruct DataFrame about the column names and the index. Let's join and take a quick peek:

#12
df = df.join(user_df)

#13 df[['user'] + user_cols].head(2)

The output shows us that everything went well. We're good, but we're not done yet. If you call df.columns in a cell, you'll see that we still have ugly names for our columns. Let's change that:

#14
better_columns = [
'Budget', 'Clicks', 'Impressions',
'cmp_name', 'Spent', 'user',
'Type', 'Start', 'End',
'Target Age', 'Target Gender', 'Currency',
'Username', 'Email', 'Name',
'Gender', 'Age', 'Address',
]
df.columns = better_columns

Good! Now, with the exception of 'cmp_name' and 'user', we only have nice names.

Completing the datasetNext step will be to add some extra columns. For each campaign, we have the numbers of clicks and impressions, and we have the amounts spent. This allows us to introduce three measurement ratios: CTR, CPC, and CPI. They stand for Click Through Rate, Cost Per Click, and Cost Per Impression, respectively.

The last two are straightforward, but CTR is not. Suffice it to say that it is the ratio between clicks and impressions. It gives you a measure of how many clicks were performed on a campaign advertisement per impression—the higher this number, the more successful the advertisement is in attracting users to click on it:

#15
def calculate_extra_columns(df):
# Click Through Rate
df['CTR'] = df['Clicks'] / df['Impressions']
# Cost Per Click
df['CPC'] = df['Spent'] / df['Clicks']
# Cost Per Impression
df['CPI'] = df['Spent'] / df['Impressions']
calculate_extra_columns(df)

I wrote this as a function, but I could have just written the code in the cell. It's not important. What I want you to notice here is that we're adding those three columns with one line of code each, but DataFrame applies the operation automatically (the division, in this case) to each pair of cells from the appropriate columns. So, even if they are masked as three divisions, these are actually 5037 * 3 divisions, because they are performed for each row. Pandas does a lot of work for us, and also does a very good job of hiding the complexity of it.

The function, calculate_extra_columns, takes DataFrame, and works directly on it. This mode of operation is called in-place. Do you remember how list.sort() was sorting the list? It is the same deal. You could also say that this function is not pure, which means it has side effects, as it modifies the mutable object it is passed as an argument.

We can take a look at the results by filtering on the relevant columns and calling head:

#16
df[['Spent', 'Clicks', 'Impressions',
'CTR', 'CPC', 'CPI']].head(3)

This shows us that the calculations were performed correctly on each row:

    Spent  Clicks  Impressions       CTR       CPC       CPI
0   39383   62554       499997  0.125109  0.629584  0.078766
1  210452   36176       500001  0.072352  5.817448  0.420903
2  342507   62299       500001  0.124598  5.497793  0.685013

Now, I want to verify the accuracy of the results manually for the first row:

#17
clicks = df['Clicks'][0]
impressions = df['Impressions'][0]
spent = df['Spent'][0]
CTR = df['CTR'][0]
CPC = df['CPC'][0]
CPI = df['CPI'][0]
print('CTR:', CTR, clicks / impressions)
print('CPC:', CPC, spent / clicks)
print('CPI:', CPI, spent / impressions)

This yields the following output:

CTR: 0.1251087506525039 0.1251087506525039
CPC: 0.6295840393899671 0.6295840393899671
CPI: 0.0787664725988356 0.0787664725988356

This is exactly what we saw in the previous output. Of course, I wouldn't normally need to do this, but I wanted to show you how can you perform calculations this way. You can access Series (a column) by passing its name to DataFrame, in square brackets, and then you access each row by its position, exactly as you would with a regular list or tuple.

We're almost done with our DataFrame. All we are missing now is a column that tells us the duration of the campaign and a column that tells us which day of the week corresponds to the start date of each campaign. This allows me to expand on how to play with date objects:

#18
def get_day_of_the_week(day):
number_to_day = dict(enumerate(calendar.day_name, 1))
return number_to_day[day.isoweekday()]

def get_duration(row):
return (row['End'] - row['Start']).days

df['Day of Week'] = df['Start'].apply(get_day_of_the_week)
df['Duration'] = df.apply(get_duration, axis=1)

We used two different techniques here but first, the code.

get_day_of_the_week takes a date object. If you cannot understand what it does, please take a few moments to try to understand for yourself before reading the explanation. Use the inside-out technique like we've done a few times before.

So, as I'm sure you know by now, if you put calendar.day_name in a list call, you get ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']. This means that, if we enumerate calendar.day_name starting from 1, we get pairs such as (1, 'Monday'), (2, 'Tuesday'), and so on. If we feed these pairs to a dictionary, we get a mapping between the days of the week as numbers (1, 2, 3, ...) and their names. When the mapping is created, in order to get the name of a day, we just need to know its number. To get it, we call date.isoweekday(), which tells us which day of the week that date is (as a number). You feed that into the mapping and, boom! You have the name of the day.

get_duration is interesting as well. First, notice it takes an entire row, not just a single value. What happens in its body is that we perform a subtraction between a campaign's end and start dates. When you subtract date objects, the result is a timedelta object, which represents a given amount of time. We take the value of its .days property. It is as simple as that.

Now, we can introduce the fun part, the application of those two functions.

The first application is performed on a Series object, like we did before for 'user' and 'cmp_name'; there is nothing new here.

The second one is applied to the whole DataFrame and, in order to instruct pandas to perform that operation on the rows, we pass axis=1.

We can verify the results very easily, as shown here:

#19
df[['Start', 'End', 'Duration', 'Day of Week']].head(3)

The preceding code yields the following output:

        Start         End  Duration Day of Week
0  2019-03-24  2020-11-06       593      Sunday
1  2017-05-21  2018-07-24       429      Sunday
2  2017-12-18  2018-02-08        52      Monday

So, we now know that between the 24th of March, 2019 and the 6th of November, 2020 there are 593 days, and that the 24th of March, 2019 is a Sunday.

If you're wondering what the purpose of this is, I'll provide an example. Imagine that you have a campaign that is tied to a sports event that usually takes place on a Sunday. You may want to inspect your data according to the days so that you can correlate them to the various measurements you have. We're not going to do it in this project, but it was useful to see, if only for the different way of calling apply() on DataFrame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset