Cleaning the data

Let's start cleaning the data:

#8
rough_data = get_data(users)
rough_data[:2] # let's take a peek

We simulate fetching the data from a source and then inspect it. The Notebook is the perfect tool for inspecting your steps. You can vary the granularity to your needs. The first item in rough_data looks like this:

{'user': '{"username": "samuel62", "name": "Tonya Lucas", "gender": "F", "email": "[email protected]", "age": 27, "address": "PSC 8934, Box 4049\nAPO AA 43073"}',
'campaigns': [{'cmp_name': 'GRZ_20171018_20171116_35-55_B_EUR',
'cmp_bgt': 999613,
'cmp_spent': 43168,
'cmp_clicks': 35603,
'cmp_impr': 500001},
...
{'cmp_name': 'BYU_20171122_20181016_30-45_B_USD',
'cmp_bgt': 561058,
'cmp_spent': 472283,
'cmp_clicks': 44823,
'cmp_impr': 499999}]}

So, we now start working on it:

#9
data = []
for datum in rough_data:
for campaign in datum['campaigns']:
campaign.update({'user': datum['user']})
data.append(campaign)
data[:2] # let's take another peek

The first thing we need to do in order to be able to feed DataFrame with this data is to denormalize it. This means transforming data into a list whose items are campaign dictionaries, augmented with their relative user dictionary. Users will be duplicated in each campaign they belong to. The first item in data looks like this:

{'cmp_name': 'GRZ_20171018_20171116_35-55_B_EUR',
'cmp_bgt': 999613,
'cmp_spent': 43168,
'cmp_clicks': 35603,
'cmp_impr': 500001,
'user': '{"username": "samuel62", "name": "Tonya Lucas", "gender": "F", "email": "[email protected]", "age": 27, "address": "PSC 8934, Box 4049\nAPO AA 43073"}'}

You can see that the user object has been brought into the campaign dictionary, which was repeated for each campaign.

Now, I would like to help you and offer a deterministic second part of the chapter, so I'm going to save the data I generated here so that I (and you, too) will be able to load it from the next Notebook, and we should then have the same results:

#10
with open('data.json', 'w') as stream:
stream.write(json.dumps(data))

You should find the data.json file in the source code for the book. Now we are done with ch13-dataprep, so we can close it, and open up ch13.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset