Every column shows an attribute

What's in a name? Yeah, I know this is Shakespeare, nevertheless it is worth mentioning it here: what is an attribute? How would you define it? Just to start approximating, we can say that an attribute is something that describes a specific feature of our entity. From this, we can see that every column should show data related to just one feature of our customer, since we have seen previously that every record should be related to one single customer and no repetition should occur. 

Since we already know that our customers are shown more than once within the dataset, we can infer that somewhere a column is showing more than one feature. But how can we be sure about it?

Let's have a look back to the structure of our data:

'data.frame': 891330 obs. of 9 variables:
$ attr_3 : num 0 0 0 0 0 0 0 0 0 0 ...
$ attr_4 : num 0 0 0 0 0 0 0 0 0 0 ...
$ attr_5 : num 0 0 0 0 0 0 0 0 0 0 ...
$ attr_6 : num 0 0 0 0 0 0 0 0 0 0 ...
$ attr_7 : num 0 0 0 0 0 0 0 0 0 0 ...
$ default_flag : Factor w/ 3 levels "?","0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ customer_code: num 1 2 3 4 5 6 7 8 9 10 ...
$ parameter : chr "attr_8" "attr_8" "attr_8" "attr_8" ...
$ value : num NA NA -1e+06 -1e+06 NA NA NA -1e+06 NA -1e+06 ...

We can see that the first five columns are named attr_3, attr_4, and so on (what attributes are they? I don't know, we will ask someone in the office later), then we find a default_flag column and a familiar customer_code column. Can you see the next column? It has quite a mysterious name, parameter, and it appears to be populated with repetitions of attr_8.  But we should not let the first rows bias our ideas, are we sure that only attr_8  is in there? What would you do to gain some more information on this?

We can employ once more the unique() function, and this will result in:

stored_data$parameter %>% unique()
[1] "attr_8" "attr_9" "attr_10" "attr_11" "attr_12" "attr_13"

Here we are, a full bouquet of attributes was hiding within that mysterious label. This is relevant, but as far as I can see those are just labels, where are the actual values of our attributes? We could have found something such as eye color, height, and number of sons, but where should we look to find the number of sons for the given customer? Just one column is missing, and we should definitely check it out: the value column (I know, it is not very original, I could actually change this in future editions of the book). What do we find in there? Our str() function tells us the following:

num NA NA -1e+06 -1e+06 NA NA NA -1e+06 NA -1e+06 ...

Which is not actually that much, but at least it lets us understand that there are some numbers in there. Things are much clearer now, some attributes are in good order in separate columns, while some others are melted within one column. This is probably due to the fact that this table was obtained by merging together different data sources, as we were made aware of from the side note, as is often the case with industrial data.

Even the second rule is broken, what about the third?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset