Data type checks with str()

We are going to leverage here the str() function, which we already met some paragraphs ago. We employed this function to get a closer look at our data frame, and we already noticed that it results in a list of vectors, each of which shows the name of a different column, the type of that column (even if we would better say the type of the vector forming that column), and the first ten records of it. We are now going to try to look more carefully through the output and see if some kind of strange result appears:

str(clean_stored_data)

This will result in:

'data.frame': 27060 obs. of  15 variables: 
$ attr_3 : num 0 0 0 0 0 0 0 0 0 0 ...
$ attr_4 : num 1 1 1 1 1 1 1 1 1 1 ...
$ attr_5 : chr "0" "0" "0" "0" ...
$ attr_6 : num 1 1 1 1 1 1 1 1 1 1 ...
$ attr_7 : num 0 0 0 0 0 0 0 0 0 0 ...
$ default_numeric: chr "?" "?" "0" "0" ...
$ default_flag : Factor w/ 3 levels "?","0","1": 1 1 2 2 2 2 2 2 2 2 ...
$ customer_code : num 6 13 22216 22217 22220 ...
$ geographic_area: chr "north_america" "asia" "north_america" "europe" ...
$ attr_10 : num -1.00e+06 -1.00e+06 1.00 9.94e-01 0.00 ...
$ attr_11 : num 0 0.00482 0.01289 0.06422 0.01559 ...
$ attr_12 : num 89 20 150000 685773 11498 ...
$ attr_13 : num 89 20 150000 1243 11498 ...
$ attr_8 : num 0 0.0601 0 0.2066 0.2786 ...
$ attr_9 : num 0.00 1.07 0.00 1.00 -1.00e+06 ...

Since we are still waiting for the colleague's email, we are not aware of the meaning and content of each attribute, and we cannot have clear expectations about the type of data within each column. Nevertheless, we can observe that a first group of columns shows zeros or ones, from attr_3 to attr_7, we then have default_numeric with a question mark and some zeros, then the default flag with another question mark, zeros and ones, before leaving space for customer ID, geographic area, and a final group of attributes from 8 to 13 which all show numbers. Leaving aside the columns showing question marks, which we are going to handle later, do you think that all the other columns show the type that we could reasonably expect?

If we look back at the first group of attributes, from 3 to 7, we observe that the great part of them has a numerical type, which is reasonable given that they show 0 and 1 as values. The great part, but not all of them. Can you see that attr_5 column? str is saying that it is a character vector, and we can get further assurance on this by running the mode() function on it:

mode(clean_stored_data$attr_5)

[1] "character"

Before moving on and changing its type to numerical, let's be sure about the values it stores, so as not to make a vector numerical that actually stores a mixed type of data. We can do this via the range function, which returns the unique values stored in a vector (yes, we could have done this also via the unique() function):

range(clean_stored_data$attr_5)
[1] "0" "1"

We are now sure of what this column stores, and we are ready to change its type from character to numeric. But how do we do it? Let me introduce you to a nice family of R functions, the as.something functions. Within R, we have a small group of functions dedicated to data casting, which all work following the same logic: Take that object, cast it and give it back to me as [type of data].

Where you can substitute the type of data with a long list of possible types, for instance:

  • Characters, as.character
  • Numbers, as.numeric
  • Data frames, as.data.frame

Let's see how this works by applying as.numeric to our attr_5 column. We are going to apply it and get a sense of the result by running mode() on it once again: 

clean_stored_data$attr_5 %>% 
as.numeric() %>%
mode()

Which will result in the following output: 

[1] "numeric"

This is exactly what we would have expected. The final step is to actually substitute this numerical version of the vector with the original one residing within clean_stored_data data.frame. We do this by leveraging the mutate function from the dplyr package, which can be employed to create new columns or change existing ones within a data.frame:

clean_stored_data %>% 
mutate(attr_5 = as.numeric(attr_5)) %>%
str()

Which will now show attr_5 as being a numeric vector. Just to store this result, let's assign it to a new clean_casted_stored_data object:

 clean_stored_data %>% 
mutate(attr_5 = as.numeric(attr_5)) -> clean_casted_stored_data

As you probably remember, while talking about data type checks some lines ago, we stated that these kinds of checks should be in some way automatically implemented within the code employed to process your data. But how do we perform those kinds of checks within the flow of our code? As usual, R comes to help us with another family of functions that could be considered close relatives of the as.something function: the is.something function. I am sure you are starting to understand how this works, you just have to pass your object within the function, and it will result in TRUE if the object is something you are looking for, and FALSE if this is not the case. For instance, run the following code:

is.data.frame(clean_stored_data$attr_5)

This will give as a result FALSE, since our well known attr_5 is a vector rather than a data.frame. As was the case for the as.something family, the is.something family is composed of a lot of members, such as:

  • is.numeric
  • is.character
  • is.integer

Once you have placed a data type check in the right point of your code flow, you will be able to provide a different flow for different possible outputs, defining different chunks of code to be executed in case the check has a positive or negative outcome. To do this, you will probably have to employ a conditional statement. Those kinds of statements are outside the scope of this book, but you can find a brief introduction to them in AppendixDealing with Dates, Relative Paths and Functions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset