This first coding stage must include all the processes that are completely independent from the application. Although they can be automatically scheduled eventually (for example, if the data source changes over time and has to be refreshed), we can think of processes that need to be done just once whenever the data source changes.
In our example, we will include the elimination of variables and the recoding. After this process, the processed data sources have to be saved, of course. In the following piece of code, we will load the dataset in the same way as we did before and eliminate the corresponding columns:
#Retrieve Data data.adult <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header = F) names(data.adult) <- c("age", "workclass", "fnlwgt", "education", "education.num", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country","earnings") #Eliminate variables data.adult$fnlwgt <- NULL data.adult$relationship <- NULL data.adult$native.country <- NULL data.adult$capital.loss <- NULL data.adult$capital.gain <- NULL
Now, it is time to the recode variables. In this case, it is a variable-by-variable customized work, as there is no way to generalize groupings between variables. The code for re-coding each one of them is explained in the next section.
With the following line, we can obtain the proportion per category:
round(prop.table(table(data.adult$workclass)),4)
The outcome is as follows:
## ? Federal-gov Local-gov Never-worked ## 0.0564 0.0295 0.0643 0.0002 ## Private Self-emp-inc Self-emp-not-inc State-gov ## 0.6970 0.0343 0.0780 0.0399 ## Without-pay ## 0.0004
As it was anticipated in the first section of this chapter, 70 percent of the sample is concentrated on Private
. From the remaining categories, three correspond to government employments and two to self-employed ones. Both types can be merged into a single category. Never-worked
and Without-pay
have a very marginal contribution.
Although they probably mean different things, due to the narrow frequency of Never-worked
and Without-pay
, we will include them all in Others
. The ?
category will be treated as missing. This is naturally a loss of information but reducing the dimensionality of the question, given the fact that it has narrow frequency will help the user to compare, as there will be only a few categories to see.
As there are just a few levels in this case, we can do the recoding manually. The function used for this purpose is mapvalues()
from the plyr
package. This receives the original labeled factor variable and two equally long vectors (from
and to
) that correspond to the original label and its replacement:
data.adult$workclass <- mapvalues(data.adult$workclass, from = c("Federal-gov","Local-gov", "State-gov", "Self-emp-inc", "Self-emp-not-inc", "Never-worked", "Without-pay", "?"), to = c(rep("Government",3), rep("Self-employed",2), rep("No-salary",2), NA))
Something similar is done with the rest of the variables. The following is the recoding code for each of them:
#Recode marital.status data.adult$marital.status <- mapvalues(data.adult$marital.status, from = c("Married-AF-spouse","Married-civ-spouse", "Divorced","Married-spouse-absent", "Separated"), to = c(rep("Married",2), rep("Divorced/Separated",3))) #Recode occupation data.adult$occupation <- mapvalues(data.adult$occupation, from = "?", to = NA) #Recode race data.adult$race <- mapvalues(data.adult$race, from = "Amer-Indian-Eskimo", to = "Other")
After this, education
and education.num
are merged with education
, and education.num
is erased:
unique.education <- unique(data.adult[,c("education.num","education")]) levels.ordered <- as.character(unique.education$education[order(unique.education$education.num)]) data.adult$education <- as.factor(data.adult$education.num) levels(data.adult$education) <- levels.ordered data.adult$education.num <- NULL
Finally, the dataset is saved in a binary file (ideally, .rda
or .rds
), due to performance and space:
save(data.adult, file = "Path_to_application/rawdata_adult.rda")
This will be the complete script:
library(plyr) #Retrieve Data data.adult <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header = F, strip.white = T) names(data.adult) <- c("age", "workclass", "fnlwgt", "education", "education.num", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country","earnings") data.adult$fnlwgt <- NULL data.adult$relationship <- NULL data.adult$native.country <- NULL data.adult$capital.loss <- NULL data.adult$capital.gain <- NULL #Recode workclass data.adult$workclass <- mapvalues(data.adult$workclass, from = c("Federal-gov","Local-gov", "State-gov", "Self-emp-inc", "Self-emp-not-inc", "Never-worked", "Without-pay", "?"), to = c(rep("Government",3), rep("Self-employed",2), rep("Others",2), NA)) #Recode marital.statuss data.adult$marital.status <- mapvalues(data.adult$marital.status, from = c("Married-AF-spouse","Married-civ-spouse", "Divorced","Married-spouse-absent", "Separated"), to = c(rep("Married",2), rep("Divorced/Separated",3))) #Recode occupation data.adult$occupation <- mapvalues(data.adult$occupation, from = "?", to = NA) #Recode race data.adult$race <- mapvalues(data.adult$race, from = "Amer-Indian-Eskimo", to = "Other") #Merge education and education.num unique.education <- unique(data.adult[,c("education.num","education")]) levels.ordered <- as.character(unique.education$education[order(unique.education$education.num)]) data.adult$education <- as.factor(data.adult$education.num) levels(data.adult$education) <- levels.ordered data.adult$eduaction.num <- NULL #Save data save(data.adult, file = "Path_to_application/rawdata_adult.rda")