Pre-application processing

This first coding stage must include all the processes that are completely independent from the application. Although they can be automatically scheduled eventually (for example, if the data source changes over time and has to be refreshed), we can think of processes that need to be done just once whenever the data source changes.

In our example, we will include the elimination of variables and the recoding. After this process, the processed data sources have to be saved, of course. In the following piece of code, we will load the dataset in the same way as we did before and eliminate the corresponding columns:

#Retrieve Data

data.adult <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header = F)

names(data.adult) <- c("age", "workclass", "fnlwgt", "education", "education.num", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country","earnings")

#Eliminate variables

data.adult$fnlwgt <- NULL
data.adult$relationship <- NULL
data.adult$native.country <- NULL
data.adult$capital.loss <- NULL
data.adult$capital.gain <- NULL

Now, it is time to the recode variables. In this case, it is a variable-by-variable customized work, as there is no way to generalize groupings between variables. The code for re-coding each one of them is explained in the next section.

Workclass

With the following line, we can obtain the proportion per category:

round(prop.table(table(data.adult$workclass)),4)

The outcome is as follows:

##          ?      Federal-gov        Local-gov     Never-worked
##     0.0564           0.0295           0.0643           0.0002
##   Private     Self-emp-inc Self-emp-not-inc        State-gov
##     0.6970           0.0343           0.0780           0.0399
##      Without-pay
##           0.0004

As it was anticipated in the first section of this chapter, 70 percent of the sample is concentrated on Private. From the remaining categories, three correspond to government employments and two to self-employed ones. Both types can be merged into a single category. Never-worked and Without-pay have a very marginal contribution.

Although they probably mean different things, due to the narrow frequency of Never-worked and Without-pay, we will include them all in Others. The ? category will be treated as missing. This is naturally a loss of information but reducing the dimensionality of the question, given the fact that it has narrow frequency will help the user to compare, as there will be only a few categories to see.

As there are just a few levels in this case, we can do the recoding manually. The function used for this purpose is mapvalues() from the plyr package. This receives the original labeled factor variable and two equally long vectors (from and to) that correspond to the original label and its replacement:

data.adult$workclass <- mapvalues(data.adult$workclass,
        from = c("Federal-gov","Local-gov", "State-gov",
        "Self-emp-inc", "Self-emp-not-inc",
        "Never-worked", "Without-pay", "?"),
        to = c(rep("Government",3), rep("Self-employed",2), rep("No-salary",2), NA))

Something similar is done with the rest of the variables. The following is the recoding code for each of them:

#Recode marital.status

data.adult$marital.status <- mapvalues(data.adult$marital.status,
       from = c("Married-AF-spouse","Married-civ-spouse",
       "Divorced","Married-spouse-absent", "Separated"),
       to = c(rep("Married",2), rep("Divorced/Separated",3)))

#Recode occupation

data.adult$occupation <- mapvalues(data.adult$occupation, from = "?", to = NA)

#Recode race

data.adult$race <- mapvalues(data.adult$race, from = "Amer-Indian-Eskimo", to = "Other")

After this, education and education.num are merged with education, and education.num is erased:

unique.education <- unique(data.adult[,c("education.num","education")])
levels.ordered <- as.character(unique.education$education[order(unique.education$education.num)])

data.adult$education <- as.factor(data.adult$education.num)

levels(data.adult$education) <- levels.ordered

data.adult$education.num <- NULL

Finally, the dataset is saved in a binary file (ideally, .rda or .rds), due to performance and space:

save(data.adult, file = "Path_to_application/rawdata_adult.rda")

This will be the complete script:

library(plyr)

#Retrieve Data

data.adult <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header = F, strip.white = T)

names(data.adult) <- c("age", "workclass", "fnlwgt", "education", "education.num", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country","earnings")

data.adult$fnlwgt <- NULL
data.adult$relationship <- NULL
data.adult$native.country <- NULL
data.adult$capital.loss <- NULL
data.adult$capital.gain <- NULL

#Recode workclass

data.adult$workclass <- mapvalues(data.adult$workclass,
                                 from = c("Federal-gov","Local-gov", "State-gov",
                                          "Self-emp-inc", "Self-emp-not-inc",
                                          "Never-worked", "Without-pay", "?"),
                                 to = c(rep("Government",3), rep("Self-employed",2), rep("Others",2), NA))

#Recode marital.statuss

data.adult$marital.status <- mapvalues(data.adult$marital.status,
                                 from = c("Married-AF-spouse","Married-civ-spouse",
                                          "Divorced","Married-spouse-absent", "Separated"),
                                 to = c(rep("Married",2), rep("Divorced/Separated",3)))

#Recode occupation

data.adult$occupation <- mapvalues(data.adult$occupation, from = "?", to = NA)

#Recode race

data.adult$race <- mapvalues(data.adult$race, from = "Amer-Indian-Eskimo", to = "Other")

#Merge education and education.num

unique.education <- unique(data.adult[,c("education.num","education")])
levels.ordered <- as.character(unique.education$education[order(unique.education$education.num)])

data.adult$education <- as.factor(data.adult$education.num)

levels(data.adult$education) <- levels.ordered

data.adult$eduaction.num <- NULL

#Save data

save(data.adult, file = "Path_to_application/rawdata_adult.rda")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset