At this stage, it is supposed that the reader has already acquired enough technical knowledge to code a full application. This is naturally necessary to produce a successful output, but sometimes it is not enough. There are often thousands of different ways to get similar results in R or any other programming language. However, some of them are usually better than others in different ways: scalability, clearness, performance, timings, and so on.
In this chapter, an application is developed from scratch so that the reader comes face to face with a typical programmer's "real world" challenges where the pros and cons of different approaches are evaluated and decisions are taken with all their implications, that is, taking into account the possible drawbacks of the choices made. In this sense, the most important thing is to be conscious about all the strong points and especially, the weak points of your code.
This chapter will be divided into the following eight sections:
global.R
file but, as it was already explained, including this will lead to significant improvements in terms of performance in most of the cases.UI.R
is coded, firstly, without outputs. Splitting UI.R
coding is clearer because it respects the information flow that was covered in Chapter 4, Shiny Structure – Reactivity Concepts.The sample application will be based on the Adult
dataset that can be found in the UCI repository. For this particular case, it is supposed that the source data will have rows added or removed over time but the structure (that is, variable names, variable order, and so on) will remain.
Let's imagine that our client or boss comes to us and says that they want us to develop an interactive web application in Shiny for Adult
data source. They expect from us a fully functional first version and leaves, in principle, everything in our hands (including design definitions).
Once we have the dataset (or any link to it), the most sensible thing to do will be to take a look at the documentation if there is one. In this case, we have https://archive.ics.uci.edu/ml/datasets/Adult and https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names. These web pages contain summarized context information, the different variables names, and their possible values.
Although it might sound trivial, client briefs, documentations, context information, and so on are very useful pieces of information for our applications, as they will give us insights on how the source of information that we have to work on was generated and what purpose did it pursue. In this sense, it is always important to keep in mind that a successful interactive application must provide insightful information to its users and, in order to do so, the programmers need to be familiar with the database themselves.
In this case, for instance, we can see that the fnlwgt
variable is an estimate that is calculated based on the socio-economic characteristics of the respondents. As there is no clear explanation of what this is, and we assume that the application that we are going to develop must be understood by a non-expert audience, we can take this out. The case of relationship is similar. As there is no documentation, there is no way of establishing what particular relationship this variable is describing. For this reason, we will be taking this out of the application.
Apart from reading the documentation, it is also sensible to take a look at the data sources that will be used for the application; not just know the variable names, but also take a look at the data. This can give revealing facts about, for example, which variables are important, which ones can be used as filter variables, what kind of visualizations can give insightful information, and so on.
In this case, we have to, firstly, load the data, assign the corresponding variable names, and call summaries for each variable:
#Retrieve Data data.adult <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header = F) #Assign variable names names(data.adult) <- c("age", "workclass", "fnlwgt", "education", "education.num", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country","earnings") #Apply summary for each column in data sapply(data.adult, summary)
The last command will output a summary for each variable, which will correspond to its class; if it is a factor or a character, a table is returned. In the case of numeric variables, quartiles and a mean is returned.
In many cases, we can check whether any of the variables in the dataset have always the same value or a different one every time (for example, ids
). In this particular case, for example, we can appreciate that the age
variable has values between 17
and 90
. So, for example, if we are planning to put an age filter, it should not accept values lower than 17
or the minimum value for the variable.
For example, the workclass
variable can give very good insights, but there is definitely some preprocessing needed as there are too many categories with an unbalanced distribution, and almost 70 percent of the sample is concentrated on the Private
category. This can lead to misleading graphics if we want to include this variable in visualization. For example, a frequency bar chart will be dominated by the most frequent value.
Summary statistics are also not possible without preprocessing, as they will be calculated based on statistically insignificant data. In order to obtain more summarized information, some of the remaining categories can be merged with a more general one. For example, Federal-gov
, Local-gov
, and State-gov
can be re-categorized into Government
. Actually, all factor-type variables that remain in the dataset except for earnings
and sex
need recoding.
native.country
seems a very interesting variable at first glance to include in the visualization. However, its multiplicity of categories in addition to its remarked concentration on United States makes it almost impossible to include it in the analysis, and also, recode the remaining categories into, for example, continents.
capital.loss
and capital.gain
are also variables with a very particular distribution with over 90 percent of the sample in 0. So, at first glance, we would suspect that it does not make much sense to include them in the visualization.
Finally, education.num
and education
are the exact same variables with the sole difference that the values in education.num
correspond to each ordered stage in education (1
for pre-school, 2
for 1st to 4th, and so on). This is definitely a duplication of information, but expressed differently. In some way, both express valuable information. On one hand, it is important to keep the categories in education
because if we would like to generate a visualization with this variable, it should be definitely done with the labels and not with the numerical codes of the corresponding education.num
.
On the other hand, education.num
contains the order of the different stages. Consequently, the information in both variables is relevant and has to be retained. The best alternative, therefore, would be to generate a single factor variable where the numerical value and its label correspond to each other.
So far, we have made the following seven decisions:
fnlwgt
because it could be unclear to the general public.relationship
because there is no explanation about its meaning.native.country
due to its concentration on United States
.capital.loss
because of its distribution.capital.gain
for the same reason as explained in step 2.sex
and earnings
to have less but statistically more significant categories in each variable.education
and education.num
into one single variable.Now that we have a better understanding of the data source, we will move go on to the conceptual design.