Chapter 11. From White Paper to a Full Application

At this stage, it is supposed that the reader has already acquired enough technical knowledge to code a full application. This is naturally necessary to produce a successful output, but sometimes it is not enough. There are often thousands of different ways to get similar results in R or any other programming language. However, some of them are usually better than others in different ways: scalability, clearness, performance, timings, and so on.

In this chapter, an application is developed from scratch so that the reader comes face to face with a typical programmer's "real world" challenges where the pros and cons of different approaches are evaluated and decisions are taken with all their implications, that is, taking into account the possible drawbacks of the choices made. In this sense, the most important thing is to be conscious about all the strong points and especially, the weak points of your code.

This chapter will be divided into the following eight sections:

  1. Problem presentation: The task is presented in a very realistic way as a boss or client would do.
  2. Conceptual design: The central parts of the applications are thought of. In this part, it is not necessary to have a definite idea of all the application's functionalities, but it is recommended to have at least a clear idea of the inputs and outputs, for example, which variables will be filtered and which ones will be used to plot, general layout issues, and so on.
  3. Pre-application processing: Based on the conceptual design, it is possible that some of the necessary processing (for example, aggregations) can be done before coding the application itself. Although this can be possibly avoided and included somewhere else (for example, in step 4 or 6), this step can lead to very significant improvements in terms of performance (some processes are run only once and not in real time, whenever someone calls the application or performs an action in it) and memory usage (the data sources that must be loaded and held in memory by the application are already summarized and consequently smaller in size).
  4. global.R coding: Exactly as in step 3, it is generally possible to code an application without a global.R file but, as it was already explained, including this will lead to significant improvements in terms of performance in most of the cases.
  5. UI.R partial coding: In order to understand the input/output process, UI.R is coded, firstly, without outputs. Splitting UI.R coding is clearer because it respects the information flow that was covered in Chapter 4, Shiny Structure – Reactivity Concepts.
  6. server.R coding: This is the backend script.
  7. UI.R completion: This involves the inclusion of the outputs.
  8. Styling: This involves the final styling with CSS.

The sample application will be based on the Adult dataset that can be found in the UCI repository. For this particular case, it is supposed that the source data will have rows added or removed over time but the structure (that is, variable names, variable order, and so on) will remain.

Problem presentation

Let's imagine that our client or boss comes to us and says that they want us to develop an interactive web application in Shiny for Adult data source. They expect from us a fully functional first version and leaves, in principle, everything in our hands (including design definitions).

Once we have the dataset (or any link to it), the most sensible thing to do will be to take a look at the documentation if there is one. In this case, we have https://archive.ics.uci.edu/ml/datasets/Adult and https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names. These web pages contain summarized context information, the different variables names, and their possible values.

Although it might sound trivial, client briefs, documentations, context information, and so on are very useful pieces of information for our applications, as they will give us insights on how the source of information that we have to work on was generated and what purpose did it pursue. In this sense, it is always important to keep in mind that a successful interactive application must provide insightful information to its users and, in order to do so, the programmers need to be familiar with the database themselves.

In this case, for instance, we can see that the fnlwgt variable is an estimate that is calculated based on the socio-economic characteristics of the respondents. As there is no clear explanation of what this is, and we assume that the application that we are going to develop must be understood by a non-expert audience, we can take this out. The case of relationship is similar. As there is no documentation, there is no way of establishing what particular relationship this variable is describing. For this reason, we will be taking this out of the application.

Apart from reading the documentation, it is also sensible to take a look at the data sources that will be used for the application; not just know the variable names, but also take a look at the data. This can give revealing facts about, for example, which variables are important, which ones can be used as filter variables, what kind of visualizations can give insightful information, and so on.

In this case, we have to, firstly, load the data, assign the corresponding variable names, and call summaries for each variable:

#Retrieve Data

data.adult <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", header = F)

#Assign variable names

names(data.adult) <- c("age", "workclass", "fnlwgt", "education", "education.num", "marital.status", "occupation", "relationship", "race", "sex", "capital.gain", "capital.loss", "hours.per.week", "native.country","earnings")

#Apply summary for each column in data

sapply(data.adult, summary)

The last command will output a summary for each variable, which will correspond to its class; if it is a factor or a character, a table is returned. In the case of numeric variables, quartiles and a mean is returned.

In many cases, we can check whether any of the variables in the dataset have always the same value or a different one every time (for example, ids). In this particular case, for example, we can appreciate that the age variable has values between 17 and 90. So, for example, if we are planning to put an age filter, it should not accept values lower than 17 or the minimum value for the variable.

For example, the workclass variable can give very good insights, but there is definitely some preprocessing needed as there are too many categories with an unbalanced distribution, and almost 70 percent of the sample is concentrated on the Private category. This can lead to misleading graphics if we want to include this variable in visualization. For example, a frequency bar chart will be dominated by the most frequent value.

Summary statistics are also not possible without preprocessing, as they will be calculated based on statistically insignificant data. In order to obtain more summarized information, some of the remaining categories can be merged with a more general one. For example, Federal-gov, Local-gov, and State-gov can be re-categorized into Government. Actually, all factor-type variables that remain in the dataset except for earnings and sex need recoding.

native.country seems a very interesting variable at first glance to include in the visualization. However, its multiplicity of categories in addition to its remarked concentration on United States makes it almost impossible to include it in the analysis, and also, recode the remaining categories into, for example, continents.

capital.loss and capital.gain are also variables with a very particular distribution with over 90 percent of the sample in 0. So, at first glance, we would suspect that it does not make much sense to include them in the visualization.

Finally, education.num and education are the exact same variables with the sole difference that the values in education.num correspond to each ordered stage in education (1 for pre-school, 2 for 1st to 4th, and so on). This is definitely a duplication of information, but expressed differently. In some way, both express valuable information. On one hand, it is important to keep the categories in education because if we would like to generate a visualization with this variable, it should be definitely done with the labels and not with the numerical codes of the corresponding education.num.

On the other hand, education.num contains the order of the different stages. Consequently, the information in both variables is relevant and has to be retained. The best alternative, therefore, would be to generate a single factor variable where the numerical value and its label correspond to each other.

So far, we have made the following seven decisions:

  1. Eliminate fnlwgt because it could be unclear to the general public.
  2. Eliminate relationship because there is no explanation about its meaning.
  3. Eliminate native.country due to its concentration on United States.
  4. Eliminate capital.loss because of its distribution.
  5. Eliminate capital.gain for the same reason as explained in step 2.
  6. Recode all categorical variables except sex and earnings to have less but statistically more significant categories in each variable.
  7. Merge education and education.num into one single variable.

Now that we have a better understanding of the data source, we will move go on to the conceptual design.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset