Conceptual design

Once we have a general understanding of the data involved in the application, we can decide what data is going to be in the input variables and what will be the role of each one of them. It is important to consider that input and output variables are not mutually exclusive; a variable can be used both as a part of an input and an output. Input widgets can also have different roles. They can be used as filters, aggregate variables, variables selectors for visualizations, and so on. In other words, they can be used as inputs for almost any process that can be programmed within a reactive context.

This flexibility enables us at this preprogramming stage to conceive an application with almost no restrictions. Although this is definitely an advantage, it has to be handled carefully. The need to develop an application that covers every possible aspect of the dataset, although technically possible, can produce a very confusing outcome. It is important to keep in mind that these kinds of visualizations have to communicate something to their user. So, it does not make sense to provide tons of information if it is not going to be understood.

At this stage, the developers must also work as analysts. They will need to decide what is important and what is not, find or determine what variables can be causes, and which ones can be consequences, and so on. In other words, the developer needs to generate their own hypothesis about the data source that has to be visualized.

Let's do this process in our example. Excluding the variables mentioned in the previous section, we can divide the remaining variables into three big groups: demographical, education and occupation, and earnings, which are composed as follows:

  • Demographical: age, sex, race, and marital.status
  • Education: education/education.num
  • Occupation and earnings: occupation, workclass, hours.per.week, and earnings

Of course, there are probably multiple correlations between the variables of the same and different groups, but in this case, we are going to focus on the influence of the demographical variables in the rest, so we are going to generate input widgets exclusively based on demographical variables.

The outputs, however, are not going to describe only the relationship between demographical variables and the rest, but also between themselves, as this can provide insightful and easy-to-read information about demographic aspects of the data.

Following the grouping logic established previously, it would be wise to structure the application in an input area and a tab-separated output area, with one tab per group. The inputs will be on the left-hand side and the tabs on the right-hand side.

The content of each tab will depend on the level of the measurement of the variables involved. There are two numeric variables (age and hours.per.week), one ordinal (education), and the rest are categorical.

In this example, we are going to stick to traditional visualizations derived from equally traditional operations with the data available so that non-expert users can also understand it:

  • Demographics tab: Here, four plots are displayed. The first is a simple male/female proportion bar chart, the second is also a bar chart of the race variable but controlled by gender, and the third one is a line chart of age frequency, also controlled by gender. The fourth one is a simple percentage table of the marital.status variable.
  • Education tab: Taking advantage of its ordinal level of measurement, we can generate a descending line chart with the education categories ordered along the horizontal axis. Each point on the horizontal axis represents the number (in percentage) of people that reached that educational state. For instance, if the value in education is HS-grad, this means that the previous categories are completed as well. This graph is controlled by a categorical variable that obtains one line per category, which enables easy comparisons.
  • Occupation and earnings tab: This contains two elements, the output of a chi-square test between earnings and any other categorical variable displayed as text, and the median of hours per week by occupation and workclass calculated and displayed in a table.

This will be then the application's schema:

Conceptual design

After this, we can say that our idea of how the application will be is clear, so it is time to make this real. In order to put in practice as many topics as possible covered here, this implementation will contain visualizations done with different packages, styles, and so on. Under normal circumstances, this is not recommended as it can be visually unattractive and even unclear.

However, first things first! Once we know what we are going to do, we can also arrange our data sources to make the application as optimal as possible. For this reason we should do some preprocessing on the data source prior to starting with the coding of the application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset