Scatterplots

Scatterplots are probably among the most common plots, since they are frequently used to display the relationship between two quantitative variables. When two variables are provided, ggplot2 will make a scatterplot by default. Now that you have already acquired some experience from the previous sections of this chapter, the representation of the scatter plot will be quite straightforward for you.

For our example on how to build a scatterplot, we will use a dataset called ToothGrowth, which is available in the base R installation. Reported in this dataset are measurements of the length of the teeth of 10 guinea pigs for three different doses of vitamin C (0.5, 1, and 2 mg). It is delivered in two different ways—as orange juice or as ascorbic acid (a compound with vitamin C activity). You can find details on the dataset help page at ?ToothGrowth.

We are interested in seeing how the length of the teeth changed for each different dose. We are not able to distinguish the different guinea pigs since this information is not contained in the data, so for the moment, we will simply plot the data we have:

require(ggplot2)
qplot(dose, len, data=ToothGrowth, geom="point")
##Alternative coding
qplot(dose, len, data=ToothGrowth)

The resulting plot is reproduced in Figure 2.13. As you have seen, the default plot generated, without a geom argument, is the scatterplot, which is the default bivariate plot type. In this plot, we see that the length of the teeth increases as the vitamin C intake increases. On the other hand, we know that since the vitamin C was provided in two different ways, as orange juice or as ascorbic acid, it could be interesting to check whether these two groups behave differently.

Scatterplots

Figure 2.13: This shows a scatterplot of the data on tooth length versus the dose in ToothGrowth

The first approach could be to have the data in two different colors. To do that, we simply need to assign the color attribute to the column supp in the data, which defines the way in which vitamin C is given to the guinea pigs:

qplot(dose, len,data=ToothGrowth, geom="point", col=supp)

The resulting plot is in Figure 2.14. We will discuss later on in the book how the colors are assigned in ggplot2, but for now, we will only focus on the general layout. We can now find out which intake route each data point came from, and it looks like the subgroup where orange juice was administered has higher teeth growth compared to the subgroup where ascorbic acid was administered. Nevertheless, to differentiate between them is not easy. We could then try with the facets, so that the data will be completely separated in two different subplots. So let's see what happens:

Scatterplots

Figure 2.14: This shows a scatterplot of the length of teeth versus the dose in ToothGrowth with data in different colors depending on vitamin C intake

The discussion in the preceding paragraph is encapsulated in this code:

qplot(dose, len,data=ToothGrowth, geom="point", facets=.~supp)

In this new plot, showed in Figure 2.15, we definitely have a better picture of the data, since we can see how the growth of teeth differs for the different intakes.

As illustrated in this simple example, the best visualization can differ depending on the data you have. In some cases, grouping a variable with colors or dividing the data with faceting may give you a different idea about the data and its tendency. For instance, with the plot in Figure 2.15, we see that growth of teeth increases with the dose and seems to be each for different intake route. However when studying only the data points, it is difficult to identify any difference in the data behavior:

Scatterplots

Figure 2.15: This shows a scatterplot of the length of teeth versus dose in ToothGrowth with faceting

One approach to highlighting the general tendency of the data could be to include a smooth line in the graph. In this case, we can see that the growth after the administration of orange juice does not look linear, so a smooth line could be a nice way to capture this. In order to do that, we simply add a smooth curve to the vector of geometry components in the qplot function. The following code shows this:

qplot(dose, len,data=ToothGrowth, geom=c("point","smooth"), facets=.~supp)

As you can see from the plot obtained in Figure 2.16, we now clearly see, not only the different data thanks to the faceting, but also the tendency of the data with respect to the dose administered. As you have seen, the smooth line in ggplot2 will also require a confidence interval in the plot. If you don't want the confidence interval, you can simply add the se=FALSE argument. We will cover this topic in more detail in Chapter 4, Advanced Plotting Techniques.

Scatterplots

Figure 2.16: This shows a scatterplot of the length of teeth versus the dose in ToothGrowth with faceting and a smooth line

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset