A solution is provided for each exercise in the book. Do not look at the solution until you have made a serious effort to solve the exercise! For many problems, there will be several possible solutions in R. If you come up with a solution different from the one provided, try to see if the two solutions are equivalent—do you get the same answer? Why or why not?
Solutions provided in the chapter.
attach(mtcars) stripchart(mpg ~ cyl, method = "jitter")
This helps to separate the cars a bit. Now we can see how many cars are in each group.
Not surprisingly, cars with fewer cylinders get better gas mileage.
install.packages("plotrix", dependencies=TRUE) library(plotrix) attach(trees) dotplot.mtb(Volume)
A type of jittering is automatic. Even so, some values that are very close still run together. One way to deal with this is to make the plot character smaller:
dotplot.mtb(Volume, pch = 20) # or dotplot.mtb(Volume, pch = ".") # too small! dotplot.mtb(Volume, pch = "/") # Hmm... detach(trees)
dotchart(USArrests$Murder, labels = row.names(USArrests))
The state names are so big, they overwrite and become illegible!
load("Nimrod.rda") # .rda shows it was saved as an R data frame dotchart(Nimrod$time)
Good!
dotchart(Nimrod$time, labels = Nimrod$performer, cex = .5)
Better!
Nimrod2 = Nimrod[order(Nimrod$time),] dotchart(Nimrod2$time, labels = Nimrod2$performer, cex = .5)
Yeah!
# print results to screen library(nlme) attach(MathAchieve) boxplot(SES ~ Minority * Sex) # graph to file pdf("SES.pdf") # opens a device library(nlme) attach(MathAchieve) boxplot(SES ~ Minority * Sex) dev.off() # closes and saves file
Insert the file SES.pdf into a word processor document.
It looks like SES
has the same relationship to Minority
and Sex
that MathAch
has.
par(mfrow = c(1,2)) # show 2 graphs side-by-side attach(mtcars) boxplot(mpg ~ cyl) library(plotrix) ehplot(mpg, cyl) detach(mtcars) par(mfrow=c(1,1)) # reset to one graph/page
The box plot shows reference points (i.e., quartiles); the EH-Plot shows actual points, including jittering.
You can add a box and whiskers to the EH-Plot: ehplot(mpg, cyl, box = T).
library(multcomp) stem(sbp$sbp, scale = .5) stem(sbp$sbp, scale = 3) stem(sbp$sbp, scale = 4)
A scale choice larger than 3 seems to give us worthless charts! This will not necessarily be true for all datasets.
install.packages("aplpack", dependencies = T) library(aplpack) attach(trees) stem.leaf.backback(Height, Volume) detach(trees)
The units of measurement are different. What happens if you standardize the units?
library(car) attach(Baumann) stem.leaf.backback(pretest.1, post.test.1)
It appears that the posttest scores are higher.
library(Sleuth2) attach(case0302) par(mfrow = c(2,2)) # show 4 graphs on one page hist(Dioxin) hist(Dioxin, breaks = 20) hist(Dioxin, breaks = 30) hist(Dioxin, breaks = 40) par(mfrow = c(1,1)) # reset to one graph/page
The distributional shape changes a great deal. With no breaks specified, it looks like most points are clustered around zero. With many breaks, it looks like a nearly symmetrical distribution with a few extreme values added to it. The strip chart emphasizes this idea. It appears that just two points are atypical. Why are they so different? Are they mistakes? Should they be included?
The two histograms look fairly similar, except that the IQbio
group has a few more higher-end IQs.
Studying the salaries of males and females in the Salary
dataset is a little trickier than the previous problem, because there are not separate vectors for “male salary” and “female salary.” So, to use histbackback()
, we must create such vectors. There are several ways to do this. Here is one of them:
attach(Salaries) m = subset(Salaries, sex == "Male") # contains only male data f = subset(Salaries, sex == "Female") # contains female data histbackback(m$salary, f$salary) detach(Salaries)
library(multcomp) eq2 = density(sbp$sbp, bw = 4) # Figure 8-1c hist (sbp$sbp, main="c. Histogram + Kernel Density, bw = 4", col = "maroon", las = 1, cex.axis = .8, freq = F) #freq=F: prob. densities lines(eq2,lwd = 2) # Plot density curve on existing histogram
This gives us sharper bends than when bw = 5.
eq2 = density(sbp$sbp, bw = 2) # Figure 8-1-c hist (sbp$sbp, main="c. Histogram + Kernel Density, bw = 2", col = "maroon", las = 1, cex.axis = .8, freq = F) #freq=F: prob. densities lines(eq2,lwd = 2) # Plot density curve on existing histogram
This produces about a dozen bends; further detail does not really fit the histogram better, but then, the histogram is a very rough approximation.
eq2 = density(sbp$sbp, bw = 20) # Figure 8-1c hist (sbp$sbp, main="c. Histogram + Kernel Density, bw = 20", col = "maroon", las = 1, cex.axis = .8, freq = F) #freq=F: prob. densities lines(eq2,lwd = 2) # Plot density curve on existing histogram
This line is flatter than bw = 10
, but not a radical change.
The sbp
of 125 has a y-coordinate about halfway between 0 and 0.25, or about 0.125. That is to say, the probability of selecting a patient with systolic blood pressure of 125 or less is about 0.125, or 12.5 percent.
By a similar process, we can determine that the probability of a systolic blood pressure less than or equal to 175 is about 90 percent. The probability of selecting a person with systolic blood pressure of 175 or greater is 1 – prob[175 or less] = 1 – 0.9 = 0.1, or 10 percent.
The probability of falling between 125 and 175 is prob[175 or less] – prob[125 or less] = 90% – 12.5% = 77.5%.
The Male
bars are on top in each group, but the legend is reversed in order, making the chart confusing to read. Change sexlab
to c("Male", "Female")
in legend()
.
library(car) attach(Salaries) library(epicalc) salK=salary/10000 pyramid(salK,sex, binwidth = 1, col = "seagreen", main = "Salaries in 10,000s of Dollars", cex.bar.value =.4, cex.axis = .8)
Compare this to the result from “Exercise 7-2”.
install.packages("HistData") library(HistData) attach(Nightingale) deaths = c(sum(Disease), sum(Wounds), sum(Other)) pie(deaths, labels=c("Disease","Wounds","Other"))
The deaths
command found the totals in each of the columns Disease
, Wounds
, and Other
. Those totals were used to make the pie chart. The difference between Disease
and the other causes was very obvious, but the difference between the other variables is not easy to see.
load("Nimrod.rda") x = table(Nimrod$medium) x pie(x, labels = c("Brass band","Concert band","Organ", "Orchestra"), col = c("gold3","deepskyblue","peachpuff3","magenta"))
load("Nimrod.rda") den = density(Nimrod$time) plot(den) rug(Nimrod$time) library(nlme) boxplot(MathAchieve$SES,main="Socioeconomic Status", ylab="SES score") rug(MathAchieve$SES)
The first of these is probably more helpful. The second one is so dense that we cannot see a separation of points over most of the range of the data.
Year = c(2004:2010) Europe = c(7.9, 7.9, 7.9, 7.8, 7.7, 7.1, 7.2) Eurasia = c(8.5, 8.5, 8.7, 8.6, 8.9, 8, 8.4) plot(Year, Europe, type = "l", col = "maroon", ylim = c(7,9), ylab = "Emissions", lwd = 2) # ylim makes graph big enough for Eurasia with values > 7.9 lines(Year, Eurasia, lty = "dashed", col = "steelblue", lwd = 2) legend("topleft", c("Eurasia", "Europe"), text.col = c("steelblue","maroon"), lty = c("dashed","solid"))
library(Sleuth2) library(epade) attach(case0701) plot(Velocity, Distance)
As expected, distance is greater with greater velocity.
scatter.ade(Velocity, Distance, wall = 3, main = "The Big Bang", col="red")
Very cool!
library(nlme) attach(MathAchieve) plot(SES, MathAch)
There may be a near-linear relationship, but it is hard to be sure.
library(nlme) attach(MathAchieve) sunflowerplot(SES, MathAch)
The sunflower plot does not help very much.
library(nlme) attach(MathAchieve) library(hexbin) plot(hexbin(SES, MathAch), colramp = heat.ob)
The relationship is much clearer, nearly vertical; that is to say, there is a similar wide range of math scores for all SES scores from –1 to about 1.5.
detach(MathAchieve)
library(ResearchMethods) load("baplot") # if you saved baplot! data(MFSV) attach(MFSV) baplot(MF,SV)
There is no obvious pattern, but we don’t know, based on the information given in the help file, what the clinically acceptable difference is.
library(reshape2) library(car) attach(tips) tips$ratio = 100*(tip/total_bill) attach(tips) qqnorm(ratio) qqline(ratio)
ratio
is not normally distributed; it is skewed at the upper end.
qq(time ~ ratio)
lunch
and dinner
are very different distributions.
qq(smoker ~ ratio)
smoker
and nonsmoker
are also very different.
library(reshape2) library(car) attach(Vocab) qqnorm(vocabulary) qqline(vocabulary)
vocabulary
does not seem to be normally distributed.
qqnorm(education) qqline(education)
education
is not, either.
qq(sex ~ vocabulary)
It is predominantly female
through most of the range, and male
at the upper extreme.
library(car) attach(Ginzberg) head(Ginzberg) pairs(~ simplicity + fatalism + depression) scatterplotMatrix(~ simplicity + fatalism + depression) library(corrplot) corrplot(y) library(GGally) ggscatmat(Ginzberg, columns = 1:3)
library(scatterplot3d) library(epicalc) data(SO2) scatterplot3d(SO2$smoke, SO2$SO2, SO2$deaths, highlight.3d = T, type = "h")
This is a tricky dataset to work with, because the dataset and a variable have the same name. If you try attach(SO2)
and scatterplot3d(smoke, SO2, deaths)
, it won’t work! Therefore, use the SO2$smoke
name instead. Deaths seem to be positively related to both of the other variables.
library(lattice) library(epicalc) data(SO2) levelplot(SO2$deaths ~ SO2$smoke + SO2$SO2)
library(Sleuth2) attach(ex1123) plot(SO2, Mort)
There appears to be a strong relationship between SO2 and mortality, with the exception of a single, high-mortality point.
coplot(Mort ~ SO2 | Educ)
The coplot shows that there is little SO2, and lower mortality, in cities where education levels are highest! Perhaps there are fewer smokestacks in cities with major financial businesses or research centers.
attach(mtcars) cars = as.matrix(mtcars) h = dist(scale(cars)) h2 = hclust(h, method = "single") plot(h2) h = dist(scale(cars), method = "manhattan") h2 = hclust(h, method = "single") plot(h2)
cars = as.matrix(mtcars) image(scale(cars), col = cm.colors(256)) image(scale(cars), col = rainbow(100)) image(scale(cars), col = terrain.colors(16))
And so on. The number after colors
tells how many colors are in the range.
install.packages("Amelia", dependencies = T) library(Amelia) missmap(airquality) library(epade) missiogram.ade(airquality)
The Wikipedia article on mosaic plots includes a demonstration of the Titanic
dataset. You can also find a mosaic plot of this data at http://bit.ly/1VDPbmp in a different orientation. You can find another one at http://bit.ly/1jdkQKr.
attach(Nimrod) par(bg = "white", mfrow = c(2,2)) # graph 1 x = table(medium) barplot(x, horiz = T, main = "Number of ensembles of each medium", names = c("Brass band", "Concert band", "Organ","Orchestra"), las = 1, cex.names = .8, col = "turquoise", space = 1.5) # graph 2 cols = "cadetblue4" boxplot(time ~ medium, col = c("goldenrod","firebrick"), ylab = "Time", xlab = "Medium", varwidth = T, main = "Performance times by medium", col.main = cols, col.axis = cols, las = 1, col.lab = cols, names = c("","","","")) mtext(text = c("Brass band", Concert band", "Organ", "Orchestra"), side = 1, cex = .6, at = c(1,2,3,4), line = 1) # graph 3 pro = subset(Nimrod, level == "p") am = subset(Nimrod, level == "a") plot(density(pro$time), ylim = c(0,.028), main = "Professional vs. amateur groups", xlab = "Time in seconds", col = "navy", lwd = 2, bty = "n", xlim = c(100,350), family = "HersheyScript", cex.main = 1.4, cex.lab = 1.3) lines(density(am$time), lty = "dotted", col = "lightblue4", lwd = 2) legend("topright", c("Amateur","Professional"), cex = .8, text.col = c("lightblue4","navy"), bty = "n") # graph 4 data2 = Nimrod[order(Nimrod$time),] dotchart(data2$time, labels = data2$performer, cex = .34, main = "Performance Time by Performer", xlab = "Performance time", pch = 19, col = c("violetred1", "violetred4"), lcolor = "gray90", cex.main = 1.9, col.main = "violetred4", cex.lab = 1.4, family = "HersheySerif") detach(Nimrod)