List of Figures

Figure 1.1 Location of R installer.

Figure 1.2 Language selection.

Figure 1.3 With modern versions of Windows, this suggestion can be safely ignored.

Figure 1.4 The license agreement must be acknowledged to use R.

Figure 1.5 It is important to choose a destination folder with no spaces in the name.

Figure 1.6 This dialog is used to choose the destination folder.

Figure 1.7 This is a proper destination, with no spaces in the name.

Figure 1.8 It is best to select everything except 32-bit components.

Figure 1.9 Accept the default startup options, as we recommend using RStudio as the front end and these will not be important.

Figure 1.10 Choose the Start Menu folder where the shortcuts will be installed.

Figure 1.11 We have multiple versions of R installed to allow development and testing with different versions.

Figure 1.12 We recommend saving the version number in the registry and associating R with RData files.

Figure 1.13 A progress bar is displayed during installation.

Figure 1.14 Confirmation that installation is complete.

Figure 1.15 Introductory screen for installation on a Mac.

Figure 1.16 Version selection.

Figure 1.17 The license agreement, which must be acknowledged to use R.

Figure 1.18 The license agreement must also be agreed to.

Figure 1.19 By default R is installed for all users, although there is the option to choose a specific location.

Figure 1.20 The administrator password might be required for installation.

Figure 1.21 A progress bar is displayed during installation.

Figure 1.22 This signals a successful installation.

Figure 2.1 The standard R interface in Windows.

Figure 2.2 The standard R interface on Mac OS X.

Figure 2.3 The general layout of RStudio.

Figure 2.4 Object Name Autocomplete in RStudio.

Figure 2.5 Clicking File >> New Project begins the project creation process.

Figure 2.6 Three options are available to start a new project: a new directory, associating a project with an existing directory or checking out a project from a version control repository.

Figure 2.7 Dialog to choose the location of a new project directory.

Figure 2.8 Dialog to choose an existing directory in which to start a project.

Figure 2.9 Here is the option to choose which type of repository to start a new project from.

Figure 2.10 Enter the URL for a Git repository, as well as the folder where this should be cloned to.

Figure 2.11 Clicking Tools >> Options brings up RStudio options.

Figure 2.12 General options in RStudio.

Figure 2.13 Options for customizing the code editing pane.

Figure 2.14 Options for code appearance.

Figure 2.15 These options control the placement of the various panes in RStudio.

Figure 2.16 Options related to packages. The most important is the CRAN mirror selection.

Figure 2.17 This is where to choose whether to use Sweave or knitr and select the PDF viewer.

Figure 2.18 These are the options for the spelling check dictionary, which allows language selection and the custom dictionaries.

Figure 2.19 This is where to set the location of Git and SVN executables so they can be used by RStudio.

Figure 2.20 The Git pane shows the Git status of files under version control. A blue square with a white M indicates a file has been changed and needs to be committed. A yellow square with a white question mark indicates a new file that is not being tracked by Git.

Figure 2.21 This displays files and the changes made to the files, with green being additions and pink being deletions. The upper right contains a space for writing commit messages.

Figure 3.1 RStudio’s Packages pane.

Figure 3.2 RStudio’s package installation dialog.

Figure 3.3 RStudio’s package installation dialog to install from an archive file.

Figure 7.1 Histogram of diamond carats.

Figure 7.2 Scatterplot of diamond price versus carat.

Figure 7.3 Boxplot of diamond carat.

Figure 7.4 Histogram of diamond carats using ggplot2.

Figure 7.5 Density plot of diamond carats using ggplot2.

Figure 7.6 Simple ggplot2 scatterplot.

Figure 7.7 Scatterplot of diamonds data mapping diamond color to the color aesthetic.

Figure 7.8 Scatterplot faceted by color.

Figure 7.9 Scatterplot faceted by cut and clarity. Notice that cut is aligned vertically while clarity is aligned horizontally.

Figure 7.10 Histogram faceted by color.

Figure 7.11 Boxplot of diamond carats using ggplot2.

Figure 7.12 Boxplot of diamond carats by cut using ggplot2.

Figure 7.13 Violin plot of diamond carats by cut using ggplot2.

Figure 7.14 Violin plots with points. The graph on the left was built by adding the points geom and then the violin geom, while the plot on the right was built in the opposite order. The order in which the geoms are added determines the positioning of the layers.

Figure 7.15 Line plot using ggplot2.

Figure 7.16 Line plot with a seperate line for each year.

Figure 7.17 Various themes from the ggthemes package. Starting from top left and going clockwise: The Economist, Excel (for those with bosses who demand Excel output), Edward Tufte and The Wall Street Journal.

Figure 12.1 Plot of foreign assistance by year for each of the programs.

Figure 14.1 Plot of random normal variables and their densities, which results in a bell curve.

Figure 14.2 Area under a normal curve. The plot on the left shows the area to the left of -1, while the plot on the right shows the area between -1 and 1.

Figure 14.3 Normal distribution function.

Figure 14.4 Ten thousand runs of binomial experiments with ten trials each and probability of success of 0.3.

Figure 14.5 Random binomial histograms faceted by trial size. Notice that while not perfect, as the number of trials increases the distribution appears more normal. Also note the differing scales in each pane.

Figure 14.6 Histograms for 10,000 draws from the Poisson distribution at varying levels of λ. Notice how the histograms become more like the normal distribution.

Figure 14.7 Density plots for 10,000 draws from the Poisson distribution at varying levels of λ. Notice how the density plots become more like the normal distribution.

Figure 15.1 Pairs plot of economics data showing the relationship between each pair of variables as a scatterplot with the correlations printed as numbers.

Figure 15.2 Heatmap of the correlation of the economics data. The diagonal has elements with correlation 1 because every element is perfectly correlated with itself. Red indicates highly negative correlation, blue indicates highly positive correlation and white is no correlation.

Figure 15.3 ggpairs plot of tips data using both continuous and categorial variables.

Figure 15.4 t distribution and t-statistic for tip data. The dashed lines are two standard deviations from the mean in either direction. The thick black line, the t-statistic, is so far outside the distribution that we must reject the null hypothesis and conclude that the true mean is not $2.50.

Figure 15.5 Histogram of tip amount by sex. Note that neither distribution appears to be normal.

Figure 15.6 Plot showing the mean and two standard errors of tips broken down by the sex of the server.

Figure 15.7 Density plot showing the difference of heights of fathers and sons.

Figure 15.8 Means and confidence intervals of tips by day. This shows that Sunday tips differ from Thursday and Friday tips.

Figure 16.1 Using fathers’ heights to predict sons’ heights using simple linear regression. The fathers’ heights are the predictors and the sons’ heights are the responses. The blue line running through the points is the regression line and the grey band around it represents the uncertainty in the fit.

Figure 16.2 Regression coefficients and confidence intervals as taken from a regression model and calculated manually. The point estimates for the mean are identical and the confidence intervals are very similar, the difference due to slightly different calculations. The y-axis labels are also different because when dealing with factors lm tacks on the name of the variable to the level value.

Figure 16.3 Histogram of value per square foot for NYC condos. It appears to be bimodal.

Figure 16.4 Histograms of value per square foot. These illustrate structure in the data revealing that Brooklyn and Queens make up one mode and Manhattan makes up the other, while there is not much data on the Bronx and Staten Island.

Figure 16.5 Histograms for total square feet and number of units. The distributions are highly right skewed in the top two graphs, so they were repeated after removing buildings with more than 1,000 units.

Figure 16.6 Scatterplots of value per square foot versus square footage and value versus number of units, both with and without the buildings that have over 1,000 units.

Figure 16.7 Scatterplots of value versus square footage. The plots indicate that taking the log of SqFt might be useful in modeling.

Figure 16.8 Scatterplots of value versus number of units. It is not yet certain whether taking logs will be useful in modeling.

Figure 16.9 Coefficient plot for condo value regression.

Figure 16.10 Coefficient plots for models with interaction terms. (a) includes individual variables and the interaction term, while (b) only includes the interaction term.

Figure 16.11 Coefficient plot for multiple condo models. The coefficients are plotted in the same spot on the y-axis for each model. If a model does not contain a particular coefficient, it is simply not plotted.

Figure 17.1 Density plot of family income with a vertical line indicating the $150,000 mark.

Figure 17.2 Coefficient plot for logistic regression on family income greater than $150,000, based on the American Community Survey.

Figure 17.3 Histogram of the number of children per household from the American Community Survey. The distribution is not perfectly Poisson but it is sufficiently so for modeling with Poisson regression.

Figure 17.4 Coefficient plot for a logistic regression on ACS data.

Figure 17.5 Coefficient plot for Poisson models. The first model, children1, does not account for overdispersion, while children2 does. Because the overdispersion was not too big, the coefficient estimates in the second model have just a bit more uncertainty.

Figure 17.6 Survival curve for Cox proportional hazards model fitted on bladder data.

Figure 17.7 Survival curve for Cox proportional hazards model fitted on bladder data stratified on rx.

Figure 17.8 Andersen-Gill survival curves for bladder2 data.

Figure 18.1 Coefficient plot for condo value data regression in house1.

Figure 18.2 Plot of residuals versus fitted values for house1. This clearly shows a pattern in the data that does not appear to be random.

Figure 18.3 Plot of residuals versus fitted values for house1 colored by Boro. The pattern in the residuals is revealed to be the result of the effect of Boro on the model. Notice that the points sit above the x-axis and the smoothing curve because geom point was added after the other geoms, meaning it gets layered on top.

Figure 18.4 Base graphics plots for residuals versus fitted values.

Figure 18.5 Q-Q plot for house1. The tails drift away from the ideal theoretical line, indicating that we do not have the best fit.

Figure 18.6 Histogram of residuals from house1. This does not look normally distributed, meaning our model is incomplete.

Figure 18.7 Coefficient plot of various models based on housing data. This shows that only Boro and some condominium types matter.

Figure 18.8 Plots for cross-validation error (raw and adjusted), ANOVA and AIC for housing models. The scales are different, as they should be, but the shapes are identical, indicating that houseG4 truly is the best model.

Figure 18.9 Histogram of the batting average bootstrap. The vertical lines are two standard errors from the original estimate in each direction. They make up the bootstrapped 95% confidence interval.

Figure 19.1 Cross-validation curve for the glmnet fitted on the American Community Survey data. The top row of numbers indicates how many variables (factor levels are counted as individual variables) are in the model for a given value of log (λ). The dots represent the cross-validation error at that point and the vertical lines are the confidence interval for the error. The leftmost vertical line indicates the value of λ where the error is minimized and the rightmost vertical line is the next largest value of λ error that is within one standard error of the minimum.

Figure 19.2 Coefficient profile plot of the glmnet model fitted on the ACS data. Each line represents a coefficient’s value at different values of λ. The leftmost vertical line indicates the value of λ where the error is minimized and the rightmost vertical line is the next largest value of λ error that is within one standard error of the minimum.

Figure 19.3 Cross-validation curve for ridge regression fitted on ACS data.

Figure 19.4 Coefficient profile plot for ridge regression fitted on ACS data.

Figure 19.5 Plot of α versus error for glmnet cross-validation on the ACS data. The lower the error the better. The size of the dot represents the value of lambda. The top pane shows the error using the one standard error methodology (0.0054) and the bottom pane shows the error by selecting the λ (6e-04) that minimizes the error. In the top pane the error is minimized for an α of 0.75 and in the bottom pane the optimal α is 0.9.

Figure 19.6 Cross-validation curve for glmnet with α= 0.75.

Figure 19.7 Coefficient path for glmnet with α= 0.75.

Figure 19.8 Coefficient plot for glmnet on ACS data. This shows that the number of workers in the family and not being on foodstamps are the strongest indicators of having high income, and using coal heat and living in a mobile home are the strongest indicators of having low income. There are no standard errors because glmnet does not calculate them.

Figure 19.9 Plot showing the coefficient for the black level of Race for each of the models. The coefficient for 1964 has a standard error that is orders of magnitude bigger than for the other years. It is so out of proportion that the plot had to be truncated to still see variation in the other data points.

Figure 19.10 Coefficient plot (the secret weapon) for the black level of Race for each of the models with a Cauchy prior. A simple change like adding a prior dramatically changed the point estimate and standard error.

Figure 20.1 Plot of WiFi device position colored by distance from the hotspot. Blue points are closer and red points are farther.

Figure 20.2 Plot of WiFi devices. The hotspot is the large green dot. Its position in the middle of the blue dots indicates a good fit.

Figure 20.3 Diamonds data with a number of different smoothing splines.

Figure 20.4 Scatterplot of price versus carat with a regression fitted on a natural cubic spline.

Figure 20.5 Plot of good credit versus bad based on credit amount, credit history and employment status.

Figure 20.6 Plot of age versus credit amount faceted by credit history and employment status, color coded by credit.

Figure 20.7 The smoother result for fitting a GAM on credit data. The shaded region represents two pointwise standard deviations.

Figure 20.8 Display of decision tree based on credit data. Nodes split to the left meet the criteria while nodes to the right do not. Each terminal node is labeled by the predicted class, either “Good” or “Bad.” The percentage is read from left to right, with the probability of being “Good” on the left.

Figure 21.1 GDP for a number of nations from 1960 to 2011.

Figure 21.2 Time series plot of U.S. Per Capita GDP.

Figure 21.3 ACF and PACF of U.S. Per Capita GDP. These plots are indicative of a time series that is not stationary.

Figure 21.4 Plot of the U.S. Per Capita GDP diffed twice.

Figure 21.5 ACF and PACF plots for the residuals of ideal model chosen by auto.arima.

Figure 21.6 Five year prediction of U.S. GDP. The think line is the point estimate and the shaded regions represent the confidence intervals.

Figure 21.7 Time series plot of GDP data for all countries in the data. This is the same information as in Figure 21.1a, but this was built using base graphics.

Figure 21.8 Differenced GDP data.

Figure 21.9 Coefficient plots for VAR model of GDP data for Canada and Japan.

Figure 21.10 Time series plot of AT&T ticker data.

Figure 21.11 Series chart for AT&T.

Figure 21.12 Residual plots from GARCH model on AT&T data.

Figure 21.13 Predictions for GARCH model on AT&T data.

Figure 22.1 Plot of wine data scaled into two dimensions and color coded by results of K-means clustering.

Figure 22.2 Plot of wine data scaled into two dimensions and color coded by results of K-means clustering. The shapes indicate the cultivar. A strong correlation between the color and shape would indicate a good clustering.

Figure 22.3 Plot of Hartigan’s Rule for a series of different cluster sizes.

Figure 22.4 Confusion matrix for clustering of wine data by cultivars.

Figure 22.5 Gap curves for wine data. The blue curve is the observed within-cluster dissimilarity, and the green curve is the expected within-cluster dissimilarity. The red curve represents the Gap statistic (expected-observed) and the error bars are the standard deviation of the gap.

Figure 22.6 Silhouette plot for country clustering. Each line represents an observation, and each grouping of lines is a cluster. Observations that fit the cluster well have large positive lines and observations that do not fit well have small or negative lines. A bigger average width for a cluster means a better clustering.

Figure 22.7 Map of PAM clustering of World Bank data. Gray countries either do not have World Bank information or were not properly matched up between the two datasets.

Figure 22.8 Hierarchical clustering of wine data.

Figure 22.9 Hierarchical clustering of country information data.

Figure 22.10 Wine hierarchical clusters with different linkage methods. Clockwise from top left: single, complete, centroid, average.

Figure 22.11 Hierarchical clustering of wine data split into three groups (red) and 13 groups (blue).

Figure 22.12 Hierarchical clustering of wine data split by the height of cuts.

Figure 23.1 Screenshot of LATEX and R code in RStudio text editor. Notice that the code section is gray.

Figure 23.2 Simple plot of the numbers 1 through 10.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset