Chapter 1. The data science process
Listing 1.1. Building a decision tree
Listing 1.2. Plotting the confusion matrix
Listing 1.3. Plotting the relation between disposable income and loan outcome
Chapter 2. Loading data into R
Listing 2.1. Reading the UCI car data
Listing 2.2. Exploring the car data
Listing 2.3. Loading the credit dataset
Listing 2.4. Setting column names
Listing 2.5. Building a map to interpret loan use codes
Listing 2.6. Transforming the car data
Listing 2.7. Summary of Good.Loan and Purpose
Listing 2.8. PUMS data provenance documentation
Listing 2.9. SQL Screwdriver XML configuration file
Listing 2.10. Loading data with SQL Screwdriver
Listing 2.11. Loading data into R from a relational database
Listing 2.12. Selecting a subset of the Census data
Chapter 3. Exploring data
Listing 3.1. The summary() command
Listing 3.2. Will the variable is.employed be useful for modeling?
Listing 3.3. Examples of invalid values and outliers
Listing 3.4. Looking at the data range of a variable
Listing 3.5. Checking units can prevent inaccurate results later
Listing 3.6. Plotting a histogram
Listing 3.7. Producing a density plot
Listing 3.8. Creating a log-scaled density plot
Listing 3.9. Producing a horizontal bar chart
Listing 3.10. Producing a bar chart with sorted categories
Listing 3.11. Producing a line plot
Listing 3.12. Examining the correlation between age and income
Listing 3.13. Plotting the distribution of health.ins as a function of age
Listing 3.14. Producing a hexbin plot
Listing 3.15. Specifying different styles of bar chart
Chapter 4. Managing data
Listing 4.1. Checking locations of missing data
Listing 4.2. Remapping NA to a level
Listing 4.3. Converting missing numeric data to a level
Listing 4.4. Tracking original NAs with an extra categorical variable
Listing 4.5. Normalizing income by state
Listing 4.6. Converting age into ranges
Listing 4.7. Centering on mean age
Listing 4.9. Splitting into test and training using a random group mark
Listing 4.10. Ensuring test/train split doesn’t split inside a household
Chapter 5. Choosing and evaluating models
Listing 5.1. Building and applying a logistic regression spam model
Listing 5.2. Spam classifications
Listing 5.3. Spam confusion matrix
Listing 5.4. Entering data by hand
Listing 5.5. Plotting residuals
Listing 5.6. Making a double density plot
Listing 5.7. Plotting the receiver operating characteristic curve
Listing 5.8. Calculating log likelihood
Listing 5.9. Computing the null model’s log likelihood
Listing 5.10. Calculating entropy and conditional entropy
Listing 5.11. Clustering random data in the plane
Listing 5.12. Plotting our clusters
Listing 5.13. Calculating the size of each cluster
Listing 5.14. Calculating the typical distance between items in every pair of clusters
Chapter 6. Memorization methods
Listing 6.1. Preparing the KDD data for analysis
Listing 6.2. Plotting churn grouped by variable 218 levels
Listing 6.3. Churn rates grouped by variable 218 codes
Listing 6.4. Function to build single-variable models for categorical variables
Listing 6.5. Applying single-categorical variable models to all of our datasets
Listing 6.6. Scoring categorical variables by AUC
Listing 6.7. Scoring numeric variables by AUC
Listing 6.8. Plotting variable performance
Listing 6.9. Running a repeated cross-validation experiment
Listing 6.10. Empirically cross-validating performance
Listing 6.11. Basic variable selection
Listing 6.12. Selected categorical and numeric variables
Listing 6.13. Building a bad decision tree
Listing 6.14. Building another bad decision tree
Listing 6.15. Building yet another bad decision tree
Listing 6.16. Building a better decision tree
Listing 6.17. Printing the decision tree
Listing 6.18. Plotting the decision tree
Listing 6.19. Running k-nearest neighbors
Listing 6.20. Platting 200-nearest neighbor performance
Listing 6.21. Plotting the receiver operating characteristic curve
Listing 6.22. Plotting the performance of a logistic regression model
Listing 6.23. Building, applying, and evaluating a Naive Bayes model
Chapter 7. Linear and logistic regression
Listing 7.1. Loading the PUMS data
Listing 7.2. Plotting log income as a function of predicted log income
Listing 7.3. Plotting residuals income as a function of predicted log income
Listing 7.4. Computing R-squared
Listing 7.5. Calculating root mean square error
Listing 7.6. Summarizing residuals
Listing 7.7. Loading the CDC data
Listing 7.8. Building the model formula
Listing 7.9. Fitting the logistic regression model
Listing 7.10. Applying the logistic regression model
Listing 7.11. Plotting distribution of prediction score grouped by known outcome
Listing 7.12. Exploring modeling trade-offs
Listing 7.13. Evaluating our chosen model
Listing 7.14. The model coefficients
Listing 7.15. The model summary
Listing 7.16. Calculating deviance residuals
Listing 7.17. Computing deviance
Listing 7.18. Calculating the significance of the observed fit
Chapter 8. Unsupervised methods
Listing 8.1. Reading the protein data
Listing 8.2. Rescaling the dataset
Listing 8.3. Hierarchical clustering
Listing 8.4. Extracting the clusters found by hclust()
Listing 8.5. Projecting the clusters on the first two principal components
Listing 8.6. Running clusterboot() on the protein data
Listing 8.7. Calculating total within sum of squares
Listing 8.8. The Calinski-Harabasz index
Listing 8.9. Evaluating clusterings with different numbers of clusters
Listing 8.10. Running k-means with k=5
Listing 8.11. Plotting cluster criteria
Listing 8.12. Running clusterboot() with k-means
Listing 8.13. A function to assign points to a cluster
Listing 8.14. An example of assigning points to clusters
Listing 8.15. Reading in the book data
Listing 8.16. Examining the transaction data
Listing 8.17. Examining the size distribution
Listing 8.18. Finding the ten most frequent books
Listing 8.19. Finding the association rules
Listing 8.21. Finding rules with restrictions
Chapter 9. Exploring advanced methods
Listing 9.1. Preparing Spambase data and evaluating the performance of decision trees
Listing 9.2. Bagging decision trees
Listing 9.3. Using random forests
Listing 9.4. randomForest variable importance()
Listing 9.5. Fitting with fewer variables
Listing 9.6. Preparing an artificial problem
Listing 9.7. Linear regression applied to our artificial example
Listing 9.8. GAM applied to our artificial example
Listing 9.9. Comparing linear regression and GAM performance
Listing 9.10. Extracting a learned spline from a GAM
Listing 9.11. Applying linear regression (with and without GAM) to health data
Listing 9.12. Plotting GAM results
Listing 9.13. Checking GAM model performance on hold-out data
Listing 9.14. GLM logistic regression
Listing 9.15. GAM logistic regression
Listing 9.16. An artificial kernel example
Listing 9.17. Applying stepwise linear regression to PUMS data
Listing 9.18. Applying an example explicit kernel transform
Listing 9.19. Modeling using the explicit kernel transform
Listing 9.20. Inspecting the results of the explicit kernel model
Listing 9.21. Setting up the spirals data as an example classification problem
Listing 9.22. SVM with a poor choice of kernel
Listing 9.23. SVM with a good choice of kernel
Listing 9.24. Revisiting the Spambase example with GLM
Listing 9.25. Applying an SVM to the Spambase example
Listing 9.26. Printing the SVM results summary
Listing 9.27. Shifting decision point to perform an apples-to-apples comparison
Chapter 10. Documentation and deployment
Listing 10.1. knitr-annotated Markdown
Listing 10.2. knitr LaTeX example
Listing 10.3. Setting knitr dependency options
Listing 10.4. Using the system() command to compute a file hash
Listing 10.5. Calculating model performance
Listing 10.6. Conditionally saving a file
Listing 10.7. Example code comment
Listing 10.9. Worse than useless comment
Listing 10.10. Checking your project status
Listing 10.11. Checking your project history
Listing 10.13. Viewing detailed project history
Listing 10.14. Finding line-based differences between two committed versions
Listing 10.16. Buzz model as an R-based HTTP service
Appendix A. Working with R and other tools
Listing A.1. Trying a few R commands
Listing A.2. Binding values to function arguments
Listing A.3. Demonstrating side effects
Listing A.4. R truth tables for Boolean operators
Listing A.5. Call-by-value effect
Listing A.6. Examples of R indexing operators
Listing A.7. R’s treatment of unexpected factor levels
Listing A.8. Confirm lm() encodes new strings correctly.
Listing A.9. Loading UCI car data directly from GitHub using HTTPS
Listing A.10. Reading database data into R
Listing A.11. Loading an Excel spreadsheet
Listing A.12. The hotel reservation and price data
Listing A.13. Using melt to restructure data
Appendix B. Important statistical concepts
Listing B.1. Plotting the theoretical normal density
Listing B.2. Plotting an empirical normal density
Listing B.3. Working with the normal CDF
Listing B.4. Plotting x < qnorm(0.75)
Listing B.5. Demonstrating some properties of the lognormal distribution
Listing B.6. Plotting the lognormal distribution
Listing B.7. Plotting the binomial distribution
Listing B.8. Working with the theoretical binomial distribution
Listing B.9. Simulating a binomial distribution
Listing B.10. Working with the binomial distribution
Listing B.11. Working with the binomial CDF
Listing B.12. Building simulated A/B test data
Listing B.13. Summarizing the A/B test into a contingency table
Listing B.14. Calculating the observed A and B rates
Listing B.15. Calculating the significance of the observed difference in rates
Listing B.16. Computing frequentist significance
Listing B.17. Bayesian estimate of the posterior tail mass
Listing B.18. Plotting the posterior distribution of the B group
Listing B.19. Sample size estimate
Listing B.20. Exact binomial sample size calculation
Listing B.21. Building a synthetic uncorrelated income example
Listing B.22. Calculating the (non)significance of the observed correlation
Listing B.23. Misleading significance result from biased observations
Listing B.24. Plotting biased view of income and capital gains
Listing B.25. Summarizing our synthetic biological data
Listing B.26. Building data that improves over time