Chapter 1. The data science process
Figure 1.1. The lifecycle of a data science project: loops within loops
Figure 1.3. A decision tree model for finding bad loan applications, with confidence scores
Chapter 2. Loading data into R
Figure 2.1. Car data viewed as a table
Figure 2.2. SQuirreL SQL table explorer
Chapter 3. Exploring data
Figure 3.1. Some information is easier to read from a graph, and some from a summary.
Figure 3.6. Bar charts show the distribution of categorical variables.
Figure 3.8. Sorting the bar chart by count makes it even easier to read.
Figure 3.9. Example of a line plot
Figure 3.10. A scatter plot of income versus age
Figure 3.11. A scatter plot of income versus age, with a linear fit
Figure 3.12. A scatter plot of income versus age, with a smoothing curve
Figure 3.13. Distribution of customers with health insurance, as a function of age
Figure 3.14. Hexbin plot of income versus age, with a smoothing curve superimposed in white
Figure 3.15. Health insurance versus marital status: stacked bar chart
Figure 3.16. Health insurance versus marital status: side-by-side bar chart
Figure 3.17. Health insurance versus marital status: filled bar chart
Figure 3.18. Health insurance versus marital status: filled bar chart with rug
Figure 3.19. Distribution of marital status by housing type: side-by-side bar chart
Figure 3.20. Distribution of marital status by housing type: faceted side-by-side bar chart
Chapter 4. Managing data
Figure 4.1. Variables with missing values
Figure 4.2. Health insurance coverage versus income (log10 scale)
Figure 4.3. Is a 35-year-old young?
Figure 4.4. A nearly lognormal distribution and its log
Figure 4.5. Signed log lets you visualize non-positive data on a logarithmic scale.
Figure 4.6. Example of dataset with customers and households
Figure 4.7. Example of dataset with customers and households
Chapter 5. Choosing and evaluating models
Figure 5.1. Schematic model construction and evaluation
Figure 5.2. Assigning products to product categories
Figure 5.3. Notional example of determining the probability that a transaction is fraudulent
Figure 5.4. Notional example of clustering your customers by purchase pattern and purchase amount
Figure 5.5. Notional example of finding purchase patterns in your data
Figure 5.8. Distribution of score broken up by known classes
Figure 5.9. ROC curve for the email spam example
Chapter 6. Memorization methods
Figure 6.1. Performance of variable 126 on calibration data
Figure 6.2. Graphical representation of a decision tree
Figure 6.3. Performance of 200-nearest neighbors on calibration data
Figure 6.4. ROC of 200-nearest neighbors on calibration data
Chapter 7. Linear and logistic regression
Figure 7.1. Fit versus actuals for y=x2
Figure 7.2. Building a linear model using the lm() command
Figure 7.3. Making predictions with a linear regression model
Figure 7.4. Plot of actual log income as a function of predicted log income
Figure 7.5. Plot of residual error as a function of prediction
Figure 7.6. The model coefficients
Chapter 8. Unsupervised methods
Figure 8.1. An example of data in three clusters
Figure 8.2. Dendrogram of countries clustered by protein consumption
Figure 8.5. Plot of the Calinski-Harabasz and WSS indices for 1–10 clusters, on protein data
Chapter 9. Exploring advanced methods
Figure 9.1. Plot of the most important variables in the spam model, as measured by accuracy
Figure 9.2. A spline that has been fit through a series of points
Figure 9.8. Notional illustration of SVM
Figure 9.9. The spiral counter-example
Figure 9.10. Identity kernel failing to learn the spiral concept
Figure 9.11. Radial kernel successfully learning the spiral concept
Chapter 10. Documentation and deployment
Figure 10.1. knitr process schematic
Figure 10.2. Simple knitr Markdown result
Figure 10.3. Simple knitr LaTeX result
Figure 10.4. knitr documentation of buzz data load
Figure 10.5. knitr documentation of prepared buzz workspace
Figure 10.6. Version control saving the day
Figure 10.7. RStudio new project pane
Figure 10.8. RStudio Git controls
Figure 10.9. gitk browsing https://github.com/WinVector/zmPDSwR
Figure 10.10. git pull: rebase versus merge
Figure 10.11. Top of HTML form that asks server for buzz classification on submit
Chapter 11. Producing effective presentations
Figure 11.1. Motivation for project
Figure 11.2. Stating the project goal
Figure 11.3. Describing the project and its results
Figure 11.4. Discussing your work in more detail
Figure 11.5. Optional slide on the modeling method
Figure 11.6. Discussing future work
Figure 11.7. Motivation for project
Figure 11.8. User workflow before and after the model
Figure 11.9. Present the model’s benefits from the users’ perspective.
Figure 11.10. Provide technical details that are relevant to the users.
Figure 11.11. Describe how the users will interact with the model.
Figure 11.12. An example instructional slide
Figure 11.13. Ask the users for feedback.
Figure 11.14. Introducing the project
Figure 11.15. Discussing related work
Figure 11.16. Introducing the pilot study
Figure 11.17. Discussing model inputs and modeling approach
Appendix A. Working with R and other tools
Figure A.1. SQuirreL SQL driver configuration
Figure A.2. SQuirreL SQL connection alias
Figure A.3. SQuirreL SQL table commands
Appendix B. Important statistical concepts
Figure B.1. The normal distribution with mean 0 and standard deviation 1
Figure B.3. Illustrating x < qnorm(0.75)
Figure B.5. The 75th percentile of the lognormal distribution with meanlog=1, sdlog=0
Figure B.9. Earned income versus capital gains