List of Figures

Chapter 1. The data science process

Figure 1.1. The lifecycle of a data science project: loops within loops

Figure 1.2. The fraction of defaulting loans by credit history category. The dark region of each bar represents the fraction of loans in that category that defaulted.

Figure 1.3. A decision tree model for finding bad loan applications, with confidence scores

Figure 1.4. Notional slide from an executive presentation

Chapter 2. Loading data into R

Figure 2.1. Car data viewed as a table

Figure 2.2. SQuirreL SQL table explorer

Figure 2.3. Browsing PUMS data using SQuirreL SQL

Figure 2.4. Strings encoded as indicators

Chapter 3. Exploring data

Figure 3.1. Some information is easier to read from a graph, and some from a summary.

Figure 3.2. A unimodal distribution (gray) can usually be modeled as coming from a single population of users. With a bimodal distribution (black), your data often comes from two populations of users.

Figure 3.3. A histogram tells you where your data is concentrated. It also visually highlights outliers and anomalies.

Figure 3.4. Density plots show where data is concentrated. This plot also highlights a population of higher-income customers.

Figure 3.5. The density plot of income on a log10 scale highlights details of the income distribution that are harder to see in a regular density plot.

Figure 3.6. Bar charts show the distribution of categorical variables.

Figure 3.7. A horizontal bar chart can be easier to read when there are several categories with long names.

Figure 3.8. Sorting the bar chart by count makes it even easier to read.

Figure 3.9. Example of a line plot

Figure 3.10. A scatter plot of income versus age

Figure 3.11. A scatter plot of income versus age, with a linear fit

Figure 3.12. A scatter plot of income versus age, with a smoothing curve

Figure 3.13. Distribution of customers with health insurance, as a function of age

Figure 3.14. Hexbin plot of income versus age, with a smoothing curve superimposed in white

Figure 3.15. Health insurance versus marital status: stacked bar chart

Figure 3.16. Health insurance versus marital status: side-by-side bar chart

Figure 3.17. Health insurance versus marital status: filled bar chart

Figure 3.18. Health insurance versus marital status: filled bar chart with rug

Figure 3.19. Distribution of marital status by housing type: side-by-side bar chart

Figure 3.20. Distribution of marital status by housing type: faceted side-by-side bar chart

Chapter 4. Managing data

Figure 4.1. Variables with missing values

Figure 4.2. Health insurance coverage versus income (log10 scale)

Figure 4.3. Is a 35-year-old young?

Figure 4.4. A nearly lognormal distribution and its log

Figure 4.5. Signed log lets you visualize non-positive data on a logarithmic scale.

Figure 4.6. Example of dataset with customers and households

Figure 4.7. Example of dataset with customers and households

Chapter 5. Choosing and evaluating models

Figure 5.1. Schematic model construction and evaluation

Figure 5.2. Assigning products to product categories

Figure 5.3. Notional example of determining the probability that a transaction is fraudulent

Figure 5.4. Notional example of clustering your customers by purchase pattern and purchase amount

Figure 5.5. Notional example of finding purchase patterns in your data

Figure 5.6. Look to the customers with similar movie-watching patterns as JaneB for her movie recommendations.

Figure 5.7. Scoring residuals

Figure 5.8. Distribution of score broken up by known classes

Figure 5.9. ROC curve for the email spam example

Figure 5.10. Clustering example

Figure 5.11. A notional illustration of overfitting

Chapter 6. Memorization methods

Figure 6.1. Performance of variable 126 on calibration data

Figure 6.2. Graphical representation of a decision tree

Figure 6.3. Performance of 200-nearest neighbors on calibration data

Figure 6.4. ROC of 200-nearest neighbors on calibration data

Chapter 7. Linear and logistic regression

Figure 7.1. Fit versus actuals for y=x2

Figure 7.2. Building a linear model using the lm() command

Figure 7.3. Making predictions with a linear regression model

Figure 7.4. Plot of actual log income as a function of predicted log income

Figure 7.5. Plot of residual error as a function of prediction

Figure 7.6. The model coefficients

Figure 7.7. Model summary

Figure 7.8. Model summary coefficient columns

Figure 7.9. Distribution of score broken up by positive examples (TRUE) and negative examples (FALSE)

Figure 7.10. Enrichment (top) and recall (bottom) plotted as functions of threshold for the training set

Chapter 8. Unsupervised methods

Figure 8.1. An example of data in three clusters

Figure 8.2. Dendrogram of countries clustered by protein consumption

Figure 8.3. Plot of countries clustered by protein consumption, projected onto first two principal components

Figure 8.4. Cluster 5: The Mediterranean cluster. Its members are separated from the other clusters, but also from each other.

Figure 8.5. Plot of the Calinski-Harabasz and WSS indices for 1–10 clusters, on protein data

Figure 8.6. The k-means procedure. The two cluster centers are represented by the outlined star and diamond.

Figure 8.7. Plot of the Calinski-Harabasz and average silhouette width indices for 1–10 clusters, on protein data

Figure 8.8. A density plot of basket sizes

Chapter 9. Exploring advanced methods

Figure 9.1. Plot of the most important variables in the spam model, as measured by accuracy

Figure 9.2. A spline that has been fit through a series of points

Figure 9.3. Linear model’s predictions versus actual response. The solid line is the line of perfect prediction (prediction=actual).

Figure 9.4. GAM’s predictions versus actual response. The solid line is the theoretical line of perfect prediction (prediction=actual).

Figure 9.5. Top: The nonlinear function s(PWGT) discovered by gam(), as output by plot(gam.model) Bottom: The same spline superimposed over the training data

Figure 9.6. Smoothing curves of each of the four input variables plotted against birth weight, compared with the splines discovered by gam(). All curves have been shifted to be zero mean for comparison of shape.

Figure 9.7. Notional illustration of a kernel transform (based on Cristianini and Shawe-Taylor, 2000)

Figure 9.8. Notional illustration of SVM

Figure 9.9. The spiral counter-example

Figure 9.10. Identity kernel failing to learn the spiral concept

Figure 9.11. Radial kernel successfully learning the spiral concept

Chapter 10. Documentation and deployment

Figure 10.1. knitr process schematic

Figure 10.2. Simple knitr Markdown result

Figure 10.3. Simple knitr LaTeX result

Figure 10.4. knitr documentation of buzz data load

Figure 10.5. knitr documentation of prepared buzz workspace

Figure 10.6. Version control saving the day

Figure 10.7. RStudio new project pane

Figure 10.8. RStudio Git controls

Figure 10.9. gitk browsing https://github.com/WinVector/zmPDSwR

Figure 10.10. git pull: rebase versus merge

Figure 10.11. Top of HTML form that asks server for buzz classification on submit

Figure 10.12. One tree from the buzz random forest model

Chapter 11. Producing effective presentations

Figure 11.1. Motivation for project

Figure 11.2. Stating the project goal

Figure 11.3. Describing the project and its results

Figure 11.4. Discussing your work in more detail

Figure 11.5. Optional slide on the modeling method

Figure 11.6. Discussing future work

Figure 11.7. Motivation for project

Figure 11.8. User workflow before and after the model

Figure 11.9. Present the model’s benefits from the users’ perspective.

Figure 11.10. Provide technical details that are relevant to the users.

Figure 11.11. Describe how the users will interact with the model.

Figure 11.12. An example instructional slide

Figure 11.13. Ask the users for feedback.

Figure 11.14. Introducing the project

Figure 11.15. Discussing related work

Figure 11.16. Introducing the pilot study

Figure 11.17. Discussing model inputs and modeling approach

Figure 11.18. Showing model performance

Figure 11.19. Discussing future work

Appendix A. Working with R and other tools

Figure A.1. SQuirreL SQL driver configuration

Figure A.2. SQuirreL SQL connection alias

Figure A.3. SQuirreL SQL table commands

Figure A.4. Hotel data in spreadsheet form

Figure A.5. Hotel data in spreadsheet form

Appendix B. Important statistical concepts

Figure B.1. The normal distribution with mean 0 and standard deviation 1

Figure B.2. The empirical distribution of points drawn from a normal with mean 0 and standard deviation 1. The dotted line represents the theoretical normal distribution.

Figure B.3. Illustrating x < qnorm(0.75)

Figure B.4. Top: The lognormal distribution X such that mean(log(X))=0 and sd(log(X)=1. The dashed line is the theoretical distribution, and the solid line is the distribution of a random lognormal sample. Bottom: The solid line is the distribution of log(X).

Figure B.5. The 75th percentile of the lognormal distribution with meanlog=1, sdlog=0

Figure B.6. The binomial distributions for 50 coin tosses, with coins of various fairnesses (probability of landing on heads)

Figure B.7. The observed distribution of the count of girls in 100 classrooms of size 20, when the population is 50% female. The theoretical distribution is shown with the dashed line.

Figure B.8. Posterior distribution of the B conversion rate. The dashed line is the A conversion rate.

Figure B.9. Earned income versus capital gains

Figure B.10. Biased earned income versus capital gains

Figure B.11. View of rows from the bioavailability dataset

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset