List of Listings

Chapter 1. The data science process

Listing 1.1. Building a decision tree

Listing 1.2. Plotting the confusion matrix

Listing 1.3. Plotting the relation between disposable income and loan outcome

Chapter 2. Loading data into R

Listing 2.1. Reading the UCI car data

Listing 2.2. Exploring the car data

Listing 2.3. Loading the credit dataset

Listing 2.4. Setting column names

Listing 2.5. Building a map to interpret loan use codes

Listing 2.6. Transforming the car data

Listing 2.7. Summary of Good.Loan and Purpose

Listing 2.8. PUMS data provenance documentation

Listing 2.9. SQL Screwdriver XML configuration file

Listing 2.10. Loading data with SQL Screwdriver

Listing 2.11. Loading data into R from a relational database

Listing 2.12. Selecting a subset of the Census data

Listing 2.13. Recoding variables

Listing 2.14. Summarizing the classifications of work

Chapter 3. Exploring data

Listing 3.1. The summary() command

Listing 3.2. Will the variable is.employed be useful for modeling?

Listing 3.3. Examples of invalid values and outliers

Listing 3.4. Looking at the data range of a variable

Listing 3.5. Checking units can prevent inaccurate results later

Listing 3.6. Plotting a histogram

Listing 3.7. Producing a density plot

Listing 3.8. Creating a log-scaled density plot

Listing 3.9. Producing a horizontal bar chart

Listing 3.10. Producing a bar chart with sorted categories

Listing 3.11. Producing a line plot

Listing 3.12. Examining the correlation between age and income

Listing 3.13. Plotting the distribution of health.ins as a function of age

Listing 3.14. Producing a hexbin plot

Listing 3.15. Specifying different styles of bar chart

Listing 3.16. Plotting data with a rug

Listing 3.17. Plotting a bar chart with and without facets

Chapter 4. Managing data

Listing 4.1. Checking locations of missing data

Listing 4.2. Remapping NA to a level

Listing 4.3. Converting missing numeric data to a level

Listing 4.4. Tracking original NAs with an extra categorical variable

Listing 4.5. Normalizing income by state

Listing 4.6. Converting age into ranges

Listing 4.7. Centering on mean age

Listing 4.8. Summarizing age

Listing 4.9. Splitting into test and training using a random group mark

Listing 4.10. Ensuring test/train split doesn’t split inside a household

Chapter 5. Choosing and evaluating models

Listing 5.1. Building and applying a logistic regression spam model

Listing 5.2. Spam classifications

Listing 5.3. Spam confusion matrix

Listing 5.4. Entering data by hand

Listing 5.5. Plotting residuals

Listing 5.6. Making a double density plot

Listing 5.7. Plotting the receiver operating characteristic curve

Listing 5.8. Calculating log likelihood

Listing 5.9. Computing the null model’s log likelihood

Listing 5.10. Calculating entropy and conditional entropy

Listing 5.11. Clustering random data in the plane

Listing 5.12. Plotting our clusters

Listing 5.13. Calculating the size of each cluster

Listing 5.14. Calculating the typical distance between items in every pair of clusters

Chapter 6. Memorization methods

Listing 6.1. Preparing the KDD data for analysis

Listing 6.2. Plotting churn grouped by variable 218 levels

Listing 6.3. Churn rates grouped by variable 218 codes

Listing 6.4. Function to build single-variable models for categorical variables

Listing 6.5. Applying single-categorical variable models to all of our datasets

Listing 6.6. Scoring categorical variables by AUC

Listing 6.7. Scoring numeric variables by AUC

Listing 6.8. Plotting variable performance

Listing 6.9. Running a repeated cross-validation experiment

Listing 6.10. Empirically cross-validating performance

Listing 6.11. Basic variable selection

Listing 6.12. Selected categorical and numeric variables

Listing 6.13. Building a bad decision tree

Listing 6.14. Building another bad decision tree

Listing 6.15. Building yet another bad decision tree

Listing 6.16. Building a better decision tree

Listing 6.17. Printing the decision tree

Listing 6.18. Plotting the decision tree

Listing 6.19. Running k-nearest neighbors

Listing 6.20. Platting 200-nearest neighbor performance

Listing 6.21. Plotting the receiver operating characteristic curve

Listing 6.22. Plotting the performance of a logistic regression model

Listing 6.23. Building, applying, and evaluating a Naive Bayes model

Listing 6.24. Using a Naive Bayes package

Chapter 7. Linear and logistic regression

Listing 7.1. Loading the PUMS data

Listing 7.2. Plotting log income as a function of predicted log income

Listing 7.3. Plotting residuals income as a function of predicted log income

Listing 7.4. Computing R-squared

Listing 7.5. Calculating root mean square error

Listing 7.6. Summarizing residuals

Listing 7.7. Loading the CDC data

Listing 7.8. Building the model formula

Listing 7.9. Fitting the logistic regression model

Listing 7.10. Applying the logistic regression model

Listing 7.11. Plotting distribution of prediction score grouped by known outcome

Listing 7.12. Exploring modeling trade-offs

Listing 7.13. Evaluating our chosen model

Listing 7.14. The model coefficients

Listing 7.15. The model summary

Listing 7.16. Calculating deviance residuals

Listing 7.17. Computing deviance

Listing 7.18. Calculating the significance of the observed fit

Listing 7.19. Calculating the pseudo R-squared

Listing 7.20. Calculating the Akaike information criterion

Chapter 8. Unsupervised methods

Listing 8.1. Reading the protein data

Listing 8.2. Rescaling the dataset

Listing 8.3. Hierarchical clustering

Listing 8.4. Extracting the clusters found by hclust()

Listing 8.5. Projecting the clusters on the first two principal components

Listing 8.6. Running clusterboot() on the protein data

Listing 8.7. Calculating total within sum of squares

Listing 8.8. The Calinski-Harabasz index

Listing 8.9. Evaluating clusterings with different numbers of clusters

Listing 8.10. Running k-means with k=5

Listing 8.11. Plotting cluster criteria

Listing 8.12. Running clusterboot() with k-means

Listing 8.13. A function to assign points to a cluster

Listing 8.14. An example of assigning points to clusters

Listing 8.15. Reading in the book data

Listing 8.16. Examining the transaction data

Listing 8.17. Examining the size distribution

Listing 8.18. Finding the ten most frequent books

Listing 8.19. Finding the association rules

Listing 8.20. Scoring rules

Listing 8.21. Finding rules with restrictions

Listing 8.22. Inspecting rules

Listing 8.23. Inspecting rules with restrictions

Chapter 9. Exploring advanced methods

Listing 9.1. Preparing Spambase data and evaluating the performance of decision trees

Listing 9.2. Bagging decision trees

Listing 9.3. Using random forests

Listing 9.4. randomForest variable importance()

Listing 9.5. Fitting with fewer variables

Listing 9.6. Preparing an artificial problem

Listing 9.7. Linear regression applied to our artificial example

Listing 9.8. GAM applied to our artificial example

Listing 9.9. Comparing linear regression and GAM performance

Listing 9.10. Extracting a learned spline from a GAM

Listing 9.11. Applying linear regression (with and without GAM) to health data

Listing 9.12. Plotting GAM results

Listing 9.13. Checking GAM model performance on hold-out data

Listing 9.14. GLM logistic regression

Listing 9.15. GAM logistic regression

Listing 9.16. An artificial kernel example

Listing 9.17. Applying stepwise linear regression to PUMS data

Listing 9.18. Applying an example explicit kernel transform

Listing 9.19. Modeling using the explicit kernel transform

Listing 9.20. Inspecting the results of the explicit kernel model

Listing 9.21. Setting up the spirals data as an example classification problem

Listing 9.22. SVM with a poor choice of kernel

Listing 9.23. SVM with a good choice of kernel

Listing 9.24. Revisiting the Spambase example with GLM

Listing 9.25. Applying an SVM to the Spambase example

Listing 9.26. Printing the SVM results summary

Listing 9.27. Shifting decision point to perform an apples-to-apples comparison

Chapter 10. Documentation and deployment

Listing 10.1. knitr-annotated Markdown

Listing 10.2. knitr LaTeX example

Listing 10.3. Setting knitr dependency options

Listing 10.4. Using the system() command to compute a file hash

Listing 10.5. Calculating model performance

Listing 10.6. Conditionally saving a file

Listing 10.7. Example code comment

Listing 10.8. Useless comment

Listing 10.9. Worse than useless comment

Listing 10.10. Checking your project status

Listing 10.11. Checking your project history

Listing 10.12. Annoying work

Listing 10.13. Viewing detailed project history

Listing 10.14. Finding line-based differences between two committed versions

Listing 10.15. git remote

Listing 10.16. Buzz model as an R-based HTTP service

Listing 10.17. Calling the buzz HTTP service

Listing 10.18. Exporting the random forest model

Appendix A. Working with R and other tools

Listing A.1. Trying a few R commands

Listing A.2. Binding values to function arguments

Listing A.3. Demonstrating side effects

Listing A.4. R truth tables for Boolean operators

Listing A.5. Call-by-value effect

Listing A.6. Examples of R indexing operators

Listing A.7. R’s treatment of unexpected factor levels

Listing A.8. Confirm lm() encodes new strings correctly.

Listing A.9. Loading UCI car data directly from GitHub using HTTPS

Listing A.10. Reading database data into R

Listing A.11. Loading an Excel spreadsheet

Listing A.12. The hotel reservation and price data

Listing A.13. Using melt to restructure data

Listing A.14. Assembling many rows using SQL

Listing A.15. Showing our hotel model results

Appendix B. Important statistical concepts

Listing B.1. Plotting the theoretical normal density

Listing B.2. Plotting an empirical normal density

Listing B.3. Working with the normal CDF

Listing B.4. Plotting x < qnorm(0.75)

Listing B.5. Demonstrating some properties of the lognormal distribution

Listing B.6. Plotting the lognormal distribution

Listing B.7. Plotting the binomial distribution

Listing B.8. Working with the theoretical binomial distribution

Listing B.9. Simulating a binomial distribution

Listing B.10. Working with the binomial distribution

Listing B.11. Working with the binomial CDF

Listing B.12. Building simulated A/B test data

Listing B.13. Summarizing the A/B test into a contingency table

Listing B.14. Calculating the observed A and B rates

Listing B.15. Calculating the significance of the observed difference in rates

Listing B.16. Computing frequentist significance

Listing B.17. Bayesian estimate of the posterior tail mass

Listing B.18. Plotting the posterior distribution of the B group

Listing B.19. Sample size estimate

Listing B.20. Exact binomial sample size calculation

Listing B.21. Building a synthetic uncorrelated income example

Listing B.22. Calculating the (non)significance of the observed correlation

Listing B.23. Misleading significance result from biased observations

Listing B.24. Plotting biased view of income and capital gains

Listing B.25. Summarizing our synthetic biological data

Listing B.26. Building data that improves over time

Listing B.27. A bad model (due to omitted variable bias)

Listing B.28. A better model

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset