Karthik Ramasubramanian and

Abhishek Singh

Machine Learning Using R

Karthik Ramasubramanian

New Delhi, Delhi, India

Abhishek Singh

New Delhi, Delhi, India

Any source code or other supplementary materials referenced by the author in this text is available to readers at www.apress.com . For detailed information about how to locate your book’s source code, go to www.apress.com/source-code/ .

ISBN 978-1-4842-2333-8

e-ISBN 978-1-4842-2334-5

DOI 10.1007/978-1-4842-2334-5

Library of Congress Control Number: 2016961515

© Karthik Ramasubramanian and Abhishek Singh 2017

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springer.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To our parents for being the guiding light and a strong pillar of support.

And to our decade-long friendship.

Acknowledgments

We are grateful to our teachers, open source communities, and colleagues for enriching us with knowledge and confidence to bring the first edition of this book. The knowledge in this book is an accumulation of a number of years of research work and professional experience gained at our alma mater and industry. We are grateful to Prof R. Nadarajan and Prof R. Anitha, Department of Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, for their continuous support and encouragement for our efforts in the machine learning field.

In the rapidly changing world, the field of machine learning is evolving very fast and most of the latest developments are driven by the open source platform. We thank all the developers and contributors across the globe who are freely sharing their knowledge about these platforms. We also want to thank our colleagues from Snapdeal, Deloitte, and our current organizations—Hike and Prudential—for providing opportunities to experiment and create cutting-edge data science solutions.

Karthik especially would like to thank his father, Mr. S Ramasubramanian, for always being a source of inspiration in his life. He is immensely thankful to his supervisor, Mr. Nikhil Dwarakanath, director of the data science team at Snapdeal, for creating the opportunities to bring about the best analytics professional in him and providing the motivation to take up challenging projects.

Abhishek would like to thank his father, Mr. Charan Singh, a senior scientist in the India meteorological department, for introducing him to the power of data in weather forecasting in his formative years. On a personal front, Abhishek would like to thank his mother Jaya, sister Asweta, and brother Avilash for their continuous moral support.

We want to thank our publisher Apress, specifically Celestine, for proving us with this opportunity, Sanchita, Prachi for managing this project, Poonam and Piyush for their reviews, and everybody involved in the production team.

—Karthik Ramasubramanian

—Abhishek Singh

Contents

  1. Chapter 1:​ Introduction to Machine Learning and R
    1. 1.​1 Understanding the Evolution
      1. 1.​1.​1 Statistical Learning
      2. 1.​1.​2 Machine Learning (ML)
      3. 1.​1.​3 Artificial Intelligence (AI)
      4. 1.​1.​4 Data Mining
      5. 1.​1.​5 Data Science
    2. 1.​2 Probability and Statistics
      1. 1.​2.​1 Counting and Probability Definition
      2. 1.​2.​2 Events and Relationships
      3. 1.​2.​3 Randomness, Probability, and Distributions
      4. 1.​2.​4 Confidence Interval and Hypothesis Testing
    3. 1.​3 Getting Started with R
      1. 1.​3.​1 Basic Building Blocks
      2. 1.​3.​2 Data Structures in R
      3. 1.​3.​3 Subsetting
      4. 1.​3.​4 Functions and Apply Family
    4. 1.​4 Machine Learning Process Flow
      1. 1.​4.​1 Plan
      2. 1.​4.​2 Explore
      3. 1.​4.​3 Build
      4. 1.​4.​4 Evaluate
    5. 1.​5 Other Technologies
    6. 1.​6 Summary
    7. 1.​7 References
  2. Chapter 2:​ Data Preparation and Exploration
    1. 2.​1 Planning the Gathering of Data
      1. 2.​1.​1 Variables Types
      2. 2.​1.​2 Data Formats
      3. 2.​1.​3 Data Sources
    2. 2.​2 Initial Data Analysis (IDA)
      1. 2.​2.​1 Discerning a First Look
      2. 2.​2.​2 Organizing Multiple Sources of Data into One
      3. 2.​2.​3 Cleaning the Data
      4. 2.​2.​4 Supplementing with More Information
      5. 2.​2.​5 Reshaping
    3. 2.​3 Exploratory Data Analysis
      1. 2.​3.​1 Summary Statistics
      2. 2.​3.​2 Moment
    4. 2.​4 Case Study:​ Credit Card Fraud
      1. 2.​4.​1 Data Import
      2. 2.​4.​2 Data Transformation
      3. 2.​4.​3 Data Exploration
    5. 2.​5 Summary
    6. 2.​6 References
  3. Chapter 3:​ Sampling and Resampling Techniques
    1. 3.​1 Introduction to Sampling
    2. 3.​2 Sampling Terminology
      1. 3.​2.​1 Sample
      2. 3.​2.​2 Sampling Distribution
      3. 3.​2.​3 Population Mean and Variance
      4. 3.​2.​4 Sample Mean and Variance
      5. 3.​2.​5 Pooled Mean and Variance
      6. 3.​2.​6 Sample Point
      7. 3.​2.​7 Sampling Error
      8. 3.​2.​8 Sampling Fraction
      9. 3.​2.​9 Sampling Bias
      10. 3.​2.​10 Sampling Without Replacement (SWOR)
      11. 3.​2.​11 Sampling with Replacement (SWR)
    3. 3.​3 Credit Card Fraud:​ Population Statistics
      1. 3.​3.​1 Data Description
      2. 3.​3.​2 Population Mean
      3. 3.​3.​3 Population Variance
      4. 3.​3.​4 Pooled Mean and Variance
    4. 3.​4 Business Implications of Sampling
      1. 3.​4.​1 Features of Sampling
      2. 3.​4.​2 Shortcomings of Sampling
    5. 3.​5 Probability and Non-Probability Sampling
      1. 3.​5.​1 Types of Non-Probability Sampling
    6. 3.​6 Statistical Theory on Sampling Distributions
      1. 3.​6.​1 Law of Large Numbers:​ LLN
      2. 3.​6.​2 Central Limit Theorem
    7. 3.​7 Probability Sampling Techniques
      1. 3.​7.​1 Population Statistics
      2. 3.​7.​2 Simple Random Sampling
      3. 3.​7.​3 Systematic Random Sampling
      4. 3.​7.​4 Stratified Random Sampling
      5. 3.​7.​5 Cluster Sampling
      6. 3.​7.​6 Bootstrap Sampling
    8. 3.​8 Monte Carlo Method:​ Acceptance-Rejection Method
    9. 3.​9 A Qualitative Account of Computational Savings by Sampling
    10. 3.​10 Summary
  4. Chapter 4:​ Data Visualization in R
    1. 4.​1 Introduction to the ggplot2 Package
    2. 4.​2 World Development Indicators
    3. 4.​3 Line Chart
    4. 4.​4 Stacked Column Charts
    5. 4.​5 Scatterplots
    6. 4.​6 Boxplots
    7. 4.​7 Histograms and Density Plots
    8. 4.​8 Pie Charts
    9. 4.​9 Correlation Plots
    10. 4.​10 HeatMaps
    11. 4.​11 Bubble Charts
    12. 4.​12 Waterfall Charts
    13. 4.​13 Dendogram
    14. 4.​14 Wordclouds
    15. 4.​15 Sankey Plots
    16. 4.​16 Time Series Graphs
    17. 4.​17 Cohort Diagrams
    18. 4.​18 Spatial Maps
    19. 4.​19 Summary
    20. 4.​20 References
  5. Chapter 5:​ Feature Engineering
    1. 5.​1 Introduction to Feature Engineering
      1. 5.​1.​1 Filter Methods
      2. 5.​1.​2 Wrapper Methods
      3. 5.​1.​3 Embedded Methods
    2. 5.​2 Understanding the Working Data
      1. 5.​2.​1 Data Summary
      2. 5.​2.​2 Properties of Dependent Variable
      3. 5.​2.​3 Features Availability:​ Continuous or Categorical
      4. 5.​2.​4 Setting Up Data Assumptions
    3. 5.​3 Feature Ranking
    4. 5.​4 Variable Subset Selection
      1. 5.​4.​1 Filter Method
      2. 5.​4.​2 Wrapper Methods
      3. 5.​4.​3 Embedded Methods
    5. 5.​5 Dimensionality Reduction
    6. 5.​6 Feature Engineering Checklist
    7. 5.​7 Summary
    8. 5.​8 References
  6. Chapter 6:​ Machine Learning Theory and Practices
    1. 6.​1 Machine Learning Types
      1. 6.​1.​1 Supervised Learning
      2. 6.​1.​2 Unsupervised Learning
      3. 6.​1.​3 Semi-Supervised Learning
      4. 6.​1.​4 Reinforcement Learning
    2. 6.​2 Groups of Machine Learning Algorithms
    3. 6.​3 Real-World Datasets
      1. 6.​3.​1 House Sale Prices
      2. 6.​3.​2 Purchase Preference
      3. 6.​3.​3 Twitter Feeds and Article
      4. 6.​3.​4 Breast Cancer
      5. 6.​3.​5 Market Basket
      6. 6.​3.​6 Amazon Food Review
    4. 6.​4 Regression Analysis
    5. 6.​5 Correlation Analysis
      1. 6.​5.​1 Linear Regression
      2. 6.​5.​2 Simple Linear Regression
      3. 6.​5.​3 Multiple Linear Regression
      4. 6.​5.​4 Model Diagnostics:​ Linear Regression
      5. 6.​5.​5 Polynomial Regression
      6. 6.​5.​6 Logistic Regression
      7. 6.​5.​7 Logit Transformation
      8. 6.​5.​8 Odds Ratio
      9. 6.​5.​9 Model Diagnostics:​ Logistic Regression
      10. 6.​5.​10 Multinomial Logistic Regression
      11. 6.​5.​11 Generalized Linear Models
      12. 6.​5.​12 Conclusion
    6. 6.​6 Support Vector Machine SVM
      1. 6.​6.​1 Linear SVM
      2. 6.​6.​2 Binary SVM Classifier
      3. 6.​6.​3 Multi-Class SVM
      4. 6.​6.​4 Conclusion
    7. 6.​7 Decision Trees
      1. 6.​7.​1 Types of Decision Trees
      2. 6.​7.​2 Decision Measures
      3. 6.​7.​3 Decision Tree Learning Methods
      4. 6.​7.​4 Ensemble Trees
      5. 6.​7.​5 Conclusion
    8. 6.​8 The Naive Bayes Method
      1. 6.​8.​1 Conditional Probability
      2. 6.​8.​2 Bayes Theorem
      3. 6.​8.​3 Prior Probability
      4. 6.​8.​4 Posterior Probability
      5. 6.​8.​5 Likelihood and Marginal Likelihood
      6. 6.​8.​6 Naive Bayes Methods
      7. 6.​8.​7 Conclusion
    9. 6.​9 Cluster Analysis
      1. 6.​9.​1 Introduction to Clustering
      2. 6.​9.​2 Clustering Algorithms
      3. 6.​9.​3 Internal Evaluation
      4. 6.​9.​4 External Evaluation
      5. 6.​9.​5 Conclusion
    10. 6.​10 Association Rule Mining
      1. 6.​10.​1 Introduction to Association Concepts
      2. 6.​10.​2 Rule-Mining Algorithms
      3. 6.​10.​3 Recommendation Algorithms
      4. 6.​10.​4 Conclusion
    11. 6.​11 Artificial Neural Networks
      1. 6.​11.​1 Human Cognitive Learning
      2. 6.​11.​2 Perceptron
      3. 6.​11.​3 Sigmoid Neuron
      4. 6.​11.​4 Neural Network Architecture
      5. 6.​11.​5 Supervised versus Unsupervised Neural Nets
      6. 6.​11.​6 Neural Network Learning Algorithms
      7. 6.​11.​7 Feed-Forward Back-Propagation
      8. 6.​11.​8 Deep Learning
      9. 6.​11.​9 Conclusion
    12. 6.​12 Text-Mining Approaches
      1. 6.​12.​1 Introduction to Text Mining
      2. 6.​12.​2 Text Summarization
      3. 6.​12.​3 TF-IDF
      4. 6.​12.​4 Part-of-Speech (POS) Tagging
      5. 6.​12.​5 Word Cloud
      6. 6.​12.​6 Text Analysis:​ Microsoft Cognitive Services
      7. 6.​12.​7 Conclusion
    13. 6.​13 Online Machine Learning Algorithms
      1. 6.​13.​1 Fuzzy C-Means Clustering
      2. 6.​13.​2 Conclusion
    14. 6.​14 Model Building Checklist
    15. 6.​15 Summary
    16. 6.​16 References
  7. Chapter 7:​ Machine Learning Model Evaluation
    1. 7.​1 Dataset
      1. 7.​1.​1 House Sale Prices
      2. 7.​1.​2 Purchase Preference
    2. 7.​2 Introduction to Model Performance and Evaluation
    3. 7.​3 Objectives of Model Performance Evaluation
    4. 7.​4 Population Stability Index
    5. 7.​5 Model Evaluation for Continuous Output
      1. 7.​5.​1 Mean Absolute Error
      2. 7.​5.​2 Root Mean Square Error
      3. 7.​5.​3 R-Square
    6. 7.​6 Model Evaluation for Discrete Output
      1. 7.​6.​1 Classification Matrix
      2. 7.​6.​2 Sensitivity and Specificity
      3. 7.​6.​3 Area Under ROC Curve
    7. 7.​7 Probabilistic Techniques
      1. 7.​7.​1 K-Fold Cross Validation
      2. 7.​7.​2 Bootstrap Sampling
    8. 7.​8 The Kappa Error Metric
    9. 7.​9 Summary
    10. 7.​10 References
  8. Chapter 8:​ Model Performance Improvement
    1. 8.​1 Machine Learning and Statistical Modeling
    2. 8.​2 Overview of the Caret Package
    3. 8.​3 Introduction to Hyper-Parameters
    4. 8.​4 Hyper-Parameter Optimization
      1. 8.​4.​1 Manual Search
      2. 8.​4.​2 Manual Grid Search
      3. 8.​4.​3 Automatic Grid Search
      4. 8.​4.​4 Optimal Search
      5. 8.​4.​5 Random Search
      6. 8.​4.​6 Custom Searching
    5. 8.​5 The Bias and Variance Tradeoff
      1. 8.​5.​1 Bagging or Bootstrap Aggregation
      2. 8.​5.​2 Boosting
    6. 8.​6 Introduction to Ensemble Learning
      1. 8.​6.​1 Voting Ensembles
      2. 8.​6.​2 Advanced Methods in Ensemble Learning
    7. 8.​7 Ensemble Techniques Illustration in R
      1. 8.​7.​1 Bagging Trees
      2. 8.​7.​2 Gradient Boosting with a Decision Tree
      3. 8.​7.​3 Blending KNN and Rpart
      4. 8.​7.​4 Stacking Using caretEnsemble
    8. 8.​8 Advanced Topic:​ Bayesian Optimization of Machine Learning Models
    9. 8.​9 Summary
    10. 8.​10 References
  9. Chapter 9:​ Scalable Machine Learning and RelatedTechnolog​ies
    1. 9.​1 Distributed Processing and Storage
      1. 9.​1.​1 Google File System (GFS)
      2. 9.​1.​2 MapReduce
      3. 9.​1.​3 Parallel Execution in R
    2. 9.​2 The Hadoop Ecosystem
      1. 9.​2.​1 MapReduce
      2. 9.​2.​2 Hive
      3. 9.​2.​3 Apache Pig
      4. 9.​2.​4 HBase
      5. 9.​2.​5 Spark
    3. 9.​3 Machine Learning in R with Spark
      1. 9.​3.​1 Setting the Environment Variable
      2. 9.​3.​2 Initializing the Spark Session
      3. 9.​3.​3 Loading Data and the Running Pre-Process
      4. 9.​3.​4 Creating SparkDataFrame
      5. 9.​3.​5 Building the ML Model
      6. 9.​3.​6 Predicting the Test Data
      7. 9.​3.​7 Stopping the SparkR Session
    4. 9.​4 Machine Learning in R with H2O
      1. 9.​4.​1 Installation of Packages
      2. 9.​4.​2 Initialization of H2O Clusters
      3. 9.​4.​3 Deep Learning Demo in R with H2O
    5. 9.​5 Summary
    6. 9.​6 References
  10. Index

About the Authors and About the Technical Reviewer

About the Authors

A416805_1_En_BookFrontmatter_Figb_HTML.jpg

Karthik Ramasubramanian works for one of the largest and fastest growing technology unicorns in India, Hike Messenger. He brings the best of business analytics and data science experience to his role at Hike Messenger. In his seven years of research and industry experience, he has worked on cross-industry data science problems in retail, e-commerce, and technology, developing and prototyping data-driven solutions. In his previous role at Snapdeal, one of the largest e-commerce retailers in India, he was leading core statistical modeling initiatives for customer growth and pricing analytics. Prior to Snapdeal, he was part of central database team, managing the data warehouses for global business applications of Reckitt Benckiser (RB). He has vast experience working with scalable machine learning solutions for industry, including sophisticated graph network and self-learning neural networks. He has a Master’s in Theoretical Computer Science from PSG College of Technology, Anna University, and is a certified big data professional. He is passionate about teaching and mentoring future data scientists through different online and public forums. He enjoys writing poems in his leisure time and is an avid traveler.

A416805_1_En_BookFrontmatter_Figc_HTML.jpg

Abhishek Singh is a data scientist in the advanced data science team of Prudential Financial Inc., the second largest life insurance provider in the United States, and is based out of Ireland. He has five years of professional and academic experience in data science, spanning across consulting, teaching, and financial services. At Deloitte Advisory, he led risk analytics initiatives for top U.S. banks in their regulatory risk, credit risk, and balance sheet modeling requirements. In his current role, he is working on scalable machine learning algorithms for individual life insurance business of Prudential. He has working experience in time series models and has worked with cross-functional teams to implement data science solutions in enterprise infrastructure. He has been an active trainer at Deloitte Professional University and led training and development initiatives for professionals in the areas of statistics, economics, financial risk, and data science tools (SAS and R). He has a B.Tech. in mathematics and computing from the Indian Institute of Technology, Guwahati, and an MBA from the Indian Institute of Management, Bangalore. He speaks at public events on data science and works with leading universities toward bringing data science skills to graduates. He has keen interest in law and holds a Post Graduate Diploma in Cyber Law from NALSAR University. He enjoys cooking and photography during his free hours.

About the Technical Reviewer

A416805_1_En_BookFrontmatter_Figd_HTML.jpg

Jojo Moolayil is a data scientist and the author of the book, Smarter Decisions – The Intersection of Internet of Things and Decision Science . With over four years of industrial experience in data science, decision science, and IoT, he has worked with industry leaders on high impact and critical projects across multiple verticals. He is currently associated with General Electric, the pioneer and leader in data science for industrial IoT and lives in Bengaluru—the silicon valley of India.

He was born and raised in Pune, India and graduated from the University of Pune with a major in information technology engineering. He started his career with Mu Sigma Inc., the world’s largest pure play analytics provider and worked with the leaders of many Fortune 50 clients. One of the early enthusiasts to venture into IoT analytics, he converged his knowledge of decision science to bring the problem-solving frameworks and his knowledge of data and decision science to IoT analytics.

To cement his foundation in data science for industrial IoT and scale the impact of the problem-solving experiments, he joined a fast-growing IoT analytics startup called Flutura, based in Bangalore and headquartered in the valley. After a short stint with Flutura, Jojo moved on to work with the leaders of industrial IoT—General Electric, in Bangalore, where he focused on solving decision science problems for industrial IoT use cases. As a part of his role at GE, Jojo also focuses on developing data science and decision science products and platforms for industrial IoT.

Apart from authoring books on decision science and IoT, Jojo has also been technical reviewer for books on machine learning and business analytics with Apress. He is an active data science tutor and also maintains a blog at http://www.jojomoolayil.com/web/blog/ .

Profile: http://www.jojomoolayil.com/

https://www.linkedin.com/in/jojo62000

“I would like to thank my family, friends, and mentors for their kind support and constant motivation throughout my life.”

—Jojo Moolayil

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset