Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Applied Unsupervised Learning with R

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Alok Malik and Bradford Tuckfield

Technical Reviewer: Smitha Shivakumar

Managing Editor: Rutuja Yerunkar

Acquisitions Editor: Aditya Date

Production Editor: Nitesh Thakur

Editorial Board: David Barnes, Ewan Buckingham, Shivangi Chatterji, Simon Cox, Manasa Kumar, Alex Mazonowicz, Douglas Paterson, Dominic Pereira, Shiny Poojary, Saman Siddiqui, Erol Staveley, Ankita Thakur, and Mohita Vyas

First Published: March 2019

Production Reference: 1260319

ISBN: 978-1-78995-639-9

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface i

Chapter 1: Introduction to Clustering Methods 1

Introduction 2

Introduction to Clustering 3

Uses of Clustering 4

Introduction to the Iris Dataset 5

Exercise 1: Exploring the Iris Dataset 6

Types of Clustering 6

Introduction to k-means Clustering 7

Euclidean Distance 8

Manhattan Distance 9

Cosine Distance 10

The Hamming Distance 10

k-means Clustering Algorithm 11

Steps to Implement k-means Clustering 14

Exercise 2: Implementing k-means Clustering on the Iris Dataset 15

Activity 1: k-means Clustering with Three Clusters 19

Introduction to k-means Clustering with Built-In Functions 20

k-means Clustering with Three Clusters 21

Exercise 3: k-means Clustering with R Libraries 22

Introduction to Market Segmentation 24

Exercise 4: Exploring the Wholesale Customer Dataset 24

Activity 2: Customer Segmentation with k-means 25

Introduction to k-medoids Clustering 27

The k-medoids Clustering Algorithm 27

k-medoids Clustering Code 27

Exercise 5: Implementing k-medoid Clustering 28

k-means Clustering versus k-medoids Clustering 31

Activity 3: Performing Customer Segmentation with k-medoids Clustering 32

Deciding the Optimal Number of Clusters 34

Types of Clustering Metrics 34

Silhouette Score 35

Exercise 6: Calculating the Silhouette Score 38

Exercise 7: Identifying the Optimum Number of Clusters 40

WSS/Elbow Method 42

Exercise 8: Using WSS to Determine the Number of Clusters 43

The Gap Statistic 45

Exercise 9: Calculating the Ideal Number of Clusters with the Gap Statistic 46

Activity 4: Finding the Ideal Number of Market Segments 48

Summary 49

Chapter 2: Advanced Clustering Methods 51

Introduction 52

Introduction to k-modes Clustering 52

Steps for k-Modes Clustering 52

Exercise 10: Implementing k-modes Clustering 53

Activity 5: Implementing k-modes Clustering on the Mushroom Dataset 55

Introduction to Density-Based Clustering (DBSCAN) 56

Steps for DBSCAN 57

Exercise 11: Implementing DBSCAN 61

Uses of DBSCAN 63

Activity 6: Implementing DBSCAN and Visualizing the Results 64

Introduction to Hierarchical Clustering 65

Types of Similarity Metrics 66

Steps to Perform Agglomerative Hierarchical Clustering 69

Exercise 12: Agglomerative Clustering with Different Similarity Measures 72

Divisive Clustering 80

Steps to Perform Divisive Clustering 81

Exercise 13: Performing DIANA Clustering 82

Activity 7: Performing Hierarchical Cluster Analysis on the Seeds Dataset 84

Summary 85

Chapter 3: Probability Distributions 87

Introduction 88

Basic Terminology of Probability Distributions 88

Uniform Distribution 89

Exercise 14: Generating and Plotting Uniform Samples in R 90

Normal Distribution 92

Exercise 15: Generating and Plotting a Normal Distribution in R 93

Skew and Kurtosis 94

Log-Normal Distributions 97

Exercise 16: Generating a Log-Normal Distribution from a Normal Distribution 98

The Binomial Distribution 100

Exercise 17: Generating a Binomial Distribution 100

The Poisson Distribution 102

The Pareto Distribution 103

Introduction to Kernel Density Estimation 103

KDE Algorithm 103

Exercise 18: Visualizing and Understanding KDE 105

Exercise 19: Studying the Effect of Changing Kernels on a Distribution 111

Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset 114

Introduction to the Kolmogorov-Smirnov Test 116

The Kolmogorov-Smirnov Test Algorithm 116

Exercise 20: Performing the Kolmogorov-Smirnov Test on Two Samples 117

Activity 9: Calculating the CDF and Performing the Kolmogorov-Smirnov Test with the Normal Distribution 121

Summary 122

Chapter 4: Dimension Reduction 125

Introduction 126

The Idea of Dimension Reduction 126

Exercise 21: Examining a Dataset that Contains the Chemical Attributes of Different Wines 128

Importance of Dimension Reduction 131

Market Basket Analysis 132

Exercise 22: Data Preparation for the Apriori Algorithm 136

Exercise 23: Passing through the Data to Find the Most Common Baskets 141

Exercise 24: More Passes through the Data 143

Exercise 25: Generating Associative Rules as the Final Step of the Apriori Algorithm 148

Principal Component Analysis 152

Linear Algebra Refresher 153

Matrices 153

Variance 153

Covariance 153

Exercise 26: Examining Variance and Covariance on the Wine Dataset 154

Eigenvectors and Eigenvalues 156

The Idea of PCA 156

Exercise 27: Performing PCA 157

Exercise 28: Performing Dimension Reduction with PCA 159

Activity 10: Performing PCA and Market Basket Analysis on a New Dataset 162

Summary 164

Chapter 5: Data Comparison Methods 167

Introduction 168

Hash Functions 168

Exercise 29: Creating and Using a Hash Function 168

Exercise 30: Verifying Our Hash Function 170

Analytic Signatures 171

Exercise 31: Perform the Data Preparation for Creating an Analytic Signature for an Image 173

Exercise 32: Creating a Brightness Comparison Function 176

Exercise 33: Creating a Function to Compare Image Sections to All of the Neighboring Sections 177

Exercise 34: Creating a Function that Generates an Analytic Signature for an Image 181

Activity 11: Creating an Image Signature for a Photograph of a Person 182

Comparison of Signatures 184

Activity 12: Creating an Image Signature for the Watermarked Image 185

Applying Other Unsupervised Learning Methods to Analytic Signatures 187

Latent Variable Models – Factor Analysis 187

Exercise 35: Preparing for Factor Analysis 188

Linear Algebra behind Factor Analysis 195

Exercise 36: More Exploration with Factor Analysis 195

Activity 13: Performing Factor Analysis 198

Summary 200

Chapter 6: Anomaly Detection 203

Introduction 204

Univariate Outlier Detection 204

Exercise 37: Performing an Exploratory Visual Check for Outliers Using R's boxplot Function 205

Exercise 38: Transforming a Fat-Tailed Dataset to Improve Outlier Classification 208

Exercise 39: Finding Outliers without Using R's Built-In boxplot Function 212

Exercise 40: Detecting Outliers Using a Parametric Method 214

Multivariate Outlier Detection 215

Exercise 41: Calculating Mahalanobis Distance 215

Detecting Anomalies in Clusters 219

Other Methods for Multivariate Outlier Detection 219

Exercise 42: Classifying Outliers based on Comparisons of Mahalanobis Distances 219

Detecting Outliers in Seasonal Data 222

Exercise 43: Performing Seasonality Modeling 222

Exercise 44: Finding Anomalies in Seasonal Data Using a Parametric Method 229

Contextual and Collective Anomalies 232

Exercise 45: Detecting Contextual Anomalies 232

Exercise 46: Detecting Collective Anomalies 235

Kernel Density 237

Exercise 47: Finding Anomalies Using Kernel Density Estimation 238

Continuing in Your Studies of Anomaly Detection 242

Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method 243

Activity 15: Using Mahalanobis Distance to Find Anomalies 244

Summary 245

Appendix 247

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.