Applied Unsupervised Learning with R

Applied Unsupervised Learning with R

Copyright © 2019 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

Authors: Alok Malik and Bradford Tuckfield

Technical Reviewer: Smitha Shivakumar

Managing Editor: Rutuja Yerunkar

Acquisitions Editor: Aditya Date

Production Editor: Nitesh Thakur

Editorial Board: David Barnes, Ewan Buckingham, Shivangi Chatterji, Simon Cox, Manasa Kumar, Alex Mazonowicz, Douglas Paterson, Dominic Pereira, Shiny Poojary, Saman Siddiqui, Erol Staveley, Ankita Thakur, and Mohita Vyas

First Published: March 2019

Production Reference: 1260319

ISBN: 978-1-78995-639-9

Published by Packt Publishing Ltd.

Livery Place, 35 Livery Street

Birmingham B3 2PB, UK

Table of Contents

Preface

Chapter 1: Introduction to Clustering Methods

Introduction

Introduction to Clustering

Uses of Clustering

Introduction to the Iris Dataset

Exercise 1: Exploring the Iris Dataset

Types of Clustering

Introduction to k-means Clustering

Euclidean Distance

Manhattan Distance

Cosine Distance

The Hamming Distance

k-means Clustering Algorithm

Steps to Implement k-means Clustering

Exercise 2: Implementing k-means Clustering on the Iris Dataset

Activity 1: k-means Clustering with Three Clusters

Introduction to k-means Clustering with Built-In Functions

k-means Clustering with Three Clusters

Exercise 3: k-means Clustering with R Libraries

Introduction to Market Segmentation

Exercise 4: Exploring the Wholesale Customer Dataset

Activity 2: Customer Segmentation with k-means

Introduction to k-medoids Clustering

The k-medoids Clustering Algorithm

k-medoids Clustering Code

Exercise 5: Implementing k-medoid Clustering

k-means Clustering versus k-medoids Clustering

Activity 3: Performing Customer Segmentation with k-medoids Clustering

Deciding the Optimal Number of Clusters

Types of Clustering Metrics

Silhouette Score

Exercise 6: Calculating the Silhouette Score

Exercise 7: Identifying the Optimum Number of Clusters

WSS/Elbow Method

Exercise 8: Using WSS to Determine the Number of Clusters

The Gap Statistic

Exercise 9: Calculating the Ideal Number of Clusters with the Gap Statistic

Activity 4: Finding the Ideal Number of Market Segments

Summary

Chapter 2: Advanced Clustering Methods

Introduction

Introduction to k-modes Clustering

Steps for k-Modes Clustering

Exercise 10: Implementing k-modes Clustering

Activity 5: Implementing k-modes Clustering on the Mushroom Dataset

Introduction to Density-Based Clustering (DBSCAN)

Steps for DBSCAN

Exercise 11: Implementing DBSCAN

Uses of DBSCAN

Activity 6: Implementing DBSCAN and Visualizing the Results

Introduction to Hierarchical Clustering

Types of Similarity Metrics

Steps to Perform Agglomerative Hierarchical Clustering

Exercise 12: Agglomerative Clustering with Different Similarity Measures

Divisive Clustering

Steps to Perform Divisive Clustering

Exercise 13: Performing DIANA Clustering

Activity 7: Performing Hierarchical Cluster Analysis on the Seeds Dataset

Summary

Chapter 3: Probability Distributions

Introduction

Basic Terminology of Probability Distributions

Uniform Distribution

Exercise 14: Generating and Plotting Uniform Samples in R

Normal Distribution

Exercise 15: Generating and Plotting a Normal Distribution in R

Skew and Kurtosis

Log-Normal Distributions

Exercise 16: Generating a Log-Normal Distribution from a Normal Distribution

The Binomial Distribution

Exercise 17: Generating a Binomial Distribution

The Poisson Distribution

The Pareto Distribution

Introduction to Kernel Density Estimation

KDE Algorithm

Exercise 18: Visualizing and Understanding KDE

Exercise 19: Studying the Effect of Changing Kernels on a Distribution

Activity 8: Finding the Standard Distribution Closest to the Distribution of Variables of the Iris Dataset

Introduction to the Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov Test Algorithm

Exercise 20: Performing the Kolmogorov-Smirnov Test on Two Samples

Activity 9: Calculating the CDF and Performing the Kolmogorov-Smirnov Test with the Normal Distribution

Summary

Chapter 4: Dimension Reduction

Introduction

The Idea of Dimension Reduction

Exercise 21: Examining a Dataset that Contains the Chemical Attributes of Different Wines

Importance of Dimension Reduction

Market Basket Analysis

Exercise 22: Data Preparation for the Apriori Algorithm

Exercise 23: Passing through the Data to Find the Most Common Baskets

Exercise 24: More Passes through the Data

Exercise 25: Generating Associative Rules as the Final Step of the Apriori Algorithm

Principal Component Analysis

Linear Algebra Refresher

Matrices

Variance

Covariance

Exercise 26: Examining Variance and Covariance on the Wine Dataset

Eigenvectors and Eigenvalues

The Idea of PCA

Exercise 27: Performing PCA

Exercise 28: Performing Dimension Reduction with PCA

Activity 10: Performing PCA and Market Basket Analysis on a New Dataset

Summary

Chapter 5: Data Comparison Methods

Introduction

Hash Functions

Exercise 29: Creating and Using a Hash Function

Exercise 30: Verifying Our Hash Function

Analytic Signatures

Exercise 31: Perform the Data Preparation for Creating an Analytic Signature for an Image

Exercise 32: Creating a Brightness Comparison Function

Exercise 33: Creating a Function to Compare Image Sections to All of the Neighboring Sections

Exercise 34: Creating a Function that Generates an Analytic Signature for an Image

Activity 11: Creating an Image Signature for a Photograph of a Person

Comparison of Signatures

Activity 12: Creating an Image Signature for the Watermarked Image

Applying Other Unsupervised Learning Methods to Analytic Signatures

Latent Variable Models – Factor Analysis

Exercise 35: Preparing for Factor Analysis

Linear Algebra behind Factor Analysis

Exercise 36: More Exploration with Factor Analysis

Activity 13: Performing Factor Analysis

Summary

Chapter 6: Anomaly Detection

Introduction

Univariate Outlier Detection

Exercise 37: Performing an Exploratory Visual Check for Outliers Using R's boxplot Function

Exercise 38: Transforming a Fat-Tailed Dataset to Improve Outlier Classification

Exercise 39: Finding Outliers without Using R's Built-In boxplot Function

Exercise 40: Detecting Outliers Using a Parametric Method

Multivariate Outlier Detection

Exercise 41: Calculating Mahalanobis Distance

Detecting Anomalies in Clusters

Other Methods for Multivariate Outlier Detection

Exercise 42: Classifying Outliers based on Comparisons of Mahalanobis Distances

Detecting Outliers in Seasonal Data

Exercise 43: Performing Seasonality Modeling

Exercise 44: Finding Anomalies in Seasonal Data Using a Parametric Method

Contextual and Collective Anomalies

Exercise 45: Detecting Contextual Anomalies

Exercise 46: Detecting Collective Anomalies

Kernel Density

Exercise 47: Finding Anomalies Using Kernel Density Estimation

Continuing in Your Studies of Anomaly Detection

Activity 14: Finding Univariate Anomalies Using a Parametric Method and a Non-parametric Method

Activity 15: Using Mahalanobis Distance to Find Anomalies

Summary

Appendix

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset