Chapter 5.  Supervised and Unsupervised Learning by Examples

In Chapter 2, Machine Learning Best Practices readers, learned some theoretical underpinnings of basic machine learning techniques. Whereas, Chapter 3, Understanding the Problem by Understanding the Data, describes the basic data manipulation using Spark's APIs such as RDD, DataFrame, and Datasets. Chapter 4, Extracting Knowledge through Feature Engineering, on the other hand, describes feature engineering from both the theoretical and practical point of view. However, in this chapter, the reader will learn the practical know-how needed quickly and powerfully to apply supervised and unsupervised techniques on the available data to the new problems through some widely used examples based on the understandings from the previous chapters. These examples we are talking about will be demonstrated from the Spark perspective. In a nutshell, the following topics will be covered throughout this chapter:

  • Machine learning classes
  • Supervised learning
  • Unsupervised learning
  • Recommender system
  • Advanced learning and generalization

Machine learning classes

As stated in Chapter 1, Introduction to Data Analytics with Spark and Chapter 2, Machine Learning Best Practices, machine learning techniques can be categorized further into three major classes of algorithms: supervised learning, unsupervised learning, and the recommender system. Where classification and regression algorithms are widely used in the supervised learning application development, clustering, on the other hand, falls in the category of unsupervised learning. In this section, we will describe some examples of the supervised learning technique.

Then we will provide some example of the same example presented using Spark. On the other hand, an example of the clustering technique will be discussed in the section: Unsupervised learning, where a regression technique often models the past relationship between variables to predict their future changes (up or down). Here we show two real-life examples of classification and regression algorithms respectively. In contrast, a classification technique takes a set of data with known labels and learns how to label new records based on that information:

  • Example (classification): Gmail uses a machine learning technique called classification to designate if an e-mail is spam or not, based on the data of an e-mail.
  • Example (regression): As an example, suppose you are an online currency trader and you work on Forex or Fortrade. Right now you have two currency pairs in mind to buy or sell say: GBP/USD and USD/JPY. If you look at these two pairs carefully, USD is a common in these two pairs. Now if you look at the historical prices of USD, GBP, or JPY you can predict the future outcome of whether you should open the trade in buy or sell. These types of problems can be resolved with supervised learning techniques using regression analysis:
    Machine learning classes

    Figure 1: Classification, clustering, and collaborative filtering-the big picture

On the other hand, clustering and dimensionality reduction are commonly used for unsupervised learning. Here are some examples:

  • Example (clustering): Google News uses a technique called clustering to group news articles into different categories, based on title and content. Clustering algorithms discover groupings that occur in collections of data.
  • Example (collaborative filtering): The collaborative filtering algorithm is often used in the recommendation system development. Renowned companies such as Amazon and Netflix use a machine learning technique called collaborative filtering, to determine which products users will like based on their history and similarity to other users.
  • Example (dimensionality reduction): Dimensionality reduction is often used to make the available dataset that is high dimensional. For example, suppose you have an image of size 2048x1920, and you would like to reduce the dimension to 1080x720 without sacrificing the quality much. In this case, popular algorithms such as Principal Component Analysis (PCA) or Singular Value Decomposition (SVD) can be used although you can also implement the SVD to implement the PCA. This is why SVD is more widely used.

Supervised learning

As already stated, a supervised learning application makes predictions based on a set of examples and the goal is to learn general rules that map inputs to outputs aligning with the real world. For example, a dataset for spam filtering usually contains spam messages as well as non-spam messages. Consequently, we could know which messages in the training set are spams or non-spam. Therefore, supervised learning is the machine learning technique of inferring a function from the labeled training data. The following steps are involved in supervised learning tasks:

  • Train the ML model with the training dataset
  • Use the test dataset to test the model performance

Therefore, the dataset for training the ML model, in this case, is labeled with the value of interest and a supervised learning algorithm looks for patterns in those value labels. After the algorithm has found the required patterns, those patterns can be used to make predictions for unlabeled test data.

A typical use of the supervised learning is diverse and commonly used in the bioinformatics, cheminformatics, database marketing, handwriting recognition, information retrieval, object recognition in computer vision, optical character recognition, spam detection, pattern recognition, speech recognition, and so on, and in these applications mostly the classification technique is used. On the other hand, supervised learning is a special case of downward causation in biological systems.

Tip

More on how the supervised learning technique works from the theoretical perspective can be found on these books: Vapnik, V. N. The Nature of Statistical Learning Theory (2nd Ed.), Springer Verlag, 2000; and Mehryar M., Afshin R. Ameet T. (2012) Foundations of Machine Learning, The MIT Press ISBN 9780262018258.

Supervised learning example

Classification is a family of supervised machine learning algorithms that designate input as belonging to one of the several pre-defined classes. Some common use cases for classification include:

  • Credit card fraud detection
  • E-mail spam detection

Classification data is labeled, for example, as spam/non-spam or fraud/non-fraud. Machine learning assigns a label or class to new data. You classify something based on pre-determined features. Features are the if questions that you ask. The label is the answer to those questions. For example, if an object walks, swims, and quacks like a duck, then the label would be duck. Or suppose for a flight is delayed on to be a departure or arrival by more than say 1 hour, it would be a delay; otherwise not a delay.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset