Chapter 2. Machine Learning Best Practices

The purpose of this chapter is to provide a conceptual introduction to statistical machine learning (ML) techniques for those who might not normally be exposed to such approaches during their typical required statistical training. This chapter also aims to take a newcomer from minimal knowledge of machine learning all the way to a knowledgeable practitioner in a few steps. The second part of the chapter is focused on giving some recommendations for choosing the right machine learning algorithms depending on the application types and requirements. It will then lead through some best practices when applying large-scale machine learning pipelines. In a nutshell, the following topics will be discussed in this chapter: 

  • What is machine learning?
  • Machine learning tasks
  • Practical machine learning problems
  • Large scale machine learning APIs in Spark
  • Practical machine learning best practices
  • Choosing the right algorithm for your application

What is machine learning?

In this section, we will try to define the term machine learning from the computer science, statistics and data analytical perspectives. Then we will show the steps of analytical machine learning applications. Finally, we will discuss some typical and emerging machine learning tasks and then name some practical machine learning problems that need to be addressed.

Machine learning in modern literature

Let's see how a renowned professor of machine learning, Tom Mitchell, Chair of the CMU Machine Learning Department and Professor at the Carnegie Mellon University defines the term machine learning in his literature (Tom M. Mitchell, The Discipline of Machine Learning, CMU-ML-06-108, July 2006, http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf):

Machine Learning is a natural outgrowth of the intersection of Computer Science and Statistics. We might say the defining question of Computer Science is 'How can we build machines that solve problems, and which problems are inherently tractable/intractable?' The question that largely defines Statistics is 'What can be inferred from data plus a set of modelling assumptions, with what reliability?' The defining question for Machine Learning builds on both, but it is a distinct question. Whereas Computer Science has focused primarily on how to manually program computers, Machine Learning focuses on the question of how to get computers to program themselves (from experience plus some initial structure). Whereas Statistics has focused primarily on what conclusions can be inferred from data, Machine Learning incorporates additional questions about what computational architectures and algorithms can be used to most effectively capture, store, index, retrieve and merge these data, how multiple learning subtasks can be orchestrated in a larger system, and questions of computational tractability.

We believe that this definition from Prof. Tom is self-explanatory. However, we will provide some clearer understanding of machine learning in the next two sub-sections from the computer science, statistics, and data analytical perspectives.

Tip

Interested readers should follow other resources to get more insights about machine learning and its theoretical perspective. Here we have provided some links as follows: Machine learning: https://en.wikipedia.org/wiki/Machine_learning.

Machine learning: what it is and why matters - http://www.sas.com/en_us/insights/analytics/machine-learning.html.

A Gentle Introduction To Machine Learning: https://www.youtube.com/watch?v=NOm1zA_Cats.

What is machine learning, and how does it work: https://www.youtube.com/watch?v=elojMnjn4kk.

Introduction to Data Analysis using Machine Learning: https://www.youtube.com/watch?v=U4IYsLgNgoY.

Machine learning and computer science

Machine learning is a branch of computer science that studies the design of algorithms that can learn from its heuristics that typically evolved from the study of pattern recognition and computational learning theory in artificial intelligence. An interesting question came into the mind of Alan Turing about the machine, which is, Can a machine think? In fact, there are some good reasons to believe a sufficiently complex machine could one day pass the unrestricted Turing test; let's postpone this question until the Turing test, but gets passed. However, machines can learn at least. Subsequently, Arthur Samuel was the first man who defined the term machine learning as a field of study that gives computers the ability to learn without being explicitly programmed in 1959. Typical machine learning tasks are concept learning, predictive modeling, classification, regression, clustering, dimensionality reduction, recommender system, deep learning and finding useful patterns from the large-scale dataset.

The ultimate goal is to improve the learning in such a way that it becomes automatic, so that no human interactions are needed any more, or the level of human interaction is reduced as much as possible. Although machine learning is sometimes conflated with Knowledge Discovery and Data Mining (KDDM), the latter sub-field on the other hand focuses more on exploratory data analysis and is known as unsupervised learning - such as clustering analysis, anomaly detection, Artificial Neural Networks (ANN), and so on.

Other machine learning techniques include supervised learning, where a learning algorithm analyzes the training data and produces an inferred function that can be used for mapping new examples towards prediction. Classification and regression analysis are two typical examples of supervised learning. Reinforcement learning, on the other hand, is inspired by behaviorist psychology (see also https://en.wikipedia.org/wiki/Behaviorism), which is is typically concerned with how a software agent performs an action in a new environment by maximizing the reward function. Dynamic programming and intelligent agent are two examples of reinforcement learning.

Typical machine learning applications can be classified into scientific knowledge discovery and more commercial applications, ranging from Robotic or Human Computer Interaction (HCI) to anti-spam filtering and recommender systems.

Machine learning in statistics and data analytics

Machine learning reconnoitres the study and construction of algorithms (see also https://en.wikipedia.org/wiki/Algorithm) that can learn (see also https://en.wikipedia.org/wiki/Learning) from the heuristics and make meaningful predictions on data. However, in order to make data-driven predictions or decisions, such algorithms operate by building a model (see also https://en.wikipedia.org/wiki/Mathematical_model) from training datasets, quicker than following a stringently static program or instructions. Machine learning is also closely related and often overlaps with the nature of computational statistics. Computational statistics is, on the other hand, an applied field of statistics that focuses on making predictions through a computerised approach. In addition, it has strong stalemates to mathematical optimisation, which delivers methods and computing tasks along with theory and application domains. The tasks that are not feasible in mathematics due to the demands for a strong background knowledge of mathematics, machine learning suits best and can be applied as the alternative to that.

Within the field of data analytics, on the other hand, machine learning is a method used to devise complex models and algorithms that advance themselves towards prediction for a future outcome. These analytical models allow researchers, data scientists, engineers, and analysts to produce reliable, repeatable, and reproducible results and mine hidden insights through learning from past relationships (heuristics) and trends in the data. Again we will refer to a famous definition from Prof. Tom, where he explained what learning really means from the computer science perspective in the literature (Tom M. Mitchell, The Discipline of Machine Learning, CMU-ML-06-108, July 2006, http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf):

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

Therefore, we can conclude that a computer program or machines can:

  • Learn from data and histories
  • Can be improved with experience
  • Interactively enhance a model that can be used to predict the outcomes of questions

Furthermore, the following diagram helps us to understand the whole process of machine learning:

Machine learning in statistics and data analytics

Figure 1: Machine learning at a glance.

Typical machine learning workflow

A typical machine learning application involving several steps from input, processing to output that form a scientific workflow is shown in Figure 2. The following steps are involved in typical machine learning applications:

  1. Load the sample data.
  2. Parse the data into the input format for the algorithm.
  3. Pre-process the data and handle the missing values.
  4. Split the data into two sets, one for building the model (training dataset) and one for testing the model (test dataset or validation dataset).
  5. Run the algorithm to build or train your ML model.
  6. Make predictions with the training data and observe the results.
  7. Test and evaluate the model with the test data or alternatively validate the model with some cross-validator technique using the third dataset, called the validation dataset.
  8. Tune the model for better performance and accuracy.
  9. Scale-up the model so that it can handle massive datasets in the future.
  10. Deploy the ML model in commercialization:
    Typical machine learning workflow

    Figure 2: Machine learning workflow.

Often the machine learning algorithms have some ways to handle the skewness in the datasets; that skewness can sometimes be immensely skewed though. In step 4, the experimental dataset is split often into a training set and test sets randomly, which is called sampling. The training dataset is used to train the model, whereas the test dataset is used to evaluate the performance of the best model at the very end. The better practice is to use the training dataset as much as you can to increase the generalization performance. On the other side, it is recommended to use the test dataset only once to avoid the overfitting and underfitting problem while computing the prediction error and the related metrics.

Tip

Overfitting is a statistical property by which random error and noise is described apart from the normal and underlying relationships. It mostly occurs when there are too many hyperparameters relative to the number of observations or features. Under fitting on the other hand refers to a model that can neither model the training data nor generalize to new data towards the model evaluation or adaptability.

However, these steps consist of several techniques and we will discuss those in Chapter 5, Supervised and Unsupervised Learning by Examples in detail. Step 9 and 10 are usually considered as advanced steps, and they consequently will be discussed in later chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset