Practical machine learning best practices

In this section, we will describe some good machine learning practices that need to be followed before developing a machine learning application of particular interest, as described in Figure 7:

Practical machine learning best practices

Figure 7: Machine learning systematic process.

A scalable and accurate ML application demand for following a systematic approach to its development from problem definition to presenting results can be summarized into four steps: problem definition and formulation, data preparation, finding suitable algorithms for machine learning, and finally, presenting the results after the machine learning model deployment. Well, these steps can be depicted as shown in Figure 6.

Best practice before developing an ML application

The learning of a machine learning system can be formulated as the sum of representation, evaluation, and optimisation. In other words, according to Pedro D et al. (Pedro Domingos, A Few Useful Things to Know about Machine Learninghttps://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf):

Learning = Representation + Evaluation + Optimization

Taking this formulation into consideration, we will provide some recommendations for practitioners before getting into ML application development.

Good machine learning and data science worth huge

So what do we need for an effective machine learning applications development? We actually need four arsenals before we start developing an ML application; including:

  • The data primitives (or the experimental data to be more frank).
  • A pipeline synthesis tool (to understand the data and control flow during the machine learning steps).
  • An effective and robust error analysis tools.
  • A verification or validation tool (to verify or validate the prediction accuracy or performance of the ML model). However, most importantly, without some strong theoretical basement with good data science that is worth a huge amount, the whole process will be in vain. In fact, many data scientists and machine learning experts often quote something like this statement: if you can pose your problem as a simple optimization problem then you is almost done (see Data Analytics & R, http://advanceddataanalytics.net/2015/01/31/condensed-news-7/).

That means before you start your machine learning voyage, if you can identify if your problem is a machine learning problem, you will be able to find some suitable algorithms to develop your ML application altogether. Of course, in practice, most machine learning applications can't be changed into simple optimization problems. Therefore, it's the duty of a data scientist like you to manage and maintain complex datasets. After that, you will have to handle other issues such as the analytical problems that evolve when engineering the machine learning pipeline to tackle those issues we mentioned earlier.

Therefore, the best practice is to use Spark MLlib, Spark ML, GraphX, and Spark Core APIs along with the best practice data science heuristics for developing your machine learning applications together. Now you might think of getting benefits out of it; yes, the benefits are obvious, and they are as follows:

  • Built-in distributed algorithms
  • In-memory and disk-based data computation and processing
  • In-memory capabilities for iterative workloads
  • Algorithmic accuracy and performance
  • Faster data cleaning, feature engineering and feature selection, training, and testing
  • Real-time visualization of the predictive results
  • Tuning towards better performance
  • Adaptability for new datasets
  • Scalability with the increasing datasets

Best practice – feature engineering and algorithmic performance

In best practice, feature engineering should be considered as one of the most important parts of machine learning. The thing is to find a better representation of features out of the experimental dataset non-technically. In parallel to this, which learning algorithms or techniques are to be used are also important. Parameter tuning, of course in addition, however, the final choice is more about  experimentation through the ML model you will be developing.

In practice, however, it is trivial to grasp the naive performance baseline by means of an out-of-the-box method (also referred to as functionality or OOTB in short, which is a feature of a product of interest that works straight away after installing or configuring) and good data pre-processing. Therefore, you might be doing it continually in order to know where the baseline is and whether this performance is of a satisfactory level or good enough for your requirements.

Once you've trained all of your out-of-the-box methods, it's always recommended and is a good idea to try bagging them together. Moreover, in order to solve the ML problems, very often you might need to know the reality that computationally hard problems (shown in section 2, for example) need either domain-specific knowledge or lots of digging down in the data or both. Consequently, the combination of a widely accepted feature engineering technique and domain-specific knowledge would help your ML algorithm/application/system to solve prediction related problems.

In a nutshell, if you have the required dataset and a robust algorithm that can take the advantages of the dataset by learning the complex features, it's almost guaranteed that you will be successful. Furthermore, sometimes domain experts might be wrong in selecting the good features; therefore, incorporation of multiple domain experts (problem domain expert), more well-structured data, and ML expertise is always helpful.

Last but not least, sometimes it is recommended from our side to consider the error rate rather than only the accuracy. For example, suppose an ML system with 99% accuracy and 50% errors is worse than the one with 90% accuracy but 25% errors, for example.

Beware of overfitting and underfitting

A common mistake often made by novice data scientists is subject to the overfitting issue that might evolve while building your ML model by hearing without generalizing. More technically, if you evaluate your model on the training data instead of test or validated data, you probably won't be able to articulate whether your model is overfitting or not. The common symptoms are:

  • Predictive accuracy of the data used for training can be over accurate (that is, sometimes even 100%)
  • And the model might show a little better compared to the random prediction for new data

Sometimes the ML model itself becomes under-fit for a particular tuning or data point, which means the model has become too simplistic. Our recommendation (like others as well we believe) is as follows:

  • Split the dataset into two sets to detect overfitting situations, the first one being for training and model selection, called the training set; the second one is the test set for evaluating the model stated in place of the ML workflow section
  • Alternatively, you also could void the overfitting by consuming simpler models (for example, linear classifiers in preference to Gaussian kernel SVM) or by swelling the regularisation parameters of your ML model (if available)
  • Tune the model with a correct data value of parameters to avoid both overfitting as well as underfitting

Hastie et al. (Hastie Trevor, Tibshirani Robert, Friedman Jerome, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition, 2009) on the other hand, have recommended splitting the large-scale dataset into three sets: Training set (50%), Validation set (25%), and Test set (25%) (roughly). They also suggested building the model using the training set and calculating the prediction errors using the validation set. The test set was recommended to be used to assess the generalization error of the final model.

If the amount of labeled data available during the supervised learning is smaller, it is not recommended to split the datasets. In that case, use cross-validation or Train split techniques (this will be discussed in Chapter 7, Tuning Machine Learning Models, with several examples). More specifically, divide the data set into 10 parts of (roughly) equal size, after that for each of these ten parts, train the classifier iteratively and use the 10th part to test the model.

Stay tuned and combining Spark MLlib with Spark ML

The first step of the pipeline designing is to create the building blocks (as a directed or undirected graph consisting of nodes and edges) and make a link between those blocks. Nevertheless, as a data scientist, you should be focused on scaling and optimizing nodes (primitives) too, so that you are able to scale-up your application for handling large-scale datasets in the later stage to make your ML pipeline consistently perform. The pipeline process will also help you to make your model adaptive for new datasets. However, some of these primitives might be explicitly defined to particular domains and data types (for example, text, images, video, audio, and spatiotemporal).

And beyond these types of data, the primitives should also be working for the general purpose domain statistics or mathematics. The casting of your ML model in terms of these primitives will make your workflow more transparent, interpretable, accessible, and explainable. A recent example would be the ML-Matrix, which is a distributed matrix library that can be used on top of Spark:

Stay tuned and combining Spark MLlib with Spark ML

Figure 8: Stay tune and interoperate ML, MLlib, and GraphX.

As we already stated in the previous section, as a developer you can seamlessly combine the implementation techniques in Spark MLlib along with the algorithms developed in Spark ML, Spark SQL, GraphX, and Spark Streaming as hybrid or interoperable ML applications on top of RDD, DataFrame, and Datasets, as shown in Figure 8. For example, an IoT-based real-time application could be developed using a hybrid model. Therefore, the recommendation here is to stay tuned or synchronized with the latest technologies around you for the betterment of your ML application.

Making ML applications modular and simplifying pipeline synthesis

Another good and often used practice when building your ML pipeline is to make the ML system modular. Some supervised learning problems can be solved using very simple models commonly referred to as generalized linear models. However, it depends on the data you will be using and others simply don't.

Therefore, to conglomerates a series of simple linear binary classifiers, try to employ a lightweight modular architecture. This might be at the workflow stems or at the algorithms level. The advantages are obvious, since the modular architecture of your application handles massive amounts of data flow in a parallel and distributed way. Consequently, we suggest you have the three key innovative mechanisms: weighted threshold sampling, logistic calibration, and intelligent data partitioning as mentioned in the literature (for example, Yu Jin; Nick Duffield; Jeffrey Erman; Patrick Haffner; Subhabrata Sen; Zhi Li Zhang, A Modular Machine Learning System for Flow-Level Traffic Classification in Large Networks, ACM Transactions on Knowledge Discovery from Data, V-6, Issue-1, March 2012). The target is to achieve scalability and high-throughput while attaining a high accuracy of the predicted results from your ML application/system. While primitives can serve as building blocks, you still need some other tools that enable users to build ML pipelines.

Subsequently, workflow tools have become more common these days, and such tools exist for data engineers, data scientists, and even for business analysts such as Alteryx, RapidMiner, Alpine Data, and Dataiku. At this point, we are talking about and stressing the business analysts since at the very last phase your target customer will be a business company who will value your ML model, right? The latest release of Spark comes with Spark ML APIs for building machine learning pipelines and making a domain specific language (see https://en.wikipedia.org/wiki/Domain-specific_language) for pipelines.

Thinking of an innovative ML system

However, in order to develop the algorithms to learn the ML models continuously with the help of available data, the viewpoint behind the machine learning is to automate the creation of analytical models. Unremittingly evolving models produce increasingly positive results and reduce the need for human interaction. This enables the ML models to automatically produce reliable and repeatable predictions.

More technically, suppose you are planning to develop a recommender system using ML algorithms. So, what is the target of developing that recommender system? And what are some innovative ideas for product development in machine learning? These two are typical questions that should be considered before you start developing your ML application or system. Consistent innovation might be challenging, especially when stirring advancing with new ideas, it can also be tough to comprehend where the greatest benefit lies. Machine learning can provision innovation from end to end of a variety of paths, such as determining weaknesses with current products, predictive analysis, or identifying previously concealed patterns.

As a result, you will have to think of large-scale computing to train your ML model offline, and later on your recommender system has to be able to work as a conventional search engine analysis for online recommendations. Thus, your ML application will be valued by a business company if your system:

  • Can forecast buying items using your machine learning application
  • Can do product analysis
  • Can work as an emerging trend in production

Thinking and becoming smarter about Big Data complexities

As shown in Figure 9, new business models are the unavoidable extension of the available data utilisation, so consideration of big data and its business values can make the business analyst's job, life and thinking smarter, which results in your targeted company delivering value to customers. In addition to this, you will also have to investigate (analyze to be more exact) rival or better companies.

Now the question is, how do you collect and use enterprise data? Big data is not only about the size (volume), it is also related to its velocity, veracity, variety, and value. For these types of complexities, for example, velocity can be addressed using Spark Streaming since streaming-based data is also big data that needs a real-time analytical approach. Other parameters such as volume and variety can be handled using Spark Core and Spark MLlib/ML towards big data processing.

Well, you will have to manage the data by hook or by crook. If you are able to manage the data, the insights from the data can really shake up the way businesses operate with the useful features of big data:

Thinking and becoming smarter about Big Data complexities

Figure 9: Machine learning in Big Data best practice.

At this point, data alone is not enough (see Pedro Domingos, A Few Useful Things to Know about Machine Learning, https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf), but extracting meaningful features from the data and putting semantics of data into the model is more important. This is like what most of the tech giants such as LinkedIn are developing through large-scale machine learning frameworks from feature targeting for their community, which is more or less a supervised learning technique. The workflow is as follows:

  • Fetch the data, extract the feature, and set the target
  • Feature and target join
  • Create a snapshot from the concatenated data
  • Partition the snapshot into two parts: training set and test set
  • From the training set, prepare the sample data by sampling techniques
  • Train the model using the sampled data
  • Scoring
  • Evaluate the model from the previously developed persistent model, as well as the test data prepared in step 4
  • If the best model is found
  • Deploy the model for the target audience

So what's next? Your model also should be adaptable to large-scale dynamic data such as real-time streaming IoT data PLUS real-time feedback is also important so that your ML system can learn from the mistakes. The next sub-section discusses that.

Applying machine learning to dynamic data

The reasons are obvious, since machine learning brings concrete and dynamic aspects to IoT projects. Recently, machine learning has experienced a pep talk in popularity amongst industrial companies and they profit out of the box. As a result, all but every IT vendor are precipitously announcing IoT platforms and consulting services. But achieving financial benefits through IoT data is not an easy job. Moreover, many businesses have failed to clearly determine what areas will change with the implementation of an IoT strategy.

Considering these positive and negative issues together, your ML model should adapt to large dynamic data since the large-scale data means billions of records, large feature spaces, and low positive rates from the sparsity issue. Nevertheless, data is dynamic so consequently, the ML models have to be adaptive enough; otherwise you will have to face a bad experience or be lost in the black hole.

Best practice after developing an ML application

The typical steps that are best practice after an ML model/system has been developed are: visualization for understanding the predictive values, model validation, error and accuracy analysis, model tuning, model adapting, and scaling up for handling large-scale datasets with ease.

How to enable real-time ML visualization

Visualization provides an interactive interface to stay tune the ML model itself. Therefore, without visualizing the predictive results, it merely becomes difficult to further improve the performance of an ML application. The best practice could be something like this:

  • Incorporate some third-party tools along with GraphX for your visualization for large-scale graph related data (more to be discussed in Chapter 9, Advanced Machine Learning with Streaming and Graph Data)
  • For non-graph data, a call-back interface for the Spark ML algorithm to send and receive messages by incorporating other tools like Apache Kafka:
    • Algorithms decide when and what message to send
    • Algorithms don't care how the message is delivered

  • A task channel to handle the message delivery service from the Spark Driver program to Spark Client or Spark cluster nodes. The task channel would be communicating using Spark Core at a lower level of abstraction:
    • It does not care about the content of the message or recipient of the message

  • The message is delivered from Spark Client to the browser or visualization client:
    • We recommend using HTML5 Server-Sent Events (SSE) and HTTP Chunked Response (PUSH) together. Incorporation of Spark with this type of technology will be discussed in Chapter 10, Configuring and Working with External Libraries
    • Pull is possible; however, it requires a message queue

  • Visualization using JavaScript frameworks such as Plot.ly (please refer to https://plot.ly/) and D3.js (please refer to https://d3js.org/)

Do some error analysis

As algorithms become more prevalent, we need better tools for building complex hitherto, robust, and stable machine learning systems. A popular distributed framework like Apache Spark takes these ideas to extremely large datasets for the wider audience. Therefore, it would be better if we could bind approximation errors and convergence rates for the layered pipelines.

Assuming we can compute error bars for nodes, the next step would be to have a mechanism for extracting error bars for these pipelines. However, in practice, when the ML model is deployed for the production, we might need tools to confirm that the pipeline will work and will not do make malfunction or stop halfway through and that it can provide some expected measure of the errors.

Keeping your ML application tuned

Devising one or two algorithms that perform solidly well on a simple problem can be considered as a good kick-off. However, sometimes you may be thirsty to get the best accuracy, by even sacrificing your valuable time and available computational resources. This would be a smarter way, and it will help you not only to squeeze out extra performance, but also to improve the results in terms of accuracy that you were receiving out of the machine learning algorithms you designed previously. In order to do that, when you tune the model and related algorithm, essentially, you must have a high confidence in the results.

Obviously, those results will be available after you specify the testing and validation. This means you should only be using those techniques that reduce the variance of the performance measure so that you can assess the algorithms that are running more smoothly.

In parallel, like most data practitioners, we also suggest you to use the cross-validation technique (also often called rotation estimation) with a reasonably high number of folds (that is, K-fold cross-validation, where a single subsample is used as the validation dataset for testing the model itself , and the remaining K-1 subsamples are used to train the data). Although the exact number of folds, or K, depends on your dataset, however, 10-fold cross-validation is commonly used, but most often the value of K remains unfixed. We will mention three strategies here that you will need to tune your machine learning model:

  • Algorithm tuning: Makes your machine learning algorithm parameterized. After that, adjust the value of those parameters (if they have multiple parameters) to influence the outcome of the overall learning process.
  • Ensembles: Sometimes it is good to be naïve! Therefore, in order to get improved results, keep trying to combine the outcomes from multiple machine learning methods or algorithms.
  • Extreme feature engineering: If your data has complex and multi-dimensional structures embedded in it, ML algorithms know how to find and exploit it to make decisions.

Keeping your ML application adaptive and scale-up

As shown in Figure 10, the adaptive learning conglomerates the previous generations of rule-based, simple machine learning, and deep learning approaches to machine intelligence according to Rob Munro:

Keeping your ML application adaptive and scale-up

Figure 10: Four generation of machine intelligence (Figure courtesy of Rob Munro).

The fourth generation of machine learning: adaptive learning, (http://idibon.com/the-fourth-generation-of-machine-learning-adaptive-learning/#comment-175958).

Research also shows that adaptive learning is 95% accurate in predicting people's intention to purchase a car, for example (please refer to Rob Munro, The fourth generation of machine learning: Adaptive learning, http://idibon.com/the-fourth-generation-of-machine-learning-adaptive-learning/#comment-175958). Moreover, if your ML application is adaptive with the new environment and new data, it is expected that if enough infrastructure is provided, your ML system can be scaled-up for the increasing data loads.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset