Scaling the ML pipelines

Data mining and machine learning algorithms impose outstanding challenges on parallel and distributed computing platforms. Furthermore, parallelizing the machine learning algorithms is highly task-specific and often depends on the preceding questions. In Chapter 1, Introduction to Data Analytics with Spark, we discussed and showed how to deploy the same machine learning application on top of a cluster or cloud computing infrastructure (that is, Amazon AWS/EC2).

Following that method, we can handle datasets with enormous batch sizes or in real time. In addition to this, scaling up the machine learning applications evolves another trade-off such as cost, complexity, run-time, and technical requirements. Furthermore, making task-appropriate algorithm and platform choices for large-scale machine learning requires an understanding of the benefits, trade-offs, and constraints of the available options.

To handle these issues, in this section, we will provide some theoretical aspects of handling big Datasets for deploying large-scale machine learning applications. However, before going any further, we need to know the answer to some questions. For example:

  • How do we collect the big dataset to fulfil our needs?
  • How large are the big datasets and how do we handle them?
  • How much training data is enough to scale-up the ML application on a big dataset?
  • What is an alternative approach if we don't have enough training data?
  • What sorts of machine learning algorithms should be used for fulfilling our needs?
  • What platform should be chosen for parallel learning?

Here we discuss some important aspects of deploying and scaling up a machine learning application that handles the preceding big data challenges, including size, data skewness, cost, and infrastructure.

Size matters

Big data is data in volume, variety, veracity, velocity, and value that is too great to process by traditional in-memory computer systems. Scaling up machine learning applications by handling big data involves tasks such as classification, clustering, regression, feature selection, boosted decision trees, and SVMs. How do we handle 1 billion or 1 trillion data instances? Moreover, 5 billion cell phones, social networks such as Twitter produce big datasets in an unprecedented way. On the other hand, crowdsourcing is the reality, which is labeling of 100,000+ data instances within a week.

In terms of sparsity, Big datasets cannot be too sparse but dense from a content perspective. From the machine learning perspective, to justify this claim, let's think of an example of data labeling. For instance, 1M data instances cannot belong to 1M classes, simply because it's not practical to have 1M classes but more than once data instances belong to a particular class. Therefore, based on the sparsity and size of such a large-scale dataset, making predictive analytics is another challenge, which needs to be considered and handled while scaling up.

Size versus skewness considerations

Machine learning also depends on the availability of labeled data, and its trustworthiness is based on the learning task, such as supervised, unsupervised, or semi-supervised. You might have a structured dataset, but with extreme skewness. More specifically, suppose you have 1K labeled and 1M unlabeled data points, so the labeled and unlabeled ratio is 0.1%.

Therefore, do you think that only the 1K label points are enough to train a supervised model? As another example, suppose, for instance, that you have 1M labeled and 1B unlabeled data points, where the labeled and unlabeled ratio is also 0.1%. Again, the same question arises, which is, is it enough to have only the 1M labels to train a supervised model?

Now the concern is what can be done or approach using existing labels as only guidance rather than a directive for the semi-supervised clustering, classification, or regression. Alternatively, label more data, either manually, or with a little help from the crowd. For example, suppose someone wants to cluster or classify analysis on a disease. More specifically, suppose we want to classify tweets, if particular tweets indicate an Ebola- or flu-related disease. In this case, we should use the semi-supervised approach for labeling the tweets.

However, in this case, a dataset might be very skewed, or labeling might be biased. Usually, the training data comes from different users, where an explicit user feedback might often be misleading.

Therefore, learning from the implicit feedback is a better idea; for example, collecting data by clicking on web search results. In these types of large-scale datasets, the skewness of the training data is hard to detect, as discussed in Chapter 4, Extracting Knowledge through Feature Engineering. Therefore, be wary of this skewness in big Datasets.

Cost and infrastructure

To scale-up your machine learning application, you will need better infrastructure and computing power to handle such big datasets. Initially, you might want to utilize a local cluster. However, sometimes, the cluster might not be enough to scale-up your ML application if the dataset increases exponentially.

As discussed in the chapter about deploying the ML pipeline on powerful infrastructure such as Amazon AWS cloud computing like EC2, you will have to go for pay-as-you-go to enjoy the cloud as Platform as a Service and Infrastructure as a Service, even though you use your own ML application as the Software as a Service.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset