Chapter 3. Understanding the Problem by Understanding the Data

This chapter will cover in details of the DataFrame, Datasets, and Resilient Distributed Dataset (RDD) APIs for working with structured data targeting to provide a basic understanding of machine learning problems with the available data. At the end of the chapter you will be able to apply basic to complex data manipulation with ease. Some comparisons will be made available with basic abstractions in Spark using RDD, DataFrame, and Dataset based data manipulation to show both gains in terms of programming and performance. In addition, we will guide you on the right track so that you will be able to use Spark to persist an RDD or data objects in memory, allowing it to be reused efficiently across the parallel operations in the later stage. In a nutshell, the following topics will be covered throughout this chapter:

  • Analyzing and preparing your data
  • Resilient Distributed Dataset (RDD) basics
  • Dataset basics
  • Dataset from string and typed class
  • Spark and data scientists, workflow
  • Deeper into Spark

Analyzing and preparing your data

In practice, several factors affect the success of machine learning (ML) applications on a given task. Therefore, the representation and quality of the experimental dataset is first and foremost considered as the first class entities. It is always advisable to have better data. For example, irrelevant and redundant data, data features with null values or noisy data result in unreliable source of information. The bad properties in datasets make the knowledge discovery process during the machine learning model training phase more tedious and time inefficient.

As a result, the data preprocessing will contribute a considerable amount of computational time across the total ML workflow steps. As we stated in the previous chapter, unless you know your available data, it would be difficult to understand the problem itself. Moreover, knowing the data will help you to formulate your problem. In parallel, and more importantly, before trying to apply an ML algorithm to a problem, first you have to identify if the problem is really a machine learning problem and whether an ML algorithm could directly be applied to solve the problem. The next step that you need to take is to know the machine learning classes. More technically, you need to know if an identified problem falls under classification, clustering, rule retraction, or regression classes.

For the sake of simplicity, we assume you have a machine learning problem. Now you need to do some data pre-processing that includes some steps like data cleaning, normalization, transformation, feature extraction, and selection. The product of a data pre-processing workflow step is the final training set that is typically used to build/train the ML model.

In the previous chapter, we also argued that a machine learning algorithm learns from the data and activities during the model building and feed backing. It is critical that you feed your algorithm with the right data for the problem you want to solve. Even if you have good data (or well-structured data to be more precise), you need to make sure that the data is in an appropriate scale, with a well-known format to be parsed by the programming languages and, most importantly, if the most meaningful features are also included.

In this section, you will learn how to prepare your data so that your machine-learning algorithm becomes spontaneous towards best performance. The overall data processing is a huge topic; however, we will try to cover essential techniques to make some large scale machine learning applications in Chapter 6, Building Scalable Machine Learning Pipelines.

Data preparation process

If you are more focused and disciplined during the data handling and preparation steps, you are likely to get more consistent and better results in the first place. However, the data preparation is a tedious process consisting of several steps. Nevertheless, the process for getting data ready for a machine learning algorithm can be summarized in three steps:

  • Data selection
  • Data pre-processing
  • Data transformation

Data selection

This step will focus on selecting the subset of all available datasets that you will be using and working with within your machine learning application development and deployment. There is always a strong urge to include all the available data in machine learning application development since more data will provide more features. In other words, by holding the well-known aphorism, more is better. However, essentially, this might not be true in all cases. You need to consider what data you need to have before you actually answer the question. The ultimate goal is to provide a solution of a particular hypothesis. You might be doing some assumptions about the data as well in the first place. Although it is difficult, if you are a domain expert of that problem, you can make some assumption to know at least some insights before applying your ML algorithms. However, be careful to record those assumptions so that you can test them at a later stage when required. We will present some common question to help you out in thinking through the data selection process:

  • The first question would be, what is the extent of the data you have available? For example, the extent could be the throughout time, database tables, connected system files, and so on. Therefore, the better practice is to ensure that you have a clear understanding and low-level structure of everything that you can use, or holding informally the available resources (while of course including the available data and computational resources).
  • The second question is a little bit weird! What data are not yet available but important to solve the problem? In this case, you might have to wait for the data to be available or alternatively you can at least generate or simulate these types of data using some generator or software.
  • The third question might be: what data don't you need to address the problem? That means again the redundancies so excluding these redundant or unwanted data is almost always easier than including it altogether. You might be wondering whether or not to note down the data you excluded and why? We think it should be yes since you might need some trivial data in the later stages.

Moreover, in practice in this case small problems or games, toy competition data will already have been selected for you; therefore, you don't need to be worried at all!

Data pre–processing

After you have selected the data you will be working with, you need to consider how you could use the data and the proper utilization required. This pre-processing step will address some steps or techniques for getting the selected data into a form that you can work and apply during your model building and validation steps. The three most common data pre-processing steps that are used are formatting, cleaning, and sampling the data:

  • Formatting: The selected data may not be in a good shape so might not be suitable for you to work with directly. Very often, your data might be in a raw data format (a flat file format such as a text format or a less used proprietary format) and if you are lucky enough then data might be in a relational database. If this is the case, then it would better be to apply some conversion steps (that is, converting a relational database to its format for example, since using Spark you cannot make any conversion). As already stated, the beauty of Spark is its support for diverse file formats. Therefore, we will be able to take advantage in the following sections.
  • Cleaning: Very often the data you will be using comes with many unwanted records or sometimes with missing entries against a record. This cleaning process deals with the removal or fixing of missing data. There may be always some trivial data objects that are insignificant or incomplete and addressing them should be the first priority. Consequently, these instances may need to be removed, ignored or deleted from the datasets to get rid of this problem. Additionally, if the privacy or security is a concern because of the presence of the sensitive information against some attributes, those attributes need to be anonymized or removed from the data entirely (if appropriate).
  • Sampling: The third step would be the sampling over the top of the formatted and cleaned datasets. Sampling is often required since there might be a time when the available data size is large or a number of records are huge. However, we argue to use the data as much as possible. Another reason is that more data can result in a longer execution time during the whole machine learning process. If this is the case, this also increases the running times of the algorithms and requires a more powerful computational infrastructure. Therefore, you can take a smaller representative sample of the selected data that may be much faster for exploring and prototyping the machine learning solution before considering the whole dataset. It is obvious that whatever the machine learning tools you apply for your machine learning application development and commercialization, data will influence the pre-processing you will be required to perform.

Data transformation

After selecting appropriate data sources and pre-processing those data, the final step is to transform the processed data. Your specific ML algorithm and knowledge of the problem domain will be influenced in this step. Three common data transformations techniques are scaling attributes, decompositions and attribute aggregations. This step is also commonly referred to as feature engineering that will be discussed in more details in the next chapter:

  • Scaling: The pre-processed data may contain attributes with a mixture of scales for various quantities and units, for example dollars, kilograms, and sales volume. However, the machine-learning methods have the data attributes within the same scale such as between 0 and 1 for the smallest and largest value for a given feature. Therefore, consider any feature scaling you may need to perform the proper scaling of the processed data.
  • Decomposition: The data might have some features that represent a complex concept that provides a more powerful response from the machine learning algorithms when you split the datasets into the fundamental parts. For example, consider a day that is composed of 24 hours, 1,440 minutes, and 86,400 seconds that in turn could be split out further. Probably some specific hours or only the hours in a day are relevant to the problem which to be investigated and resolved. Therefore, consider an appropriate feature extraction and selection to perform the proper decomposition of the processed data.
  • Aggregation: Often segregated or scattered features may be trivial on their. However, those features can be aggregated into a single feature that would be more meaningful to the problem you are trying to solve. For example, several data instances can be presented in an online shopping website for each time a customer logged on the site. These data objects could be aggregated into a count for the number of logins by discarding additional instances. Therefore, consider appropriate feature aggregation to process the data properly.

Apache Spark has its distributed data structures includes RDD, DataFrame, and Datasets by which you can perform the data pre-processing efficiently. These data structures have different advantages and performance for processing the data. In the next sections, we will describe those data structures individually and also show examples of how to process the large Dataset using them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset