Spark and data scientists workflow

As already stated that, a common task for a data scientist is to select the data, data pre-processing (formatting, cleaning and sampling) and data transformation (scaling, decomposition and aggregation) the raw data into a format that can be passed into machine learning models to build the models. As the size of the experimental datasets increases, the traditional single-node databases will not be feasible to handle these kinds of datasets, therefore, you need to switch a big data processing computing like Spark. Fortunately, we have the Spark to be an excellent option as a scalable distributed computing system to coup with your datasets.

Spark and data scientists workflow

Figure-19: Data scientist's workflow for using the Spark

Now let's move to the exact point, as a data scientist at first you will have to read the Dataset available in diverse formats. Then reading the Datasets will provide you with the concept of RDDs, DataFrames and Datasets that we already describe. You can cache the Dataset into the main memory; you can transform the read data sets from the DataFrame, SQL or as Datasets. And finally, you will perform an action to dump your data to the disks, computing nodes or clusters. The step what we describe here essentially forms a workflow that you will follow for the basic data processing using Spark that showed in Figure 1.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset