Preparing the data 

Data preparation enables the creation of input data for ML algorithms to consume. Raw data that we get from data sources is often not very clean. Sometimes, the data cannot be readily fused into an ML algorithm to create a model. We need to ensure that the raw data is cleaned up and it is prepared in a format that is acceptable for the ML algorithm to take as input.

EDA is a substep in the process of creating the input data. It is a process of using visual and quantitative aids to understand the data without getting prejudice about the contents of the data. EDA gives us deeper insights into the data available at hand. It helps us to understand the required data preparation steps. Some of the insights that we could obtain during EDA are the existence of outliers in the data, missing values existence in the data, and the duplication of data. All of these problems are addressed during data cleansing which is another substep in data preparation. Several techniques may be adopted during data cleansing and the following mentioned are some of the popular techniques:

Deleting records that are outliers
Deleting redundant columns and irrelevant columns in data
Missing values imputation—filling missing values with special value NA or a blank or median or mean or mode or with a regressed value
Scaling the data
Removing stop words such as a, and, and how, from unstructured text data
Normalizing words in unstructured text documents with techniques such as stemming, and lemmatization
Eliminating non-dictionary words in text data
Spelling corrections on misspelled words in text documents
Replacing non-recognizable domain-specific acronyms in the text with actual word descriptions
Rotation, scaling, and translation of image data

Representing the unstructured data as vectors, providing labels for the records if the problem at hand needs to be dealt with by supervised learning, handling class imbalance problems in the data, feature engineering, transforming the data through transformation functions such as log transform, min-max transform, square root transform, and cube transform, are all part of the data preparation process.

The output of the data preparation step is tabular data that can be fit readily into an ML algorithm as input in order to create models.

An additional substep that is typically done in data preparation is to divide the dataset into training data, validation data, and test data. These various datasets are used for specific purposes in the model-building step.

Table of Contents for Preparing the data&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Preparing the data