Preparing the data 

Data preparation enables the creation of input data for ML algorithms to consume. Raw data that we get from data sources is often not very clean. Sometimes, the data cannot be readily fused into an ML algorithm to create a model. We need to ensure that the raw data is cleaned up and it is prepared in a format that is acceptable for the ML algorithm to take as input.

EDA is a substep in the process of creating the input data. It is a process of using visual and quantitative aids to understand the data without getting prejudice about the contents of the data. EDA gives us deeper insights into the data available at hand. It helps us to understand the required data preparation steps. Some of the insights that we could obtain during EDA are the existence of outliers in the data, missing values existence in the data, and the duplication of data. All of these problems are addressed during data cleansing which is another substep in data preparation. Several techniques may be adopted during data cleansing and the following mentioned are some of the popular techniques:

  • Deleting records that are outliers
  • Deleting redundant columns and irrelevant columns in data
  • Missing values imputation—filling missing values with special value NA or a blank or median or mean or mode or with a regressed value
  • Scaling the data
  • Removing stop words such as a, and, and how, from unstructured text data
  • Normalizing words in unstructured text documents with techniques such as stemming, and lemmatization
  • Eliminating non-dictionary words in text data
  • Spelling corrections on misspelled words in text documents
  • Replacing non-recognizable domain-specific acronyms in the text with actual word descriptions
  • Rotation, scaling, and translation of image data

Representing the unstructured data as vectors, providing labels for the records if the problem at hand needs to be dealt with by supervised learning, handling class imbalance problems in the data, feature engineering, transforming the data through transformation functions such as log transform, min-max transform, square root transform, and cube transform, are all part of the data preparation process.

The output of the data preparation step is tabular data that can be fit readily into an ML algorithm as input in order to create models.

An additional substep that is typically done in data preparation is to divide the dataset into training data, validation data, and test data. These various datasets are used for specific purposes in the model-building step.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset