Data cleansing and preparation

A common description for data cleansing and preparation is the work that goes into transforming raw data into a form that data scientists and analysts can more easily run through machine learning algorithms in an effort to uncover insights or make predictions based upon that data.

This process can be complicated by issues such as missing or incomplete records or simply finding extraneous columns of information within a data source.

In the previous example screenshot, we can see that the DataFrame object includes the columns country, description, designation, points, price, province, and so on.

As an exercise designed to demonstrate how easily we can use Python within Watson Studio to prepare data, let's suppose that we wanted to drop one or more columns from the DataFrame. To accomplish this task, we use the following Python statements:

to_drop = ['points']
df_data_1.drop(to_drop, inplace=True, axis=1)
df_data_1.head()

The preceding simple Python commands define the column name to be dropped from the DataFrame, that is, points and then drop the column from the df_data_1 DataFrame:

Within IBM Watson Studio, using the notebook we created earlier in this chapter, we can enter and run the preceding commands, and then use the head() function, to verify that the column we indicated has actually been dropped.

Although the preceding demonstration is simplistic and doesn't begin to break the surface on the process of data cleansing and preparation, it does demonstrate the ability to use Python easily in Watson Studio to access and manipulate data.

Rather than continuing with additional fundamental data manipulations, we'll move on to looking at something a bit more complex.

Table of Contents for Data cleansing and preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data cleansing and preparation