Data preprocessing

While using the Data Refinery feature, we noticed that there is a column within the data file named Category, which seems to indicate the type of product sold (furniture, office supplies, and several others). In this example, the data scientist is interested in only furniture sales, so the following lines of code are used to filter the data and then verify if there is a reasonable amount of data for that category to perform a proper analysis:

furniture = df.loc[df['Category'] == 'Furniture']
furniture['Order Date'].min(), furniture['Order Date'].max()

The preceding code executed in Watson Studio is shown in the following screenshot, filtering the sales data and verifying that we have four years of furniture sales within this data file by showing the earliest and latest timestamp:

In the selected example, the data scientist chose to use raw Python commands to remove (drop) columns of data not needed in the analysis, check for missing values, aggregate (group by) sales transactions by date, and so on. Although using Python scripting to accomplish these tasks is not overly complex, you can alternatively perform all of those tasks (and more) with the drag and drop of Watson Data Refinery flow.

As we mentioned earlier, a Data Refinery flow is an ordered set of steps to cleanse, shape, and enhance a data asset. As you refine data by applying operations to the data, you are actually dynamically building a customized Data Refinery flow that can be modified in real time and saved for future use as new data becomes available!

Table of Contents for Data preprocessing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data preprocessing