Refining data by using commands

Rather than using scripting to process or refine your data, you can use commands (also known as operations) to preprocess the data. To do this, you can start by entering a command (or operation name) and let Watson Studio's autocomplete function help you find the right operation and syntax (no programming involved!). 

Note that, if you hover over any operation or function name, you'll see both a description and detailed information for completing the command.

When you're ready, you can then click on Apply to add the operation to your data refinery flow.

For example, you might want to sort or reorder a column of data. If you start typing arra, you'll see the following result:

If you then click to select the arrange() function, you can click on the () and select a column from your data file, as shown in the following screenshot:

When you're ready, click on the Apply button to add the command to the data refinery flow, as shown in the following screenshot:

Without saving the data refinery (more on this in a moment), Watson Studio will display the effect of applying the command to your data, as shown in the following screenshot:

There are numerous commands available to add to the data refinery flow. Each command will be added as a step within the flow. If you click on Steps (as shown in the following screenshot) you can view and edit each of the steps you have defined for the flow; you can even delete them, if you wish. In our case, I have added three steps: the first to arrange (sort) the data, the second to convert the first column from a string into an integer, and the third to filter out only those records for the years before 2011, as shown in the following screenshot:

At any time, you can save and run the data refinery flow by clicking the save data refinery flow and run data refinery flow icons shown in the following screenshot:

Again, there are commands to perform almost any of the preprocessing tasks required in most machine learning projects, such as group, rename, sample_n, and summarize. Perhaps most importantly, the data refinery provides scripting support for the many dplyr R library operations, functions, and logical operators. For example, sample_frac and sample_n are supported by the data refinery and are very useful for generating sample datasets for your original data source.

We can use the following command to take our original file (combine_.csv) and create a data refinery flow to generate a sample:

Sample_n(199, replace=TRUE)

The preceding command will generate the following output:

This will automatically read our original file and create a a random sample of data based on the number of rows indicated (in our case, 99). The replace parameter indicates that if a dataset already exists with the output filename, then it will be overwritten.

After saving and running the data refinery flow, we will see a summary of our results (along with a list of our prior runs), as shown in the following screenshot:

We can see that we have created a sample file from the original, named combine.csv_sharped_199.csv, as shown in the following screenshot:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset