Loading the data

Now we have a file of (although somewhat structured) still raw data. One typical first task when preparing data for analysis is to add column headings. If the file is of reasonable size, you can use a programmer's text editor to open and add a heading row, but if not, you can accomplish this directly in your Spark notebook.

Assuming you've loaded the file into your Watson project (using the process that we have shown in previous chapters), you can then click on Insert to code and then select Insert pandas DataFrame object as shown in the following screenshot:

When you click on Insert pandas DataFrame, code is generated and added to the notebook for you. The generated code imports any required packages, accesses the data file (with the appropriate credentials), and loads the data into a DataFrame. You can then modify the pd.read_csv command (within the code) to include the names parameter (as shown in the following code).

This will create a heading row at the top of the file, using the provided column names:

df_data_1 = pd.read_csv(body, sep=',',names = ['STATION', 'DATE', 'METRIC', 'VALUE', 'C5', 'C6', 'C7', 'C8'])

Running the code in the cell is shown in the following screenshot:

The raw data in the base file has the format shown in the following screenshot:

Hopefully, you can see that each column contains a weather station identifier, a date, a metric that is collected (such as precipitation, daily maximum and minimum temperatures, temperature at the time of observation, snowfall, snow depth, and so on) and some additional columns (note that missing values may show as NaN, meaning Not a Number).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset