Processing arrays from the tabular data

The meat of any data science application is to find an appropriate data handling routine for a given problem. In the case of machine learning, it's either the supervised or unsupervised method to predict or classify the data. Even before this step, a good amount of time is spent in the data transformation and making the data suitable for these methods.

Usually, data is made available to a data science program in many ways. A data science programmer is faced with the challenge of accessing the data and making it available to a later part of his code using the Python data structure. Mastering ways to access data through Python will be very handy when writing a data science program as it will allow you to jump to the meat of the problem very quickly.

Typically, data is available as a text file, separated by either a comma or tab. A Python built-in file object utility can be used in this case. As we saw earlier, a file object implements the __iter__() and next() methods. This allows us to work on very large files, which do not fit into memory, by reading only a small chunk of the files at a time.

Python machine learning libraries such as scikit-learn works on the NumPy libraries. In this section, we will see ways of efficiently reading external data and converting it to numPy arrays for the downstream data processing.

Getting ready

NumPy provides us with a function called genfromtext to create NumPy arrays from tabular data. Once the data is available as NumPy arrays, it's much easier for the downstream systems to process this data. Let's look at how we can leverage genfromtext. The following code was written using the NumPy version 1.8.0.

How to do it…

Let's import the necessary libraries to start with. We will proceed to define a sample input. Finally, we will demonstrate how to process tabular data.

# 1.	Let us simulate a small tablular input using StringIO
import numpy as np
from StringIO import StringIO
in_data = StringIO("10,20,30
56,89,90
33,46,89")

# 2.Read the input using numpy’s genfromtext to create a nummpy array.
data = np.genfromtxt(in_data,dtype=int,delimiter=",")

# cases where we may not need to use some columns.
in_data = StringIO("10,20,30
56,89,90
33,46,89")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",usecols=(0,1))

# providing column names
in_data = StringIO("10,20,30
56,89,90
33,46,89")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",names="a,b,c")

# using column names from data
in_data = StringIO("a,b,c
10,20,30
56,89,90
33,46,89")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",names=True)

How it works…

In step 1, we simulated a tabulated data using the StringIO utility. We have three rows and three columns. The rows are new line-delimited and columns are comma-delimited.

In step 2, we used genfromtxt from NumPy to ingest the data as a NumPy array.

The first argument to genfromtxt is the source of the file and filename; in our case, it's the StringIO object. The input is comma-delimited; the delimiter argument allows us to specify the same. After running the preceding code, the data value is as follows:

>>> data
array([[10, 20, 30],
       [56, 89, 90],
       [33, 46, 89]])

As you can see, we successfully loaded the data from the string in a NumPy array.

There's more…

Various parameters and default values of the same are shown here for the genfromtxt function:

genfromtxt(fname, dtype=<type 'float'>, comments='#', delimiter=None, skiprows=0, skip_header=0, skip_footer=0, converters=None, missing='', missing_values=None, filling_values=None, usecols=None, names=None, excludelist=None, deletechars=None, replace_space='_', autostrip=False, case_sensitive=True, defaultfmt='f%i', unpack=None, usemask=False, loose=True, invalid_raise=True)

The only mandatory argument is the name of the source of the data. In our case, we used a StringIO object. It can be a string corresponding to the name of a file or an object similar to a file with a read method. It can also be a URL of a remote file.

The first step is to split the given line into columns. Once the file is open to be read, genfromtxt splits the non-empty lines into a sequence of strings. Empty lines are ignored and so are the commented lines. The comments option helps gentext decide which are the comment lines. The strings are split into columns based on a delimiter specified by the delimiter option. In our example case, we used a , delimiter. A /t is also a very popular delimiter. By default, the delimiter is None in gentext, which means that it assumes that the line is split into columns through whitespaces.

Typically, when lines are changed to a sequence of strings and subsequently the columns are extracted, the individual columns are not stripped of the leading or trailing whitespaces. In a later part of the code, this needs to be handled, especially if some of the variables are used as keys in a dictionary. For example, if the leading or trailing whitespaces are not handled consistently, this may lead to a bug/error in the code. Setting autostrip=True helps avoid this problem.

Many times, we want to skip, say, top n rows or bottom n rows while reading a file. This may be due to the presence of headers or footers. The skip_header = n will skip the first n lines while reading and similarly, skip_footer = n will skip the last n lines.

Similar to unwanted rows, we may encounter many cases where we may not need to use some columns. The usecols argument is used to specify the list of columns that we are interested in:

in_data = StringIO("10,20,30
56,89,90
33,46,89")

data = np.genfromtxt(in_data,dtype=int,delimiter=",",usecols=(0,1))

As you can see in the preceding example, we selected only two columns, column 0 and 1. The data object looks as follows:

>>> data
array([[10, 20],
       [56, 89],
       [33, 46]])

Custom column names can be provided using the names argument. A string argument with comma-separated column names looks as follows:

in_data = StringIO("10,20,30
56,89,90
33,46,89")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",names="a,b,c")

>>> data
array([(10, 20, 30), (56, 89, 90), (33, 46, 89)], 
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4')])

By having names to true, the first row in the input data is used as a column header:

in_data = StringIO("a,b,c
10,20,30
56,89,90
33,46,89")
data = np.genfromtxt(in_data,dtype=int,delimiter=",",names=True)

>>> data
array([(10, 20, 30), (56, 89, 90), (33, 46, 89)], 
      dtype=[('a', '<i4'), ('b', '<i4'), ('c', '<i4')])

Another simple method from NumPy to create NumPy arrays from the text input is loadtxt:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html

This is less sophisticated than genfromtxt; if you need a simple reader without any sophisticated data handling mechanisms such as handling missing values, you can opt for loadtxt.

However, if we are not interested in loading the data as a NumPy array but want to load it as a list, Python provides us with a default csv library:

https://docs.python.org/2/library/csv.html

An interesting method in the preceding csv library is csv.Sniffer.sniff(). If we have a very large csv file and we want to understand its structure, we can use sniff(). This will return a dialect subclass, which has most of the properties of the csv file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset