Chapter 6. Interacting with Databases

Data analysis starts with data. It is therefore beneficial to work with data storage systems that are simple to set up, operate and where the data access does not become a problem in itself. In short, we would like to have database systems that are easy to embed into our data analysis processes and workflows. In this module, we focus mostly on the Python side of the database interaction, and we will learn how to get data into and out of pandas data structures.

There are numerous ways to store data. In this chapter, we are going to learn to interact with three main categories: text formats, binary formats and databases. We will focus on two storage solutions, MongoDB and Redis. MongoDB is a document-oriented database, which is easy to start with, since we can store JSON documents and do not need to define a schema upfront. Redis is a popular in-memory data structure store on top of which many applications can be built. It is possible to use Redis as a fast key-value store, but Redis supports lists, sets, hashes, bit arrays and even advanced data structures such as HyperLogLog out of the box as well.

Interacting with data in text format

Text is a great medium and it's a simple way to exchange information. The following statement is taken from a quote attributed to Doug McIlroy: Write programs to handle text streams, because that is the universal interface.

In this section we will start reading and writing data from and to text files.

Reading data from text format

Normally, the raw data logs of a system are stored in multiple text files, which can accumulate a large amount of information over time. Thankfully, it is simple to interact with these kinds of files in Python.

pandas supports a number of functions for reading data from a text file into a DataFrame object. The most simple one is the read_csv() function. Let's start with a small example file:

$ cat example_data/ex_06-01.txt
Name,age,major_id,sex,hometown
Nam,7,1,male,hcm
Mai,11,1,female,hcm
Lan,25,3,female,hn
Hung,42,3,male,tn
Nghia,26,3,male,dn
Vinh,39,3,male,vl
Hong,28,4,female,dn

Note

The cat is the Unix shell command that can be used to print the content of a file to the screen.

In the preceding example file, each column is separated by comma and the first row is a header row, containing column names. To read the data file into the DataFrame object, we type the following command:

>>> df_ex1 = pd.read_csv('example_data/ex_06-01.txt')
>>> df_ex1
    Name  age  major_id     sex hometown
0    Nam    7         1    male      hcm
1    Mai   11         1  female      hcm
2    Lan   25         3  female       hn
3   Hung   42         3    male       tn
4  Nghia   26         3    male       dn
5   Vinh   39         3    male       vl
6   Hong   28         4  female       dn

We see that the read_csv function uses a comma as the default delimiter between columns in the text file and the first row is automatically used as a header for the columns. If we want to change this setting, we can use the sep parameter to change the separated symbol and set header=None in case the example file does not have a caption row.

See the following example:

$ cat example_data/ex_06-02.txt
Nam     7       1       male    hcm
Mai     11      1       female  hcm
Lan     25      3       female  hn
Hung    42      3       male    tn
Nghia   26      3       male    dn
Vinh    39      3       male    vl
Hong    28      4       female  dn

>>> df_ex2 = pd.read_csv('example_data/ex_06-02.txt',
                         sep = '	', header=None)
>>> df_ex2
       0   1  2       3    4
0    Nam   7  1    male  hcm
1    Mai  11  1  female  hcm
2    Lan  25  3  female   hn
3   Hung  42  3    male   tn
4  Nghia  26  3    male   dn
5   Vinh  39  3    male   vl
6   Hong  28  4  female   dn

We can also set a specific row as the caption row by using the header that's equal to the index of the selected row. Similarly, when we want to use any column in the data file as the column index of DataFrame, we set index_col to the name or index of the column. We again use the second data file example_data/ex_06-02.txt to illustrate this:

>>> df_ex3 = pd.read_csv('example_data/ex_06-02.txt',
                         sep = '	', header=None,
                         index_col=0)
>>> df_ex3
        1  2       3    4
0
Nam     7  1    male  hcm
Mai    11  1  female  hcm
Lan    25  3  female   hn
Hung   42  3    male   tn
Nghia  26  3    male   dn
Vinh   39  3    male   vl
Hong   28  4  female   dn

Apart from those parameters, we still have a lot of useful ones that can help us load data files into pandas objects more effectively. The following table shows some common parameters:

Parameter

Value

Description

dtype

Type name or dictionary of type of columns

Sets the data type for data or columns. By default it will try to infer the most appropriate data type.

skiprows

List-like or integer

The number of lines to skip (starting from 0).

na_values

List-like or dict, default None

Values to recognize as NA/NaN. If a dict is passed, this can be set on a per-column basis.

true_values

List

A list of values to be converted to Boolean True as well.

false_values

List

A list of values to be converted to Boolean False as well.

keep_default_na

Bool, default True

If the na_values parameter is present and keep_default_na is False, the default NaN values are ignored, otherwise they are appended to

thousands

Str, default None

The thousands separator

nrows

Int, default None

Limits the number of rows to read from the file.

error_bad_lines

Boolean, default True

If set to True, a DataFrame is returned, even if an error occurred during parsing.

Besides the read_csv() function, we also have some other parsing functions in pandas:

Function

Description

read_table

Read the general delimited file into DataFrame

read_fwf

Read a table of fixed-width formatted lines into DataFrame

read_clipboard

Read text from the clipboard and pass to read_table. It is useful for converting tables from web pages

In some situations, we cannot automatically parse data files from the disk using these functions. In that case, we can also open files and iterate through the reader, supported by the CSV module in the standard library:

$ cat example_data/ex_06-03.txt
Nam     7       1       male    hcm
Mai     11      1       female  hcm
Lan     25      3       female  hn
Hung    42      3       male    tn      single
Nghia   26      3       male    dn      single
Vinh    39      3       male    vl
Hong    28      4       female  dn

>>> import csv
>>> f = open('data/ex_06-03.txt')
>>> r = csv.reader(f, delimiter='	')
>>> for line in r:
>>>    print(line)
['Nam', '7', '1', 'male', 'hcm']
['Mai', '11', '1', 'female', 'hcm']
['Lan', '25', '3', 'female', 'hn']
['Hung', '42', '3', 'male', 'tn', 'single']
['Nghia', '26', '3', 'male', 'dn', 'single']
['Vinh', '39', '3', 'male', 'vl']
['Hong', '28', '4', 'female', 'dn']

Writing data to text format

We saw how to load data from a text file into a pandas data structure. Now, we will learn how to export data from the data object of a program to a text file. Corresponding to the read_csv() function, we also have the to_csv() function, supported by pandas. Let's see the following example:

>>> df_ex3.to_csv('example_data/ex_06-02.out', sep = ';')

The result will look like this:

$ cat example_data/ex_06-02.out
0;1;2;3;4
Nam;7;1;male;hcm
Mai;11;1;female;hcm
Lan;25;3;female;hn
Hung;42;3;male;tn
Nghia;26;3;male;dn
Vinh;39;3;male;vl
Hong;28;4;female;dn

If we want to skip the header line or index column when writing out data into a disk file, we can set a False value to the header and index parameters:

>>> import sys
>>> df_ex3.to_csv(sys.stdout, sep='	',
                  header=False, index=False)
7       1       male    hcm
11      1       female  hcm
25      3       female  hn
42      3       male    tn
26      3       male    dn
39      3       male    vl
28      4       female  dn

We can also write a subset of the columns of the DataFrame to the file by specifying them in the columns parameter:

>>> df_ex3.to_csv(sys.stdout, columns=[3,1,4],
                  header=False, sep='	')
Nam     male    7       hcm
Mai     female  11      hcm
Lan     female  25      hn
Hung    male    42      tn
Nghia   male    26      dn
Vinh    male    39      vl
Hong    female  28      dn

With series objects, we can use the same function to write data into text files, with mostly the same parameters as earlier.

Writing data to text format
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset