Interacting with data in binary format

We can read and write binary serialization of Python objects with the pickle module, which can be found in the standard library. Object serialization can be useful, if you work with objects that take a long time to create, like some machine learning models. By pickling such objects, subsequent access to this model can be made faster. It also allows you to distribute Python objects in a standardized way.

pandas includes support for pickling out of the box. The relevant methods are the read_pickle() and to_pickle() functions to read and write data from and to files easily. Those methods will write data to disk in the pickle format, which is a convenient short-term storage format:

>>> df_ex3.to_pickle('example_data/ex_06-03.out')
>>> pd.read_pickle('example_data/ex_06-03.out')
        1  2       3    4
0
Nam     7  1    male  hcm
Mai    11  1  female  hcm
Lan    25  3  female   hn
Hung   42  3    male   tn
Nghia  26  3    male   dn
Vinh   39  3    male   vl
Hong   28  4  female   dn

HDF5

HDF5 is not a database, but a data model and file format. It is suited for write-one, read-many datasets. An HDF5 file includes two kinds of objects: data sets, which are array-like collections of data, and groups, which are folder-like containers what hold data sets and other groups. There are some interfaces for interacting with HDF5 format in Python, such as h5py which uses familiar NumPy and Python constructs, such as dictionaries and NumPy array syntax. With h5py, we have high-level interface to the HDF5 API which helps us to get started. However, in this module, we will introduce another library for this kind of format called PyTables, which works well with pandas objects:

>>> store = pd.HDFStore('hdf5_store.h5')
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
Empty

We created an empty HDF5 file, named hdf5_store.h5. Now, we can write data to the file just like adding key-value pairs to a dict:

>>> store['ex3'] = df_ex3
>>> store['name'] = df_ex2[0]
>>> store['hometown'] = df_ex3[4]
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
/ex3                  frame        (shape->[7,4])
/hometown             series       (shape->[1])
/name                 series       (shape->[1])

Objects stored in the HDF5 file can be retrieved by specifying the object keys:

>>> store['name']
0      Nam
1      Mai
2      Lan
3     Hung
4    Nghia
5     Vinh
6     Hong
Name: 0, dtype: object

Once we have finished interacting with the HDF5 file, we close it to release the file handle:

>>> store.close()
>>> store
<class 'pandas.io.pytables.HDFStore'>
File path: hdf5_store.h5
File is CLOSED

There are other supported functions that are useful for working with the HDF5 format. You should explore ,in more detail, two libraries – pytables and h5py – if you need to work with huge quantities of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset