The pandas data structure

Let's first get acquainted with two of pandas' primary data structures: the Series and the DataFrame. They can handle the majority of use cases in finance, statistic, social science, and many areas of engineering.

Series

A Series is a one-dimensional object similar to an array, list, or column in table. Each item in a Series is assigned to an entry in an index:

>>> s1 = pd.Series(np.random.rand(4),
                   index=['a', 'b', 'c', 'd'])
>>> s1
a    0.6122
b    0.98096
c    0.3350
d    0.7221
dtype: float64

By default, if no index is passed, it will be created to have values ranging from 0 to N-1, where N is the length of the Series:

>>> s2 = pd.Series(np.random.rand(4))
>>> s2
0    0.6913
1    0.8487
2    0.8627
3    0.7286
dtype: float64

We can access the value of a Series by using the index:

>>> s1['c']
0.3350
>>>s1['c'] = 3.14
>>> s1['c', 'a', 'b']
c    3.14
a    0.6122
b    0.98096

This accessing method is similar to a Python dictionary. Therefore, pandas also allows us to initialize a Series object directly from a Python dictionary:

>>> s3 = pd.Series({'001': 'Nam', '002': 'Mary',
                    '003': 'Peter'})
>>> s3
001    Nam
002    Mary
003    Peter
dtype: object

Sometimes, we want to filter or rename the index of a Series created from a Python dictionary. At such times, we can pass the selected index list directly to the initial function, similarly to the process in the preceding example. Only elements that exist in the index list will be in the Series object. Conversely, indexes that are missing in the dictionary are initialized to default NaN values by pandas:

>>> s4 = pd.Series({'001': 'Nam', '002': 'Mary',
                    '003': 'Peter'}, index=[
                    '002', '001', '024', '065'])
>>> s4
002    Mary
001    Nam
024    NaN
065    NaN
dtype:   object
ect

The library also supports functions that detect missing data:

>>> pd.isnull(s4)
002    False
001    False
024    True
065    True
dtype: bool

Similarly, we can also initialize a Series from a scalar value:

>>> s5 = pd.Series(2.71, index=['x', 'y'])
>>> s5
x    2.71
y    2.71
dtype: float64

A Series object can be initialized with NumPy objects as well, such as ndarray. Moreover, pandas can automatically align data indexed in different ways in arithmetic operations:

>>> s6 = pd.Series(np.array([2.71, 3.14]), index=['z', 'y'])
>>> s6
z    2.71
y    3.14
dtype: float64
>>> s5 + s6
x    NaN
y    5.85
z    NaN
dtype: float64

The DataFrame

The DataFrame is a tabular data structure coprising a set of ordered columns and rows. It can be thought of as a group of Series objects that share an index (the column names). There are a number of ways to initialize a DataFrame object. Firstly, let's take a look at the common example of creating DataFrame from a dictionary of lists:

>>> data = {'Year': [2000, 2005, 2010, 2014],
         'Median_Age': [24.2, 26.4, 28.5, 30.3],
         'Density': [244, 256, 268, 279]}
>>> df1 = pd.DataFrame(data)
>>> df1
    Density    Median_Age    Year
0  244        24.2        2000
1  256        26.4        2005
2  268        28.5        2010
3  279        30.3        2014

By default, the DataFrame constructor will order the column alphabetically. We can edit the default order by passing the column's attribute to the initializing function:

>>> df2 = pd.DataFrame(data, columns=['Year', 'Density', 
                                      'Median_Age'])
>>> df2
    Year    Density    Median_Age
0    2000    244        24.2
1    2005    256        26.4
2    2010    268        28.5
3    2014    279        30.3
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')

We can provide the index labels of a DataFrame similar to a Series:

>>> df3 = pd.DataFrame(data, columns=['Year', 'Density',  
                   'Median_Age'], index=['a', 'b', 'c', 'd'])
>>> df3.index
Index([u'a', u'b', u'c', u'd'], dtype='object')

We can construct a DataFrame out of nested lists as well:

>>> df4 = pd.DataFrame([
    ['Peter', 16, 'pupil', 'TN', 'M', None],
    ['Mary', 21, 'student', 'SG', 'F', None],
    ['Nam', 22, 'student', 'HN', 'M', None],
    ['Mai', 31, 'nurse', 'SG', 'F', None],
    ['John', 28, 'laywer', 'SG', 'M', None]],
columns=['name', 'age', 'career', 'province', 'sex', 'award'])

Columns can be accessed by column name as a Series can, either by dictionary-like notation or as an attribute, if the column name is a syntactically valid attribute name:

>>> df4.name    # or df4['name'] 
0    Peter
1    Mary
2    Nam
3    Mai
4    John
Name: name, dtype: object

To modify or append a new column to the created DataFrame, we specify the column name and the value we want to assign:

>>> df4['award'] = None
>>> df4
    name age   career province  sex award
0  Peter  16    pupil       TN    M  None
1    Mary  21  student       SG    F  None
2    Nam   22  student       HN  M  None
3    Mai    31    nurse        SG    F    None
4    John    28    lawer        SG    M    None

Using a couple of methods, rows can be retrieved by position or name:

>>> df4.ix[1]
name           Mary
age              21
career      student
province         SG
sex               F
award          None
Name: 1, dtype: object

A DataFrame object can also be created from different data structures such as a list of dictionaries, a dictionary of Series, or a record array. The method to initialize a DataFrame object is similar to the preceding examples.

Another common case is to provide a DataFrame with data from a location such as a text file. In this situation, we use the read_csv function that expects the column separator to be a comma, by default. However, we can change that by using the sep parameter:

# person.csv file
name,age,career,province,sex
Peter,16,pupil,TN,M
Mary,21,student,SG,F
Nam,22,student,HN,M
Mai,31,nurse,SG,F
John,28,lawer,SG,M
# loading person.cvs into a DataFrame
>>> df4 = pd.read_csv('person.csv')
>>> df4
     name   age   career   province  sex
0    Peter    16    pupil       TN        M
1    Mary     21    student     SG       F
2    Nam      22    student     HN       M
3    Mai      31    nurse       SG       F
4    John     28    laywer      SG       M

While reading a data file, we sometimes want to skip a line or an invalid value. As for pandas 0.16.2, read_csv supports over 50 parameters for controlling the loading process. Some common useful parameters are as follows:

  • sep: This is a delimiter between columns. The default is comma symbol.
  • dtype: This is a data type for data or columns.
  • header: This sets row numbers to use as the column names.
  • skiprows: This skips line numbers to skip at the start of the file.
  • error_bad_lines: This shows invalid lines (too many fields) that will, by default, cause an exception, such that no DataFrame will be returned. If we set the value of this parameter as false, the bad lines will be skipped.

Moreover, pandas also has support for reading and writing a DataFrame directly from or to a database such as the read_frame or write_frame function within the pandas module. We will come back to these methods later in this chapter.

The DataFrame
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset