The pandas library is built on the top of the numpy library. Therefore, in order to use pandas, you need to import numpy first. The pandas library introduces many additional data structures and functions. Let's start our pandas tour with the panda Series object. This is a one-dimensional array, like a numpy array; however, you can define an explicitly named index, and refer to those names to retrieve the data, not just to the positional index. Therefore, a pandas Series object already looks like a tuple in the relational model, or a row in a table. The following code imports both packages, numpy and pandas. Then it defines a simple pandas Series, without an explicit index. The Series looks like a simple single-dimensional array, and you can refer to elements through the positional index:
import numpy as np import pandas as pd ser1 = pd.Series([1, 2, 3, 4]) ser1[1:3]
Here are the results. I retrieved the second and the third element, position 1 and 2, with a zero-based positional index, as shown here:
1 2 2 3
Now I will create a Series with an explicitly named index, as shown in the following code:
ser1 = pd.Series([1, 2, 3, 4], index = ['a', 'b', 'c', 'd']) ser1['b':'c']
As you can see from the last example, you can refer to elements using the names of the index, which serve as column names in a SQL Server row. And here is the result:
b 2 c 3
You can create a Series from the dictionary object as well, as the following code shows:
dict1 = {'a': 1, 'b': 2, 'c': 3, 'd': 4} ser1 = pd.Series(dict1)
Imagine you have multiple Series with the same structure stacked vertically. This is the pandas DataFrame object. It looks and behaves like the R data frame. You use the pandas DataFrame object to store and analyze tabular data from relational sources, or to export the result to the tabular destinations, like SQL Server. The following code creates two dictionaries, then a pandas Series from each one, and then a pandas DataFrame from both Series:
age_dict = {'John': 35, 'Mick': 75, 'Diane': 42} age = pd.Series(age_dict) weight_dict = {'John': 88.3, 'Mick': 72.7, 'Diane': 57.1} weight = pd.Series(weight_dict) people = pd.DataFrame({'age': age, 'weight': weight}) people
The resulting data frame is:
age weight Diane 42 57.1 John 35 88.3 Mick 75 72.7
You can see that the Series were joined by the common index values. The result looks like a table and is therefore suitable for exchanging data and results with relational systems.
You can do a lot of manipulations with a data frame. For the beginning, I am extracting some meta-data, and creating a projection by extracting a single column, as shown in the following code:
people.index people.columns people['age']
Here are the results:
Index(['Diane', 'John', 'Mick'], dtype='object') Index(['age', 'weight'], dtype='object') Diane 42 John 35 Mick 75
You can add a calculated column to a data frame. For example, the following code adds a column to the people data frame I am using here:
people['WeightDivAge'] = people['weight'] / people['age'] people
The result is:
age weight WeightDivAge Diane 42 57.1 1.359524 John 35 88.3 2.522857 Mick 75 72.7 0.969333
You can transform columns to rows with the T function of the pandas DataFrame, as shown in the following code:
people.T
You can see the transformed data frame from the result:
Diane John Mick age 42.000000 35.000000 75.000000 weight 57.100000 88.300000 72.700000 WeightDivAge 1.359524 2.522857 0.969333
You can refer to the elements of a data frame by position or by column and index names. A data frame is still a matrix and not a relational table, so the zero-based positional index is also accessible. You can even use expressions in order to filter a data frame. Here are some examples:
people.iloc[0:2, 0:2] people.loc['Diane':'John', 'age':'weight'] people.loc[people.age > 40, ['age', 'WeightDivAge']]
The results of the previous operations are:
age weight Diane 42 57.1 John 35 88.3 age weight Diane 42 57.1 John 35 88.3 age WeightDivAge Diane 42 1.359524 Mick 75 0.969333
From many other operations possible with data frames, let me just expose the joins. You can join two data frames with the pd.merge() method. The following code creates two data frames and joins them using the same index:
df1 = pd.DataFrame({'Person': ['Mary', 'Dejan', 'William', 'Milos'], 'BirthYear': [1978, 1962, 1993, 1982]}) df2 = pd.DataFrame({'Person': ['Mary', 'Milos', 'Dejan', 'William'], 'Group': ['Accounting', 'Development', 'Training', 'Training']}) pd.merge(df1, df2)
Here is the result:
BirthYear Person Group 0 1978 Mary Accounting 1 1962 Dejan Training 2 1993 William Training 3 1982 Milos Development
Note you could always use an explicit key name, or even different key names, one from the left and one from the right data frame involved in the join, as long as the key values match. Here is an example of using an explicit key name:
pd.merge(df1, df2, on = 'Person')
The result is, of course, the same as the previous one. This was a one-to-one join. You can also do one-to-many joins, as the following example shows:
df3 = pd.DataFrame({'Group': ['Accounting', 'Development', 'Training'], 'Supervisor': ['Carol', 'Pepe', 'Shasta']}) pd.merge(df2, df3)
The joined result is:
Group Person Supervisor 0 Accounting Mary Carol 1 Development Milos Pepe 2 Training Dejan Shasta 3 Training William Shasta
Finally, you can also perform many-to-many joins, as the last example in this section shows:
df4 = pd.DataFrame({'Group': ['Accounting', 'Accounting', 'Development', 'Development', 'Training'], 'Skills': ['math', 'spreadheets', 'coding', 'architecture', 'presentation']}) pd.merge(df2, df4)
The result of this example is:
Group Person Skills 0 Accounting Mary math 1 Accounting Mary spreadheets 2 Development Milos coding 3 Development Milos architecture 4 Training Dejan presentation 5 Training William presentation
I guess this chapter was quite terse so far, and also exhaustive. Unfortunately, all of this knowledge is needed to get to the data frames, which were our target from the beginning of the chapter. Therefore, it is finally time to do some more interesting things, namely analyze the data.