© A.J. Henley and Dave Wolf 2018
A.J. Henley and Dave WolfLearn Data Analysis with Pythonhttps://doi.org/10.1007/978-1-4842-3486-0_4

4. Finding the Meaning

A. J. Henley1  and Dave Wolf2
(1)
Washington, D.C., District of Columbia, USA
(2)
Sterling Business Advantage, LLC, Adamstown, Maryland, USA
 

The third stage of data analysis is actually analyzing the data. Finding meaning within your data can be difficult without the right tools. In this section, we look at some of the tools available to the Python user.

With just a few lines of code, you will be able to do the following analysis:
  • Compute descriptive statistics

  • Correlation

  • Linear regression

  • Pivot tables

Computing Aggregate Statistics

As you may have seen in the last chapter, it is easy to get some summary statistics by using describe. Let’s take a look at how we can find those values directly.

First, let’s create some data (Listing 4-1).

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
GradeList = zip(names,grades)
df = pd.DataFrame(data=GradeList,
        columns=['Names','Grades'])
df
Listing 4-1

Creating Dataset for Statistics

Once that is set up, we can take a look at some statistics (Listing 4-2).

df['Grades'].count()  # number of values
df['Grades'].mean()   # arithmetic average
df['Grades'].std()    # standard deviation
df['Grades'].min()    # minimum
df['Grades'].max()    # maximum
df['Grades'].quantile(.25)  # first quartile df['Grades'].quantile(.5)   # second quartile df['Grades'].quantile(.75)  # third quartile
Listing 4-2

Computing Aggregate Statistics

Note

If you tried to execute the previous code in one cell all at the same time, the only thing you would see is the output of the .quantile() function. You have to try them one by one. I simply grouped them all together for reference purposes. OK?

It is important to note that the mean is not the only measure of central tendency. See Listing 4-3 for other measures.

# computes the arithmetic average of a column
# mean = dividing the sum by the number of values
df['Grades'].mean()
# finds the median of the values in a column
# median = the middle value if they are sorted in order
df['Grades'].median()
# finds the mode of the values in a column
# mode = the most common single value
df['Grades'].mode()
Listing 4-3

Other Measures of Central Tendency

And if you need to compute standard deviation, you might also need variance (Listing 4-4).

# computes the variance of the values in a column
df['Grades'].var()
Listing 4-4

Computing Variance

Finally, you don’t have to specify the column to compute the statistics. If you just run it against the whole dataframe, you will get the function to run on all applicable columns (Listing 4-5).

df.var()
Listing 4-5

Computing Variance on All Numeric Columns

Your Turn

Of course, in our dataset we only have one column. Try creating a dataframe and computing summary statistics using the dataset in Listing 4-6.

names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bsdegrees = [1,1,0,0,1]
msdegrees = [2,1,0,0,0]
phddegrees = [0,1,0,0,0]
Listing 4-6

Starting Dataset

Computing Aggregate Statistics on Matching Rows

It is possible to compute descriptive statistics on just the rows that match certain criteria. First, let’s create some data (Listing 4-7).

import pandas as pd
names = ['Bob','Jessica','Mary','John','Mel']
grades = [76,95,77,78,99]
bs = [1,1,0,0,1]
ms = [2,1,0,0,0]
phd = [0,1,0,0,0]
GradeList = zip(names,grades,bs,ms,phd)
df = pd.DataFrame(data=GradeList,
        columns=['Name','Grade','BS','MS','PhD'])
df
Listing 4-7

Creating Dataset

Ok, we have covered how to find the rows that match a set of criteria. We have also covered how to compute descriptive statistics, both all at once and one by one. If you put those two together, you will be able to find the statistics of the data that matches certain criteria.

So, to count the rows of the people without a PhD, use the code shown in Listing 4-8.

df.loc[df['PhD']==0].count()
Listing 4-8

Code for Computing Aggregate Statistics

You can use any of the aggregate statistics functions on individual columns in the same way. So, to find the average grade of those people without a PhD, use the code in Listing 4-9.

df.loc[df['PhD']==0]['Grade'].mean()
Listing 4-9

Computing Aggregate Statistics on a Particular Column

Your Turn

Using the data from Listing 4-10, what is the average grade for people with master’s degrees?

import pandas as pd
names = ['Bob','Jessica','Mary','John',
        'Mel','Sam','Cathy','Henry','Lloyd']
grades = [76,95,77,78,99,84,79,100,73]
bs = [1,1,0,0,1,1,1,0,1]
ms = [2,1,0,0,0,1,1,0,0]
phd = [0,1,0,0,0,2,1,0,0]
GradeList = zip(names,grades,bs,ms,phd)
df = pd.DataFrame(data=GradeList,
        columns=['Names','Grades','BS','MS','PhD'])
df
Listing 4-10

Dataset for Exercise

Sorting Data

Generally, we get data in a random order, but need to use it in a completely different order. We can use the sort_values function to rearrange our data to our needs (Listing 4-11).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 4-11

Loading Data from CSV

Sort the dataframe’s rows by age, in descending order (Listing 4-12).

df = df.sort_values(by='age', ascending=0)
df.head()
Listing 4-12

Sorting by Age, Descending

Sort the dataframe’s rows by hours of study and then by exercise, in ascending order (Listing 4-13).

df = df.sort_values(by=['grade', 'age'],
        ascending=[True, True])
df.head()
Listing 4-13

Sorting by Hours of Study and Exercise, Ascending

Your Turn

Can you sort the dataframe to order it by name, age, and then grade?

Correlation

Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price.

Basically, correlation measures how closely two variables move in the same direction. Tall parents have tall kids? Highly correlated. Wear lucky hat, but rarely win at cards? Very slightly correlated. As your standard of living goes up, your level of savings plummet? Highly negatively correlated.

Simple, right?

Well, computing correlation can be a little difficult by hand, but is totally simple in Python (Listing 4-14).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
df.corr()
Listing 4-14

Running a Correlation

 

Age

Exercise

Hours

Grade

Age

1.000000

-0.003643

-0.017467

-0.007580

Exercise

-0.003643

1.000000

0.021105

0.161286

Hours

-0.017467

0.021105

1.000000

0.801955

Grade

-0.007580

0.161286

0.801955

1.000000

The intersections with the highest absolute values are the columns that are the most correlated. Positive values are positively correlated, which means they go up together. Negative values are negatively correlated (as one goes up the other goes down). And, of course, each column is perfectly correlated with itself. As you can see, hours of study and grade are highly correlated.

Your Turn

Load the data from the code in Listing 4-15 and find the correlations.

import pandas as pd
Location = "datasets/tamiami.csv"
Listing 4-15

Load Data from CSV

Regression

In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. This is a fancy way of saying that we use regression to create an equation that explains the value of a dependent variable based on one or several independent variables. Let’s get our data (Listing 4-16),

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 4-16

Load Data from CSV

Once we have that, we need to decide what columns we want to perform the regression on and which is the dependent variable. I want to try to predict the grade based on the age, hours of exercise, and hours of study (Listing 4-17).

import statsmodels.formula.api as sm
result = sm.ols(
        formula='grade ~ age + exercise + hours',
        data=df).fit()
result.summary()
Listing 4-17

First Regression

The formula format in line two is one that you need to learn and get comfortable editing. It shows the dependent variable on the left of the tilde (~) and the independent variables we want considered on the right.

If you look at the results you get from the summary, the R-squared represents the percentage of the variation in the data that can be accounted for by the regression. .664, or 66.4 percent, is good, but not great. The p-value (represented here by the value of P>|t|) represents the probability that the independent variable has no effect on the dependent variable. I like to keep my p-values less than 5 percent, so the only variable that stands out is the age with 59.5 percent. Let’s rerun the regression, but leaving out the age (Listing 4-18).

import statsmodels.formula.api as sm
result = sm.ols(
        formula='grade ~ exercise + hours',
        data=df).fit()
result.summary()
Listing 4-18

Second Regression

Looking at our new results, we haven’t changed our R-squared, but we have eliminated all our high p-values. So, we can now look at our coefficients, and we will end up with an equation that looks something like grade = 1.916 * hours of study +.989 * hours of exercise + 58.5316.

Your Turn

Create a new column where you convert gender to numeric values, like 1 for female and 0 for male. Can you now add gender to your regression? Does this improve your R-squared?

Regression without Intercept

Sometimes, your equation works better without an intercept. This can happen even though your p-values indicate otherwise. I always try it both ways, just as a matter of course, to see what the R-Squared is. To run your regression without an intercept, simply follow Listing 4-19.

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
result = sm.ols(
        formula='grade ~ age + exercise + hours - 1', data=df).fit()
result.summary()
Listing 4-19

Run Regression without Intercept

Note that it is the - 1 at the end of the formula that tells Python that you wish to eliminate the intercept. If you look at the results, you can see we now have a much higher R-squared than we had in the last lesson, and we also have no p-values that cause us concern.

Your Turn

Try running these simple regressions with no intercept: 1. Tests for the relationship between just grade and age; 2. Tests for the relationship between just grade and exercise; and 3. Tests for the relationship between just grade and study.

If you had to pick just one, which one do you like best?

Basic Pivot Table

Pivot tables (or crosstabs) have revolutionized how Excel is used to do analysis. However, I like pivot tables in Python better than I do in Excel. Let’s get some data (Listing 4-20).

import pandas as pd
Location = "datasets/gradedata.csv"
df = pd.read_csv(Location)
df.head()
Listing 4-20

Load Data from CSV

At its simplest, to get a pivot table we need a dataframe and an index (Listing 4-21).

pd.pivot_table(df, index=['gender'])
Listing 4-21

Get Averages of All Numeric Columns Categorized by Gender

As you can see, pivot_table is smart enough to assume that we want the averages of all the numeric columns. If we wanted to specify just one value, we could do that (Listing 4-22).

pd.pivot_table(df,
        values=['grade'],
        index=['gender'])
Listing 4-22

Average Grade by Gender

Gender

Grade

Female

82.7173

Male

82.3948

Now we see just the average grades, categorized by gender. If we wanted to, however, we could look at minimum hours of study (Listing 4-23).

pd.pivot_table(df,
        values=['grade'],
        index=['gender'],
        aggfunc='min')
Listing 4-23

Minimum Grade by Gender

Gender

Grade

Female

2

Male

0

We can also add other columns to the index. So, to view the maximum grade categorized by gender and age, we simply run the code in Listing 4-24.

pd.pivot_table(df,
        index=['gender','age'],
        aggfunc='max',
        values=['hours'])
Listing 4-24

Max Grade by Gender and Age

Gender

Age

Hours

Female

14

20

 

15

20

 

16

19

 

17

20

 

18

20

 

19

20

Male

14

19

 

15

20

 

16

20

 

17

20

 

18

20

 

19

20

We can also have multiple value columns. So, to show the average grade and hours of study by gender, we can run the code in Listing 4-25.

pd.pivot_table(df,
        index=['gender'],
        aggfunc='mean',
        values=['grade','hours'])
Listing 4-25

Average Grade and Hours by Gender

Gender

Grade

Hours

Female

82.7173

10.932

Male

82.3948

11.045

We can also perform pivot tables on subsets of the data. First, select your data, then do a standard pivot on that selection. So, to show the average grade and hours of study by gender for students who are 17 years old, we can run the code in Listing 4-26.

df2 = df.loc[df['age'] == 17]
pd.pivot_table(df2,
        index=['gender'],
        aggfunc='mean',
        values=['grade','hours'])
Listing 4-26

Average Grade and Hours by Gender

Gender

Grade

Hours

Female

83.599435

10.943503

Male

82.949721

11.268156

Finally, we can include totals on our Python pivot tables, as shown in Listing 4-27.

df2 = df.loc[df['age'] == 17]
pd.pivot_table(df2,
        index=['gender'],
        aggfunc='mean',
        values=['grade','hours'],
        margins='True')
Listing 4-27

Average Grade and Hours by Gender

Gender

Grade

Hours

Female

83.599435

10.943503

Male

82.949721

11.268156

All

83.272753

11.106742

Your Turn

Can you create a pivot table showing the average grade by gender of people who had more than two hours of exercise?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset