Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A.J. Henley and Dave WolfLearn Data Analysis with Pythonhttps://doi.org/10.1007/978-1-4842-3486-0_4

4. Finding the Meaning

A. J. Henley¹ and Dave Wolf²

(1)

Washington, D.C., District of Columbia, USA

(2)

Sterling Business Advantage, LLC, Adamstown, Maryland, USA

The third stage of data analysis is actually analyzing the data. Finding meaning within your data can be difficult without the right tools. In this section, we look at some of the tools available to the Python user.

With just a few lines of code, you will be able to do the following analysis:

Compute descriptive statistics
Correlation
Linear regression
Pivot tables

Computing Aggregate Statistics

As you may have seen in the last chapter, it is easy to get some summary statistics by using describe. Let’s take a look at how we can find those values directly.

First, let’s create some data (Listing 4-1).

import pandas as pd

names = ['Bob','Jessica','Mary','John','Mel']

grades = [76,95,77,78,99]

GradeList = zip(names,grades)

df = pd.DataFrame(data=GradeList,

columns=['Names','Grades'])

Listing 4-1

Creating Dataset for Statistics

Once that is set up, we can take a look at some statistics (Listing 4-2).

df['Grades'].count() # number of values

df['Grades'].mean() # arithmetic average

df['Grades'].std() # standard deviation

df['Grades'].min() # minimum

df['Grades'].max() # maximum

df['Grades'].quantile(.25) # first quartile df['Grades'].quantile(.5) # second quartile df['Grades'].quantile(.75) # third quartile

Listing 4-2

Computing Aggregate Statistics

Note

If you tried to execute the previous code in one cell all at the same time, the only thing you would see is the output of the .quantile() function. You have to try them one by one. I simply grouped them all together for reference purposes. OK?

It is important to note that the mean is not the only measure of central tendency. See Listing 4-3 for other measures.

# computes the arithmetic average of a column

# mean = dividing the sum by the number of values

df['Grades'].mean()

# finds the median of the values in a column

# median = the middle value if they are sorted in order

df['Grades'].median()

# finds the mode of the values in a column

# mode = the most common single value

df['Grades'].mode()

Listing 4-3

Other Measures of Central Tendency

And if you need to compute standard deviation, you might also need variance (Listing 4-4).

# computes the variance of the values in a column

df['Grades'].var()

Listing 4-4

Computing Variance

Finally, you don’t have to specify the column to compute the statistics. If you just run it against the whole dataframe, you will get the function to run on all applicable columns (Listing 4-5).

df.var()

Listing 4-5

Computing Variance on All Numeric Columns

Your Turn

Of course, in our dataset we only have one column. Try creating a dataframe and computing summary statistics using the dataset in Listing 4-6.

names = ['Bob','Jessica','Mary','John','Mel']

grades = [76,95,77,78,99]

bsdegrees = [1,1,0,0,1]

msdegrees = [2,1,0,0,0]

phddegrees = [0,1,0,0,0]

Listing 4-6

Starting Dataset

Computing Aggregate Statistics on Matching Rows

It is possible to compute descriptive statistics on just the rows that match certain criteria. First, let’s create some data (Listing 4-7).

import pandas as pd

names = ['Bob','Jessica','Mary','John','Mel']

grades = [76,95,77,78,99]

bs = [1,1,0,0,1]

ms = [2,1,0,0,0]

phd = [0,1,0,0,0]

GradeList = zip(names,grades,bs,ms,phd)

df = pd.DataFrame(data=GradeList,

columns=['Name','Grade','BS','MS','PhD'])

Listing 4-7

Creating Dataset

Ok, we have covered how to find the rows that match a set of criteria. We have also covered how to compute descriptive statistics, both all at once and one by one. If you put those two together, you will be able to find the statistics of the data that matches certain criteria.

So, to count the rows of the people without a PhD, use the code shown in Listing 4-8.

df.loc[df['PhD']==0].count()

Listing 4-8

Code for Computing Aggregate Statistics

You can use any of the aggregate statistics functions on individual columns in the same way. So, to find the average grade of those people without a PhD, use the code in Listing 4-9.

df.loc[df['PhD']==0]['Grade'].mean()

Listing 4-9

Computing Aggregate Statistics on a Particular Column

Your Turn

Using the data from Listing 4-10, what is the average grade for people with master’s degrees?

import pandas as pd

names = ['Bob','Jessica','Mary','John',

'Mel','Sam','Cathy','Henry','Lloyd']

grades = [76,95,77,78,99,84,79,100,73]

bs = [1,1,0,0,1,1,1,0,1]

ms = [2,1,0,0,0,1,1,0,0]

phd = [0,1,0,0,0,2,1,0,0]

GradeList = zip(names,grades,bs,ms,phd)

df = pd.DataFrame(data=GradeList,

columns=['Names','Grades','BS','MS','PhD'])

Listing 4-10

Dataset for Exercise

Sorting Data

Generally, we get data in a random order, but need to use it in a completely different order. We can use the sort_values function to rearrange our data to our needs (Listing 4-11).

import pandas as pd

Location = "datasets/gradedata.csv"

df = pd.read_csv(Location)

df.head()

Listing 4-11

Loading Data from CSV

Sort the dataframe’s rows by age, in descending order (Listing 4-12).

df = df.sort_values(by='age', ascending=0)

df.head()

Listing 4-12

Sorting by Age, Descending

Sort the dataframe’s rows by hours of study and then by exercise, in ascending order (Listing 4-13).

df = df.sort_values(by=['grade', 'age'],

ascending=[True, True])

df.head()

Listing 4-13

Sorting by Hours of Study and Exercise, Ascending

Your Turn

Can you sort the dataframe to order it by name, age, and then grade?

Correlation

Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price.

Basically, correlation measures how closely two variables move in the same direction. Tall parents have tall kids? Highly correlated. Wear lucky hat, but rarely win at cards? Very slightly correlated. As your standard of living goes up, your level of savings plummet? Highly negatively correlated.

Simple, right?

Well, computing correlation can be a little difficult by hand, but is totally simple in Python (Listing 4-14).

import pandas as pd

Location = "datasets/gradedata.csv"

df = pd.read_csv(Location)

df.head()

df.corr()

Listing 4-14

Running a Correlation

	Age	Exercise	Hours	Grade
Age	1.000000	-0.003643	-0.017467	-0.007580
Exercise	-0.003643	1.000000	0.021105	0.161286
Hours	-0.017467	0.021105	1.000000	0.801955
Grade	-0.007580	0.161286	0.801955	1.000000

The intersections with the highest absolute values are the columns that are the most correlated. Positive values are positively correlated, which means they go up together. Negative values are negatively correlated (as one goes up the other goes down). And, of course, each column is perfectly correlated with itself. As you can see, hours of study and grade are highly correlated.

Your Turn

Load the data from the code in Listing 4-15 and find the correlations.

import pandas as pd

Location = "datasets/tamiami.csv"

Listing 4-15

Load Data from CSV

Regression

In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. This is a fancy way of saying that we use regression to create an equation that explains the value of a dependent variable based on one or several independent variables. Let’s get our data (Listing 4-16),

import pandas as pd

Location = "datasets/gradedata.csv"

df = pd.read_csv(Location)

df.head()

Listing 4-16

Load Data from CSV

Once we have that, we need to decide what columns we want to perform the regression on and which is the dependent variable. I want to try to predict the grade based on the age, hours of exercise, and hours of study (Listing 4-17).

import statsmodels.formula.api as sm

result = sm.ols(

formula='grade ~ age + exercise + hours',

data=df).fit()

result.summary()

Listing 4-17

First Regression

The formula format in line two is one that you need to learn and get comfortable editing. It shows the dependent variable on the left of the tilde (~) and the independent variables we want considered on the right.

If you look at the results you get from the summary, the R-squared represents the percentage of the variation in the data that can be accounted for by the regression. .664, or 66.4 percent, is good, but not great. The p-value (represented here by the value of P>|t|) represents the probability that the independent variable has no effect on the dependent variable. I like to keep my p-values less than 5 percent, so the only variable that stands out is the age with 59.5 percent. Let’s rerun the regression, but leaving out the age (Listing 4-18).

import statsmodels.formula.api as sm

result = sm.ols(

formula='grade ~ exercise + hours',

data=df).fit()

result.summary()

Listing 4-18

Second Regression

Looking at our new results, we haven’t changed our R-squared, but we have eliminated all our high p-values. So, we can now look at our coefficients, and we will end up with an equation that looks something like grade = 1.916 * hours of study +.989 * hours of exercise + 58.5316.

Your Turn

Create a new column where you convert gender to numeric values, like 1 for female and 0 for male. Can you now add gender to your regression? Does this improve your R-squared?

Regression without Intercept

Sometimes, your equation works better without an intercept. This can happen even though your p-values indicate otherwise. I always try it both ways, just as a matter of course, to see what the R-Squared is. To run your regression without an intercept, simply follow Listing 4-19.

import pandas as pd

Location = "datasets/gradedata.csv"

df = pd.read_csv(Location)

df.head()

result = sm.ols(

formula='grade ~ age + exercise + hours - 1', data=df).fit()

result.summary()

Listing 4-19

Run Regression without Intercept

Note that it is the - 1 at the end of the formula that tells Python that you wish to eliminate the intercept. If you look at the results, you can see we now have a much higher R-squared than we had in the last lesson, and we also have no p-values that cause us concern.

Your Turn

Try running these simple regressions with no intercept: 1. Tests for the relationship between just grade and age; 2. Tests for the relationship between just grade and exercise; and 3. Tests for the relationship between just grade and study.

If you had to pick just one, which one do you like best?

Basic Pivot Table

Pivot tables (or crosstabs) have revolutionized how Excel is used to do analysis. However, I like pivot tables in Python better than I do in Excel. Let’s get some data (Listing 4-20).

import pandas as pd

Location = "datasets/gradedata.csv"

df = pd.read_csv(Location)

df.head()

Listing 4-20

Load Data from CSV

At its simplest, to get a pivot table we need a dataframe and an index (Listing 4-21).

pd.pivot_table(df, index=['gender'])

Listing 4-21

Get Averages of All Numeric Columns Categorized by Gender

As you can see, pivot_table is smart enough to assume that we want the averages of all the numeric columns. If we wanted to specify just one value, we could do that (Listing 4-22).

pd.pivot_table(df,

values=['grade'],

index=['gender'])

Listing 4-22

Average Grade by Gender

Gender	Grade
Female	82.7173
Male	82.3948

Now we see just the average grades, categorized by gender. If we wanted to, however, we could look at minimum hours of study (Listing 4-23).

pd.pivot_table(df,

values=['grade'],

index=['gender'],

aggfunc='min')

Listing 4-23

Minimum Grade by Gender

Gender	Grade
Female	2
Male	0

We can also add other columns to the index. So, to view the maximum grade categorized by gender and age, we simply run the code in Listing 4-24.

pd.pivot_table(df,

index=['gender','age'],

aggfunc='max',

values=['hours'])

Listing 4-24

Max Grade by Gender and Age

Gender	Age	Hours
Female	14	20
	15	20
	16	19
	17	20
	18	20
	19	20
Male	14	19
	15	20
	16	20
	17	20
	18	20
	19	20

We can also have multiple value columns. So, to show the average grade and hours of study by gender, we can run the code in Listing 4-25.

pd.pivot_table(df,

index=['gender'],

aggfunc='mean',

values=['grade','hours'])

Listing 4-25

Average Grade and Hours by Gender

Gender	Grade	Hours
Female	82.7173	10.932
Male	82.3948	11.045

We can also perform pivot tables on subsets of the data. First, select your data, then do a standard pivot on that selection. So, to show the average grade and hours of study by gender for students who are 17 years old, we can run the code in Listing 4-26.

df2 = df.loc[df['age'] == 17]

pd.pivot_table(df2,

index=['gender'],

aggfunc='mean',

values=['grade','hours'])

Listing 4-26

Average Grade and Hours by Gender

Gender	Grade	Hours
Female	83.599435	10.943503
Male	82.949721	11.268156

Finally, we can include totals on our Python pivot tables, as shown in Listing 4-27.

df2 = df.loc[df['age'] == 17]

pd.pivot_table(df2,

index=['gender'],

aggfunc='mean',

values=['grade','hours'],

margins='True')

Listing 4-27

Average Grade and Hours by Gender

Gender	Grade	Hours
Female	83.599435	10.943503
Male	82.949721	11.268156
All	83.272753	11.106742

Your Turn

Can you create a pivot table showing the average grade by gender of people who had more than two hours of exercise?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 4. Finding the Meaning

Create new playlist

Sign In

Sign Up

4. Finding the Meaning

Computing Aggregate Statistics

Note

Your Turn

Computing Aggregate Statistics on Matching Rows

Your Turn

Sorting Data

Your Turn

Correlation

Your Turn

Regression

Your Turn

Regression without Intercept

Your Turn

Basic Pivot Table

Your Turn

Table of Contents for
4. Finding the Meaning