The third stage of data analysis is actually analyzing the data. Finding meaning within your data can be difficult without the right tools. In this section, we look at some of the tools available to the Python user.
Compute descriptive statistics
Correlation
Linear regression
Pivot tables
Computing Aggregate Statistics
As you may have seen in the last chapter, it is easy to get some summary statistics by using describe. Let’s take a look at how we can find those values directly.
First, let’s create some data (Listing 4-1).
Creating Dataset for Statistics
Once that is set up, we can take a look at some statistics (Listing 4-2).
Computing Aggregate Statistics
Note
If you tried to execute the previous code in one cell all at the same time, the only thing you would see is the output of the .quantile() function. You have to try them one by one. I simply grouped them all together for reference purposes. OK?
It is important to note that the mean is not the only measure of central tendency. See Listing 4-3 for other measures.
Other Measures of Central Tendency
And if you need to compute standard deviation, you might also need variance (Listing 4-4).
Computing Variance
Finally, you don’t have to specify the column to compute the statistics. If you just run it against the whole dataframe, you will get the function to run on all applicable columns (Listing 4-5).
Computing Variance on All Numeric Columns
Your Turn
Of course, in our dataset we only have one column. Try creating a dataframe and computing summary statistics using the dataset in Listing 4-6.
Starting Dataset
Computing Aggregate Statistics on Matching Rows
It is possible to compute descriptive statistics on just the rows that match certain criteria. First, let’s create some data (Listing 4-7).
Creating Dataset
Ok, we have covered how to find the rows that match a set of criteria. We have also covered how to compute descriptive statistics, both all at once and one by one. If you put those two together, you will be able to find the statistics of the data that matches certain criteria.
So, to count the rows of the people without a PhD, use the code shown in Listing 4-8.
Code for Computing Aggregate Statistics
You can use any of the aggregate statistics functions on individual columns in the same way. So, to find the average grade of those people without a PhD, use the code in Listing 4-9.
Computing Aggregate Statistics on a Particular Column
Your Turn
Using the data from Listing 4-10, what is the average grade for people with master’s degrees?
Dataset for Exercise
Sorting Data
Generally, we get data in a random order, but need to use it in a completely different order. We can use the sort_values function to rearrange our data to our needs (Listing 4-11).
Loading Data from CSV
Sort the dataframe’s rows by age, in descending order (Listing 4-12).
Sorting by Age, Descending
Sort the dataframe’s rows by hours of study and then by exercise, in ascending order (Listing 4-13).
Sorting by Hours of Study and Exercise, Ascending
Your Turn
Can you sort the dataframe to order it by name, age, and then grade?
Correlation
Correlation is any of a broad class of statistical relationships involving dependence, though in common usage it most often refers to the extent to which two variables have a linear relationship with each other. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price.
Basically, correlation measures how closely two variables move in the same direction. Tall parents have tall kids? Highly correlated. Wear lucky hat, but rarely win at cards? Very slightly correlated. As your standard of living goes up, your level of savings plummet? Highly negatively correlated.
Simple, right?
Well, computing correlation can be a little difficult by hand, but is totally simple in Python (Listing 4-14).
Running a Correlation
Age | Exercise | Hours | Grade | |
---|---|---|---|---|
Age | 1.000000 | -0.003643 | -0.017467 | -0.007580 |
Exercise | -0.003643 | 1.000000 | 0.021105 | 0.161286 |
Hours | -0.017467 | 0.021105 | 1.000000 | 0.801955 |
Grade | -0.007580 | 0.161286 | 0.801955 | 1.000000 |
The intersections with the highest absolute values are the columns that are the most correlated. Positive values are positively correlated, which means they go up together. Negative values are negatively correlated (as one goes up the other goes down). And, of course, each column is perfectly correlated with itself. As you can see, hours of study and grade are highly correlated.
Your Turn
Load the data from the code in Listing 4-15 and find the correlations.
Load Data from CSV
Regression
In statistical modeling, regression analysis is a statistical process for estimating the relationships among variables. This is a fancy way of saying that we use regression to create an equation that explains the value of a dependent variable based on one or several independent variables. Let’s get our data (Listing 4-16),
Load Data from CSV
Once we have that, we need to decide what columns we want to perform the regression on and which is the dependent variable. I want to try to predict the grade based on the age, hours of exercise, and hours of study (Listing 4-17).
First Regression
The formula format in line two is one that you need to learn and get comfortable editing. It shows the dependent variable on the left of the tilde (~) and the independent variables we want considered on the right.
If you look at the results you get from the summary, the R-squared represents the percentage of the variation in the data that can be accounted for by the regression. .664, or 66.4 percent, is good, but not great. The p-value (represented here by the value of P>|t|) represents the probability that the independent variable has no effect on the dependent variable. I like to keep my p-values less than 5 percent, so the only variable that stands out is the age with 59.5 percent. Let’s rerun the regression, but leaving out the age (Listing 4-18).
Second Regression
Looking at our new results, we haven’t changed our R-squared, but we have eliminated all our high p-values. So, we can now look at our coefficients, and we will end up with an equation that looks something like grade = 1.916 * hours of study +.989 * hours of exercise + 58.5316.
Your Turn
Create a new column where you convert gender to numeric values, like 1 for female and 0 for male. Can you now add gender to your regression? Does this improve your R-squared?
Regression without Intercept
Sometimes, your equation works better without an intercept. This can happen even though your p-values indicate otherwise. I always try it both ways, just as a matter of course, to see what the R-Squared is. To run your regression without an intercept, simply follow Listing 4-19.
Run Regression without Intercept
Note that it is the - 1 at the end of the formula that tells Python that you wish to eliminate the intercept. If you look at the results, you can see we now have a much higher R-squared than we had in the last lesson, and we also have no p-values that cause us concern.
Your Turn
Try running these simple regressions with no intercept: 1. Tests for the relationship between just grade and age; 2. Tests for the relationship between just grade and exercise; and 3. Tests for the relationship between just grade and study.
If you had to pick just one, which one do you like best?
Basic Pivot Table
Pivot tables (or crosstabs) have revolutionized how Excel is used to do analysis. However, I like pivot tables in Python better than I do in Excel. Let’s get some data (Listing 4-20).
Load Data from CSV
At its simplest, to get a pivot table we need a dataframe and an index (Listing 4-21).
Get Averages of All Numeric Columns Categorized by Gender
As you can see, pivot_table is smart enough to assume that we want the averages of all the numeric columns. If we wanted to specify just one value, we could do that (Listing 4-22).
Average Grade by Gender
Gender | Grade |
---|---|
Female | 82.7173 |
Male | 82.3948 |
Now we see just the average grades, categorized by gender. If we wanted to, however, we could look at minimum hours of study (Listing 4-23).
Minimum Grade by Gender
Gender | Grade |
---|---|
Female | 2 |
Male | 0 |
We can also add other columns to the index. So, to view the maximum grade categorized by gender and age, we simply run the code in Listing 4-24.
Max Grade by Gender and Age
Gender | Age | Hours |
---|---|---|
Female | 14 | 20 |
15 | 20 | |
16 | 19 | |
17 | 20 | |
18 | 20 | |
19 | 20 | |
Male | 14 | 19 |
15 | 20 | |
16 | 20 | |
17 | 20 | |
18 | 20 | |
19 | 20 |
We can also have multiple value columns. So, to show the average grade and hours of study by gender, we can run the code in Listing 4-25.
Average Grade and Hours by Gender
Gender | Grade | Hours |
---|---|---|
Female | 82.7173 | 10.932 |
Male | 82.3948 | 11.045 |
We can also perform pivot tables on subsets of the data. First, select your data, then do a standard pivot on that selection. So, to show the average grade and hours of study by gender for students who are 17 years old, we can run the code in Listing 4-26.
Average Grade and Hours by Gender
Gender | Grade | Hours |
---|---|---|
Female | 83.599435 | 10.943503 |
Male | 82.949721 | 11.268156 |
Finally, we can include totals on our Python pivot tables, as shown in Listing 4-27.
Average Grade and Hours by Gender
Gender | Grade | Hours |
---|---|---|
Female | 83.599435 | 10.943503 |
Male | 82.949721 | 11.268156 |
All | 83.272753 | 11.106742 |
Your Turn
Can you create a pivot table showing the average grade by gender of people who had more than two hours of exercise?