Scatter plot in Seaborn

A scatter plot is one of the most common plots in the scientific and business worlds. It is particularly useful for displaying the relationship between two variables. While we can simply use matplotlib.pyplot.scatter to draw a scatter plot, we can also use Seaborn to build similar plots with more advanced features.

The two functions seaborn.regplot() and seaborn.lmplot() display a linear relationship in the form of a scatter plot, a regression line, plus the 95% confidence interval around that regression line. The main difference between the two functions is that lmplot() combines regplot() with FacetGrid such that we can create color-coded or faceted scatter plots to show the interaction between three or more pairs of variables. We will demonstrate the use of lmplot() later in this chapter and the next chapter.

The simplest form of seaborn.regplot() supports numpy arrays, pandas Series, or pandas DataFrames as input. The regression line and the confidence interval can be removed by specifying fit_reg=False.

We are going to investigate the hypothesis that Big Macs are cheaper in poorer countries, and vice versa, checking whether there is any correlation between the Big Mac index and GDP per capita:

import seaborn as sns
import matplotlib.pyplot as plt

# seaborn.regplot() returns matplotlib.Axes object
ax = sns.regplot(x="Value", y="dollar_price", data=merged_df, fit_reg=False)
ax.set_xlabel("GDP per capita (constant 2000 US$)")
ax.set_ylabel("BigMac index (US$)")

plt.show()

The expected output:

So far so good! It looks like the Big Mac index is positively correlated with GDP per capita. Let's turn the regression line back on and label a few countries that show extreme Big Mac index values:

ax = sns.regplot(x="Value", y="dollar_price", data=merged_df)
ax.set_xlabel("GDP per capita (constant 2000 US$)")
ax.set_ylabel("BigMac index (US$)")

# Label the country code for those who demonstrate extreme BigMac index
for row in merged_df.itertuples():
if row.dollar_price >= 5 or row.dollar_price <= 2:
ax.text(row.Value,row.dollar_price+0.1,row.country)

plt.show()

This is the expected output:

We can see that many countries fall within the confidence interval of the regression line. Given the GDP per capita level for each country, the linear regression model predicts the corresponding Big Mac index. The currency value shows signs of under- or over-valuation if the actual index deviates from the regression model.

By labeling the countries that show extremely high or low values, we can clearly see that the average price of a Big Mac in Brazil and Switzerland is overvalued, while it is undervalued in India, Russia, and Ukraine even if the differences in GDP are considered.

Since Seaborn is not a package for statistical analysis, we would need to rely on other packages, such as scipy.stats or statsmodels, to obtain the parameters of a regression model. In the next example, we are going to get the slope and intercept parameters from the regression model, and apply different colors for points that are above or below the regression line:

from scipy.stats import linregress


ax = sns.regplot(x="Value", y="dollar_price", data=merged_df)
ax.set_xlabel("GDP per capita (constant 2000 US$)")
ax.set_ylabel("BigMac index (US$)")

# Calculate linear regression parameters
slope, intercept, r_value, p_value, std_err = linregress(merged_df.Value, merged_df.dollar_price)

colors = []
for row in merged_df.itertuples():
if row.dollar_price > row.Value * slope + intercept:
# Color markers as darkred if they are above the regression line
color = "darkred"
else:
# Color markers as darkblue if they are below the regression line
color = "darkblue"

# Label the country code for those who demonstrate extreme BigMac index
if row.dollar_price >= 5 or row.dollar_price <= 2:
ax.text(row.Value,row.dollar_price+0.1,row.country)

# Highlight the marker that corresponds to China
if row.country == "CHN":
t = ax.text(row.Value,row.dollar_price+0.1,row.country)
color = "yellow"

colors.append(color)

# Overlay another scatter plot on top with marker-specific color
ax.scatter(merged_df.Value, merged_df.dollar_price, c=colors)

# Label the r squared value and p value of the linear regression model.
# transform=ax.transAxes indicates that the coordinates are given relative
# to the axes bounding box, with 0,0 being the lower left of the axes
# and 1,1 the upper right.
ax.text(0.1, 0.9, "$r^2={0:.3f}, p={1:.3e}$".format(r_value ** 2, p_value), transform=ax.transAxes)

plt.show()

Contrary to popular belief, it looks like China's currency was not significantly under-valued in 2015 since its marker lies well within the 95% confidence interval of the regression line.

To better illustrate the distribution of values, we can combine histograms of x or y values with scatter plots using seaborn.jointplot():

# seaborn.jointplot() returns a seaborn.JointGrid object
g = sns.jointplot(x="Value", y="dollar_price", data=merged_df)

# Provide custom axes labels through accessing the underlying axes object
# We can get matplotlib.axes.Axes of the scatter plot by calling g.ax_joint
g.ax_joint.set_xlabel("GDP per capita (constant 2000 US$)")
g.ax_joint.set_ylabel("BigMac index (US$)")

# Set the title and adjust the margin
g.fig.suptitle("Relationship between GDP per capita and BigMac Index")
g.fig.subplots_adjust(top=0.9)

plt.show()

By additionally specifying the kind parameter in jointplot to reg, resid, hex, or kde, we can quickly change the plot type to regression, residual, hex bin, or KDE contour plot respectively.

Here is a big disclaimer: with the data in our hands, it is still too early to make any conclusions about the valuation of currencies! Different business factors such as labor cost, rent, raw material costs, and taxation can all contribute to the pricing model of Big Mac, but this is beyond the scope of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset