Enhancing your EDA capabilities

Seaborn doesn't just make your charts more beautiful and easily controlled in their aspect; it also provides you with new tools for EDA that helps you discover distributions and relationships between variables.

Before proceeding, let's reload the package and have both the Iris and Boston datasets ready in pandas DataFrame format:

In: import seaborn as sns 
    sns.set()  

    from sklearn.datasets import load_iris 
    iris = load_iris() 
    X_iris, y_iris = iris.data, iris.target 
    features_iris = [a[:-5].replace(' ','_') for a in iris.feature_names] 
    target_labels = {j: flower 
                        for j, flower in enumerate(iris.target_names)} 
    df_iris = pd.DataFrame(X_iris, columns=features_iris) 
    df_iris['target'] = [target_labels[y] for y in y_iris]  

    from sklearn.datasets import load_boston 
    boston = load_boston() 
    X_boston, y_boston = boston.data, boston.target 
    features_boston = np.array(['V'+'_'.join([str(b), a]) 
                                for a,b in zip(boston.feature_names, 
                                range(len(boston.feature_names)))]) 
    df_boston = pd.DataFrame(X_boston, columns=features_boston) 
    df_boston['target'] = y_boston 
    df_boston['target_level'] = pd.qcut(y_boston,3)

As for as the Iris dataset, the target variable has been converted into a descriptive text of the Iris species. For the Boston dataset, the continuous target variable, the median value of owner-occupied homes, has been divided into three equal parts, representing lower, median, and high prices (using the pandas function, qcut).

Seaborn can first help your data exploration with figuring out how discretely valued or categorical variables are related to numeric ones. This is achieved using the catplot function:

In: with sns.axes_style('ticks'): 
        sns.catplot(data=df_boston, x='V8_RAD', y='target', kind='point')

You will find it insightful exploring similar plots, since they explicit the target level and its variance:

In our example, in the Boston dataset, the index of accessibility to radial highways, which is discretely valued, is compared with the target in order to check both the functional form of its relationships and the associated variance at each level.

In the case, instead, the comparison is between numeric variables; Seaborn offers an enhanced scatterplot with a regression fitted curve trend incorporated, which can clue you in to possible data transformations when the relationship is not linear:

In: with sns.axes_style("whitegrid"): 
        sns.regplot(data=df_boston, x='V12_LSTAT', y="target", order=3)

The fitting line is promptly displayed:

regplot in Seaborn can visualize regression plots of any order (we displayed a second-degree polynomial fit). Among the available regression plots, you can use a standard linear regression, a robust regression or even a logistic regression if one of the inspected features is binary.

Where it is necessary to consider distributions too, jointplot will provide additional plots on the side of the scatterplot:

In: with sns.axes_style("whitegrid"): 
        sns.jointplot("V4_NOX", "V7_DIS", 
                      data=df_boston, kind='reg', 
                      order=3)

jointplot produces the following chart:

Ideal for representing bivariate relationships by acting on the kind parameter, jointplot can also represent simple scatterplots or densities (kind=scatter or kind=kde).

When the purpose is to discover what discriminates classes, FacetGrid can arrange different plots in a comparable way and help you understand where there are differences. For instance, we can inspect the scatterplot of Iris species in order to figure out whether they occupy different parts of the feature state:

In: with sns.axes_style("darkgrid"): 
        chart = sns.FacetGrid(df_iris, col="target_level")   
        chart.map(plt.scatter, "sepal_length", "petal_length")

The code will nicely print a panel representing the comparisons based on groups:

Similar comparisons can be made using distributions (sns.distplot) or regression slopes (sns.regplot):

In: with sns.axes_style("darkgrid"):
        chart = sns.FacetGrid(df_iris, col="target")
        chart.map(sns.distplot, "sepal_length")

The first comparison is based on distributions:

The subsequent comparison is based on fitting a linear regression line:

In: with sns.axes_style("darkgrid"):
        chart = sns.FacetGrid(df_boston, col="target_level") 
        chart.map(sns.regplot, "V4_NOX", "V7_DIS")

Here is the regression-based comparison:

As for evaluating data distributions across classes, Seaborn offers an alternative tool, which is the violin plot (https://medium.com/@bioturing/5-reasons-you-should-use-a-violin-graph-31a9cdf2d0c6). A violin plot is simply a boxplot whose box is shaped based on density estimation, thus visually conveying information that is more intuitive:

In: with sns.axes_style("whitegrid"):     
        ax = sns.violinplot(x="target", y="sepal_length", 
                            data=df_iris, palette="pastel") 
        sns.despine(offset=10, trim=True)

The violin plot produced by the previous code can provide interesting insights into the dataset:

Finally, Seaborn offers a much better way of creating a matrix of scatterplots by using the pairplot command and allowing you to define group colors (parameter hue) and how to populate the diagonal row. This is by using the diag_kind parameter, which can be a histogram ('hist') or kernel density estimation ('kde'):

In: with sns.axes_style("whitegrid"): 
        chart = sns.pairplot(data=df_iris, hue="target", diag_kind="hist")

The previous code will output a complete matrix of scatterplots for the dataset:

Table of Contents for Enhancing your EDA capabilities

Create new playlist

Sign In

Sign Up

Table of Contents for
Enhancing your EDA capabilities