Plotting scatterplots

Scatterplots can be used to effectively understand whether the variables are in a nonlinear relationship, and you can get an idea about their best possible transformations to achieve linearization. If you are using an algorithm based on linear combinations, such as linear or logistic regression, figuring out how to render their relationship more linearly will help you achieve a better predictive power:

In: colors_palette = {0: 'red', 1: 'yellow', 2:'blue'}
colors = [colors_palette[c] for c in groups]
simple_scatterplot = iris_df.plot(kind='scatter', x=0, y=1, c=colors)

After running the code, a nicely drawn scatterplot will appear:

Scatterplots can be turned into hexagonal binning plots. In addition, they help you effectively visualize the point densities, where the points naturally aggregate together more, thus revealing clusters hidden in your data. For achieving such results, you may use some of the variables originally present in the dataset, or the dimensions obtained by a PCA or by another dimensionality reduction algorithm:

In: hexbin = iris_df.plot(kind='hexbin', x=0, y=1, gridsize=10)

Here is the resulting hexbin plot:

The gridsize parameter indicates how many data points the chart will summarize in a single grid. A larger number will create large grid cells, whereas a smaller one will create small cells.

Scatterplots are bivariate. Consequently, you'll require a single plot for every variable combination. If your variables are not so many in number (otherwise, the visualization will be cluttered), a quick solution is to use the pandas command to draw a matrix of scatterplots automatically (using the kernel density estimation, 'kde', in order to plot the distribution of each feature on the diagonal of the chart):

In: from pandas.plotting import scatter_matrix
colors_palette = {0: "red", 1: "green", 2: "blue"}
colors = [colors_palette[c] for c in groups]
matrix_of_scatterplots = scatter_matrix(iris_df,
alpha=0.2,
figsize=(6, 6),
color=colors,
diagonal='kde')

After running the previous code, you will get a complete matrix of plots (densities on the diagonal):

A few parameters can control various aspects of the scatterplot matrix. The alpha parameter controls the amount of transparency, and figsize provides the width and height of the matrix in inches. Finally, color accepts a list indicating the color of each point in the plot, thus allowing the depicting of different groups in data. In addition, by selecting 'kde' or 'hist' on your diagonal parameter, you can opt to represent density curves or histograms of each variable on the diagonal of the scatter matrix.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset