Histograms

Histograms can effectively represent the distribution of a variable. Here, we will visualize two normal distributions, both characterized by unit standard deviation, one having a mean of 0 and the other a mean of 3.0:

In: import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(loc=0.0, scale=1.0, size=500)
z = np.random.normal(loc=3.0, scale=1.0, size=500)
plt.hist(np.column_stack((x,z)),
bins=20,
histtype='bar',
color = ['c','b'],
stacked=True)
plt.grid()
plt.show()

The conjoint distributions can offer a different insight on the data if there is a classification problem:

There are a few ways to personalize this kind of plot and obtain further insights about the analyzed distributions. First, by changing the number of bins, you will change how the distributions are discretized (discretization is the process that transforms continuous functions or series of values into a reduced, countable set of numbers: en.wikipedia.org/wiki/Discretization). Generally, 10 to 20 bins offer a good understanding of the distribution, though it really depends on the size of the dataset as well as the distribution. For instance, the Freedman-Diaconis rule prescribes that the optimal number of bins in a histogram in order to meaningfully visualize your data depends on the bin's width, to be calculated using the interquartile range (IQR) and the number of observations:

Having calculated h, which is the bin width, the number of bins is computed by dividing the difference between the maximum and the minimum value by h:

bins=(max-min) / h

We can also change the type of visualization from bars to steps by changing the parameters from histtype='bar' to histtype='step'. By changing the stacked Boolean parameter to False, the curves won't stack into a unique bar in the parts that overlap, but you will clearly see the separate bars of each one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset