Introducing EDA

Exploratory data analysis (EDA), or data exploration, is the first step in the data science process. John Tukey coined this term in 1977 when he first wrote his, book Exploratory Data Analysis, emphasizing the importance of EDA. EDA is required to understand the dataset better, check its features and its shape, validate some first hypothesis that you have in mind, and get a preliminary idea about the next step that you want to pursue in subsequent subsequent data science tasks.

In this section, you will work on the Iris dataset, which was already used in the previous chapter. First, let's load the dataset:

In: import pandas as pd
    iris_filename = 'datasets-uci-iris.csv'
    iris = pd.read_csv(iris_filename, header=None, 
            names= ['sepal_length', 'sepal_width', 
            'petal_length', 'petal_width', 'target'])
    iris.head()

Calling the head method will display the first five rows:

Great! Using a few commands, you have already loaded the dataset. Now, the investigation phase starts. Some great insights are provided by the .describe() method, which can be used as follows:

In: iris.describe()

Promptly, a description of the dataset, comprising frequencies, means, and other descriptives appears:

For all numerical features, you have the number of observations, their respective average values, standard deviations, minimum and maximum values, and some routinely reported quantiles (at 25 percent, 50 percent, and 75 percent), the so-called quartiles. This provides you with a good idea about the distribution of each feature. If you want to visualize this information, just use the boxplot() method, as follows:

In: boxes = iris.boxplot(return_type='axes')

A boxplot for each variable will appear:

Sometimes, the graphs/diagrams presented in this chapter can be slightly different from the ones obtained on your local computer because graphical layout initialization is made with random parameters.

If you need to learn about other quantile values, you can use the .quantile() method. For example, if you need the values at 10 % and 90 % of the distribution of values, you can try out the following code:

In: iris.quantile([0.1, 0.9])

Here are the values for the required percentiles:

Finally, to calculate the median, you can use the .median() method. Similarly, to obtain the mean and standard deviation, the .mean() and .std() methods are used, respectively. In the case of categorical features, to get information about the levels present in a feature (that is, the different values the feature assumes), you can use the .unique() method, as follows:

In: iris.target.unique()

Out: array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], 
     dtype=object)

To examine the relationship between features, you can create a co-occurrence matrix or a similarity matrix.
In the following example, we will count the number of times the petal_length feature appears more than the average against the same count for the petal_width feature. To do this, you need to use the crosstab method, as follows:

In: pd.crosstab(iris['petal_length'] > 3.758667, 
                iris['petal_width'] > 1.198667)

The command produces a two-way table:

As a result, you will notice that the features will almost always occur conjointly. Consequently, you can suppose that there's a strong relationship between the two events. Graphically, you can check such a hypothesis by using the following code:

In: scatterplot = iris.plot(kind='scatter', 
                            x='petal_width', y='petal_length', 
                            s=64, c='blue', edgecolors='white')

You obtain a scatterplot of the variables you specified as x and y:

The trend is quite marked; we deduce that x and y are strongly related. The last operation that you usually perform during an EDA is checking the distribution of the feature. To manage this with pandas, you can approximate the distribution using a histogram, which can be done thanks to the following snippet:

In: distr = iris.petal_width.plot(kind='hist', alpha=0.5, bins=20)

As a result, a histogram is displayed:

We chose 20 bins after a careful search. In other situations, 20 bins might be an extremely low or high value. As a rule of thumb, when drawing a distribution histogram, the starting value is the square root of the number of observations. After the initial visualization, you will then need to modify the number of bins until you recognize a well-known shape in the distribution.

We suggest that you explore all of the features in order to check their relationships and estimate their distribution. In fact, given its distribution, you may decide to treat each feature differently in order subsequently to achieve maximum classification or regression performance.

Table of Contents for Introducing EDA

Create new playlist

Sign In

Sign Up

Table of Contents for
Introducing EDA