Skewness

OK, let's move on with our EDA, taking along our first hint about the profits drop. We should now look at our distribution shape, and first of all, whether it is symmetrical or not. Without being too complex, let me just show you the concept with a sketch:

Within all of the three plots, you find, on the x axis, the value of a variable, and on the y axis, the frequency. It is like a histogram, isn't it?

eh, ehm, the author here... you should remember histograms since we met them in Chapter 3The Data Mining Process - CRISP-DM Methodology, but you can quickly skip back to refresh this.

Nice, so the blue distribution is symmetrical, that is, it's specular around its mean. The green and red plots are not symmetrical since the first one is biased toward the left, and the third toward the right.

This is the intuition behind the concept of skewness. There are different measures of skewness proposed within literature; let's use the one from Sir Arthur Lyon Bowley since it employs quartile and median, which you should now be familiar with.

Bowley's definition of skewness is shown in the following:

Here, qis the nth quartile of our vector.

When this number is greater than 0 we are dealing with a positively skewed population, while the opposite is true for a negative value.

Moving a bit closer to the formula, what do you think is in the numerator? As you can see, there we measure the difference between the distance between the third and the second quartile, and the distance between the second and the third quartile.

In a symmetric distribution, those two quantities are equal, while in a skewed the first will be greater than the second, or vice versa based on the distribution being skewed to the left or the right.

You can get it through a numerical example. Consider, for instance, the following vector:

1
1
2
2
2
2.5
3
3
3
4
4

 

Which is the median, or the second quartile? Right, it is 2.5. The first quartile here is 2 and the third is 3. Let's compute the two distances mentioned in the preceding example:

q2 - q1 = 2.5 - 2 = 0.5

q3 - q2 = 3 - 2.5 = 0.5

As you can see, those two quantities are exactly the same. This is coherent with the population being perfectly symmetrical about the mean, which is exactly 2.5:

Number Frequency
1 2
2 3
2.5 1
3 3
4

2

 

What if we increase the number of 1 ? For instance, substituting it with a 2? We get this:

1
1
1
2
2
2.5
3
3
3
4
4

 

The average moves to 2.4 and the two differences now become 1 and 0.5. Let's now compute the Cowley's skewness for both of the examples:

first skewness = (0.5 - 0.5)/(3-2) = 0

>

second skewness = (0.5 - 1)/(3-1.5) = -0.3

Following Cowley's skewness, we see that the first distribution should look symmetric around the mean, while the second should be negatively skewed, that is, it's biased towards higher values. We can check this by computing the number of elements lower and greater than the mean within the first and second population: 5 and 5 for the first (since the median is equal to the mean) and 5 and 6 for the second (since the median here is higher than the mean).

Nice, but what about our cash flow population? Let's compute the Cowley's skewness, employing the quantiles we previously obtained from fivenum. We first save the output from this function within a quartiles object:

cash_flow_report %>% 
select(cash_flow) %>%
unlist() %>%
fivenum()-> quartiles

Then, we filter out the quartiles we are not interested in, that is, the quartile 0 and 4, which are the first and the fifth element of our resulting vector. We therefore keep only elements from the second to the fourth:

q <- quartiles[2:4]

Finally, we apply the formula shown in the preceding code:

skewness <- ((q[3]- q[2])-(q[2]- q[1]))/(q[3]-q[1])

OK, our population seems to be negatively skewed, which means that we are dealing with a historical series biased towards higher values, that is, mainly constituted from higher cash flows. What do you think this can tell us? Is it another hint about the drop? For sure, this makes us, even more, suspect our drop, and at least you now know about population skewness!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset