Checking for outliers

Within the boxplot function, outliers are computed following John Tukey's formula, that is, setting a threshold equal to 1.5 * interquartile range and marking everything outside the range defined from 1st quartile - 1.5 * interquartile range and 3rd quartile + 1.5 * interquartile range as outlier. No, there doesn't seem to be any statistical reason behind that 1.5, nor a cabalistic one.

To get a closer look at the value marked as outlier, we have to resort to the boxplot.stats function, which is the one called from the boxplot function. This function actually computes stats behind the plot, and outliers are included among them.

Let me try to call it, passing the cash_flow attribute as argument:

boxplot.stats(x = cash_flow_report$cash_flow)

OK then, you find the following as the output:

$stats
[1] 81482.76 95889.92 102593.81 107281.90 117131.77
$n
[1] 84
$conf
[1] 100629.9 104557.7
$out
[1] 132019.06 77958.74 45089.21

The stats command shows the values of Tukey's upper and lower threshold (first and last term) the first quartile, the median, and the third quartile. The n object just acknowledges the number of records analyzed and the conf one report about confidence intervals that do not have any interest for us at the moment.

Finally, the out element shows detected outliers, by decreasing value. Can you see it?

$out
[1] 132019.06 77958.74 45089.21

45089.21, here it is, our first suspect.

But, when was this recorded, and where? We will find this out by storing detected outliers in a vector and filtering out our cash_flow report, employing as a filter that 45089.21:

Store the outliers:

stats <- boxplot.stats(x = cash_flow_report$cash_flow)
outliers <- stats$out

Filter cash_flow_report based on the value of cash_flow output, looking for records equal to outliers[3], which is our beloved 45089.21.

cash_flow_report %>% 
filter(cash_flow == outliers[3])

Look, we did it here, it is our suspect:

    x              y                    cash_flow
1 middle_east 2017-07-01 45089.21

It is the last recorded cash flow from the Middle East. Is it the last of a decreasing trend affecting this region; is there any general trend over time? We are going to discover this by looking at the cash flows together with the other variables, that is, reporting date and geographic area.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset