Detecting fraud in e-commerce orders with Benford's law

Benford's law is a popular empirical law that states that the first digits of a population of data will follow a specific logarithmic distribution.

This law was observed by Frank Benford around 1938 and since then has gained increasing popularity as a way to detect anomalous alterations in a population of data.

Basically, testing a population against Benford's law means verifying that the given population respects this law. If deviations are discovered, the law performs further analysis for items related to those deviations.

In this recipe, we will test a population of e-commerce orders against the law, focusing on items deviating from the expected distribution.

Getting ready

This recipe will use functions from the well-documented benford.analysis package by Carlos Cinelli.

We therefore need to install and load this package:

install.packages("benford.analysis")
library(benford.analysis)

In our example, we will use a data frame that stores e-commerce orders, provided within the book as an .Rdata file.

In order to make it available within your environment, we need to load this file by running the following command (assuming the file is in your current working directory):

load("ecommerce_orders_list.Rdata")

How to do it...

  1. Perform the Benford test on the order amounts:
    benford_test <- benford(ecommerce_orders_list$order_amount,1)
    
  2. Plot the test analysis:
    plot(benford_test)
    

    This will result in the following plot:

    How to do it...
  3. Highlight digits deviating from expected distribution:
    suspectsTable(benford_test)
    

    This will produce a table showing, for each digit, the absolute differences between expected and observed frequencies. The first digits will therefore be the anomalous ones:

    > suspectsTable(benford_test)
       digits absolute.diff
    1:      5     4860.8974
    2:      9     3764.0664
    3:      1     2876.4653
    4:      2     2870.4985
    5:      3     2856.0362
    6:      4     2706.3959
    7:      7     1567.3235
    8:      6     1300.7127
    9:      8      200.4623
    
  4. Define a function to extrapolate the first digit from each amount:
    left = function (string,char) {
      substr(string,1,char)}
    
  5. Extrapolate the first digit from each amount:
    ecommerce_orders_list$first_digit <- left(ecommerce_orders_list$order_amount,1)
    
  6. Filter amounts starting with the suspected digit:
    suspects_orders <- subset(ecommerce_orders_list,first_digit == 5)
    

How it works...

In step 1, we perform the Benford test on the order amounts. In this step, we apply the benford() function to the amounts.

Applying this function means evaluating the distribution of the first digits of amounts against the expected Benford distribution.

The function will result in the production of the following objects:

Info

General information, including:

  • data.name: The name of the data used
  • n: The number of observations used
  • n.second.order: The number of observations used for second-order analysis
  • number.of.digits: The number of first digits analyzed

Data

A data frame with:

  • lines.used: The original lines of the dataset
  • data.used: The data used
  • data.mantissa: The log data's Mantissa
  • data.digits: The first digits of the data

s.o.data

A data frame with:

  • data.second.order: The differences of the ordered data
  • data.second.order.digits: The first digits of the second-order analysis

Bfd

A data frame with:

  • digits: The groups of digits analyzed
  • data.dist: The distribution of the first digits of the data
  • data.second.order.dist: The distribution of the first digits of the second-order analysis
  • benford.dist: The theoretical Benford distribution
  • data.second.order.dist.freq: The frequency distribution of the first digits of the second-order analysis
  • data.dist.freq: The frequency distribution of the first digits of the data
  • benford.dist.freq: the theoretical Benford frequency distribution
  • benford.so.dist.freq: The theoretical Benford frequency distribution of the second order analysis
  • data.summation: The summation of the data values grouped by first digits
  • abs.excess.summation: The absolute excess summation of the data values grouped by first digits
  • difference: The difference between the data and Benford frequencies
  • squared.diff: The chi-squared difference between the data and Benford frequencies
  • absolute.diff: The absolute difference between the data and Benford frequencies

Mantissa

A data frame with:

  • mean.mantissa: The mean of the Mantissa
  • var.mantissa: The variance of the Mantissa
  • ek.mantissa: The excess kurtosis of the Mantissa
  • sk.mantissa: The skewness of the Mantissa

MAD

The mean absolute deviation

distortion.factor

The distortion factor

Stats

List of htest class statistics:

  • chisq: Pearson's chi-squared test
  • mantissa.arc.test: Mantissa Arc Test

In step 2, we plot the test results. Running plot on the object resulting from the benford() function will result in a plot showing the following (from upper-left corner to bottom-right corner):

  • First digit distribution
  • Results of the second-order test
  • Summation distribution for each digit
  • Results of the chi-squared test
  • Summation differences

If you look carefully at these plots, you will understand which digits show a distribution significantly different from the one expected by the Benford law. In order to have a sounder base for our consideration, we need to look at the suspects table, showing absolute differences between expected and observed frequencies. This is what we will do in the next step.

In step 3, we highlight suspects digits. Using suspectsTable() we can easily discover which digits present the greater deviation from the expected distribution.

Looking at the suspects table, we can see that number 5 shows up as the first variable within our table. In the next step, we will focus our attention on the orders with amounts that have this digit as the first digit.

In step 4, we define a function to extrapolate the first digit from each amount. This function leverages the substr() function from the stringr() package and extracts the first digit from the number passed to it as an argument.

In step 5, we add a new column to the investigated dataset, where the first digit is extrapolated.

In step 6, we filter amounts starting with the suspected digit.

After applying the left function to our sequence of amounts, we can now filter the dataset, retaining only rows whose amounts have 5 as the first digit. We will now be able to perform analytical testing procedures on those items.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset