Measuring customer retention using cohort analysis in R

Within the e-commerce field, customer retention metrics can be considered crucial for several reasons. Among these, the virtual absence of a barrier to entry for competitors in the virtual arena makes online sellers very willing to build an enduring relationship with their customers.

This recipe gives you a straightforward way to compute retention metrics within the R environment.

From the possible methods available for these tasks, we will use one from the family of cohort methods.

In this method, customers are divided into homogenous groups (that is, cohorts) that share relevant segmentation attributes, such as sex or age.

Purchases made by those groups are monitored monthly over a period of time, and a retention rate is calculated each month using the following formula:

retention rate = (number of customers purchasing in a given month)/(number of customers within the cohort at the starting point)

Getting ready

This recipe is not going to leverage any particular package, apart from the ggplot2 package. In this recipe, we will build our own derived variables, leveraging the power of vector-based R code.

Install the ggplot2 package:

install.packages("ggplot2")
library(ggplot2)

Our example will be based on a synthetic cohort dataset, based on four cohorts: one for older people, one for younger people, one for men, and the last one for women.

Let's create the dataset with the following script:

elder_cohort <- c(10567,9763,8327,8318,7108,6280,6279,5873,4986,3296,2986,1357)
younger_cohort <- c(25000,24500,24324,19500,15078,11879,10856,10543,10234,9678,8542,6321 )
total <- elder_cohort+younger_cohort
women_cohort <- total - total*0.46
men_cohort <- total - women_cohort
cohort_dataset <- data.frame(rbind(elder_cohort,younger_cohort,women_cohort,men_cohort))
colnames(cohort_dataset) <- c(seq(1:12))

Our dataset will now look like this:

Getting ready

How to do it...

  1. Compute retention rates for each cohort:
    retention_younger <- younger_cohort/sum(younger_cohort)
    retention_elder   <- elder_cohort/sum(elder_cohort)
    retention_women   <- women_cohort/sum(women_cohort)
    retention_men     <- men_cohort/sum(men_cohort)
    
  2. Create a unique dataset for all rates:
    retention_rates <- rbind(retention_younger,retention_elder,retention_women,retention_men)
    colnames(retention_rates) <- c(seq(1:12))
    
  3. Plot retention rates:
    retention_
    plot <- ggplot() +
     geom_line(aes(x = seq(1:12),retention_younger, colour = "younger")) + geom_line(aes(x = seq(1:12),retention_elder,colour = "elder")) + geom_line(aes(x = seq(1:12),retention_women, colour = "women")) + geom_line(aes(x = seq(1:12),retention_men, colour = "men"))
    retention_plot
    

How it works...

In step 1, there is a given structure for our dataset, so it is easy to compute the retention rate for each customer. This is done for each month with the lines of code that are provided and results in 12 ratios for each cohort.

In step 2, all retention rate vectors are now joined within one dataset, which will serve as a base for our plot.

In step 3, the retention_plot parameter is a ggplot2 plot built by starting with a blank layer, namely the ggplot() function, and four geom_line() layers, one for each cohort.

Refer to the Adding text to a ggplot2 plot at a custom location recipe in Chapter 3, Basic Visualization Techniques, which provides a good introduction to these plots.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset