4.2 Repurchase or Not (Stay or Leave)

4.2.1 Will a Customer Repurchase?

Companies spend significant resources in acquiring potential customers and one of their biggest concerns is whether these newly acquired customers just make a first-time purchase and leave, or make subsequent purchases and stay with companies for an extended period of time. Repurchase or not is usually modeled as a binary outcome where either a repurchase occurs (1) or does not occur (0). The most commonly used method to model this binary outcome is logistic regression. Geoffrey (2006) built a logit model to predict who became an active customer using online activity and surfing behavior as explanatory variables. The author collected clickstream data to track each mouse-click of customers and to analyze customer online behavior. Lemon et al. [18] conducted a study in the television entertainment service subscription industry and estimated customers' keep (repurchase) or drop (churn) decisions. As the study was conducted in a monthly contractual setting, the behavior of the renewal of contract was directly observed and coded (0/1). These authors adopted logistic regression to model the decision of whether to remain in the service relationship as a function of expected future use and satisfaction with the service. Lewis [3] conducted two studies, one in the non-contractual setting of online retailing and the other in the contractual setting of newspaper subscription, to investigate the effects of acquisition promotion discount depth on repeat purchasing and on renewal of contract using logistic regression.

A direct question that managers will ask is whether the newly acquired customers will repurchase or not. This is a binary classification problem which can be tackled by many statistical analyses such as logistic and probit regressions. Baesens et al. [19] adopted neural networks (NNs) to solve the problem and used a Bayesian learning paradigm during NN training. These authors used RFM as explanatory variables and compared the predictive results to logistic regression, and linear and quadratic discriminant analysis. In addition, Baesens et al. (2004) conducted another studies using Bayesian network classifiers to investigate whether newly acquired customers will increase or decrease their future spending from initial purchase information. The authors first adopted a linear regression to estimate the slope of the life cycle of customers based on their historical contributions. The estimated slope was then discretized into a binary variable (positive/negative) to represent increasing or decreasing spending. This binary variable was used as the dependent variable and customers' past transaction activities were taken as independent variables. In addition to econometric and statistical techniques, we suggest that researchers can adopt technology from machine learning and artificial intelligence to assist customer retention decision modeling.

Bolton et al. [20] conducted a study to investigate the factors that might influence the firm's service contract renewal decision. These authors modeled these decisions as a function of service quality and price and argued that firms assess the value of contracts renewal based on their prior service experiences under the old contract. In their model specification process, the authors considered that it was essential to account for the intrafirm association and potential heterogeneity since different firms with different characteristics and demands might assess the value of new contracts differently. Thus, they adopted a random intercept model in their analysis. Following Bolton et al.'s [20] specification, the probability that a firm (N) renews a contract (c) can be modeled as

(4.1) equation

where

(4.2) equation

(4.3) equation

in which the variables associated with the fixed parameters are denoted by vector img. The explanatory variables include both contract-level and firm-level variables. The random intercepts img were assumed to follow a univariate normal distribution across firms (with mean img and estimated variance img). In this way, the amount of intrafirm correlation can be captured by the variance of the random intercept. The authors further assumed that the error term img is an independently and identically distributed extreme value. A complementary log–log model is defined as

(4.4) equation

The random intercepts model is estimated by marginal maximum likelihood estimation, utilizing a Fisher-scoring solution. We provide an introduction to the random intercept model and its estimation in Appendix appJ.

4.2.2 When Will a Customer No Longer Repurchase?

Another important question concerning repurchase behavior is when a customer is likely to leave. Bhattacharya [6] investigated the hazard of lapsing of customers in a paid membership context. The data the author used came from an art museum and contained members' joining, affiliation, and helping characteristics. The author used survival analysis in the analysis because it is able to model the timing and occurrence of events. The dependent variable in the study was the hazard of lapsing and is defined as

(4.5) equation

where img is the instantaneous probability of member img lapsing at time img and img is the probability of an event between time img and img, given that the member is in the sample at risk at time img. In Bhattacharya's [6] study, the hazard rate of lapsing from origin state img (i.e., being a member) to destination state img (i.e., lapsing) can be described as

(4.6) equation

where img is a vector of lagged independent variables observed at time img. The hazard model is usually estimated using maximum likelihood techniques. An introduction to survival analysis and the estimation are provided in Appendix G.

In a study by Kivetz et al. [21], the authors adopted a discrete-time model in which the hazard model likelihood is decomposed into probabilities of purchase within given time intervals. The full discretized survival function is expressed as a function of the baseline hazard function img, time-varying covariates img, and estimated covariate coefficients img:

(4.7) equation

In the study, the authors decomposed the survival function into day-specific components where the dependent variable is the probability of purchase on a given day, conditional on no purchase having yet occurred:

(4.8) equation

Seetharaman and Chintagunta (2003) give the following likelihood function which can be maximized to estimate the parameters of the discrete-time proportional hazards model at the individual level:

(4.9) equation

where in the authors' context, img is an indicator variable (1/0) that takes the value 1 if the product is purchased by the household on shopping trip img and 0 otherwise, and img is the household's probability of purchasing the product on shopping trip img, given by Equation 4.8. An introduction to the discrete-time hazard models is also provided in Appendix H.

In Borle et al.'s (2008) study, the authors adopted a discrete-hazard approach to model the hazard of lifetime img for customer img, which is the risk of leaving in the imgth spell (probability that the customer will leave the company without making the imgth purchase after having made the (img)th purchase):

(4.10) equation

where img is specified as follows:

(4.11) equation

where img and img indexes the purchase occasion. This third-order polynomial expression addressed non-stationarity across purchase occasions. The authors also allow for a heterogeneity structure over the coefficients for the lagged variables as follows:

(4.12) equation

(4.13) equation

The authors also estimated the defection model jointly with the interpurchase time and the purchase amount models together by assigning appropriate prior distributions to the parameters to be estimated and using a Markov chain Monte Carlo (MCMC) sampling algorithm.

Schweidel, Fader, and Bradlow [7] argued that after customers have been acquired, they churn following a parametric distribution. Incorporating time-varying covariates into the retention modeling, these authors adopted a proportional hazards regression using a baseline hazard function, img. As in the acquisition modeling, three sets of possible baseline hazard specifications were considered for the retention process: the Weibull, log-logistic and expo-power distribution. The survival function for the retention modeling was

(4.14) equation

where img is the impact of the time-varying marketing activities, denoted img. The duration of service retention img is distributed as

(4.15) equation

where img. An introduction to the proportional hazards model is provided in Appendix I.

4.2.3 Empirical Example: Repurchase or not (stay or leave)

One of the key questions we want to answer with regard to customer retention is whether we can determine which customers have the highest likelihood of repurchase. To do this we first need to know which current customers actually made additional purchases after their initial first purchase. In the dataset provided for this chapter we have a binary variable which identifies whether or not a customer purchased in a given time period, in this case quarter. We also provide a set of drivers which are likely to help explain a customer's decision to repurchase. At the end of this example you should be able to do the following:

1. Identify the drivers of customer repurchase behavior.
2. Interpret the parameter estimates from the repurchase model.
3. Predict the number of repeat purchases by customers.
4. Determine the predictive accuracy of the repurchase model.

A B2C firm wants to improve the repurchase rate of customers and reduce the retention spending on customers by better understanding which customers are most likely to repurchase in a given time period. A random sample of 500 customers from a single cohort was taken from the customer database. The information we need for our model includes the following list of variables:

Dependent variable
Purchase 1 when the customer purchased in the given quarter, 0 if no purchase occurred in that quarter
Independent variables
Lag_Purchase 1 if the customer purchased in the previous quarter, 0 if no purchase occurred in the previous quarter
Avg_Order_Quantity The average dollar value of the purchases in all previous quarters
Ret_Expense Dollars spent on marketing efforts to try and retain that customer in the given quarter
Ret_Expense_SQ Square of dollars spent on marketing efforts to try and retain that customer in the given quarter
Gender 1 if the customer is male, 0 if the customer is female
Married 1 if the customer is married, 0 if the customer is not married
Income 1 if income < $30 000
2 if $30 001< income < $45 000
3 if $45 001 < income < $60 000
4 if $60 001 < income < $75 000
5 if $75 001 < income < $90 000
6 if income > $90 001
First_Purchase The value of the first purchase made by the customer in quarter 1
Loyalty 1 if the customer is a member of the loyalty program, 0 if not

In this case, we have a binary dependent variable (Purchase) which tells us whether the customer did purchase (= 1) or did not purchase (= 0) in a given quarter. We also have 10 independent variables that we believe will be drivers of repurchase behavior.

We believe that transaction behavior in the past is likely to explain future purchase behavior. As a result we use several lagged operationalizations of current variables as independent variables in this example. First, we have whether or not the customer purchased in the last quarter (Lag_Purchase). This variable can be obtained by taking the lagged value of the purchase indicator variable, noting that one observation will be lost for each customer for each lag that is taken. In this case we are only using a one-period lag. Second, we have the average past order quantity (Avg_Order_Quantity). In this case the value for average order quantity is the mean of the Order_Quantity variable in all quarters before the current time period. Third, we have how many dollars the firm spent on each customer (Ret_Expense) in each time period and the squared value of that variable (Ret_Expense_SQ). We want to use both the linear and squared terms since we expect that for each additional dollar spent on the retention effort for a given customer, there will be a diminishing return to the value of that dollar. Finally, since the focal firm of this example is a B2C firm, the other five variables are demographic and static variables of the customers. These include the Gender of the customer, whether the customer is Married, the Income of the customer, the value of the customer's first purchase (First_Purchase), and whether the customer is a member of the loyalty program (Loyalty).

First, we need to model the probability that a customer will purchase in a given time period. Since our dependent variable (Purchase) is binary, we select a logistic regression to estimate the model. We could also select a probit model and in general achieve the same results. In this case the y variable is Purchase and the x variables represent the nine independent variables in our database. When we run the logistic regression we get the following result:

img

As we can see from the results, seven of the nine independent variables are significant at a p-value of 5% or better with only Married and First_Purchase being statistically non-significant. First, this means that Lag_Purchase has a positive effect on current purchase, that is, customers who made a purchase in the previous quarter are more likely to make a purchase in the current quarter. Second, since the coefficient on Avg_Order_Quantity is positive and statistically significant, this means that customers who in the past have spent more on average are also more likely to purchase in the current time period. Third, we find a positive, but a diminishing, return on the effect of retention spending (Ret_Expense) on purchasing in the same quarter since the coefficient on Ret_Expense is positive and the coefficient on Ret_Expense_SQ is negative. Fourth, we find a positive effect for females (negative coefficient on Gender) meaning that females are generally more likely to purchase than males. Fifth, we find a positive income effect suggesting that customers who have a higher income are more likely to purchase in the current quarter. Finally, since the coefficient on Loyalty is positive this suggests that customers who are members of the loyalty program are more likely to purchase in the current quarter.

It is also important to understand exactly how changes in the drivers of repurchase likelihood are likely to lead to either increases or decreases in repurchase likelihood. To do this we need to determine the odds ratio for each of the parameter estimates. Since we are dealing with a logistic regression, this means that we are interested in the log-odds ratio. For example, for Lag_Purchase, we want to know the change in repurchase likelihood when Lag_Purchase = 0 and when Lag_Purchase = 1. For Lag_Purchase = 0, we get the following:

equation

and, for Lag_Purchase = 1,

equation

By dividing the second equation by the first we get:

equation

We then simplify the equation to get the following:

equation

When we compute the log-odds ratio for each of the statistically significant variables we get the following results for an increase in 1 unit of the independent variable. For the case of categorical variables such as Gender, the log-odds ratio is merely exp(βvariable).

Variable Log-odds ratio
Lag_Purchase 8.542
Avg_Order_Quantity 1.018
Ret_Expense (0.105-0.004*Ret_Expense)
Gender 0.817
Income 1.141
Loyalty 1.307

We gain the following insights from the log-odds ratios. With regard to Lag_ Purchase, we see that a customer who purchased in the previous quarter is 854.2% more likely to purchase in the current quarter than a customer who did not purchase in the previous quarter. With regard to Avg_Order_Quantity, we see that for every increase in $1, the probability of purchase in the current quarter increases by 1.8%. With regard to Ret_Expense, we see that the odds ratio is dependent on the level of Ret_Expense. This is due to the fact that we include both the level and squared terms for Ret_Expense. For example, if we usually spend $15 on a given customer, by spending $16 we should see an increase in the likelihood of purchase by exp(0.105–0.004*16) = exp(0.041) = 1.041. This means that by increasing our spending from $15 to $16, we should see an increase in purchase likelihood by 4.2%. And it is important to note that this will vary depending on the initial level of Ret_Expense. With regard to Gender, we see that customers who are males are 18.3% less likely to purchase in a given period than females. With regard to Income, we see that for each increase in Income level by 1 the purchase likelihood should increase by 14.1%. Finally, with regard to Loyalty, we see that by being a member of the loyalty program the probability of purchase in a given quarter is 30.7% higher than a customer who is not in the loyalty program.

Now that we have determined the drivers of repurchase behavior by customers we need to use the output of the model to determine our model's predictive accuracy. To do this we need to use the estimates we obtained from the repurchase model to help us determine the predicted probability that each customer will repurchase. We use the parameter estimates from the repurchase model and values for the x variables for each customer in each time period to predict whether a customer is likely to purchase in that time period. For a logistic regression we must apply the proper probability function as noted earlier in the chapter:

equation

Once we compute the probability of repurchase, we need to create a cutoff value to determine at which point we are going to divide the customers into the two groups – predicted to purchase and predicted not to purchase. There is no rule that explicitly tells us what that cutoff number should be. Often by default we select 0.5 since it is equidistant from 0 and 1. However, it is also reasonable to check multiple cutoff values and choose the one that provides the best predictive accuracy for the dataset. By using 0.5 as the cutoff for our example, any customer whose predicted probability of repurchase is greater than or equal to 0.5 is classified as predicted to purchase and the rest are predicted not to purchase. To determine the predictive accuracy we compare the predicted to the actual repurchase values in a 2 × 2 table. For our sample of 500 customers over 11 quarters (we drop quarter 1 since all customers purchased in quarter 1) we get Table 4.2.

Table 4.2 Predicted versus actual repurchase.

img

As we can see from the table, our in-sample model accurately predicts 88.4% of the customers who chose not to purchase at a given time period (2494/2822) and 87.7% of the customers who chose to purchase (2348/2678). This is a significant increase in the predictive capability of a random guess model1 which would be only 51.3% accurate for this dataset. Since our model is significantly better than the best alternative, in this case a random guess model, we determine that the predictive accuracy of the model is good. If there are other benchmark models available for comparison, the ‘best’ model would be the one that provides the highest accuracy of both the prediction to purchase and not to purchase, or in other words the prediction would provide the highest sum of the diagonal. In this case the sum of the diagonal is 4842 and it is accurate 88.0% of the time (4842/5500).

As a result we now know how changes in retention expense, past customer transactions, and customer characteristics are likely to either increase or decrease our likelihood of repurchase. And we also know that these drivers do a good job in helping us predict whether a customer is going to repurchase or not. This information can provide significant insights to managers who are charged with determining the optimal amount of resources to spend on retention efforts.

4.2.4 How Do You Implement it?

To implement the logistic regression in this example we used the PROC Logistic feature in SAS. To determine predictive accuracy we carried out a SAS Data step and the Freq procedure. While we did use SAS to estimate the model and determine predictive accuracy, many other statistical packages are capable of estimating a logistic regression including (but not limited to) SPSS, MATLAB, and GAUSS.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset