Credit risk analysis pipeline with Spark

In this section, we will develop a credit risk pipeline that is commonly used in financial institutions such as banks and credit unions. First we will discuss what credit risk analysis is and why it is important before developing a Spark ML-based pipeline using a Random-Forest-based classifier. Finally, we will provide some performance improvement suggestions.

What is credit risk analysis? Why is it important?

When an applicant applies for a loan and a bank receives that application, based on the applicant's profile, the bank has to make a decision whether to approve the loan application or not.

In this regard, there are two types of risk associated with the bank's decision on the loan application:

  • Applicant is a good credit risk: That means the client or applicant is more likely to repay the loan. Then, if the loan is not approved, the bank can potentially suffer loss of business.
  • Applicant is a bad credit risk: That means that the client or applicant is most likely not to repay the loan. In that case, approving the loan to the client will result in financial loss to the bank.

Our common sense says that the second risk is the greater risk, as the bank has a higher chance of not being reimbursed the borrowed amount.

Therefore, most banks or credit unions evaluate the risks associated with lending money to a client, applicant, or customer. In business analytics, minimizing the risk tends to maximize the profit to the bank itself. In other words, maximizing the profit and minimizing the loss from a financial perspective is important.

Often, the bank makes a decision about approving a loan application based on different factor and parameters of an applicant. For example, the demographic and socio-economic conditions regarding their loan application.

Developing a credit risk analysis pipeline with Spark ML

In this section, we will first discuss the credit risk dataset in detail in order to gain some insight. After that, we will look at how to develop a large-scale credit risk pipeline. Finally, we will provide some performance improvement suggestions toward better prediction accuracy.

The dataset exploration

The German Credit dataset was downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/. Although a detailed description of the dataset is available in the link, we provide some brief insights here in Table 3. The data contains credit-related data on 21 variables and the classification of whether an applicant is considered a good or a bad credit risk for 1000 loan applicants. Table 3 shows details about each variable that was considered before making the dataset available online:

Entry

Variable

Explanation

1

creditability

Capable of repaying

2

balance

Current balance

3

duration

Duration of the loan being applied for

4

history

Is there any bad loan history?

5

purpose

Purpose of the loan

6

amount

Amount being applied for

7

savings

Monthly saving

8

employment

Employment status

9

instPercent

Interest percent

10

sexMarried

Sex and marriage status

11

guarantors

Are there any guarantors?

12

residenceDuration

Duration of residence at the current address

13

assets

Net assets

14

age

Age of the applicant

15

concCredit

Concurrent credit

16

apartment

Residential status

17

credits

Current credits

18

occupation

Occupation

19

dependents

Number of dependents

20

hasPhone

If the applicant uses a phone

21

foreign

If the applicant is a foreigner

Table 3: German credit dataset properties

Note that, although Table 3 describes the variables in the dataset, there is no associated header. In Table 3, we have shown the variable, position, and associated significance of each variable.

Credit risk pipeline with Spark ML

There will be several steps involved, from data loading, parsing, data preparation, training testing set preparation, model training, model evaluation, and result interpretation. Let's go through the steps one by one.

Step 1: Load required APIs and libraries

The following is the code for loading the required APIs and libraries:

import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.ml.classification.RandomForestClassificationModel; 
import org.apache.spark.ml.classification.RandomForestClassifier; 
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator; 
import org.apache.spark.ml.feature.StringIndexer; 
import org.apache.spark.ml.feature.VectorAssembler; 
import org.apache.spark.mllib.evaluation.RegressionMetrics; 
import org.apache.spark.sql.Dataset; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.SparkSession; 

Step 2: Create a Spark session

The following is another code for creating a Spark session:

  static SparkSession spark = SparkSession.builder() 
      .appName("CreditRiskAnalysis") 
      .master("local[*]") 
      .config("spark.sql.warehouse.dir", "E:/Exp/") 
      .getOrCreate();  

Step 3: Load and parse the credit risk dataset

Note that the dataset is in Comma-Separated Value (CSV) format. Now load and parse the dataset using the Databricks-provided CSV readers and prepare a Dataset of Row, as follows:

String csvFile = "input/german_credit.data"; 
Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("header", "false").load(csvFile); 

Now, show the Dataset to get to know the exact structure, as follows:

df.show(); 
Credit risk pipeline with Spark ML

Figure 36: A snapshot of the credit risk dataset

Step 4: Create an RDD of type Credit

Create an RDD of typed class Credit, as follows:

JavaRDD<Credit> creditRDD = df.toJavaRDD().map(new Function<Row, Credit>() { 
      @Override 
      public Credit call(Row r) throws Exception { 
        return new Credit(parseDouble(r.getString(0)), parseDouble(r.getString(1)) - 1, 
            parseDouble(r.getString(2)), parseDouble(r.getString(3)), parseDouble(r.getString(4)), 
            parseDouble(r.getString(5)), parseDouble(r.getString(6)) - 1, parseDouble(r.getString(7)) - 1, 
            parseDouble(r.getString(8)), parseDouble(r.getString(9)) - 1, parseDouble(r.getString(10)) - 1, 
            parseDouble(r.getString(11)) - 1, parseDouble(r.getString(12)) - 1, 
            parseDouble(r.getString(13)), parseDouble(r.getString(14)) - 1, 
            parseDouble(r.getString(15)) - 1, parseDouble(r.getString(16)) - 1, 
            parseDouble(r.getString(17)) - 1, parseDouble(r.getString(18)) - 1, 
            parseDouble(r.getString(19)) - 1, parseDouble(r.getString(20)) - 1); 
      } 
    }); 

The preceding code segments creates an RDD of type Credit after taking the variable as double values by using the parseDouble() method, which takes a string and returns the corresponding value in Double format. The parseDouble() method goes as follows:

  public static double parseDouble(String str) { 
    return Double.parseDouble(str); 
  } 

Now we need to know the structure of the Credit class so that the structure itself helps to create the RDDs using the typed class.  

Well, the Credit class is basically a singleton class that initializes all the setter and getter methods for the 21 variables from the dataset through the constructor. Here is the class:

public class Credit { 
  private double creditability; 
  private double balance; 
  private double duration; 
  private double history; 
  private double purpose; 
  private double amount; 
  private double savings; 
  private double employment; 
  private double instPercent; 
  private double sexMarried; 
  private double guarantors; 
  private double residenceDuration; 
  private double assets; 
  private double age; 
  private double concCredit; 
  private double apartment; 
  private double credits; 
  private double occupation; 
  private double dependents; 
  private double hasPhone; 
  private double foreign; 
 
  public Credit(double creditability, double balance, double duration, 
  double history, double purpose, double amount, 
      double savings, double employment, double instPercent, 
      double sexMarried, double guarantors, 
      double residenceDuration, double assets, double age, 
      double concCredit, double apartment, double credits, 
      double occupation, double dependents, double hasPhone, double foreign) { 
    super(); 
    this.creditability = creditability; 
    this.balance = balance; 
    this.duration = duration; 
    this.history = history; 
    this.purpose = purpose; 
    this.amount = amount; 
    this.savings = savings; 
    this.employment = employment; 
    this.instPercent = instPercent; 
    this.sexMarried = sexMarried; 
    this.guarantors = guarantors; 
    this.residenceDuration = residenceDuration; 
    this.assets = assets; 
    this.age = age; 
    this.concCredit = concCredit; 
    this.apartment = apartment; 
    this.credits = credits; 
    this.occupation = occupation; 
    this.dependents = dependents; 
    this.hasPhone = hasPhone; 
    this.foreign = foreign; 
  } 
 
  public double getCreditability() { 
    return creditability; 
  } 
 
  public void setCreditability(double creditability) { 
    this.creditability = creditability; 
  } 
 
  public double getBalance() { 
    return balance; 
  } 
 
  public void setBalance(double balance) { 
    this.balance = balance; 
  } 
 
  public double getDuration() { 
    return duration; 
  } 
 
  public void setDuration(double duration) { 
    this.duration = duration; 
  } 
 
  public double getHistory() { 
    return history; 
  } 
 
  public void setHistory(double history) { 
    this.history = history; 
  } 
 
  public double getPurpose() { 
    return purpose; 
  } 
 
  public void setPurpose(double purpose) { 
    this.purpose = purpose; 
  } 
 
  public double getAmount() { 
    return amount; 
  } 
 
  public void setAmount(double amount) { 
    this.amount = amount; 
  } 
 
  public double getSavings() { 
    return savings; 
  } 
 
  public void setSavings(double savings) { 
    this.savings = savings; 
  } 
 
  public double getEmployment() { 
    return employment; 
  } 
 
  public void setEmployment(double employment) { 
    this.employment = employment; 
  } 
 
  public double getInstPercent() { 
    return instPercent; 
  } 
 
  public void setInstPercent(double instPercent) { 
    this.instPercent = instPercent; 
  } 
 
  public double getSexMarried() { 
    return sexMarried; 
  } 
 
  public void setSexMarried(double sexMarried) { 
    this.sexMarried = sexMarried; 
  } 
 
  public double getGuarantors() { 
    return guarantors; 
  } 
 
  public void setGuarantors(double guarantors) { 
    this.guarantors = guarantors; 
  } 
 
  public double getResidenceDuration() { 
    return residenceDuration; 
  } 
 
  public void setResidenceDuration(double residenceDuration) { 
    this.residenceDuration = residenceDuration; 
  } 
 
  public double getAssets() { 
    return assets; 
  } 
 
  public void setAssets(double assets) { 
    this.assets = assets; 
  } 
 
  public double getAge() { 
    return age; 
  } 
 
  public void setAge(double age) { 
    this.age = age; 
  } 
 
  public double getConcCredit() { 
    return concCredit; 
  } 
 
  public void setConcCredit(double concCredit) { 
    this.concCredit = concCredit; 
  } 
 
  public double getApartment() { 
    return apartment; 
  } 
 
  public void setApartment(double apartment) { 
    this.apartment = apartment; 
  } 
 
  public double getCredits() { 
    return credits; 
  } 
 
  public void setCredits(double credits) { 
    this.credits = credits; 
  } 
 
  public double getOccupation() { 
    return occupation; 
  } 
 
  public void setOccupation(double occupation) { 
    this.occupation = occupation; 
  } 
 
  public double getDependents() { 
    return dependents; 
  } 
 
  public void setDependents(double dependents) { 
    this.dependents = dependents; 
  } 
 
  public double getHasPhone() { 
    return hasPhone; 
  } 
 
  public void setHasPhone(double hasPhone) { 
    this.hasPhone = hasPhone; 
  } 
 
  public double getForeign() { 
    return foreign; 
  } 
 
  public void setForeign(double foreign) { 
    this.foreign = foreign; 
  } 
} 

If you look at the flow of the class, at first it declares 21 variables for the 21 features in the dataset. Then it initializes them using the constructor. The rest are simple setter and getter methods.

Step 5: Create a Dataset of type Row from the RDD of type Credit

The following code shows how to create a Dataset of type Row:

Dataset<Row> creditData = spark.sqlContext().createDataFrame(creditRDD, Credit.class); 

Now save the Dataset as a temporary view, or more formally, a table in-memory for query purposes, as follows:

creditData.createOrReplaceTempView("credit"); 

Now let's get to know the schema of the table as follows:

creditData.printSchema(); 
Credit risk pipeline with Spark ML

Figure 37: The schema of the Dataset

Step 6: Create the feature vector using the VectorAssembler

Create a new feature vector for the 21 variables using the VectorAssembler class of Spark, as follows:

VectorAssembler assembler = new VectorAssembler() 
        .setInputCols(new String[] { "balance", "duration", "history", "purpose", "amount", "savings", 
            "employment", "instPercent", "sexMarried", "guarantors", "residenceDuration", "assets", "age", 
            "concCredit", "apartment", "credits", "occupation", "dependents", "hasPhone", "foreign" }) 
        .setOutputCol("features"); 

Step 7: Create a Dataset by combining and transforming the assembler

Create a Dataset by transforming the assembler using the creditData Dataset previously created, and print the first top 20 rows of the Dataset, as follows:

Dataset<Row> assembledFeatures = assembler.transform(creditData); 
assembledFeatures.show(); 
Credit risk pipeline with Spark ML

Figure 38: Newly created featured Credit Dataset

Step 8: Create label for making predictions

Create a label column out of the creditability column of the preceding Dataset (Figure 38), as follows:

StringIndexer creditabilityIndexer = new StringIndexer().setInputCol("creditability").setOutputCol("label"); 
Dataset<Row> creditabilityIndexed = creditabilityIndexer.fit(assembledFeatures).transform(assembledFeatures); 

Now let's explore the new Dataset using the show() method as follows:

creditabilityIndexed.show(); 
Credit risk pipeline with Spark ML

Figure 39: Dataset with a new label column

From the preceding figure, we can understand that there are only two labels associated with the Dataset, which are 1.0 and 0.0. That signifies the problem as a binary classification problem.

Step 9: Prepare the training and test set

Prepare the training and test set as follows:

long splitSeed = 12345L; 
Dataset<Row>[] splits = creditabilityIndexed.randomSplit(new double[] { 0.7, 0.3 }, splitSeed); 
Dataset<Row> trainingData = splits[0]; 
Dataset<Row> testData = splits[1]; 

Here, the ratio is 70% and 30% for the training and testing set, respectively, with a long seed value to disallow the random result generation in each iteration.

Step 10: Train the Random Forest model

To train the Random Forest model, use the following code:

RandomForestClassifier classifier = new RandomForestClassifier() 
        .setImpurity("gini") 
        .setMaxDepth(3) 
        .setNumTrees(20) 
        .setFeatureSubsetStrategy("auto") 
        .setSeed(splitSeed); 

As previously mentioned, the problem is a binary classification problem. Therefore, we will evaluate the Random Forest model using a binary evaluator for the label column, as follows:

RandomForestClassificationModel model = classifier.fit(trainingData); 
BinaryClassificationEvaluator evaluator = new BinaryClassificationEvaluator().setLabelCol("label"); 

Now we need to collect the model performance metric on the test set that goes as follows:

Dataset<Row> predictions = model.transform(testData); 
model.toDebugString(); 

Step 11: Print the performance parameters

We will observe several performance parameters of the binary evaluator, for example, accuracy after fitting the model, Mean Square Error (MSE), Mean Absolutize Error (MAE), Root Mean Squared Error (RMSE), R Squared and explained variable, and so on. Let's do it as follows:

double accuracy = evaluator.evaluate(predictions); 
System.out.println("Accuracy after pipeline fitting: " + accuracy); 
RegressionMetrics rm = new RegressionMetrics(predictions); 
System.out.println("MSE: " + rm.meanSquaredError()); 
System.out.println("MAE: " + rm.meanAbsoluteError()); 
System.out.println("RMSE Squared: " + rm.rootMeanSquaredError()); 
System.out.println("R Squared: " + rm.r2()); 
System.out.println("Explained Variance: " + rm.explainedVariance() + "
"); 

The preceding code segment generates the following output:

Accuracy after pipeline fitting: 0.7622000403307129 
MSE: 1.926235109206349E7 
MAE: 3338.3492063492063 
RMSE Squared: 4388.8895055655585 
R Squared: -1.372326447615067 
Explained Variance: 1.1144695981899707E7 

Performance tuning and suggestions

If you look at the performance metrics in Step 11, it is obvious that the credit risk predictions are not satisfactory, especially in terms of accuracy, which is only 76.22%. That means that for the given test data, our model can predict if there is a credit risk with 76.22% precision. Since we need to be more careful about such sensitive financial sectors, therefore, more accuracy is desired no doubt.

Now, if you want to increase the prediction performance, you should try training your model using a model other than the Random-Forest-based classifier. For example, a Logistic Regression or Naïve Baseyan-based classifier.

Moreover, you can use the SVM-based classifier or neural-network-based Multilayer Perceptron classifier. In Chapter 7, Tuning Machine Learning Models, we will look at how to tune the hyper parameters in order to select the best model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset