We can define optimization as the act, process, or method that is employed to find the “best” element from a set of alternatives based on the objectives, such as maximizing yield, while satisfying all the constraints. Within the context of a current business problem, this “best” element can be far-reaching: from a system, decision, or technical design. “Problems” can include but are not limited to:
For example, to optimize capital across portfolios, the allocation of credit risk mitigants to credit risk exposures under different regulatory regimes can be done with a specific objective in mind. Here, the allocation is optimized in a way that allows a firm to reduce their regulatory capital requirements within the context of a regulatory framework. Here, the regulatory requirements are the constraints within the optimization problem.
In mathematical terms, optimization refers to the act of minimizing or maximizing a value function subjected to constraints as an expression. The function and constraints can be linear or nonlinear. Mathematical optimization is integral and extensively used in AI and machine learning (e.g., minimizing a “loss function” subject to constraints). This is achieved by applying an optimization routine to improve the accuracy of the algorithm based on the provided training data by reducing the error in predictions.1
Mathematicians have developed optimization algorithms for many decades. For the modern-day risk practitioner, selecting the best algorithm and specification can be a daunting task. The choice of algorithm impacts the processing time, efficiency, and accuracy of the outcome. But before we describe the use of optimization for machine learning, it is best to describe what is meant by “optimizing a value function that is subject to constraints” as mentioned earlier.
Taking a practical risk management example, a commercial bank may want to minimize the variance of returns (the objective function) while keeping their loss rates within risk appetite over t years (the constraints) and optimize the AI or machine learning accordingly.
Even though the range of optimization applications in AI and machine learning is wide, two main optimization problems depend on whether the loss function is convex or nonconvex.2
Convex optimization has only one “best” solution, as the assumption is that a single local optimum exists—that is also the global optimum. Such a minimum exists that is determinable using a variety of well-tested and validated methods. Thus, the search area for convex optimization (Figure 8.1) in itself takes on a convex shape, whereby following the negative gradient a local minimum is reached.
Nonconvex optimization has multiple locally optimal solutions for the problem. Historically, most optimization problems in a machine learning context were considered convex; however, in recent years, with the application of neural networks and deep learning, the use of nonconvex optimization methods has grown. These algorithms work in the nonconvex landscape (Figure 8.2) (i.e., where there is more than one local minimum and the optimization of these is to efficiently find the local minima, because in this case a global minimum is impossible to prove).
Of course, there are also times when the objective function is concave. These are simply the negative of a convex function where the same principles apply.
When optimizing machine learning, the machine learning fit function is approximated by optimizing the parameters so that the objective function is either minimized or maximized. To do that, there may be times when the input variables are restricted. These restrictions are constraints. Constraints define the feasible space over which the objective function is optimized. If the constraint requires a constant change applied to each variable, then the constraint is an equality constraint that decomposes to a linear constraint. Conversely, if the constraint is unequal for variables and can be achieved unilaterally, then the constraint is an inequality constraint. When applying these concepts together for machine learning optimization we get the following:
Of course, there are myriad ways to solve convex or nonconvex optimization problems. This is because the choice of algorithm to use as the solver depends on the application as well as the analytical method. The intention of the section is not to explain all algorithms available as solvers for convex and nonconvex optimization. By way of an introduction, we'll focus on machine learning function optimization that involves the use of solvers, the values of hyperparameters, as well as model fit optimization using stochastic gradient descent. Here are three types of solvers:
Hyperparameters broadly fall into two categories: model training parameters that relate to the model architecture (e.g., the number of layers of a neural network and neurons in each layer) and solver parameters such as the learning rate or momentum of an algorithm. The choice of solver and the solver settings will significantly affect the training process.
In the next sections, we will discuss each of the solvers in more detail.
As defined earlier, there are many solvers to optimize an objective function, and these are dependent on whether the function to solve is convex or nonconvex. In the next section, we detail selected examples of solvers based on whether the objective function is convex or nonconvex.
When the objective function is convex, a solver needs to find one global minimum. Some solvers satisfy constraints while optimizing the objective function. Typically, solvers perform this by utilizing either an interior-point method or an active-set method. Interior-point methods start in the interior of the feasible region and use barrier functions to impose constraints. Active-set methods look to find a solution by quickly guessing the set of active constraints.
If the active constraints are known, then inactive constraints may be discarded and a much simpler solution can be efficiently reached. For this reason, despite the run-time of the algorithm being higher (referred to as “worst-case complexity”), active-set methods can outperform their interior-point counterpart. They are often the method of choice when only variable bounds are present (e.g., there are no other general constraints). The method stops when a workable point is found within the feasible region; at such a point, any attempt to further improve the objective will violate at least one constraint. The weights that achieve these balance points are known as dual variables; that is, they are multipliers of the constraint gradients such that the objective gradients equal the scaled sum of the constraint gradients. A Lagrangian function is often used to describe this system mathematically and hence dual variables are often called Lagrange multipliers.
When the problem is nonconvex, there are two options to try:
Mutation operators randomly vary the solution, so that the search over the solution space performs in a way that prevents the convergence at the local minimum to be premature.
There are two broad classes of parameters that require tuning during AI and machine learning model development. These can be optimized during model training or external to the model training process. There are typically hyperparameters associated with model-training (e.g., number of layers in a neural network and the number of neurons at each layer) and the solvers used for training (e.g., learning rate, momentum).
Settings for both the model training parameters and solver parameters of hyperparameters can significantly influence the accuracy of the models. Table 8.1 lists examples of parameter entities and hyperparameter attributes across both model training parameters and solver parameters for a selection of algorithms.
Table 8.1 Key Algorithms and Their Associated Parameter Entities, Including Model Training Parameters and Solver Parameters of Hyperparameters.
Algorithm | Parameter entities | Hyperparameters attributes |
---|---|---|
Linear regression |
| The stepwise regression approach to use can consist of Backward, Elasticent, Forward, Forwardswap, LAR, LASSO, None or Stepwise methods, and what cut-off values for adding/removing terms, such as the significance level for entry when the significance level is used as the select or stop criterion, and the significance level for removal when the significance level is used as the select or stop criterion. |
Logistic regression | Coefficients of the logistic regression equation include the intercept term (constant) and the betas (coefficients for the independent variables). The main estimation methods are maximum likelihood or stochastic gradient descent (using both the learning rate and epochs) to estimate the logistic regression algorithm parameters. | Hyperparameters include, for example, the model selection method and use that can consist of Backward, Forward, LASSO, None, or Stepwise methods, the significance levels for entry/removal, and the lassobase regularization parameter for the LASSO methods and the number of steps for the LASSO method. |
Quantile regression | Each quantile level, the distinct set of regression coefficients include the intercept and betas (coefficients for the independent variables). | None |
General linear model | For the transformed response in terms of the link function and the independent variables, the coefficients are based on a linear relationship: the intercept term (constant) and the betas (coefficients for the independent variables). Maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) estimates the parameters. MLE relies on large-sample approximations. | The stepwise regression approach to use that can consist of Backward, Elasticent, Forward, Forwardswap, LAR, LASSO, None or Stepwise methods, and what cut-off values for adding/removing terms such as the significance level for entry when the significance level is used as the select or stop criterion, and the significance level for removal when the significance level is used as the select or stop criterion. |
Decision trees | The parameters of decision trees such as classification and regression trees (CART) are those that change the tree structure from the inside. Recursive partitioning parameters that define the tree splits are internal to the algorithm and thus determine how the tree grows, such as cost complexity and reduced error methods. |
|
Random forests | Parameters include those on how the trees are split internally using the class/interval target criterion to apply, maximum number of branches, and how missing values as attributes are used. |
|
Gradient boosting machines | Parameters are either tree-specific, boosting, or miscellaneous.
|
|
| ||
Neural networks |
|
|
Support vector machines | The three classes of parameters are:
|
|
Bayesian network | The parameters are dependent on the complexity of the network. This complexity is based on the probability distribution over the n variables—that is, the probability of every combination state that represents the relationship between the variables. The parameters are:
|
|
Tuning hyperparameter values is a critical aspect of the model training process. The approach to finding the ideal values for hyperparameters (tuning a model to a particular dataset) has been a manual effort and thus time consuming. For guidance in setting these values, risk modelers often rely on their experience using machine learning. However, even with expertise in machine learning and their hyperparameters, the best settings of these hyperparameters will change significantly with different data; it is often difficult to prescribe the hyperparameter values based on previous experience. The ability to explore alternative configurations in a more guided and automated manner helps reduce the need for manual effort. Below are common approaches used for automated hyperparameter tuning:
For the 9 hyperparameters and assuming the same number of levels (say 3), the grid can end up with 19,683 combinations.
Optimization algorithms exist for specific machine learning models, including those for logistic regression and neural networks. In this section, we list each of these algorithms.
Each of the following techniques are for nonlinear optimization that can be encountered with logistic regression algorithms where repeated computation is needed for the optimization criterion, the gradient vector (first-order partial derivative), and some of the Hessian matrix (second-order partial derivatives).
We have discussed mathematical optimization and how that is used in algorithms. However, and as explained at the beginning of the chapter, optimization refers to an act, process, or method that is employed to make “something” as fully functional or effective as possible based on a current problem. That “something” can be far reaching in risk management—from a system, decision, a technical design, and of course the machine learning algorithms. What follows are examples of the business applications in risk management that require optimization to maximize effectiveness.
Although the policy rules that are applied to consumer loan applications represent a latent loss function of the individual credit risk profile, they also reflect the risk appetite of an organization. This means that the policy rules expand and contract to tighten or loosen credit policy over time. They are also needed to either contribute or act as reason code generators to approve or decline loans. However, policy rules are often applied subjectively, which often slows down decision response time.
Furthermore, although policy rules are critical for lending decisions, there are five overlooked aspects of policy rules that create opportunities for optimization:
A decision science optimization tool that uses approximation (or function approximate) machine learning to simulate what the final decision will be could:
The decision science optimization tool carefully considers the lending process flow of a retail loan application. Policy rules are often implemented into a decision engine, but some banks run credit models (i.e., scorecards) against applications first and then policy rules to ensure that credit declines are not overridden by policy rules.
Irrespective of the sequence of running a policy rule or credit model on the loan application, decisions made are often a blend of models and policy rules, and these can be adjusted by manual assessors. Manual assessors can override automated decisions. In some cases, the override process prolongs the time to decision and can result in non-take-up of approved applications. Excessive amounts of policy overrides are often associated with too many policy rules and complexity.
The design principles shown in Figure 8.3 are explained below:
Importantly, the decision science optimization tool could allow for both aspects of the originations decision process, namely the credit models and policy rules, to be analyzed at the same time. The last decision is typically difficult to model because decisions on loan applications are driven by models, policy rules, and manual assessment interventions, and thus are both objective and subjective in nature.
The tool provides machine learning insights, but importantly, these insights can be used to shorten decision time by utilizing final approvals or rejections early in the lending process. Furthermore, the strength of the machine learning can be used to optimize revenue by determining customer segments that are high net worth, have ability to repay, and should not undergo unnecessarily long decisioning times.
Collateral refers to assets, other guarantees, or securities that a counterparty agrees to transfer to the seller that are equal or exceed the receivables and liabilities provided by the seller. Collateral acts to mitigate the risk to the seller of the counterparty not completing its agreement if, for example, the counterparty defaults.5 Thus, collateral is the asset that the buyer transfers to the seller as securities.6 Below are definitions to help further explain the terms:
Managing collateral requires multiple decisions that involve complexity, because the management depends on whether the bank is acting as the seller or counterparty (as is the case in the capital markets). At a general level, these decisions include the following:
When there are only a few available assets to be distributed between a handful of exposures, the management of the collateral is quite simple and intuitive, meaning that the management of the collateral can easily be done manually.7 When the available assets, exposures, and counterparties grow beyond a few, it is faster to optimize collateral allocation using algorithmic approaches including machine learning. Here, the optimization exercise is to solve which allocation of collateral results in the lowest cost to the bank, but one that meets the agreements made between the seller and the counterparty and margin call requirements of the assets.8 As an example, the objective functions of a set of models are listed below:
The use of AI, machine learning, and advanced analytics for optimization is a rapidly evolving field. In this chapter, machine learning algorithmic optimization was explained. These involve the use of solvers, the automated tuning of hyperparameters, and model training optimization using stochastic gradient descent. In addition, optimization algorithms with embedded machine learning are powerful tools to solve risk-specific business problems. Keep in mind that although powerful, optimization is mathematically complex and may involve intensive compute resources. The good news is that there are automated ways to apply optimization using machine learning that transfer some of the complexity to help streamline decision-making, so that risk departments can focus on the business value that optimization can generate.