Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

33
Evaluating Optimizers

33.1 Introduction

An optimizer is a procedure for determining the optimum. A classic statement is to determine the value of the decision variable to maximize desirables and minimize undesirables:

(33.1)

The optimization algorithms presented in prior chapters included the analytical method, Newton’s, Golden Section, heuristic direct, Hooke–Jeeves, leapfrogging, etc. The question here is not “What is the result when an optimizer is used?” but “Which is the best optimizer?” Here the DV is not a variable in an engineering model, but the choice of the optimizer, a class variable. The statement is

(33.2)

In order to evaluate an optimizer, we must define the desirables and undesirables, define metrics to assess them, and define a way to either combine them in a single OF or to alternately use a Pareto analysis.

As well as class variables for the DVs, each optimizer has coefficient values (such as the number of players, an acceleration/temper factor, initial step sizes, or thresholds for logic switching or convergence). These may be continuum valued or discrete. Further, there are other diverse category choices and other class variables, such as forward or central difference, surrogate model type, and local exploration pattern. Accordingly, the DV list should include the other optimizer choices. A more comprehensive statement is

(33.3)

As with any optimization application, begin by qualitatively considering the issues.

33.2 Challenges to Optimizers

Classic challenges to optimizers are objective functions that have the following:

Non‐quadratic behavior
Multiple optima
Stochastic responses
Asymptotic approach to optima at infinity
Hard inequality constraints or infeasible regions
Slope discontinuities (sharp valleys)
A gently sagging channel (effectively slope discontinuities)
Level discontinuities (cliffs)
Flat spots
Nearly flat spots
Very thin global optimum in a large surface, pin‐hole optima, improbable to find
Discrete, integer, or class DVs mixed with continuous variables
Underspecified problems with infinite number of equal solutions
Discontinuous response to seemingly continuous DVs because of discretization in a numerical integration
Sensitivity of DV or OF solution to givens

33.3 Stakeholders

Stakeholders, the people who might claim to have a stake in the outcome of the optimization or choice of the optimizer, include the following:

Software user—wants speed of solutions, no execution errors, no decisions/choices related to parameters, robust to surface features
Software programmer—efficient computation, low storage, scalability, few conditionals, and simple logic
Business manager—low number of function evaluations, utility of solution not compromised by complexity or propagation of uncertainty

33.4 Metrics of Optimizer Performance

There are many metrics that can be used to assess optimizer (and coefficient choice) performance. Start out by listing what you think are the OF and DV and constraint considerations, then let this trigger who might be stakeholders, and then look at it from their perspective. We aim to have the following:

Efficiency 1—A minimum number of function evaluations (NOFE). Each function evaluation represents either computer work or experimental cost. Be sure to include either numerical or analytical evaluations of gradients and Hessian elements in the NOFE. These derivative evaluations might be a part of the optimizer, or they could be within a convergence criterion. Count them all.
Efficiency 2—Minimum computer time. Nominally, the OF evaluation is the most time‐consuming aspect, but where the computer must sort through multiple players, invert matrices, or apply convoluted logic, the optimizer logic may take more time than the OF evaluation. When you accumulate optimizer time, be sure to discriminate between the several‐time consumers such as display I/O (especially the choices for your tracking convenience), the function evaluation duration, convergence testing, and the optimizer. Also, be aware that other processes that a computer might be running (such as a virus scan) may reduce space or time that the computer might use for the optimization. In any comparison use equivalent languages (don’t compare an interpreter to a compiled .exe program) and equivalent computing devices (don’t compare a laptop to a main frame).
Efficiency 3—User required level of understanding. How easy is it for the user to understand the optimizer logic and computations? How easy to set up the optimizer and choose optimizer parameter values? How easy is it for the user to establish confidence that the optimizer result is in desired proximity of the global? User ability to adapt the code? Algorithm complexity? This could be measured by the number of choices a user must make or number of lines of code related to the issues.
Efficiency 4—Consider storage requirements needed to compute or invert the algorithm.
Efficiency 5—Consider the complexity of the code, the number of conditionals, the number of alternate treatments or methodologies, and the number of switches in logic.
Interpretability 1—How well will the human understand and interpret the results? Considering the CDF of the OF* or the range of the DV* values, how will a user interpret the probability of finding the global?
Interpretability 2—How will a user make an actionable decision based on the optimization results with respect to uncertainty in the “givens,” the model equations, and the probability of the optimizer results?
Probability of finding the global optima—When there are multiple optima, traps, diversions to a constraint, etc., the likelihood of any particular run of an optimizer in finding the global is important. You many need 1000, or more, runs from random initializations to be able to assess the probability of an optimizer to find the global. Probability could be measured by the fraction of trials that the optimizer identifies the global (converges in the vicinity of the global).
Accuracy—This is the closeness of the optimizer to the true optimum. It is often termed systematic error or bias. The optimizer does not stop exactly at the optimum, but stops in the proximity of it based on the convergence criterion. If the best of N trials is taken as the answer, what is the average deviation from true? If the distribution of converged solutions about the optimum is not symmetric, then an average DV* value will have a bias. Does convergence choice, initialization range, constraint accommodation penalty methods, single/double precision, discretization, etc., affect accuracy?
Precision—This is repeatability. The optimizers are numerical procedures, which have finite convergence stopping criteria. They will not stop exactly at the true optimum. Closeness to the optimum can be assessed either by the OF value or by the DV value deviations from the true optimum. We talk about finding the global optimum, but the reality is that the optimizer finds the proximity of an optimum, not the exact point. Precision could be measured by the rms (root‐mean‐square) deviation (either DV from DV*, or OF from OF*) from those trials that located the global. Precision also is dependent on convergence choice, initialization range, constraint accommodation penalty methods, single/double precision, discretization, etc. Accordingly, to compare optimizers, the impact of these other choices needs to be removed.
Robustness—This is a measure of the optimizer ability to cope with surface aberrations (cliffs, flat spots, slope discontinuities, hard constraints, stochastic OF values, a trial solution that is infeasible or cannot return an OF value). It also includes the optimizer ability to generate a next feasible trial solution regardless of the surface features without execution faults. In general, you want an optimizer to perform well (not necessarily perfectly) on a wide variety of applications with all reasonable DV initializations and optimizer parameter choices (number of players, convergence threshold, acceleration factor, etc.). You would like the procedure to be fully autonomous, not requiring human intervention to solve problems. You want the results to be relatively insensitive to human choices of rules or coefficient values. Perhaps a measure of robustness could be the fraction of times the optimizer can generate a feasible next trial solution. Or the fraction of applications that it can solve. Perhaps robustness should be categorized to application features (discrete, class, discontinuity).
Scalability—As the number of DVs increases, how does the computational time or storage increase? How does the NOFE increase? How do the requirements on the user (such as the number of user‐chosen coefficient values that need to be specified for either the optimizer or the convergence criteria) increase? The burden might be acceptable for low‐order applications but excessive for higher dimension ones. Do the diverse aspects rise linearly with DV dimension, as a quadratic, or exponentially? Is the result independent of user‐selected variable dimensions or scaling?

33.5 Designing an Experimental Test

Since any of these metrics will depend on the initial trial solution(s) and the surface features of a specific objective function, you will need to run many trials and calculate an average value representing individual functions. However, replicate trials from random initializations will not produce exact duplicate results. For simple problems, perhaps 20 trials are fully adequate to have relatively certain values of the statistics. However, you may need 100–10,000 trials to determine representative values for probability statistics associated with other attributes. You should keep running trials until the standard deviation of the statistic of interest is small enough to confidently differentiate between statistic values representing different experimental conditions (optimizers, coefficients, convergence criteria).

Statistical comparisons need to be made on replicate results, such as a t‐test of differences in NOFE. The central limit theorem reveals that the standard error of the average is inversely proportional to the square root of the number of replicate trials. The uncertainty of the average reduces with . The uncertainty on experimental probability is dependent on both N and the probability.

Further, one optimizer (Newton’s, successive quadratic) might jump to the solution in one step if the function is a simple quadratic response, but it might become hopelessly lost on a different function. How do you compare optimizers? You need to choose a set of test functions that represent a diversity of features that your applications will present.

If one is assessing number of function evaluations (or iterations) and the precision of a solution, then it becomes understood that the choice of convergence criterion and the threshold for convergence also affect the results. “The optimizer” involves more choices than just the optimization algorithm choice.

With this understanding, the DV section of the optimization statement might be understood as a 3‐D choice. For each optimizer, the first column in the DV list of Equation (33.4), there are a variety of convergence criteria that can be selected (second column); and for each criterion, there is a numerical threshold (third column):

(33.4)

But there are also the initialization factors such as number of trials of randomized initializations and coefficient values (or initial step sizes, or initial range, or number of players):

(33.5)

And this needs to be considered on a full spectrum of applications:

(33.6)

It can become complicated and confusing.

Now one must devise the test basis:

The choice of test functions needs to span the features that are relevant. One function is not a comprehensive means of testing. Choose a number of functions that include all of the challenges that might be represented by the application.
The number of trials from randomized starts needs to be high enough to eliminate statistical vagaries from misdirecting the answer. Use the t‐statistic to determine the number of trials, N, needed to have adequate precision or discrimination between values.
The initialization choices need to be representative of the application—initialize in the area of the global optimum or in local regions off optimum, or throughout a global DV range.
The criterion for convergence and the threshold values need to be fair. In cases that cannot use the same criterion (such as a multiplayer in comparison to a single‐trial solution procedure), the testing needs to choose values that provide an equivalence. For example, choose the criterion thresholds to generate equivalent precision of the DV* values, and then evaluate NOFE.
Use a t‐test to compare averages.
Use a t‐test to compare probabilities.
Combine metrics for efficiency. For instance, use PNOFE instead of separately p(Global) and ANOFE. . Use a t‐test to compare PNOFE values.
Assess precision by the rms deviation of all solutions at the global best, from the best of N. This could be based on the DV* values or the OF* values. Use an F‐test to compare rms values.
Design the test for equivalence in some performance aspect so that the others can be easily compared. For instance, adjust the convergence threshold so that two optimizers have the same precision, and then compare PNOFE values at equivalent precision.
Use EC factors when combining metrics, additively. Defend the EC values within the application context.
Create an equivalent basis w.r.t. DV* or OF* precision. Don’t choose an excessive precision, but make it compatible with the application.
Use a Pareto optimal selection when not combining competing objectives into one expression.
Rank optimizer by number of functions (and/or performance criteria) on which it is best (3 points) or second (2 points) or third (1 point).

Example 1

As a simple comparison example, I have compared a variety of optimizers on one function (Function 89 in the companion site—enzyme reaction) on a limited number of assessment criteria (ANOFE and p‐Global) with the convergence criterion threshold chosen so that precision was equivalent (a standard deviation of global OF values of about 1E − 6). The labels on the data in Figure 33.1 are the optimizer names, and the average number of function evaluations is plotted w.r.t. the probability of the optimizer finding the global. This is a Pareto‐type analysis. In each axis, better is toward the origin.

Figure 33.1 Pareto comparison of two contrasting performance criteria on diverse optimizers.

In this analysis PS (particle swarm) and LM (Levenberg Marquardt) are dominated by LF, HJ, CHD, and CSLS. Each of those four is better in both of the assessment criteria. ISD (Incremental Steepest Descent) is dominated by HJ and CHD, while CSLS (Cauchy Sequential Line Search) is dominated by LF and HJ. The non‐dominated algorithms in this application are LF (Leapfrogging), HJ (Hooke–Jeeves), and CHD (Cyclic Heuristic Direct). Although LF was worse than HJ for ANOFE, it was better in p(global). This, of course, only represents two of many criteria for assessing optimizer desirables, and only tests optimization on one function representing one set of issues (nonlinear with slight discontinuities due to time discretization), under one convergence precision outcome.

Example 2

Compare several optimizers with a variety of 2‐DV test functions using PNOFE (the probable number of function evaluations needed to be 99% confident that the global will be found in at least 1 of N trials) at equivalent precision on OF* values. In each of the applications, convergence will be based on the DV.

In LF and PS (with an end‐game perching, contraction feature), the convergence is on the 20‐player cluster size. With HJ, the convergence criterion is based on the size of the exploration range. In LM the convergence is based on iteration‐to‐iteration DV changes. As a result, even if the convergence threshold is identical, the precision will be different. So, each test is based on an OF* precision based on rms deviation from the best of all solutions of .

Table 33.1 reveals the results of N = 100 trials on each of 11 different functions. The PNOFE values are calculated from the ANOFE of the 100 trials and the probability of that optimizer finding the global minimum. The PNOFE data is a stochastic response. Another set of 100 optimizer trials from randomized initializations will not exactly replicate the same patterns. The probability of finding the global is stochastic. (Flip a coin 100 times, and you expect 50 trials to be heads. But, obtaining only 47 of the 100 is not unexpected.) Additionally, the number of function evaluations to converge will be a stochastic response to the vagaries of initialization. Accordingly, the PNOFE values in the table have about a 10% coefficient of variation (the rms value, or standard deviation, is about 10% of the PNOFE value).

Table 33.1 PNOFE data for the optimizer comparison.

Function	LF	HJ	LM	PS
Three Springs (#12)	460	410	3.7k	2.1k
Fluid Bed Dryer (#90 in the companion site)	400	260	1.9k	2.5k
Peaks (#30)	1.0k	3.3k	5.8k	2.5k
Mountain Path (#29)	1.6k	13.7k	38.5k	10.4k
Parallel Pumps (#41)	1.2k	11.1k	45.0k	8.1k
Shortest Time (#47)	160	230	220	2.3k
Hose Winder (#46)	1.1k	2.6k	48.0k	2.3k
Classroom Lecture Pattern (#73)	550	3.1k	—x—	1.7k
Rank Model (#84)	490	14.7k	—x—	2.4k
Goddard Problem (#79)	890	4.2k	—x—	2.2k
Reliability (#56)	2.6k	—x—	—x—	8.4k

The numbers in bold indicate statistical winners, based on a t‐test and 95% confidence. The —x— entries indicate optimizers that either encountered execution errors or converged at any initialization and failed to find an optimum.

Using points to score the first place (3 points), second place (2 points), and third place (1 point) optimizers, the scores are LF = 31, PSO = 19, HJ = 16, and LM = 3.

33.6 Takeaway

Design tests to compare optimizers to be a comprehensive and complete assessment of issues that matter to the application situation and its stakeholders.

33.7 Exercises

Figure 33.2 reveals the CDF(OF) of 50 trials of two different optimizers on the same function. Which optimizer is best? Explain your choice.

Figure 33.2 Illustration for Exercise 1.
State the number of function evaluations per iteration for each of these three optimizers. Explain your answer. Define “iteration” as one complete pass through all stages of the optimization algorithm on a 2‐DV problem. (i) Cauchy Successive Line Search. (ii) Newton (Newton–Raphson), (iii) Hooke–Jeeves.
If the model is linear in coefficients, for example, the temperature dependence of specific heat is traditionally modeled as , and then linear regression can be used to determine model coefficient values. First, measure c_p at a variety of temperatures to generate the data. Then set the derivative of the SSD with respect to each model coefficient to zero, and you have the four normal equations. Linear algebra will provide the exact values of the coefficients for that particular experimental data set. Alternately, if you choose a numerical optimization procedure (Newton–Raphson, ISD, LM, HJ, LF, etc.), it will approach the optimum DV values and stop in the proximity of the true DV* values when the convergence criterion is satisfied. Which method is better? Briefly state three (or more) issues.
Here are four algorithms: linear programming, dynamic programming, leapfrogging, and Levenberg–Marquardt. For each, state what aspect the other three have in common that differentiates the one from the others.
With a brief discussion (and possibly supporting illustration), explain PNOFE.
If you do not know either DV* or OF*, explain how could you calculate precision (closeness to the DV* value) for an optimizer.
Explore the desirability of several 2‐D optimizers. Use the file “2‐D Optimization Examples.” Explain how you will determine values of metrics that would assess such desirable attributes and also how you will design optimization experiments to reveal and obtain those values. Select functions with diverse attributes. Explore optimization algorithms that are most commonly used or have widest applicability. You should end up with an opinion about what optimization approach is best. To defend your opinion, you should present a table that summarizes desirability of the optimizers, within the uncertainty that your experimental metrics. Prior to the table, you should explain your experimental results. Prior to revealing experimental results, you need to explain your experimental plan and method of interpreting data. Finally, but this is the first part of your presentation, you need to explain how you will be assessing the attributes of desirability. I think all of this can be presented in five, or so, pages.
Compare your favorite of the gradient‐based searches with your favorite of the direct search procedures. Run them a sufficient number of times on several of the simple and several of the difficult surfaces. Organize the assessment data in a table. Pick a best and defend your choice. The performance differences are not due to unknowable mystery, they are directly the result of the algorithm. Use your knowledge of the algorithms to explain why the results are as you obtained.
With your own selection of functions, optimizers, precision, etc., acquire assessment data to evaluate the optimizers. Justify your choices in the experimental design and evaluation procedures. Explain and defend your conclusions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 33 Evaluating Optimizers

Create new playlist

Sign In

Sign Up