24
Stochastic Objective Functions

24.1 Introduction

A deterministic function returns the exact same value each time it is given the same inputs. What is the value of 3 times 4 (in base 10 arithmetic)? Whether last year or next year, whether in Barbados or on the Moon, whether by human or by computer, the answer is 12. By contrast a stochastic function does not return the same value when given supposedly identical conditions. If I roll this cubical die, what value will appear? Even if I do it, in one setting in the same location, trying my best to be repeatable, the answer is a 1 or a 2 or a … or a 6. This stochastic function returns one of 6 values with an equal probability. It does not consistently return the same value.

How does this relate to optimization? Following are several example situations:

Returning to the introductory discussion, stochastic OFs confound an optimizer search for the best DV value. There are several issues: (i) Near the minimum the stochastic variability might be larger than the DV impact, and the superficial trend might make the optimizer move the DV in the wrong direction. Additionally, (ii) optimizers that seek to use gradient information to direct a search may be wholly confounded by the noise impact on the values. Further, (iii) optimizers that seek a minimum (or maximum) may follow a chance path of fortuitous best points into a DV region that is generally undesired. Finally, (iv) the stopping criteria must not require a tolerance or precision that is greater than the replicate variability of the OF.

To include uncertainty in the OF, first develop the nominal deterministic model. Next, identify all of the “givens” that might have uncertain values. Then, estimate the uncertain range and the distribution pdf for each of those values (see Chapters 25 and 26 for some guidance). Finally, at each call of the function, randomly sample each of the givens to obtain a realization, a particular possible value; and use this value to calculate the OF value. This is method 3 in Section 20.8.2. Also consider methods 4 and 5 to autocorrelate or cross‐correlate perturbations.

In general, stochastic responses confound an optimizer, and the questions are “How can an optimizer take rational decisions on stochastic inputs?” and “What should be used as the criterion for convergence?” One answer is to “Resample the best, and use the worst of the samplings to characterize the OF” and “Claim convergence when there is no statistical improvement of the OF w.r.t. iteration.”

Now that computers are fast enough and have become a common personal tool for engineers and analysts, it appears that stochastic simulations (Monte Carlo analysis) are gaining traction in the engineering and business community. However, classic optimization techniques were developed for deterministic objective functions, and they get confounded by stochastic response. Accordingly, the human is often left to intuitively guide the DV evolution to optimize the stochastic function. So, procedures that can handle stochastic responses will permit automation of the optimization. This chapter presents one approach that I have found effective and practicable.

24.2 Method Summary for Optimizing Stochastic Functions

This section is based on the publication by Rhinehart, R. R., “Convergence Criterion in Optimization of Stochastic Processes,” Computers & Chemical Engineering, Vol. 68, 2014, pp. 1–6:

  1. Use a multiplayer, direct‐search optimizer to search for the optimum such as LF.
  2. Identify the best player, and replicate the trial solution values. Take the worst of replicate evaluations as the representative value.
  3. If the replicated values reveal it is no longer the best player, use the remaining best player (formerly second best) to guide the search.
  4. Use a steady‐state identifier on the iteration sequence of the OF value of the worst player to determine convergence.

24.2.1 Step 1: Replicate the Apparent Best Player

Leapfrogging (LF) is a multiplayer, direct‐search approach. Trial solutions (players) are randomly initialized through decision variable (DV) space. One player will have the worst objective function (OF) value, while another, the best. The worst player leaps over the best and “lands” at a random spot in the reflected DV space on the other side of the best. If the new leap‐over position is infeasible, then the same player leaps from the infeasible spot back over the best player and into a new random location within the new reflected DV space. Normally, for deterministic functions, whether the leap‐to location is better or worse, when the new location is feasible, the move is complete. Normally, an iteration is defined as the number of leap‐overs the same as the DV dimension. This means that if leap‐to positions are feasible, there are N function evaluations per iteration.

However, that definition of an iteration is arbitrary. An iteration in the CHD method has one function evaluation per dimension, but HJ has 1 + 1.5N function evaluations per iteration, ISD (with central difference) has 1 + 2N, and CSLS and SQ and Newton types have many more.

If a leap‐to position is up on a cliff, or otherwise worse than the leap from position, then that player will be the next to leap. As will be seen, if it jumps on and off of a cliff, this will make the local variation seem greater than the stochastic properties at the local optimum and confound the steady‐state identification. Accordingly, for stochastic functions if the leap‐to position is either worse or infeasible, then immediately leap it back. If it jumped up on a cliff, then the leap back will be an improvement. Call an iteration when there are N leap‐overs with improvement.

In LF, the player with the worst OF value leaps over the player with the best, but the best player might have had a fortuitous realization. Rather than truly being best, it might be a phantom appearance. One would not want that to lead the optimization to move more players into an undesired DV region. To accommodate the stochastic OF attribute, here, the best player is reevaluated at each leap‐over, and the worst of replicate values is taken as its representative OF value.

How many replicates? If you want to be c confident that the number of replicates will have found one of the worst fraction, f, of OF realization values, then

(24.4)images

For example, if you desire to be 95% confident that there are enough replicates to find 1 of the 75 percentile worst values (1 in the 25% worst fraction), then

(24.5)images

This would indicate the current OF value and 9 additional samples. However, if the best remains the best, 9 additional samples at the next leap‐over would represent a total of 19, and 9 more on the next leap‐over would represent 28. The worst of 28 is roughly the 90 percentile worst. I find that 1–3 replicates per leap‐over is adequate. If the player is a phantom best, then after a few leap‐overs, it will have enough replicates to acquire a value representing a local worst outcome.

The greater the number of replicates, the lower the chance that a converged DV* value will represent a bad solution masquerading as a good one. Choose c and f values to represent the situation. If it is a normal business economic analysis, then images could be an appropriate number; however, where there is less tolerance for risk, perhaps images is appropriate.

If reevaluation of what appears to have been the best player reveals that it is not the best, then the second best becomes the lead player and basis for the leap‐over, redirecting the search. This prevents end‐game convergence of the cluster in DV space and motivates the novel approach demonstrated here—stop when there is no improvement in the stochastic response with respect to iteration. Stop at steady state.

24.2.2 Step 2: Steady‐State Detection

Normally, in optimization of deterministic functions, the best trial solution will progressively improve with each iteration, asymptotically approaching its optimum. However, with replicating stochastic functions, at the optimum, the value of the best player will generate what appears to be a noisy value at a steady‐state average. As the optimizer moves toward the optimum on a stochastic surface, this noisy value will approach its noisy steady state (SS).

If one player is initialized in the vicinity of the global, then it may remain in the best spot, while others are gathering, and observing the OF response of the best player might identify SS prior to all players converging. However, if the worst OF is observed, then when it settles to SS, all players will be in the vicinity of the optimum. Accordingly the method is to observe the OF of the worst player and claim SS when it settles to a noisy SS w.r.t. iteration.

Appendix D provides a method to detect steady state in a signal. There are many methods to determine steady state (SS) in a signal. The method demonstrated here is based on a probable steady‐state detection algorithm, chosen for simplicity and effectiveness. The method first calculates a filtered value, Xf, of the process measurements, and then the variance in the data is measured by two methods. Deviation d2 is the difference between measurement and the filtered trend, while deviation d1 is the difference between sequential data measurements.

If the process is at SS, Xf is almost the middle of the data. Then a process variance, σ2, estimated by d2 will ideally be equal to σ2 estimated by d1, and the ratio of the variances, images, will be approximately equal to unity, images. Alternately, if the process is in a transient state, then Xf lags behind the data trend, the variance as measured by d2 will be much larger than the variance as estimated by d1, and ratio will be much greater than unity, images. Being a ratio of variances, the statistic is scaled by the inherent noise level in the data and is dimensionless. Critical values for the ratio to confidently trigger the process being at steady state with independent and identically distributed variation (white noise) were developed. Filter coefficient values of 0.1 and an r‐critical value of 0.85 provide agreement with human perception to claim SS.

The method presumes no autocorrelation in the “time series” (iteration sequence) of the data. This is ensured by observing the worst player position after each iteration, not the best. The best player makes sequential moves toward the optimum or remains stationary for several iterations, creating autocorrelation in the sequential OF values. By contrast, the worst player is from an iteration‐to‐iteration, new randomized surface sampling. As the players converge on the optimum, the worst player moves from the high OF values to the low ones. With iteration, the worst OF value progressively drops until, at the minimum, its stationary value represents independent stochastic OF values from the proximity of the optimum.

24.3 What Value to Report?

Because the surface is stochastic, the converged solution is not unique. Multiple trials that find the vicinity of the global optimum will stop at random locations in the vicinity. The best of those results is not truly the best, but just the fortuitous outcome of the confluence of stochastic events realized at the last sampling. Accordingly, the best of N trials is not the DV* “truth,” but simply a representative value. Reporting only the best of N trials as the optimum will convey the appearance of a single point and high precision, which misrepresents the situation. An alternate could be to report the average of all DV* values that ended in the vicinity of the global. This provides a better location of DV*, but again a single‐point value does not convey the uncertainty of range of possible DV* values. The average reported with some measure of its uncertainty (e.g., DV* range or DV* standard deviation) would better represent the vicinity of equivalent solutions.

However, the average might not be a feasible solution. If the DV must be an integer, an average of 3.14 does not make sense. If the DV* solutions surround a convex infeasible solution, then the average may be within the constraint.

Further, if one reports the average outcome of the trials that find the vicinity of the global, one first has to segregate trials with some indication of which subset qualifies to represent the global.

Accordingly, be sure to convey the criteria you use to segregate trials, the range of DV* values that represent the uncertainty associated with identifying the global, the number of replicates, and your understanding of how constraints might affect the interpretation of the representative value.

24.4 Application Examples

Several relatively simple examples illustrate the issues to reveal the applicability of the method for a variety of surface features. In each the DVs and the OF are scaled to a common 0–10 range. These are all 2‐DV applications; however, in my testing the method is effective in 1‐D or N‐D applications. First, there is a brief description of each application and then the results.

24.4.1 GMC Control of Hot and Cold Mixing

Function 36 in the 2‐D examples code presents this application. Hot and cold fluids are mixed in line, and the objective is to determine the controller outputs (ideally these are the control valve positions), o1 and o2, to produce the desired mixed temperature and total flow rate.

With no uncertainty on model values, the contour of the 2‐D search for o1 and o2 appears as Figure 24.1a. However, with a relative uncertainty on each model coefficient (Gaussian with a 5% standard deviation), in Figure 24.1b the contours of one realization are obviously irregular, and each realization will generate a different irregular set of contours. The corresponding 3‐D plot of OF w.r.t. DVs (Figure 24.1c) reveals one realization of the irregular surface, and each realization will distribute the spikes and wrinkles differently. In general, Figure 24.1c shows that the minimum is somewhere along a curved path from the front right to the back left.

Image described by caption and surrounding text.

Figure 24.1 (a) Contour plot of the deterministic background. (b) One realization of the stochastic contours. (c) The surface in 3‐D.

Source: Rhinehart (2014). Reproduced with permission of Elsevier.

24.4.2 MBC of Hot and Cold Mixing

This is similar to the aforementioned physical process; however, it is a single controlled variable application. The objective is to determine the optimum sequence of future control actions to make the temperature best match a linear reference trajectory to the set point. Control is subject to rate of change constraints.

24.4.3 Batch Reaction Management

This simulates an algae farm and offered as Function 63 in the 2‐D Optimization Examples program. Algae is seeded in a wastewater pond, permitted to grow in response to sunlight and nutrients, and after a time the water is processed to harvest the algae. Profit depends on the mass of algae harvested. The longer the growth period, the more algae in the pond; but the mass asymptotically approaches a maximum as limits of nutrients make the algal death rate approach growth rate. Waiting a long time to harvest 100% of what is possible reduces the number of batches that can be generated per year. It might be better to wait until the batch is only 80% developed and to be able to harvest twice as many.

Additionally, the deeper the pond, the more volume there is to grow algae and, consequently, mass that can be harvested. But as the pond gets deeper, less sunlight and CO2 will diffuse to the lower levels. Too deep a pond leads to the cost of processing excess water with little extra product to be harvested.

The objective is to determine the batch growth time and pond depth to maximize annual profit. Figure 24.2 illustrates one realization of the surface. Contrasting Figure 24.1c, in which the smooth area is in the vicinity of the optimum, in Figure 24.2 the fluctuations are most severe in the region of the optimum.

Image described by caption and surrounding text.

Figure 24.2 Algae farm OF surface. One realization.

Source: Rhinehart (2014). Reproduced with permission of Elsevier.

24.4.4 Reservoir and Stochastic Boot Print

These two simulators represent a water reservoir design problem and are Functions 18 and 20 in the 2‐D Optimization Examples program. The objectives of a reservoir are to trap excessive rain water to prevent downstream floods, to release water downstream to compensate for upstream droughts, and to provide water for human recreational and security needs. The design objectives also include minimizing the cost of the dam and land. The taller it is, the more it costs to build; but with its greater capacity, the probability of flood or drought impact will be lower, and the recreational and security features will be better. The fuller it is kept, the less it can absorb floods, but the drought, security, or recreational performance will be better. On the other hand, if kept nearly empty, it can mitigate any flood, but cannot provide recreation or drought protection. The OF represents the reservoir cost plus the economic risk (probability of an event times the cost of the event). The lower left axis in Figure 24.3 (one realization of Function 20) represents the dam height, while the lower right axis is the set point portion of full.

Image described by caption and surrounding text.

Figure 24.3 Reservoir. One realization.

Source: Rhinehart (2014). Reproduced with permission of Elsevier.

The OF value in Figure 24.3 depends on a probability of the flood or drought event, and each realization of the contour will yield a slightly different appearance. Note that starting in the middle of the DV space on the planar portion, a downhill optimizer will progressively move toward the far side of the illustration to a less costly reservoir, but into the region with the spikes. In that region for a small reservoir, or one that is kept too full or too empty, there is a probability of encountering a costly flood or drought event that the reservoir cannot mitigate. Moving in the downhill direction the optimizer may, or may not, encounter a costly event in the Monte Carlo simulation. If it does not, it continues to place trial solutions into a region with a high probability of a disastrous event and continues into the bad region as the vagaries of probability generate fortuitous appearing OF values.

Stochastic boot print, Function 18, is a surrogate for the reservoir, a simplified set of equations that computes faster. It’s response is similar.

24.4.5 Optimization Results

Figure 24.4a shows the results of a Levenberg–Marquardt search on the GMC control problem with convergence based on sequential changes in the DV, a classical measure. Trial solutions were initialized at random locations within the DV space, 100 times, and converged at 100 wrong places. The pattern is independent of the threshold on the change in DV or whether convergence is based on a change in OF value. The results are similar when other single‐point algorithms are used, even direct‐search optimizers such as Hooke–Jeeves. These algorithms get stuck on a local fortuitous good spot and cannot find a way out of the apparent, phantom minimum. In contrast, Figure 24.4b reveals the results of a particle swarm optimizer search starting with 20 particles, with an end‐game perching attribute to converge. The broad surface exploration and particle attraction toward the best leads the swarm to the vicinity of the global minimum. But, in that vicinity, the swarm converges on a fortuitous best, resulting in a range of phantom solutions.

Image described by caption and surrounding text.

Figure 24.4 Conventional optimization results on the GMC problem: (a) Levenberg–Marquardt, (b) particle swarm, and (c) leapfrogging.

Source: Rhinehart (2014). Reproduced with permission of Elsevier.

The range of converged solutions is relatively insensitive to several orders of magnitude on the convergence criterion precision. In all figures the threshold value is 0.01 as an rms change on the DVs; however, a four‐order‐of‐magnitude range of 0.001–1 was explored. By contrast, the number of function evaluations to converge is strongly dependent on the convergence threshold. This is an issue for a user, who must specify the threshold values for the convergence criterion, but cannot have rational, a priori criteria for relating criterion thresholds to results. In the 100 trials, PSO required 2180 ANOFE (average number of function evaluations per trial) to converge.

Figure 24.4c reveals similar results for LF. For the same convergence criterion, the ending range of solutions is nearly identical to PSO, but LF averaged 1150 ANOFE, therefore the computational work is significantly lower.

In contrast, Figure 24.5 illustrates the impact of the method presented here. It is the same problem as that in Figure 24.4c; however, the OF value of the best player is based on the worst of the prior and one replicate value, and convergence is based on the automated, scale‐independent recognition of steady state. The replicates reduce the impact of fortuitous good spots from continuing to act as a draw, which significantly reduces the range of solutions nearer to the global precision. One might expect that replicates double the optimizer effort, but because of the SS convergence criterion, evaluations stop when continued work is substantially ineffective. The ANOFE for this set of 100 trials is 1090 (about 10% fewer), and the converged solutions are visibly more precise.

Contour plot of method for stochastic optimization, displaying a shaded area situated at the bottom right, near 8 on the X-axis and at 2 on the Y-axis.

Figure 24.5 Method for stochastic optimization.

Source: Rhinehart (2014). Reproduced with permission of Elsevier.

These illustrations motivate a choice of dual criteria for evaluating the method: (i) the range of converged solutions indicating precision and (ii) ANOFE indicating data generation cost.

Table 24.1 summarizes findings for each of the stochastic functions described previously, comparing LF with a conventional convergence criterion of DV rms less than 0.01 and LF with a replicate evaluation and the SS detection on the worst player value. The table entries are the precision of the cluster of converged solutions (as indicated by rms value of the scaled DV range) and the work required to obtain the solution (as measured by the average number of function evaluations to converge). The values are based on 100 trials, rounded to reflect run‐to‐run variability. For similar converged cluster size (DV rms), the proposed convergence criteria finds the optimum with nearly half the ANOFE. Or for similar ANOFE the proposed method provides greater precision on the DV optimum.

Table 24.1 Experimental comparison—rms values of cluster scaled DV range/average number of function evaluations to converge.

Source: Rhinehart (2014). Reproduced with permission of Elsevier.

Optimizer Function
Hot and cold mixing
GMC
Hot and cold mixing
MPC
Stochastic boot print Reservoir Algae farm
LF with rms DV convergence based on 0.1% of DV range 0.8/1450 0.8/1300 1.8/930 2.2/610 1.2/1200
LF with replicate and SSID method 0.4/910 0.4/700 1.2/520 0.8/515 1.2/430

24.5 Takeaway

Whether experimental responses or Monte Carlo simulations, stochastic objective functions are broadly applicable and include uncertainty in an optimization application. They are important in forecasting economics, reliability, etc.; but they confound classic optimization algorithms and convergence criteria. Using LF with replicates and SSID on the worst player, OF value locates the vicinity of the global with precision and/or fewer function evaluations.

24.6 Exercises

  1. Explain what you would use as convergence criterion when you are optimizing an active process as it “runs.” Describe such a process (economic process, chemical process, natural process, human process). There are usually several DVs and several OFs that are combined into one weighted‐sum OF. For example, if it is the national economy, DVs may be the rates for prime interest and federal spending and OFs may be unemployment and gross national product. A natural process example is wildlife, in which population is controlled by zoning for land use and hunting and fishing licenses. The processes have a stochastic, not deterministic, response.
  2. If OF values come from either physical experiments or a stochastic simulator, then experimental variability adds noise (uncertainty, perturbation, error, variation) to the OF value. Consider both direct and gradient‐based search algorithms, and discuss how noise on the OF value affects the determination of the next trial solution in (i) the search stages that move toward the optimum and (ii) the “end‐game” stage very close to the optimum.
  3. The problem with a stochastic function (such as Function 18, Reservoir) is that optimizers seek the best. If the best represents a low‐cost small reservoir, in a fortuitous simulation year with neither flood nor drought, the fortuitously good trial solution remains the best, the basis and an optimizer seeks a better solution from there. When it finds a smaller reservoir size in another fortuitous simulation year, the trial solution moves to a smaller lower‐cost reservoir. But repeated simulations could reveal that fortuitous, one‐time best trial solution was actually frequently disastrously expensive. (i) Run several trials of the Leapfrogging optimizer on Function 18, with a convergence criterion of 0.2 on the DVs. (ii) Report on your results. You should see the converged solution penetrate into the region with a high probability of a flood or drought disaster. (iii) Run several trials of the Leapfrogging with replicate optimizer, with the same convergence criterion, and five replicates. (iv) Report results. You should see that not only does the retesting of the best player and accepting the worst value of the replicates keep the converged solution in the disaster‐free region, but also the NOFE is increased.
  4. I have claimed that the steady‐state identification method is a good approach to detecting convergence in optimization of a stochastic function. “Good” means that it works reasonably well, is scale independent, is user convenient, and is grounded in fundamentals. What do you think? Provide evidence to support your conclusion. What makes one method better than another? If the SSID approach is universal (if it is scale independent, does not require a person to choose threshold values, and does not require adding apples and oranges), then that makes it better than the other approach. If the NOFE is lower for one method than another while giving the same optimum DV value, then the one method is better. How will you define better? What will you measure to quantify better and to defend your conclusion?
  5. Function 52 is a single‐decision variable optimization. The DV is the number of rolls of a six‐sided die. You pay for each roll, so there is a penalty directly related to the DV, N. If you get two or more 5s, you win. But if a 6 follows a 5, then the 5 does not count, and regardless of the number of 5s, if there are three 3s, you lose. Add a rule to the game and implement it in the 2‐D Examples function. Perhaps, if there are two or more 1s, then the 3 rule is discounted. Or if there are four or more 5s, the reward is doubled. (i) Explain your new rule. (ii) Show the code. (iii) Show the surface. (iv) Compare several optimizers, and include LF with replicates (Optimizer #16) and the steady‐state convergence criterion (Convergence #13).
  6. The file 2‐D Optimization Examples includes several stochastic functions. Choose one and explore the results from several optimizers and several convergence criteria.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset