Finding the optimal partition size with simulated annealing

In the previous recipe, Partitioning Monte Carlo simulations for better pmap performance, we more or less guessed what will make a good partition size. We tried a few different values and saw what gives us the best results. However, it's still largely guesswork since just making the partitions larger or smaller doesn't give consistently better or worse results.

This is the type of task that computers are good at. Namely, searching a complex space to find the function parameters that result in an optimal output value. For this recipe, we'll use a fairly simple optimization algorithm called simulated annealing. Similar to many optimization algorithms, this is based on a natural process: the way molecules settle into low-energy configurations as the temperature drops to freezing. This is what allows water to form efficient crystal lattices as it freezes.

In simulated annealing, we feed a state to a cost function. At each point, we evaluate a random neighboring state and possibly move to it. As the energy in the system (the temperature) goes down, we are less likely to jump to a new state, especially if that state is worse than the current one according to the cost function. Finally, after either reaching a target output or iterating through a set number of steps, we take the best match found. Similar to many optimization algorithms, this doesn't guarantee that the result will be the absolute best match, but it should be a good one.

For this recipe, we'll use the Monte Carlo pi approximation function that we did in the Partitioning Monte Carlo simulations for better pmap performance recipe, and we'll use simulated annealing to find a better partition size.

Getting ready

We need to use the same dependencies, uses, imports, and functions as we did in the Partitioning Monte Carlo simulations for better pmap performance recipe. In addition, we'll also need the mc-pi-part function from that recipe.

How to do it…

  1. For this recipe, we'll first define a generic simulated annealing system. Then, we'll define some functions to pass them as parameters. Everything will be driven by the simulated annealing function that takes all of the function parameters for the process as arguments (we'll discuss them in more detail in a minute):
    (defn annealing
      [initial max-iter max-cost
        neighbor-fn cost-fn p-fn temp-fn]
       (let [get-cost (memoize cost-fn)
             cost (get-cost initial)]
         (loop [state initial
                cost cost
                k 1
                best-seq [{:state state, :cost cost}]]
           (println '>>> 'sa k . state $ cost)
           (if (and (< k max-iter)
                    (or (nil? max-cost)
                        (> cost max-cost)))
             (let [t (temp-fn (/ k max-iter))
                   next-state (neighbor-fn state)
                   next-cost (get-cost next-state)
                   next-place {:state next-state,
                               :cost next-cost}]
               (if (> (p-fn cost next-cost t) (rand))
                 (recur next-state next-cost (inc k)
                        (conj best-seq next-place))
                 (recur state cost (inc k) best-seq)))
               best-seq))))
  2. For parameters, annealing takes an initial state, a limit to the number of iterations, a target output value, and a series of functions. The first function takes the current state and returns a new neighboring state.

    To write this function, we have to decide how best to handle the state for this problem. Namely, since the function to be evaluated has multiple parameters, we will use a vector and randomly slide one value in that around. However, for this problem, we only have one input value: the partition size.

    So, for this problem, we'll use an integer between 0 and 20 instead. The actual partition size will be 2 raised to that power. To find a neighbor, we just randomly slide the state's value up or down at most by five units, within the range of 0 to 20:

    (defn get-neighbor [state]
       (max 0 (min 20 (+ state (- (rand-int 11) 5)))))
  3. The next function parameter used for annealing is the cost function. This will take the state and return the value that we're trying to minimize. In this case, we benchmark mc-pi-part with the given partition size (2 raised to the power) and return the average time:
    (defn get-pi-cost [n state]
       (let [chunk-size (long (Math/pow 2 state))]
         (first (:mean (quick-benchmark
                         (mc-pi-part chunk-size n)
                         {})))))
  4. The next function takes the current state's cost, a potential new state's cost, and the current energy in the system (from 0 to 1). It returns the odds that the new state should be used. Currently, this will always skip to an improved state or a worse state 25 percent of the time (both of these are prorated by the temperature):
    (defn should-move [c0 c1 t]
      (* t (if (< c0 c1) 0.25 1.0)))
  5. The final function parameter takes the current percent through the iteration count and returns the energy or temperature as a number from 0 to 1. This can use a number of easing functions, but for this, we'll just use a simple linear one:
    (defn get-temp [r] (- 1.0 (float r)))

That's it. We can let this find a good partition size. We'll start with the value that we used in the Partitioning Monte Carlo simulations for better pmap performance recipe. We'll only allow 10 iterations since the search space is relatively small:

user=> (annealing 12 10 nil get-neighbor
                  (partial get-pi-cost 1000000)
                  should-move get-temp)
>>> sa 1 . 12 $ 0.5805938333333334
>>> sa 2 . 8 $ 0.38975950000000004
>>> sa 3 . 8 $ 0.38975950000000004
>>> sa 4 . 8 $ 0.38975950000000004
>>> sa 5 . 8 $ 0.38975950000000004
>>> sa 6 . 8 $ 0.38975950000000004
>>> sa 7 . 6 $ 0.357514
>>> sa 8 . 6 $ 0.357514
>>> sa 9 . 6 $ 0.357514
>>> sa 10 . 6 $ 0.357514
[{:state 12, :cost 0.5805938333333334}
 {:state 8, :cost 0.38975950000000004}
 {:state 6, :cost 0.357514}]

We can see that a partition size of 64 (26) is the best time, and rerunning the benchmarks verifies this.

How it works…

In practice, this algorithm won't help if we run it over the full input data. However, if we can get a large enough sample, this can help us process the full dataset more efficiently by taking a lot of the guesswork out of picking the right partition size for the full evaluation.

  1. What we did was kind of interesting. Let's take the annealing function apart to see how it works. The process is handled by a loop inside the annealing function. Its parameters are a snapshot of the state of the annealing process:
         (loop [state initial
                cost cost
                k 1
                best-seq [{:state state, :cost cost}]]
  2. We only continue if we need more iterations or if we haven't bested the maximum cost:
           (if (and (< k max-iter)
                    (or (nil? max-cost)
                        (> cost max-cost)))
  3. If we continue, we calculate the next energy and get a potential state and cost to evaluate:
    (let [t (temp-fn (/ k max-iter))
          next-state (neighbor-fn state)
          next-cost (get-cost-cache next-state)
          next-place {:state next-state, :cost next-cost}]
  4. If the probability function (in this case, should-move) indicates so, we move to the next state and loop. Otherwise, we stay at the current state and loop:
    (if (> (p-fn cost next-cost t) (rand))
      (recur next-state next-cost (inc k)
             (conj best-seq next-place))
      (recur state cost (inc k) best-seq)))
  5. If we're done, we return the sequence of the best states and costs seen:
    best-seq)))))

This provides a systematic way to explore the problem area. In this case, it was to find a better partition size for this problem.

There's more…

Simulated annealing is one of a class of algorithms known as metaheuristic. This is a class of algorithms that are widely applicable to search a result space but that do not guarantee a globally optimal solution. Instead, they often approach the problem stochastically, randomly searching for the best solution that they can find within a reasonable amount of time. All of these take a function (the cost-fn function mentioned previously) and try to find the largest or smallest value for it.

Other optimization algorithms include genetic algorithms, ant colony optimization, particle swarm optimization, and many others. This is a broad and interesting field, and being familiar with these algorithms can be helpful for anyone who does data analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset