Sequence and the code performance

Sequences are, without doubt, extremely powerful members of the functional programmer tool belt. However, they are not free of gotchas that may hurt the performance badly. It is best that they are known and avoided. A few of them are as follows:

  • Unfortunate materialization, which may be either unnecessary/premature elements materialization or the other way around, missing elements materialization.
  • Data laziness, in concert with the non-preserving once current element values, can severely hurt the performance in a situation where the algorithm requires multiple traversals or where calculating elements is expensive. The developer should be able to compensate for these detrimental factors by applying patterns such as caching and/or memoization.
  • Often, when composing data processing pipelines, the developer may carelessly use a library function that unexpectedly requires them to enumerate the entire sequence. This is not necessarily a bad thing, but it should be used sparingly.
  • With all the aesthetic beauty of sequences of indefinite length, if misdemeanors mentioned in the previous bullet point just hurt the performance of finite length sequences, the first such negligence upon an indefinite length sequence simply kills! Beware and proceed with due caution when dealing with unlimited length sequences!

Sequence caching

F# language creators were nice enough to provide an out-of-the-box tool for caching, namely the Seq.cache library function. It should be used in a situation where lazy materialization is not a killer, but element generation is not cheap and repetitive enumerations are really required. Let me demonstrate how easy putting caching to work would be.

To begin with, I need an indicator of the enumerator consumption. This is not complicated for those who already have experience working with sequence guts. Let's slightly modify our good old makeSeq function as following (Ch6_5.fsx):

let makeSeq f = 
{ 
  new System.Collections.Generic.IEnumerable<'U> with 
    member x.GetEnumerator() = printfn "Fresh enumerator given"; f() 
  interface System.Collections.IEnumerable with 
    member x.GetEnumerator() = 
    (f() :> System.Collections.IEnumerator) 
} 

Now we are ready to see how the caching works as shown here (Ch6_5.fsx):

//caching 
let nums = (seq {1..100}).GetEnumerator |> makeSeq 
// non-cached - double enumeration 
((nums |> Seq.sum),(nums |> Seq.length)) 
//Fresh enumerator given 
//Fresh enumerator given 
//val it : int * int = (5050, 100) 
 
let cache = nums |> Seq.cache 
// cached - single enumeration 
((cache |> Seq.sum),(cache |> Seq.length)) 
//Fresh enumerator given 
//val it : int * int = (5050, 100) 
// just another time - no enumerations at all 
((cache |> Seq.sum),(cache |> Seq.length)) 
//val it : int * int = (5050, 100) 

First, in the absence of any caching, Seq.sum and Seq.length each imposed an independent sequence traversal, which the presence of two enumerator alerts confirms.

Then, after wrapping the working sequence with Seq.cache, I repeat the calculation using the wrapper sequence. As expected, we notice only a single enumerator alert to populate the cache; the second traversal did not leave any traces as it went through the cache.

To be sure, just reissue the calculation. Now, all the data come from the cache, and no traversals of the original sequence take place at all.

The fusion of sequence transformations

I want to wrap up this chapter by demonstrating a pattern known as fusion. It is not conceptually difficult: imagine that you have a composition of functions that collectively transform a data sequence. At some point, your implementation requires multiple traversals of the sequence. However, the compiler in principle, or the human in practice, may optimize the transformation, so multiple traversals have now fused into just the single one.

Let's perform fusion in practice, reusing our makeSeq implementation as the indicator of obtaining enumerators as shown in the following code (Ch6_5.fsx):

let series = (seq {1..100}).GetEnumerator |> makeSeq 
let average dd = (Seq.sum dd) / (Seq.length dd) 
average series 
//Fresh enumerator given 
//Fresh enumerator given 
//val it : int = 50 

The preceding naive implementation of average traverses the sequence twice, of which the enumeration alerts give the evidence.

However, rewriting the implementation of average less naively as averageFused ends up in the fusion of these traversals as shown in the following code (Ch6_5.fsx):

let averageFused dd = 
  dd 
  |> Seq.fold (fun acc x -> (fst acc + x, snd acc + 1)) (0,0) 
  |> fun x -> fst x / snd x 
averageFused series 
//Fresh enumerator given 
//val it : int = 50 

The single enumeration alert confirms my statement completely.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset