Chapter 6.  Sequences - The Core of Data Processing Patterns

In this chapter, we will take a deep dive into one of the most essential and utterly important arrangements of functional programming, that is, sequences. The ability to represent any data transformation as a composition of atomic functions applied to the elements of an arbitrary enumerable data container is a must for a functional programmer. The goal of this chapter is to help you acquire this mental skill. The way towards this goal is paved by the following topics covered here:

  • Review the basic data transformations and partition the immense host of standard library data transformation functions by handful of underlying processing patterns
  • Consider the duality of sequence data generators being at once a data and an on-demand calculation
  • Cover how a sequence generalizes arbitrary collections by enumerating them, which represents the pull data transformation pattern
  • Further consider just another pattern of using generated sequences for data drilling
  • Wrap up by practically exploring how the usage of sequences affects the code performance

Basic sequence transformations

Let's revisit the functional solution of the sample problem from Chapter 1, Begin Thinking Functionally. It represents the common functional pattern of finding a given property of the collection as follows:

  • From the given string literal representing 1000 consecutive single digit characters, make a collection of the collections represented by chunks of just five consecutive single digit characters of the original collection. Each chunk takes the inner characters of a five character-wide stencil aligned first with the left-hand side border of the string literal. The stencil then gets moved to the right by a single character before extracting the next sub collection. This sliding of the stencil to the right is continued until the right-hand side borders of both the stencil and the literal get aligned. To be exact, the main sequence consists of 996 such five character sub sequences.
  • Note that the originally sought-for property of the maximal product of five consecutive digits at this point is substituted with a similar property of the sequence of elements, each representing a candidate group from which the sought-for property originates. It is worth pointing out that in order to solve the original problem, all elements of this secondary sequence must be taken into account (other patterns may differ in this respect, for example, finding any sequence element with a given property).
  • Perform a complete scan of the substitute sequence, looking for the maximal value of the sought-for property, which is the product of the constituents of the inner sequence representing an element of the outer sequence that substitutes the original string literal.

Those of you who are attentive to detail may have already spotted the similarity of the preceding solution approach to the MapReduce (https://en.wikipedia.org/wiki/MapReduce) pattern, just without the possible partitioning and parallelization of the map phase for now. This similarity is not coincidental. After implementing a serious amount of F# ETL (https://en.wikipedia.org/wiki/Extract,_transform,_load) tasks, big and small for enterprise Line of Business (LOB) applications, I can conclude that the part of the F# core library covering basic operations upon enumerable sequences, namely the Collections.seq library module of the Microsoft.FSharp.Collections ( https://msdn.microsoft.com/en-us/library/ee353635.aspx ) namespace, has already distilled the typical functional patterns of data sequence processing. Any effective F# developer should be conversant in representing a sought-for data transformation solution at hand into a combination of these library functions from Collections.seq.

Based on my own experience, this set of 70 library functions (for version 4.0 of F#) is hard to grok when you consider it as a list that is just alphabetically ordered by the function name. It is hard to memorize what exactly this or that function is doing without distinguishing their commonalities and differences. This perception can be facilitated if we start seeing a certain data transformation pattern being implemented by each of these functions. These patterns stem from years of accumulated experience in applying functional programming to data processing and are coined into the selection of functions that the F# designers have slated for inclusion into the core library.

I believe that by observing the Collection.seq library constituents from this data processing pattern relationship angle, the following function groups can be distinguished:

  • Aggregates: These functions traverse the sequence in its entirety, returning a single value calculated upon the sequence elements.
  • Generators: These functions produce sequences "out of thin air", or seriously, sequences of special kinds (such as empty typed sequence) and sequences defined either by a quantitative relation between the element and its sequence order number or just by a recurrently defined function deriving the next element based on the previous element(s).
  • Wrappers and Type Converters: These functions either wrap the entire sequence into a useful property (caching is a good example of a wrap) or just convert the sequence to other collection types (lists or arrays).
  • Appliers: These functions just traverse the sequence, applying the given calculation to each element for the sake of a side effect, for example, to print out sequence elements as strings.
  • Recombinators: These functions shuffle the sequence or extract its elements in a type-uniform manner; in other words, for a sequence of type 'T dealing exclusively with seq<'T> or 'T objects. For example, create a new sequence by skipping the first 100 elements of the original one.
  • Filters: These functions are concerned with the selection of elements that conform to an arbitrary condition(s). For example, try to find the first element of a sequence for which a given predicate function returns true.
  • Mappers: These functions change shape and/or the type of the original sequence(s) by producing a transformed sequence(s), for example, a zipping function that takes two input sequences and produces the single result sequence with each element being a tuple that combines elements from both input sequences sharing the same order number.

Equipped with this classification approach, I've partitioned the library functions by the following set of patterns. Under each pattern, all the relevant library functions are listed along with their signatures. I encourage you to explore the signatures in order to spot the commonalities responsible for each group formation.

Additional information for those of you who are eager to dig deeper is given in the Ch6_1.fsx script of this book's accompanying code, where the use of each of the library functions is illustrated by a brief code sample.

The aggregation pattern

average : seq<^T> -> ^T (requires member (+) and member    DivideByInt and member get_Zero)averageBy : ('T -> ^U) -> seq<'T> -> ^U (requires ^U with static    member (+) and ^U with static member DivideByInt and ^U with    static member Zero) 
fold : ('State -> 'T -> 'State) -> 'State -> seq<'T> -> 'State 
length : seq<'T> -> int 
sum : seq<^T> -> ^T (requires member (+) and member get_Zero) 
sumBy : ('T -> ^U) -> seq<'T> -> ^U (requires ^U with static  member (+) and ^U with static member Zero) 
max : seq<'T> -> 'T (requires comparison) 
maxBy : ('T -> 'U) -> seq<'T> -> 'T (requires comparison) 
min : seq<'T> -> 'T (requires comparison) 
minBy : ('T -> 'U) -> seq<'T> -> 'T (requires comparison) 
isEmpty : seq<'T> -> bool 
reduce : ('T -> 'T -> 'T) -> seq<'T> -> 'T 
exactlyOne : seq<'T> -> 'T 
compareWith : ('T -> 'T -> int) -> seq<'T> -> seq<'T> -> int 

The generation pattern

empty : seq<'T>
init : int -> (int -> 'T) -> seq<'T>
initInfinite : (int -> 'T) -> seq<'T>
singleton : 'T -> seq<'T>
unfold : ('State -> 'T * 'State option) -> 'State -> seq<'T>

The wrapping and type conversion pattern

cast : IEnumerable -> seq<'T>
cache : seq<'T> -> seq<'T>
delay : (unit -> seq<'T>) -> seq<'T>
readonly : seq<'T> -> seq<'T>
toArray : seq<'T> -> 'T []
toList : seq<'T> -> 'T
list ofArray : 'T array -> seq<'T>
ofList : 'T list -> seq<'T>

The application pattern

iter : ('T -> unit) -> seq<'T> -> unit
iter2 : ('T1 -> 'T2 -> unit) -> seq<'T1> -> seq<'T2> -> unit
iteri : (int -> 'T -> unit) -> seq<'T> -> unit

The recombination pattern

append : seq<'T> -> seq<'T> -> seq<'T>
collect : ('T -> 'Collection) -> seq<'T> -> seq<'U>
concat : seq<'Collection> -> seq<'T>
head : seq<'T> -> 'T
last : seq<'T> -> 'T
nth : int -> seq<'T> -> 'T
skip : int -> seq<'T> -> seq<'T>
take : int -> seq<'T> -> seq<'T>
sort : seq<'T> -> seq<'T>
sortBy : ('T -> 'Key) -> seq<'T> -> seq<'T>
truncate : int -> seq<'T> -> seq<'T>
distinct : seq<'T> -> seq<'T>
distinctBy : ('T -> 'Key) -> seq<'T> -> seq<'T>

The filtering pattern

choose : ('T -> 'U option) -> seq<'T> -> seq<'U>
exists : ('T -> bool) -> seq<'T> -> bool
exists2 : ('T1 -> 'T2 -> bool) -> seq<'T1> -> seq<'T2> -> bool
filter : ('T -> bool) -> seq<'T> -> seq<'T>
find : ('T -> bool) -> seq<'T> -> 'T
findIndex : ('T -> bool) -> seq<'T> -> int
forall : ('T -> bool) -> seq<'T> -> bool
forall2 : ('T1 -> 'T2 -> bool) -> seq<'T1> -> seq<'T2> -> bool
pick : ('T -> 'U option) -> seq<'T> -> 'U
skipWhile : ('T -> bool) -> seq<'T> -> seq<'T>
takeWhile : ('T -> bool) -> seq<'T> -> seq<'T>
tryFind : ('T -> bool) -> seq<'T> -> 'T option
tryFindIndex : ('T -> bool) -> seq<'T> -> int
option tryPick : ('T -> 'U option) -> seq<'T> -> 'U option
where : ('T -> bool) -> seq<'T> -> seq<'T>

The mapping pattern

countBy : ('T -> 'Key) -> seq<'T> -> seq<'Key * int>
groupBy : ('T -> 'Key) -> seq<'T> -> seq<'Key * seq<'T>>
pairwise : seq<'T> -> seq<'T * 'T>
map : ('T -> 'U) -> seq<'T> -> seq<'U>
map2 : ('T1 -> 'T2 -> 'U) -> seq<'T1> -> seq<'T2> -> seq<'U>
mapi : (int -> 'T -> 'U) -> seq<'T> -> seq<'U>
scan : ('State -> 'T -> 'State) -> 'State -> seq<'T> -> seq<'State> windowed : int -> seq<'T> -> seq<'T []>
zip : seq<'T1> -> seq<'T2> -> seq<'T1 * 'T2>
zip3 : seq<'T1> -> seq<'T2> -> seq<'T3> -> seq<'T1 * 'T2 * 'T3>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset