Up until this point, all F# data transformation patterns covered were dealing with in-memory collections. That is, important data crunching use cases, such as querying data, already persisted within the enterprise, and ingesting data from outside of the enterprise have not been considered yet.
This chapter covers these data transformation scenarios and related coding patterns:
So far in the book, the sources of data collections were either collection generators or the file system. Let me move toward more realistic enterprise data sources, where data are persisted in databases. To access and transform such data, F# provides query expressions (https://msdn.microsoft.com/visualfsharpdocs/conceptual/query-expressions-%5bfsharp%5d).
Query expressions represent a concrete type of computation expressions that is embedded into the language. They allow to bring data into memory by querying external sources and transforming incoming data to the desired shape.
The F# query expressions are akin to sequence expressions: both produce data sequences. However, before the final data projection shaping the produced data sequence a query expression may apply to data a wide variety of data transformations that are similar to what we used to see in LINQ (https://en.wikipedia.org/wiki/Language_Integrated_Query). Query expressions can be considered LINQ support in F#.
Chronologically, query expressions were introduced in F# 3.0. Prior to that, F# allowed you to access the LINQ machinery via .NET 3.5 Enumerable
extension
methods (https://msdn.microsoft.com/en-us/library/system.linq.enumerable_methods(v=vs.110).aspx). Let's look at the following script, which finds out the last vowel in the sequence of lexicographically ordered letters of the English alphabet (Ch9_1_1.fsx
):
let isVowel = function | 'A' | 'a' | 'E' | 'e' | 'I' | 'i' | 'O' | 'o' | 'U' | 'u' -> true | _ -> false let alphabet = seq { 'A' .. 'Z' } alphabet |> Seq.filter isVowel |> Seq.sortDescending |> Seq.head // val it : char = 'U'
If we recollect that the F# sequence alphabet
is IEnumerable
, then the task can be achieved with LINQ extension methods (Ch9_1_2.fsx
):
open System.Linq let isVowel = function | 'A' | 'a' | 'E' | 'e' | 'I' | 'i' | 'O' | 'o' | 'U' | 'u' -> true | _ -> false let alphabet = seq { 'A' .. 'Z' } alphabet.Where(isVowel).OrderByDescending(fun x -> x).First() // val it : char = 'U'
Using the fluent interface of LINQ extension methods as a rough substitute for the F# pipe operator, |>
, we have achieved almost one-to-one correspondence between the definitions. The same result has been achieved by combining the Seq
library functions, filter-sortDescending-head
, and by combining the LINQ extension methods, Where-OrderByDescending-First
.
You may ask why I pay so much attention to the above similarity? That is because query expressions are nothing more than syntactic sugar similar to the one we observed with sequence expressions in Chapter 6, Sequences - The Core of Data Processing Patterns. Query expressions use F# computation expressions magic to express chain of function applications as a linear sequence of SQL-like operations within built-in computation expression builder query { ... }
. This approach is similar to seq { ... }
workings for generating F# sequences. The script Ch9_1_2.fsx
given in the previous section may be present using a query expression as (Ch9_1_3.fsx
):
let isVowel = function | 'A' | 'a' | 'E' | 'e' | 'I' | 'i' | 'O' | 'o' | 'U' | 'u' -> true | _ -> false let alphabet = seq { 'A' .. 'Z' } query { for letter in alphabet do where (isVowel letter) sortByDescending letter select letter // may be omitted head } // val it : char = 'U'
When dissecting the preceding query expression you may spot the already familiar ETL data transformation process considered in Chapter 8, Data Crunching - Data Transformation Patterns: given a collection, perform one or more modifications on its members, eventually projecting the query result. As a computation expression query
provides that magic glue between adjacent lines. It makes the data flowing from one query operator to another in the manner similar to functions chained with the >>
combinator.
Although the number of query operators (https://msdn.microsoft.com/en-us/visualfsharpdocs/conceptual/query-expressions-%5Bfsharp%5D) is substantially less than the number of F# core library functions for collections - only around 40 - the query operators fit nicely into, where applicable, the hierarchy of data transformation patterns (another similar classification I managed to discover is the following one (https://weblogs.asp.net/dixin/understanding-linq-to-objects-2-query-methods-and-query-expressions). The mapping accompanied by the similar category names from the preceding classification provided in round brackets is given as following:
count
, averageBy
, averageByNullable
, minBy
, maxBy
, minByNullable
, maxByNullable
, sumBy
, and sumByNullable
find
functionlast
, lastOrDefault
, head
, headOrDefault
, nth
, exactlyOne
, exactlyOneOrDefault
, take
, takeWhile
, skip
, skipWhile
, distinct
, and where
groupBy
and groupValBy
sortBy
, sortByDescending
, sortByNullable
, sortByNullableDescending
, thenBy
, thenByDescending
, thenByNullable
, and thenByNullableDescending
exists
and all
select
zip
, join
, groupJoin
, and leftOuterJoin
Nice! However, so far, the consideration was rotating around in-memory collections. Then how do we encompass querying out-of-memory data? F# provides considerable flexibility in this matter; so, let's approach it gradually in order to explore the richness and variety of the available schemes.
The important detail in using LINQ that often gets missed by occasional users is that the query mechanism is agnostic to the nature of the data collection. There is a layer that may be involved that abstracts the details of the concrete data source behind the IQueryable<'T> (https://msdn.microsoft.com/en-us/library/bb351562(v=vs.110).aspx) interface that we did not touch yet. Without involving this layer, you are on your own with our familiar IEnumerable<'T>
interface.
Both interfaces ensure deferred execution. However, IEnumerable<'T>
just brings into memory the data collection that matches a query expressed with relevant external means, subject to further LINQ-to-Object in-memory manipulations.
By comparison, IQueryable<'T>
allows LINQ-to-Something (LINQ-to-SQL, LINQ-to-OData, LINQ-to-WMI, to name a few) workings by the virtue of the component named LINQ provider. It ensures hidden translation of a LINQ query to terms that the concrete substitute of Something part understands, followed by the translated query execution by Something, bringing just the matching data collection back into the memory. Those of you interested in what entity may play the role of Something, I refer to the representative-although slightly outdated - LINQ-to-Everywhere - List of LINQ Providers (https://blogs.msdn.microsoft.com/knom/2009/04/27/linq-to-everywhere-list-of-linq-providers/).
There are two key moments in the previous paragraph that must be understood properly. Firstly, the LINQ provider fully abstracts the details of query translation and execution. For the intuitively clear case of LINQ-to-SQL, such translation is quite straightforward. The translated SQL query is to be executed on the side of the engaged database engine, sending back over the wire only the results of the server-side query execution. For something like, for example, LINQ-to-CRM (http://linqtocrm.codeplex.com/), some further digging would be required in order to find out what exactly this particular LINQ provider does.
Secondly, the to-be-translated LINQ query should not contain elements that cannot be expressed in terms of the translated query execution engine. Such violations may take place through selectiveness of features in the provider implementation, or unintentional capturing of unrelated elements from the context. This means that if the provider implementation, for example, does not support the sorting operation, the LINQ query that has the sorting piece will be rejected by the underlying provider. Also, sometimes, the translated query execution engines may have varying abilities and the same LINQ-to-SQL query may be successfully executed by the Microsoft SQL engine but fail miserably on a MySQL engine.
With the role of LINQ provider in mind, let's first turn to the case of LINQ provider-free F# querying.
For this use case, let me take something that can be easily reproduced by you. Being on the Microsoft platform myself, I will be using the traditional test database Microsoft supplies to developers, namely Adventureworks 2014 (https://msftdbprodsamples.codeplex.com/releases/view/125550). It has been installed under the localdb Microsoft SQL engine that comes with Visual Studio 2013.
There is a [Person].[Person]
table in this database that carries, among other things, names of people. Let me perform a simple analytical task on it by performing the following query:
select count(distinct [FirstName]) from [Adventureworks2014].[Person].[Person]
This allows me to find out that the database carries 1018 distinct personal first names. Let's find out how these names are distributed by the first letter in the English alphabet.
To access the database, I will be using the simple Reader
object of native .NET System.Data.SqlClient
library. The first (and rather simplistic) approach would be to just provide the complete list of distinct first names over the wire to the memory on demand. The following script implements this approach (Ch9_1_4.fsx
):
open System.Data open System.Data.SqlClient let alphabet = seq { 'A' .. 'Z' } let connStr = @"Data Source=(localdb)projectsv12;Initial Catalog=Adventureworks2014;Integrated Security=true;" let dbConnection = new SqlConnection(connStr) dbConnection.Open() let dbCommand = new SqlCommand("select FirstName from [Person].[Person]",dbConnection) let names = seq { printfn "reading from db" use reader = dbCommand.ExecuteReader(CommandBehavior.Default) while reader.Read() do yield reader.GetString(0) } let distribution = query { for letter in alphabet do let howMuch = query { for name in names do where (name.StartsWith(string letter)) distinct select name } |> Seq.length sortBy howMuch select (letter, howMuch) } distribution |> Seq.toList |> printfn "%A"
There are two query expressions here: the first goes over each letter
of the alphabet
, delegating the trip to the database for the complete set of data to the nested second query, and then filtering out everything in memory except names starting with the current letter
value, throwing away duplicates and finding the resulting number of names. The outer query puts this number into its place according to the found frequency and returns the sought-for projection (letter, howMuch)
as the distribution
sequence. Materializing it in FSI, I can observe the target name distribution. The timed results of running the script Ch9_1_4.fsx
are presented in the following screenshot, where FSI just takes source script code from the given file path:
You may notice that while running, the script performed a complete reading of the list of first names from the [Person][Person]
database table for 26 times, which is an apparent overkill, and the approach can be improved significantly.
For example, we may parameterize our SQL command and get back not all names, but just the distinct names for each specific letter, which will dramatically decrease the traffic over the wire with the database. The script being refactored to reflect this improvement approach is given in the following code (Ch9_1_5.fsx
):
open System.Data open System.Data.SqlClient let alphabet = seq { 'A' .. 'Z' } let connStr = @"Data Source=(localdb)projectsv12;Initial Catalog=Adventureworks2014;Integrated Security=true;" let dbConnection = new SqlConnection(connStr) dbConnection.Open() let dbCommandR l = new SqlCommand( (sprintf "%s%s%s" "select distinct FirstName from [Person].[Person] where FirstName like '" l "%'"), dbConnection) let names l = seq { printfn "reading from db" use reader = (dbCommandR l).ExecuteReader(CommandBehavior.Default) while reader.Read() do yield reader.GetString(0) } let distribution = query { for letter in alphabet do let howMuch = names (string letter) |> Seq.length sortBy howMuch select (letter, howMuch) } #time "on" distribution |> Seq.toList |> printfn "%A"
You may notice that there is no need for the nested query {...}
groups now, as a significant amount of work is delegated to the SQL engine over the wire. The timed results of running the refactored script are given in the following screenshot:
You may observe an almost four-fold improvement in performance due to significant decrease of network traffic volume in the second version.
Pushing this trend of minimizing traffic to the extreme and giving as much work to the SQL server as possible, I may make all work get pushed on the side of SQL server, leaving just a rudimentary task of getting remote data for the F# query, such as in the third version of the script shown here (Ch9_1_6.fsx
):
open System.Data open System.Data.SqlClient let connStr = @"Data Source=(localdb)projectsv12;Initial Catalog=Adventureworks2014;Integrated Security=true;" let dbConnection = new SqlConnection(connStr) dbConnection.Open() let dbCommandF = new SqlCommand("select SUBSTRING(FirstName, 1, 1),count(distinct FirstName) as "count" from [Adventureworks2014].[Person].[Person] group by SUBSTRING(FirstName, 1, 1) order by count",dbConnection) let frequences = seq { printfn "reading from db" use reader = dbCommandF.ExecuteReader(CommandBehavior.Default) while reader.Read() do yield (reader.GetString(0), reader.GetInt32(1)) } let distribution = query { for freq in frequences do select freq } #time "on" distribution |> Seq.toList |> printfn "%A"
Note that now all the dirty work is done by the SQL server, which is perfectly fine as Microsoft SQL Server is a masterpiece of software dedicated to data storing and crunching, and it does its work really well (if you do not detrimentally interfere, of course). The results of running the final script refactoring are presented in the following screenshot. Do not miss there the evidence that the whole data exchange over the wire took just a single round trip:
Wow, this optimization is the bomb! Version 3, in comparison with version 1, has roughly 17.6 times better performance. Now your take-home lesson is hard to forget.
I hope that after the bold performance results achieved in the previous section there is no need to convince you just how important the ability of relaying LINQ query execution to the remote party is. However, do not expect to take this for granted. This direction may have a steep learning curve, which we will notice shortly.
Let's take as a use case the 100% real task I was recently addressing on the job at Jet.com Inc. (https://jet.com/about-us). I will be building a backend for a dashboard, showing some top-paid partners of Jet.com in real time (merchants with the largest amounts of reimbursement for fulfilled orders that they have shipped to Jet.com customers).
I will be accessing a limited amount of data from the Jet.com Quality Assurance environment, so the numbers will not be that indicative of real top-paid partners.
The data required for the dashboard backend are split between two databases: SQL.Colossus
carries the data on payments in the Payments
table, while SQL.IronmanData
carries the data on partners in the Partner
table.
If the case is that the data is located at the same SQL engine that supports cross-DB queries, then the T-SQL script that brings me the sought-for data would be something along the following lines (Ch9_2.fsx
, commented section at the top):
select top (10) min(r.DisplayName) as Name, sum(p.[Amount]) as Total from [sql.colossus].[dbo].[Payments] p join [sql.ironmandata].[dbo].[Partner] r on r.MerchantId = p.MerchantId where p.[IsDeposited] = 1 group by p.[MerchantId] order by total desc
After being executed against the target environment in SQL Server Management Studio, this yields the results reflected in the following screenshot:
Let me make the first attempt in expressing the similar T-SQL query using the F# query expression query{...}
. In order to get access to LINQ-to-SQL, I will be using a more advanced F# mechanism to get strongly typed access to the data than ADO.NET, which I was using in the previous section. This mechanism is known as the F# type provider. Specifically, I will be using the SQLDataConnection (LINQ to SQL) type provider (http://fsharp.org/guides/data-access/#sql-data-access), which is a part of the standard F# distribution that has been targeting Microsoft Windows since F# v3.0.
Those of you who are totally unfamiliar with the matter can follow this MSDN walkthrough (https://msdn.microsoft.com/visualfsharpdocs/conceptual/walkthrough-accessing-a-sql-database-by-using-type-providers-%5bfsharp%5d) in order to better understand the contents of this section.
The F# script that can be put into the core of the dashboard backend is as follows (Ch9_2.fsx
):
#r "FSharp.Data.TypeProviders" #r "System.Data" #r "System.Data.Linq" open Microsoft.FSharp.Data.TypeProviders open System.Linq [<Literal>] let compileTimeCsusCS = @"Data Source=(localdb)projectsv12;Initial Catalog=Colossus.DB;Integrated Security=SSPI" let runTimeCsusCS = @"Data Source=***;Initial Catalog=SQL.Colossus;User ID=***;Password=***" [<Literal>] let compileTimeImCS = @"Data Source=(localdb)projectsv12;Initial Catalog=SQL.Ironman;Integrated Security=SSPI" let runTimeImCS = @"Data Source=***;Initial Catalog=SQL.IronmanData;User ID=***;Password=***" type Colossus = SqlDataConnection<compileTimeCsusCS> type IronManData = SqlDataConnection<compileTimeImCS> let pmtContext = Colossus.GetDataContext(runTimeCsusCS) let imContext = IronManData.GetDataContext(runTimeImCS) let mostPaid = fun x -> query { for payment in pmtContext.Payments do where (payment.IsDeposited.HasValue && payment.IsDeposited.Value) groupBy payment.MerchantId into p let total = query { for payment in p do sumBy payment.Amount} sortByDescending total select (p.Key,total) take x } let active = (mostPaid 10) let activeIds = active |> Seq.map fst let mostActiveNames = query { for merchant in imContext.Partner do where (activeIds.Contains(merchant.MerchantId)) select (merchant.MerchantId,merchant.DisplayName) } |> dict active |> Seq.map (fun (id, total) -> (mostActiveNames.[id],total)) |> Seq.iter (fun x -> printfn "%s: %.2f" (fst x) (snd x))
Take into account that in order to adhere with the security requirements, I do not reveal any parameters of the Jet.com infrastructure except some (not necessarily coinciding with real ones) names of databases and tables.
When it comes to the type provider, it is important to realize that the provider itself works at compile-time, providing typed access to the fields of the involved SQL tables. In order to do this, it requires access to the SQL schema information at compile-time. This access to the structural information in the preceding script is given via the compileTimeCsusCS
and compileTimeImCS
connection strings for Colossus.DB
and SQL.Ironman
databases, respectively.
Note that compile-time access to the local SQL engine from the type provider has nothing to do with application data. It just retrieves system data about SQL schemas. These schemas are structurally similar to ones carrying application data on production SQL data engine. So, the provided Colossus
and IronManData
types are built off the localdb
SQL engine, while the pmtContext
and imContext
runtime data contexts are built off the production server(s) with the help of runTimeCsusCS
and runTimeImCS
runtime connection strings.
The mostPaid
function represents the query used to find any given number of top-paid partners along with their aggregate deposited payments. As we may expect, the signature of this function is mostPaid : x:int -> System.Linq.IQueryable<string * decimal>
, and it is to be translated by the LINQ-to-SQL provider into plain T-SQL, to be executed on the SQL server side.
Another interesting moment is that in Jet.com's Microsoft Azure production environment, cross-database queries such as the one from Figure SQL Query to feed the top-paid Jet.com Partners dashboard do not work, so I split the access into three phases:
active
represents a collection of tuples (merchantId, paymentAmount)
of which the list of pertinent partner IDs, activeIds
, can be easily projectedmostActiveNames
, retrieves only those partner display names that belong only to the top-paid partners and packages them into a dictionaryactive
experiences a transformation where the ID is substituted by mostActiveNames.[id]
, yielding the final data shape required for the dashboard.The results of running the previous script with FSI are presented in the following screenshot; as expected, they are identical to previous ones:
Wouldn't it be great to compose smaller F# subqueries into bigger queries? Put differently, it would mean composing multiple queryables into a single LINQ query translated into SQL and executed on the database engine side.
This sounds promising, and it has attracted the attention of some individual developers and groups.
The most sizeable effort was taken by a group at the University of Edinburgh, UK, governed by functional programming authorities such as Philip Wadler. Their results can be found at the FSharpComposableQuery (http://fsprojects.github.io/FSharp.Linq.ComposableQuery/index.html) project home page, offering a NuGet package, source code, tutorial, and even some theoretical papers on the subject. An introductory video presentation given by Philip Wader is available on the SkillsMatter website: A practical theory of language-integrated query (https://skillsmatter.com/skillscasts/4486-a-practical-theory-of-language-integrated-query).
Also, a few years ago, an alternative and more lightweight approach to composable queries was suggested in this blog post (http://fpish.net/blog/loic.denuziere/id/3508/2013924-f-query-expressions-and-composability) by Loïc Denuzière (http://fpish.net/profile/loic.denuziere). It is based on splicing partial F# query expressions together in order to build more complex ones. I will make a foray into composable LINQ queries based on the latter approach.
Just before we start writing the code, I must point out a big limitation of F# querying based on LINQ-to-SQL: it is not possible to perform cross-database and cross-engine queries, as all subqueries must share the same LINQ context! This factor might be a show-stopper for enterprises that have a multitude of OLTP and OLAP databases.
In order to refactor to a composable query a T-SQL query from the script Ch9_2.fsx
that was covered above in the dashboard use case discussion, I have moved a copy of the Partner
table to the SQL.Colossus
database. It can now share the same LINQ context with the Payments
table.
The composition method is based on introducing a special PartialQueryBuilder
class that:
Linq.QueryBuilder
introducing extra method Run
Linq.QueryBuilder
with method Source
All these measures allow you writing subqueries with the alternative expression builder pquery
that gets wrapped into quotations instead of being evaluated . These quotations are embedded into ordinary queries and get evaluated uniformly.
In the following script, which relies on these features, I've omitted the separation of compile-time and run-time connections for brevity (Ch9_3.fsx
):
#r "FSharp.Data.TypeProviders" #r "System.Data" #r "System.Data.Linq" open Microsoft.FSharp.Data.TypeProviders open System.Linq [<Literal>] let runTimeCsusCS = @"Data Source=***;Initial Catalog=SQL.Colossus;User ID=***;Password=***" type Colossus = SqlDataConnection<runTimeCsusCS> let pmtContext = Colossus.GetDataContext(runTimeCsusCS)
Then goes the utility part defining pquery
:
type PartialQueryBuilder() = inherit Linq.QueryBuilder() member __.Run(e: Quotations .Expr<Linq.QuerySource<'T,IQueryable>>) = e let pquery = PartialQueryBuilder() type Linq.QueryBuilder with [<ReflectedDefinition>] member __.Source(qs: Linq.QuerySource<'T,_>) = qs
Finally, the composed queries are as follows:
let mostPaid = pquery { for payment in pmtContext.Payments do where (payment.IsDeposited.HasValue && payment.IsDeposited.Value) groupBy payment.MerchantId into p let total = pquery { for payment in p do sumBy payment.Amount} sortByDescending total select (p.Key,total) take 10 } let dashboard = pquery { for merchant in pmtContext.Partner do for (id,total) in %mostPaid do where (merchant.MerchantId = id ) select (merchant.DisplayName, total) } query { for m in %dashboard do select m } |> Seq.iter (fun x -> printfn "%s: %.2f" (fst x) (snd x))
Note how mostPaid
is spliced into the dashboard
, creating a seamless composition, and in turn, dashboard
is spliced into the final query.
Running the script in FSI yields the results shown here:
You may wonder whether there is a way to check out that the query composition really took place. Fortunately, that is not that hard to do. All it takes is adding the following property to the LINQ context, as shown here:
pmtContext.Payments.Context.Log <- new System.IO.StreamWriter( @"C:usersgenedownloadspmtlinq.log", AutoFlush = true)
After running the preceding script again, the LINQ log file now contains the SQL code executed by the SQL engine:
SELECT [t0].[DisplayName] AS [Item1], [t3].[value] AS [Item2] FROM [dbo].[Partner] AS [t0] CROSS JOIN ( SELECT TOP (10) [t2].[MerchantId], [t2].[value] FROM ( SELECT SUM([t1].[Amount]) AS [value], [t1].[MerchantId] FROM [dbo].[Payments] AS [t1] WHERE ([t1].[IsDeposited] IS NOT NULL) AND (([t1].[IsDeposited]) = 1) GROUP BY [t1].[MerchantId] ) AS [t2] ORDER BY [t2].[value] DESC ) AS [t3] WHERE [t0].[MerchantId] = [t3].[MerchantId] ORDER BY [t3].[value] DESC -- Context: SqlProvider(Sql2008) Model: AttributedMetaModel Build: 4.0.30319.33440
Note how all IQueryable
bits and pieces from the script F# queries get molded into the single SQL statement.