In SQL Server suite, SQL Server Analysis Services (SSAS) supports data mining from version 2000. SSAS has included some of the most popular algorithms with very explanatory visualizations. SSAS data mining is very simple to use. However, the number of algorithms is limited, and the whole statistical analysis is missing in the SQL Server suite. By introducing R in SQL Server, Microsoft made a quantum leap forward in statistics, data mining and machine learning.
Of course, the R language and engine have their own issues. For example, installing packages directly from code might not be in accordance with the security policies of an enterprise. In addition, most of the calculations are not scalable. Scalability might not be an issue for statistical and data mining analyses, because you typically work with samples. However, machine learning algorithms can consume huge amounts of data.
With SQL Server 2016, you get a highly scalable R engine. Not every function and algorithm is rewritten as a scalable one. Nevertheless, you will probably find the one you need for your analysis of a big dataset. You can store an R data mining or machine learning model in a SQL Server table and use it for predictions on new data. You can even store graphs in a binary column and use it in SQL Server Reporting Services (SSRS) reports. Finally, R support is not limited to SQL Server only. You can use R code also in Power BI Desktop and Power BI Service, and in Azure Machine Learning (Azure ML) experiments.
Installing packages is not that simple, and must be done by a DBA. In SQL Server, you call R code through a stored procedure. This way, a DBA can apply all SQL Server security to R code as well. In addition, you need a SQL Server, a Windows login, or Windows to run the code that uses SQL Server R Services. This login must also have enough permissions on SQL Server objects. It needs to access the database where you run the R code, permissions to read SQL Server data, and potentially, if you need to store the results in a SQL Server table, permissions to write data.
This section introduces R support in SQL Server, including:
Microsoft provides the highly scalable R engine in two flavors:
For development, you prepare a client installation. You can download Microsoft R Client from http://aka.ms/rclient/download. This installation includes the open R engine and the scalable packages as well. In addition to the engine, you probably want to also install a development IDE, either RStudio or R Tools for Visual Studio. Of course, you can also download and install the Developer Edition of SQL Server 2016 instead. This way, you get both the R runtime with the scalable packages and the database engine.
Some of the scalable packages shipped with SQL Server R Services are:
The following figure shows how the communication process between SQL Server and R engine works:
Figure 13.6: The communication between SQL Server and R runtime
The components involved and their communications are as follows:
sys.sp_execute_external_script
system stored procedure. SQL Server sends the request to the Launchpad service, a new service that supports the execution of external scripts.Time to test the execution of R script in SQL Server! First, you need to use the sys.sp_configure
system stored procedure to enable external scripts. You can do this with the following code:
USE master; EXEC sys.sp_configure 'show advanced options', 1; RECONFIGURE EXEC sys.sp_configure 'external scripts enabled', 1; RECONFIGURE;
After that, you can call the sys.sp_execute_external_script
system stored procedure. The most important parameters of this procedure include:
@language
: Currently limited to value R@script
: The actual script in the external language@input_data_1_name
: The name of the data frame, as seen in the R code in the @script
parameter for the first input dataset; the default name is InputDataSet
@input_data_1
: The T-SQL query that specifies the first input dataset@output_data_1_name
: The name of the R object, most probably a data frame, with the output dataset; the default name is OutputDataSet
WITH RESULT SETS
: The option where you specify the column names and data types of the output of the R script, as seen in SQL ServerIn the following example, the R script called from SQL Server retrieves the list of installed packages:
EXECUTE sys.sp_execute_external_script @language=N'R', @script = N'str(OutputDataSet); packagematrix <- installed.packages(); NameOnly <- packagematrix[,1]; OutputDataSet <- as.data.frame(NameOnly);' WITH RESULT SETS ( ( PackageName nvarchar(20) ) );
The shortened results are:
PackageName -------------------- base boot class ... RevoIOQ revoIpe RevoMods RevoPemaR RevoRpeConnector RevoRsrConnector RevoScaleR RevoTreeView RevoUtils RevoUtilsMath ... spatial splines stats ...
You can see that besides some base packages, there is a set of packages where the name starts with the string Revo
, including the RevoScaleR
and RevoPemaR
packages. These are two packages with scalable functions and their associated packages.
You can use the scalable server resources from the client. You start development in RStudio or R Tools for Visual Studio by setting the execution context to the server. Of course, you must do it with a function from the RevoScaleR
package, which is loaded in memory at the beginning of the following R code. The code defines the execution context on SQL Server, in the context of the AdventureWorksDW2014
database. Remember, the dbo.vTargetMail
view comes from this database. Also note that the RUser
used to connect to SQL Server needs permission to use the sys.sp_execute_external_script
procedure. For the sake of simplicity, I just added the RUser
database user in the AdventureWorksDW2014
database on my local SQL Server instance to the db_owner
database role. The following code changes the execution context to SQL Server:
library(RevoScaleR); sqlConnStr <- "Driver=SQL Server;Server=SQL2016EIM; Database=AdventureWorksDW2014;Uid=RUser;Pwd=Pa$$w0rd"; sqlShare <- "C:\SQL2016DevGuide"; chunkSize = 1000; srvEx <- RxInSqlServer(connectionString = sqlConnStr, shareDir = sqlShare, wait = TRUE, consoleOutput = FALSE); rxSetComputeContext(srvEx);
The parameters define the connection string to my SQL Server instance, the shared folder used to exchange the data between SQL Server and R engine, and the chunk size, which is actually used later when reading the data. Please note that you need to change the name of the SQL Server to your SQL Server instance. The RxInSqlServer()
object creates the compute context in SQL Server. The wait
parameter defines whether the execution in SQL Server is blocking and the control does not return to the client until the execution is finished or the execution is not blocking. The consoleOutput
parameter defines whether the output of the R code started by SQL Server should be returned to the user console. The rxSetComputeContext()
function actually sets the execution context to SQL Server.
After the execution context has been set to SQL Server, you can try to use other scalable functions. For example, rxImport()
can be used to import comma-separated value file data to a data frame. The rowsPerRead
parameter reads in batches, using the chunk size defined earlier in the code. The batch size of 1,000 rows is quite small, just to show how this import in chunks works. For larger datasets, you should use much larger batches. You should test what the best size is for your datasets and the processing power you have:
TMCSV = rxImport(inData = "C:\SQL2016DevGuide\Chapter13_TM.csv", stringsAsFactors = TRUE, type = "auto", rowsPerRead = chunkSize, reportProgress = 3);
The reportProgress
parameter defines a detailed output. The abbreviated result of the previous code is here:
ReadNum=1, StartRowNum=1, CurrentNumRows=1000, TotalRowsProcessed=1000, ReadTime=0.01, ProcessDataTime = 0, LoopTime = 0.01 ReadNum=2, StartRowNum=1001, CurrentNumRows=1000, TotalRowsProcessed=2000, ReadTime=0.007, ProcessDataTime = 0.002, LoopTime = 0.007 ... Overall compute summaries time: 0.133 secs. Total loop time: 0.132 Total read time for 19 reads: 0.115 Total process data time: 0.041 Average read time per read: 0.00605263 Average process data time per read: 0.00215789 Number of threads used: 2
You can see that the chunk size was really 1,000 rows, how much time was needed for each chunk, the total time, the number of threads used, and more. This confirms that RevoScaleR
functions use parallelism.
The next code reads the same data again, this time from SQL Server. Note that an ODBC connection is not needed; the code is already executed on the server side in the context of the AdventureWorksDW2014
database. The RxSqlServerData()
function generates a SQL Server data source object. You can think of it as a proxy object to the SQL Server rowset, which is the result of the query:
TMquery <- "SELECT CustomerKey, MaritalStatus, Gender, TotalChildren, NumberChildrenAtHome, EnglishEducation AS Education, EnglishOccupation AS Occupation, HouseOwnerFlag, NumberCarsOwned, CommuteDistance, Region, BikeBuyer, YearlyIncome, Age FROM dbo.vTargetMail"; sqlTM <- RxSqlServerData(sqlQuery = TMquery, connectionString = sqlConnStr, stringsAsFactors = TRUE, rowsPerRead = chunkSize); TMSQL <- rxImport(inData = sqlTM, reportProgress = 3);
The sqlTM
object is the pointer to the SQL Server data, and exposes the metadata of the result set of the query to the client R code. Note that the last line creates a new data frame and physically transfers data to the client. Therefore, if you executed the code in this section step-by-step, you should have two data frames—TMCSV
and TMSQL
—with data in local client memory, and the sqlTM
data source connection, which you can use as a data frame. You can see the difference if you try to get the info about all three objects with the rxGetInfo()
function:
rxGetInfo(TMCSV); rxGetInfo(sqlTM);
The previous code returns the following result:
Data frame: TMCSV Number of observations: 18484 Number of variables: 14 Connection string: Driver=SQL Server;Server=localhost; Database=AdventureWorksDW2014;Uid=RUser;Pwd=Pa$$w0rd Data Source: SQLSERVER
You get the details about the metadata of the SQL data source connection object with the rxGetVarInfo()
function. You can get summary statistics and different cross-tabulations of the SQL Server data with the rxSummary()
, rxCrossTabs()
, and rxCube()
functions. You can create histograms with the rxHistogram()
function. All these functions use the SQL Server execution context. The following code shows how to use the functions mentioned:
sumOut <- rxSummary( formula = ~ NumberCarsOwned + Occupation + F(BikeBuyer), data = sqlTM); sumOut; cTabs <- rxCrossTabs(formula = BikeBuyer ~ Occupation : F(HouseOwnerFlag), data = sqlTM); print(cTabs, output = "counts"); print(cTabs, output = "sums"); print(cTabs, output = "means"); summary(cTabs, output = "sums"); summary(cTabs, output = "counts"); summary(cTabs, output = "means"); cCube <- rxCube(formula = BikeBuyer ~ Occupation : F(HouseOwnerFlag), data = sqlTM); cCube; rxHistogram(formula = ~ BikeBuyer | MaritalStatus, data = sqlTM);
Note that all of these scalable functions accept data from the SQL Server data source connection object. Because they execute on SQL Server, you cannot use the local data frames to feed them. If you would use a local data frame, you would get an error. If you want to use the scalable functions with the local datasets, you need to switch the execution context back to local.
The following code shows how to set the execution context back to the client machine:
rxSetComputeContext("local");
The RevoScaleR
package includes a function to calculate clusters of similar cases based on the values of the input variables. It uses the K-means clustering algorithm. The rxKmeans()
function in the following code uses a local data frame. It defines two clusters and then assigns each case to one of the clusters. The summary()
function gives you the details of the clustering model:
TwoClust <- rxKmeans(formula = ~ BikeBuyer + TotalChildren + NumberCarsOwned, data = TMSQL, numClusters = 2); summary(TwoClust);
You can add the cluster membership to the original data frame and rename the variable to a friendlier name:
TMClust <- cbind(TMSQL, TwoClust$cluster); names(TMClust)[15] <- "ClusterID";
In order to understand the meaning of the clusters, you need to analyze them. The following code creates a nice graph that consists of three individual small graphs showing the distribution of each input variable in each cluster:
attach(TMClust); oldpar <- par(no.readonly = TRUE); par(mfrow=c(1,3)); # NumberCarsOwned and clusters nofcases <- table(NumberCarsOwned, ClusterID); nofcases; barplot(nofcases, main='Number of cars owned and cluster ID', xlab='Cluster Id', ylab ='Number of Cars', legend=rownames(nofcases), col=c("black", "blue", "red", "orange", "yellow"), beside=TRUE); # BikeBuyer and clusters nofcases <- table(BikeBuyer, ClusterID); nofcases; barplot(nofcases, main='Bike buyer and cluster ID', xlab='Cluster Id', ylab ='BikeBuyer', legend=rownames(nofcases), col=c("blue", "yellow"), beside=TRUE); # TotalChildren and clusters nofcases <- table(TotalChildren, ClusterID); nofcases; barplot(nofcases, main='Total children and cluster ID', xlab='Cluster Id', ylab ='Total Children', legend=rownames(nofcases), col=c("black", "blue", "green", "red", "orange", "yellow"), beside=TRUE); # Clean up par(oldpar); detach(TMClust);
You should already be familiar with this code from the examples earlier in this chapter. The next screenshot shows the results:
Figure 13.7: The analysis of the clusters
Once you have created a model, you can deploy it to a SQL Server table and use it later for predictions. You can also do the predictions in R and store just the results in a SQL Server table.
Let's start by creating another model in R. This time, the model uses the Logistic Regression algorithm. The model uses the SQL Server data and the dbo.vTargetMail
view to learn how the values of the NumberCarsOwned
, TotalChildren
, Age
, and YearlyIncome
input variables influence the value of the BikeBuyer
target variable. The following code sets the execution context back to SQL Server, creates the model with the RevoScale
RrxLogit()
function, and shows the summary of the model:
rxSetComputeContext(srvEx); bbLogR <- rxLogit(BikeBuyer ~ NumberCarsOwned + TotalChildren + Age + YearlyIncome, data = sqlTM); summary(bbLogR);
You can use the model to perform predictions. In the following example, the model is used to make predictions on the same dataset that was used for training the model. In a real-life situation, you would perform predictions on a new dataset. The code stores the predictions, together with the input values used, in a SQL Server table. The RxSqlServerData()
function prepares the connection to the SQL Server database and the target table name. The rxPredict()
function performs the predictions and physically creates the SQL Server table and inserts the data. Of course, the database user used to connect to SQL Server must have appropriate permissions to create a table:
bbLogRPredict <- RxSqlServerData(connectionString = sqlConnStr, table = "dbo.TargetMailLogR"); rxPredict(modelObject = bbLogR, data = sqlTM, outData = bbLogRPredict, predVarNames = "BikeBuyerPredict", type = "response", writeModelVars = TRUE);
You can use a T-SQL query to check the results, as shown here:
USE AdventureWorksDW2014; SELECT * FROM dbo.TargetMailLogR;
The partial results are shown here. Values above 0.5 mean positive predictions:
BikeBuyerPredict BikeBuyer NumberCarsOwned TotalChildren Age YearlyIncome ---------------- --------- --------------- ------------- ---- ------------ 0.733910292274223 1 0 2 44 90000 0.540550204813772 1 1 3 40 60000 0.529196837245225 1 1 3 45 60000
As mentioned, you can store a model in a SQL Server table. The following T-SQL code creates a table where the models are going to be stored and a stored procedure that actually inserts a model:
CREATE TABLE dbo.RModels (Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY, ModelName NVARCHAR(50) NOT NULL, Model VARBINARY(MAX) NOT NULL); GO CREATE PROCEDURE dbo.InsertModel (@modelname NVARCHAR(50), @model NVARCHAR(MAX)) AS BEGIN SET NOCOUNT ON; INSERT INTO dbo.RModels (ModelName, Model) VALUES (@modelname, CONVERT(VARBINARY(MAX), @model, 2)); END;
The infrastructure is created. Now you can store the model in a SQL Server table from R. However, in order to call a stored procedure, you need to use an ODBC connection. Therefore, the following code first loads the RODBC library to memory, and creates a connection to the SQL Server database. Then it serializes the model to a binary variable, and creates a string from the binary variable using the paste()
function. The same function is used to prepare a string with the T-SQL code to insert the model in the table. Finally, the sqlQuery()
function sends the T-SQL command to SQL Server. Again, the R user used to execute this code must have permission to execute the stored procedure:
library(RODBC); conn <- odbcDriverConnect(sqlConnStr); modelbin <- serialize(bbLogR, NULL); modelbinstr=paste(modelbin, collapse=""); sqlQ <- paste("EXEC dbo.InsertModel @modelname='bbLogR', @model='", modelbinstr,"'", sep=""); sqlQuery(conn, sqlQ); close(conn);
The final step is to use the model from T-SQL. The following code uses the sys.sp_execute_external_script
system procedure to use the model and perform a prediction on a single case. First, it creates an input dataset that consists of a single row with four input variables. Then, it retrieves the stored model. Then the R code is executed, which un-serializes the model and uses the rxPredict()
function again to generate the output dataset, which includes the input variables and the prediction:
DECLARE @input AS NVARCHAR(MAX) SET @input = N' SELECT * FROM (VALUES (0, 2, 44, 90000)) AS inpQ(NumberCarsOwned, TotalChildren, Age, YearlyIncome);' DECLARE @mod VARBINARY(max) = (SELECT Model FROM DBO.RModels WHERE ModelName = N'bbLogR'); EXEC sys.sp_execute_external_script @language = N'R', @script = N' mod <- unserialize(as.raw(model)); OutputDataSet<-rxPredict(modelObject = mod, data = InputDataSet, outData = NULL, predVarNames = "BikeBuyerPredict", type = "response", checkFactorLevels=FALSE, writeModelVars = TRUE, overwrite = TRUE); ', @input_data_1 = @input, @params = N'@model VARBINARY(MAX)', @model = @mod WITH RESULT SETS (( BikeBuyerPredict FLOAT, NumberCarsOwned INT, TotalChildren INT, Age INT, YearlyIncome FLOAT));
The results are:
BikeBuyerPredict NumberCarsOwned TotalChildren Age YearlyIncome ---------------------- --------------- ------------- ---- ------------- 0.733910292274223 0 2 44 90000
The input values used were the same as the values in the first row of the batch predictions from the dbo.vTargetMail
view used in the previous prediction example. You can see that the predicted value is also the same.