SQL Server R services

In SQL Server suite, SQL Server Analysis Services (SSAS) supports data mining from version 2000. SSAS has included some of the most popular algorithms with very explanatory visualizations. SSAS data mining is very simple to use. However, the number of algorithms is limited, and the whole statistical analysis is missing in the SQL Server suite. By introducing R in SQL Server, Microsoft made a quantum leap forward in statistics, data mining and machine learning.

Of course, the R language and engine have their own issues. For example, installing packages directly from code might not be in accordance with the security policies of an enterprise. In addition, most of the calculations are not scalable. Scalability might not be an issue for statistical and data mining analyses, because you typically work with samples. However, machine learning algorithms can consume huge amounts of data.

With SQL Server 2016, you get a highly scalable R engine. Not every function and algorithm is rewritten as a scalable one. Nevertheless, you will probably find the one you need for your analysis of a big dataset. You can store an R data mining or machine learning model in a SQL Server table and use it for predictions on new data. You can even store graphs in a binary column and use it in SQL Server Reporting Services (SSRS) reports. Finally, R support is not limited to SQL Server only. You can use R code also in Power BI Desktop and Power BI Service, and in Azure Machine Learning (Azure ML) experiments.

Installing packages is not that simple, and must be done by a DBA. In SQL Server, you call R code through a stored procedure. This way, a DBA can apply all SQL Server security to R code as well. In addition, you need a SQL Server, a Windows login, or Windows to run the code that uses SQL Server R Services. This login must also have enough permissions on SQL Server objects. It needs to access the database where you run the R code, permissions to read SQL Server data, and potentially, if you need to store the results in a SQL Server table, permissions to write data.

This section introduces R support in SQL Server, including:

  • Architecture
  • Using R code in T-SQL
  • Scalable solutions
  • Security
  • Deploying R models in SQL Server

Discovering SQL Server R services

Microsoft provides the highly scalable R engine in two flavors:

  • R Services (In-Database): This is the installation that integrates R into SQL Server. It includes a database service that runs outside the SQL Server Database Engine and provides a communication channel between the Database Engine and R runtime. You install it with SQL Server setup. The R engine includes the open source R components and in addition a set of scalable R packages.
  • Microsoft R Server: This is a standalone R server with the same open and scalable packages that runs on multiple platforms.

For development, you prepare a client installation. You can download Microsoft R Client from http://aka.ms/rclient/download. This installation includes the open R engine and the scalable packages as well. In addition to the engine, you probably want to also install a development IDE, either RStudio or R Tools for Visual Studio. Of course, you can also download and install the Developer Edition of SQL Server 2016 instead. This way, you get both the R runtime with the scalable packages and the database engine.

Some of the scalable packages shipped with SQL Server R Services are:

  • RevoScaleR: This is a set of parallelized scalable R functions for processing data, data overview and preliminary analysis, and machine learning models. The procedures in this package can work with chunks of data at a time, so they don't need to load all of the data in memory immediately.
  • RevoPemaR: This package allows you to write custom parallel external algorithms.
  • MicrosoftML: This is a new package from December 2016, with many additional scalable machine learning algorithms implemented.

The following figure shows how the communication process between SQL Server and R engine works:

Discovering SQL Server R services

Figure 13.6: The communication between SQL Server and R runtime

The components involved and their communications are as follows:

  • In SQL Server Database Engine, you run R script with the sys.sp_execute_external_script system stored procedure. SQL Server sends the request to the Launchpad service, a new service that supports the execution of external scripts.
  • The Launchpad service starts the launcher appropriate for the language of your script. Currently, the only launcher available is the RLauncher, and therefore you can launch an external script from SQL Server using the R language only. However, you can see that the infrastructure is prepared to enable the execution of scripts in additional programming languages.
  • The RLauncher starts RTerm, the R terminal application for executing R scripts.
  • The RTerm sends the script to BxlServer. This is a new executable used for communication between SQL Server and the R engine. The scalable R functions are implemented in this executable as well.
  • The BxlServer uses SQL Satellite, a new extensibility API that provides a fast data transfer between SQL Server and external runtime. Again, currently only R runtime is supported.

Time to test the execution of R script in SQL Server! First, you need to use the sys.sp_configure system stored procedure to enable external scripts. You can do this with the following code:

USE master; 
EXEC sys.sp_configure 'show advanced options', 1; 
RECONFIGURE 
EXEC sys.sp_configure 'external scripts enabled', 1;  
RECONFIGURE; 

After that, you can call the sys.sp_execute_external_script system stored procedure. The most important parameters of this procedure include:

  • @language: Currently limited to value R
  • @script: The actual script in the external language
  • @input_data_1_name: The name of the data frame, as seen in the R code in the @script parameter for the first input dataset; the default name is InputDataSet
  • @input_data_1: The T-SQL query that specifies the first input dataset
  • @output_data_1_name: The name of the R object, most probably a data frame, with the output dataset; the default name is OutputDataSet
  • WITH RESULT SETS: The option where you specify the column names and data types of the output of the R script, as seen in SQL Server

In the following example, the R script called from SQL Server retrieves the list of installed packages:

EXECUTE sys.sp_execute_external_script 
 @language=N'R', 
 @script =  
 N'str(OutputDataSet); 
   packagematrix <- installed.packages(); 
   NameOnly <- packagematrix[,1]; 
   OutputDataSet <- as.data.frame(NameOnly);' 
WITH RESULT SETS ( ( PackageName nvarchar(20) ) ); 

The shortened results are:

PackageName
--------------------
base
boot
class
...
RevoIOQ
revoIpe
RevoMods
RevoPemaR
RevoRpeConnector
RevoRsrConnector
RevoScaleR
RevoTreeView
RevoUtils
RevoUtilsMath
...
spatial
splines
stats
...

You can see that besides some base packages, there is a set of packages where the name starts with the string Revo, including the RevoScaleR and RevoPemaR packages. These are two packages with scalable functions and their associated packages.

Creating scalable solutions

You can use the scalable server resources from the client. You start development in RStudio or R Tools for Visual Studio by setting the execution context to the server. Of course, you must do it with a function from the RevoScaleR package, which is loaded in memory at the beginning of the following R code. The code defines the execution context on SQL Server, in the context of the AdventureWorksDW2014 database. Remember, the dbo.vTargetMail view comes from this database. Also note that the RUser used to connect to SQL Server needs permission to use the sys.sp_execute_external_script procedure. For the sake of simplicity, I just added the RUser database user in the AdventureWorksDW2014 database on my local SQL Server instance to the db_owner database role. The following code changes the execution context to SQL Server:

library(RevoScaleR); 
sqlConnStr <- "Driver=SQL Server;Server=SQL2016EIM; 
 Database=AdventureWorksDW2014;Uid=RUser;Pwd=Pa$$w0rd"; 
sqlShare <- "C:\SQL2016DevGuide"; 
chunkSize = 1000; 
srvEx <- RxInSqlServer(connectionString = sqlConnStr, shareDir = sqlShare, 
                       wait = TRUE, consoleOutput = FALSE); 
rxSetComputeContext(srvEx); 

The parameters define the connection string to my SQL Server instance, the shared folder used to exchange the data between SQL Server and R engine, and the chunk size, which is actually used later when reading the data. Please note that you need to change the name of the SQL Server to your SQL Server instance. The RxInSqlServer() object creates the compute context in SQL Server. The wait parameter defines whether the execution in SQL Server is blocking and the control does not return to the client until the execution is finished or the execution is not blocking. The consoleOutput parameter defines whether the output of the R code started by SQL Server should be returned to the user console. The rxSetComputeContext() function actually sets the execution context to SQL Server.

After the execution context has been set to SQL Server, you can try to use other scalable functions. For example, rxImport() can be used to import comma-separated value file data to a data frame. The rowsPerRead parameter reads in batches, using the chunk size defined earlier in the code. The batch size of 1,000 rows is quite small, just to show how this import in chunks works. For larger datasets, you should use much larger batches. You should test what the best size is for your datasets and the processing power you have:

TMCSV = rxImport(inData = "C:\SQL2016DevGuide\Chapter13_TM.csv", 
                 stringsAsFactors = TRUE, type = "auto", 
                 rowsPerRead = chunkSize, reportProgress = 3); 

The reportProgress parameter defines a detailed output. The abbreviated result of the previous code is here:

ReadNum=1, StartRowNum=1, CurrentNumRows=1000, TotalRowsProcessed=1000, ReadTime=0.01, ProcessDataTime = 0, LoopTime = 0.01
ReadNum=2, StartRowNum=1001, CurrentNumRows=1000, TotalRowsProcessed=2000, ReadTime=0.007, ProcessDataTime = 0.002, LoopTime = 0.007
...
Overall compute summaries time: 0.133 secs.
Total loop time: 0.132
Total read time for 19 reads: 0.115
Total process data time: 0.041
Average read time per read: 0.00605263
Average process data time per read: 0.00215789
Number of threads used: 2

You can see that the chunk size was really 1,000 rows, how much time was needed for each chunk, the total time, the number of threads used, and more. This confirms that RevoScaleR functions use parallelism.

The next code reads the same data again, this time from SQL Server. Note that an ODBC connection is not needed; the code is already executed on the server side in the context of the AdventureWorksDW2014 database. The RxSqlServerData() function generates a SQL Server data source object. You can think of it as a proxy object to the SQL Server rowset, which is the result of the query:

TMquery <-  
"SELECT CustomerKey, MaritalStatus, Gender, 
  TotalChildren, NumberChildrenAtHome, 
  EnglishEducation AS Education, 
  EnglishOccupation AS Occupation, 
  HouseOwnerFlag, NumberCarsOwned, CommuteDistance, 
  Region, BikeBuyer, 
  YearlyIncome, Age 
 FROM dbo.vTargetMail"; 
sqlTM <- RxSqlServerData(sqlQuery = TMquery, 
                         connectionString = sqlConnStr, 
                         stringsAsFactors = TRUE, 
                         rowsPerRead = chunkSize); 
TMSQL <- rxImport(inData = sqlTM, reportProgress = 3); 

The sqlTM object is the pointer to the SQL Server data, and exposes the metadata of the result set of the query to the client R code. Note that the last line creates a new data frame and physically transfers data to the client. Therefore, if you executed the code in this section step-by-step, you should have two data frames—TMCSV and TMSQL—with data in local client memory, and the sqlTM data source connection, which you can use as a data frame. You can see the difference if you try to get the info about all three objects with the rxGetInfo() function:

rxGetInfo(TMCSV); 
rxGetInfo(sqlTM); 

The previous code returns the following result:

Data frame: TMCSV 
Number of observations: 18484 
Number of variables: 14 
Connection string: Driver=SQL Server;Server=localhost;
 Database=AdventureWorksDW2014;Uid=RUser;Pwd=Pa$$w0rd 
Data Source: SQLSERVER 

You get the details about the metadata of the SQL data source connection object with the rxGetVarInfo() function. You can get summary statistics and different cross-tabulations of the SQL Server data with the rxSummary(), rxCrossTabs(), and rxCube() functions. You can create histograms with the rxHistogram() function. All these functions use the SQL Server execution context. The following code shows how to use the functions mentioned:

sumOut <- rxSummary( 
  formula = ~ NumberCarsOwned + Occupation + F(BikeBuyer), 
  data = sqlTM); 
sumOut; 
cTabs <- rxCrossTabs(formula = BikeBuyer ~ 
                     Occupation : F(HouseOwnerFlag),  
                     data = sqlTM); 
print(cTabs, output = "counts"); 
print(cTabs, output = "sums"); 
print(cTabs, output = "means"); 
summary(cTabs, output = "sums"); 
summary(cTabs, output = "counts"); 
summary(cTabs, output = "means"); 
cCube <- rxCube(formula = BikeBuyer ~ 
                Occupation : F(HouseOwnerFlag),  
                data = sqlTM); 
cCube; 
rxHistogram(formula = ~ BikeBuyer | MaritalStatus, 
            data = sqlTM); 

Note that all of these scalable functions accept data from the SQL Server data source connection object. Because they execute on SQL Server, you cannot use the local data frames to feed them. If you would use a local data frame, you would get an error. If you want to use the scalable functions with the local datasets, you need to switch the execution context back to local.

The following code shows how to set the execution context back to the client machine:

rxSetComputeContext("local"); 

The RevoScaleR package includes a function to calculate clusters of similar cases based on the values of the input variables. It uses the K-means clustering algorithm. The rxKmeans() function in the following code uses a local data frame. It defines two clusters and then assigns each case to one of the clusters. The summary() function gives you the details of the clustering model:

TwoClust <- rxKmeans(formula = ~ BikeBuyer + TotalChildren + NumberCarsOwned, 
                     data = TMSQL, 
                     numClusters = 2); 
summary(TwoClust); 

You can add the cluster membership to the original data frame and rename the variable to a friendlier name:

TMClust <- cbind(TMSQL, TwoClust$cluster); 
names(TMClust)[15] <- "ClusterID"; 

In order to understand the meaning of the clusters, you need to analyze them. The following code creates a nice graph that consists of three individual small graphs showing the distribution of each input variable in each cluster:

attach(TMClust); 
oldpar <- par(no.readonly = TRUE); 
par(mfrow=c(1,3)); 
 
# NumberCarsOwned and clusters 
nofcases <- table(NumberCarsOwned, ClusterID); 
nofcases; 
barplot(nofcases, 
        main='Number of cars owned and cluster ID',     
        xlab='Cluster Id', ylab ='Number of Cars', 
        legend=rownames(nofcases), 
        col=c("black", "blue", "red", "orange", "yellow"), 
        beside=TRUE); 
# BikeBuyer and clusters 
nofcases <- table(BikeBuyer, ClusterID); 
nofcases; 
barplot(nofcases, 
        main='Bike buyer and cluster ID',     
        xlab='Cluster Id', ylab ='BikeBuyer', 
        legend=rownames(nofcases), 
        col=c("blue", "yellow"), 
        beside=TRUE); 
# TotalChildren and clusters 
nofcases <- table(TotalChildren, ClusterID); 
nofcases; 
barplot(nofcases, 
        main='Total children and cluster ID',     
        xlab='Cluster Id', ylab ='Total Children', 
        legend=rownames(nofcases), 
        col=c("black", "blue", "green", "red", "orange", "yellow"), 
        beside=TRUE); 
 
# Clean up 
par(oldpar); 
detach(TMClust); 

You should already be familiar with this code from the examples earlier in this chapter. The next screenshot shows the results:

Creating scalable solutions

Figure 13.7: The analysis of the clusters

Deploying R models

Once you have created a model, you can deploy it to a SQL Server table and use it later for predictions. You can also do the predictions in R and store just the results in a SQL Server table.

Let's start by creating another model in R. This time, the model uses the Logistic Regression algorithm. The model uses the SQL Server data and the dbo.vTargetMail view to learn how the values of the NumberCarsOwned, TotalChildren, Age, and YearlyIncome input variables influence the value of the BikeBuyer target variable. The following code sets the execution context back to SQL Server, creates the model with the RevoScale RrxLogit() function, and shows the summary of the model:

rxSetComputeContext(srvEx); 
bbLogR <- rxLogit(BikeBuyer ~ 
          NumberCarsOwned + TotalChildren + Age + YearlyIncome, 
          data = sqlTM); 
summary(bbLogR); 

You can use the model to perform predictions. In the following example, the model is used to make predictions on the same dataset that was used for training the model. In a real-life situation, you would perform predictions on a new dataset. The code stores the predictions, together with the input values used, in a SQL Server table. The RxSqlServerData() function prepares the connection to the SQL Server database and the target table name. The rxPredict() function performs the predictions and physically creates the SQL Server table and inserts the data. Of course, the database user used to connect to SQL Server must have appropriate permissions to create a table:

bbLogRPredict <- RxSqlServerData(connectionString = sqlConnStr, 
                                 table = "dbo.TargetMailLogR"); 
rxPredict(modelObject = bbLogR, 
          data = sqlTM, outData = bbLogRPredict, 
          predVarNames = "BikeBuyerPredict",  
          type = "response", writeModelVars = TRUE); 

You can use a T-SQL query to check the results, as shown here:

USE AdventureWorksDW2014; 
SELECT *  
FROM dbo.TargetMailLogR; 

The partial results are shown here. Values above 0.5 mean positive predictions:

BikeBuyerPredict  BikeBuyer NumberCarsOwned TotalChildren Age  YearlyIncome
----------------  --------- --------------- ------------- ---- ------------
0.733910292274223  1          0               2             44   90000
0.540550204813772  1          1               3             40   60000
0.529196837245225  1          1               3             45   60000

As mentioned, you can store a model in a SQL Server table. The following T-SQL code creates a table where the models are going to be stored and a stored procedure that actually inserts a model:

CREATE TABLE dbo.RModels 
(Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY, 
 ModelName NVARCHAR(50) NOT NULL, 
 Model VARBINARY(MAX) NOT NULL); 
GO 
CREATE PROCEDURE dbo.InsertModel 
(@modelname NVARCHAR(50), 
 @model NVARCHAR(MAX)) 
AS 
BEGIN 
    SET NOCOUNT ON;   
    INSERT INTO dbo.RModels (ModelName, Model) 
   VALUES (@modelname, CONVERT(VARBINARY(MAX), @model, 2)); 
END; 

The infrastructure is created. Now you can store the model in a SQL Server table from R. However, in order to call a stored procedure, you need to use an ODBC connection. Therefore, the following code first loads the RODBC library to memory, and creates a connection to the SQL Server database. Then it serializes the model to a binary variable, and creates a string from the binary variable using the paste() function. The same function is used to prepare a string with the T-SQL code to insert the model in the table. Finally, the sqlQuery() function sends the T-SQL command to SQL Server. Again, the R user used to execute this code must have permission to execute the stored procedure:

library(RODBC); 
conn <- odbcDriverConnect(sqlConnStr); 
modelbin <- serialize(bbLogR, NULL); 
modelbinstr=paste(modelbin, collapse=""); 
sqlQ <- paste("EXEC dbo.InsertModel @modelname='bbLogR', @model='",  
               modelbinstr,"'", sep=""); 
sqlQuery(conn, sqlQ); 
close(conn); 

The final step is to use the model from T-SQL. The following code uses the sys.sp_execute_external_script system procedure to use the model and perform a prediction on a single case. First, it creates an input dataset that consists of a single row with four input variables. Then, it retrieves the stored model. Then the R code is executed, which un-serializes the model and uses the rxPredict() function again to generate the output dataset, which includes the input variables and the prediction:

DECLARE @input AS NVARCHAR(MAX) 
SET @input = N' 
    SELECT *  
    FROM (VALUES  
          (0, 2, 44, 90000)) AS  
          inpQ(NumberCarsOwned, TotalChildren, Age, YearlyIncome);'  
DECLARE @mod VARBINARY(max) = 
 (SELECT Model  
  FROM DBO.RModels 
  WHERE ModelName = N'bbLogR');   
EXEC sys.sp_execute_external_script 
 @language = N'R',   
 @script = N'   
  mod <- unserialize(as.raw(model));   
  OutputDataSet<-rxPredict(modelObject = mod, data = InputDataSet,
  outData =  NULL,    
          predVarNames = "BikeBuyerPredict", type = "response",  
           checkFactorLevels=FALSE, 
           writeModelVars = TRUE, overwrite = TRUE);   
 ',   
  @input_data_1 = @input,   
  @params = N'@model VARBINARY(MAX)', 
  @model = @mod   
WITH RESULT SETS (( 
 BikeBuyerPredict FLOAT, 
 NumberCarsOwned INT, 
 TotalChildren INT, 
 Age INT, 
 YearlyIncome FLOAT));   

The results are:

BikeBuyerPredict       NumberCarsOwned TotalChildren Age  YearlyIncome
---------------------- --------------- ------------- ---- -------------
0.733910292274223      0               2             44   90000

The input values used were the same as the values in the first row of the batch predictions from the dbo.vTargetMail view used in the previous prediction example. You can see that the predicted value is also the same.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset