Chapter 13. Supporting R in SQL Server

SQL Server R Services combines the power and flexibility of the open source R language with enterprise-level tools for data storage and management, workflow development, and reporting and visualization. This chapter introduces R Services and the R language. R is developing quite fast, so it is worth mentioning that the R version used in this book is 3.2.2 (2015-08-14).

In the first section, you will learn about the free version of the R language and engine. You will also become familiar with the basic concepts of programming in R.

When developing an advanced analytical solution, you spend the vast majority of time with data. Typically, data is not in a shape useful for statistical and other algorithms. Data preparation is not really glorious but is an essential part of analytical projects. You will learn how to create a new or use an existing dataset and learn about basic data manipulation with R in the second section of this chapter.

The data preparation part of an analytical project is interleaved with the data understanding part. You need to gather in-depth knowledge of your data and data values before you can analyze it. Showing data graphically is a very efficient and popular method of data understanding. Fortunately, R support for data visualization is really comprehensive. However, sometimes numbers tell us more or in a more condensed way than graphs. Introductory statistics, like descriptive statistics, provide you with the numbers you need to understand your data. Again, R support for introductory statistics is astonishing.

Open source products also have some disadvantages. For example, scalability might be an issue. SQL Server 2016 brings R support inside the database engine. With this support, many of your problems are solved. You get many scalable procedures and enhanced security for your R applications.

This chapter will cover the following points:

  • R: basics and concepts
  • Core elements of the R language
  • R data structures
  • Basic data management
  • Simple visualizations
  • Introductory statistics
  • SQL Server R Services architecture
  • Creating and using scalable R solutions in SQL Server

Introducing R

R is the most widely used language for statistics, data mining, and machine learning. Besides the language, R is also the environment and the engine that executes the R code. You need to learn how to develop R programs, just as you need to learn any other programming language you intend to use.

Before going deeper into the R language, let's explain what the terms statistics, data mining, and machine learning mean. Statistics is the study and analysis of data collections, and interpretation and presentation of the results of the analysis. Typically, you don't have all population data, or census data, collected. You have to use samples—often survey samples. Data mining is again a set of powerful analysis techniques used on your data in order to discover patterns and rules that might improve your business. Machine learning is programming to use data to solve a given problem automatically. You can immediately see that all three definitions overlap. There is not a big distinction between them; you can even use them as synonyms. Small differences are visible when you think of the users of each method. Statistics is a science, and the users are scientists. Data mining users are typically business people. Machine learning users are, as the name suggests, often machines. Nevertheless, in many cases, the same algorithms are used, so there is really a lot of overlapping among the three branches of applied mathematics in analytical applications.

Let's not get lost in these formal definitions and start with R immediately. You will learn about:

  • Parts of open source R
  • Basic description of the R language
  • Obtaining help
  • R core syntax elements
  • R variables
  • R vectors
  • R packages

This chapter assumes that you are already familiar with the SQL Server and R tools, including SQL Server Management Studio, RStudio, or R Tools for Visual Studio, so you can start to write the code immediately.

Starting with R

You can download R from the Comprehensive R Archive Network (CRAN) site at http://cran.r-project.org. You can get the R engine for Windows, Linux, or Mac OS X. Microsoft also maintains its own R portal, the Microsoft R Application Network (MRAN) site at https://mran.revolutionanalytics.com/. You can use this site to download Microsoft R Open, the enhanced distribution of open source R from Microsoft. After installation, you start working in an interactive mode. You can use the R console client tool to write code line by line. As you already know, there are many additional tools. Let's just repeat here that the most widely used free tool for writing and executing R code is RStudio. It is free and you can download it from https://www.rstudio.com/. This section assumes you use RStudio for the code examples.

R is a case-sensitive, functional, and interpreted language. The R engine interprets your commands one by one. You don't type commands but rather call functions to achieve results, even a function called q() to quit an R session. As in any programming language, it is good practice to comment the code well. You can start a comment with a hash mark (#) anywhere in the line; any text after the hash mark is not executed and is treated as a comment. You end a command with a semicolon (;) or a new line. Commands finished by a semicolon can be combined in a single line. Similarly, as in T-SQL, the explicit end of a command with a semicolon is a better practice than just entering a new command on a new line. Here is the first example of the R code, showing a comment and using the contributors() function to list the authors and other contributors to the language:

# R Contributors 
contributors(); 

The abbreviated results are:

R is a project which is attempting to provide a modern piece of
statistical software for the GNU suite of software.
The current R is the result of a collaborative effort with
contributions from all over the world.
Authors of R.
R was initially written by Robert Gentleman and Ross Ihaka-also known as "R & R"
of the Statistics Department of the University of Auckland.
Since mid-1997 there has been a core group with write access to the R
source, currently consisting of
...

In R, help is always at your fingertips. With the help() function, you can get onto help first, and then search for the details you need. Using help.start() gives you the links to the free documentation about R. With help("options"), you can examine the global options that are used to affect the way in which R computes and displays the results. You can get help for a specific function. For example, help("exp") displays the help for the exponential function. You can use the shorter version of the help function—just the question mark. For example, ?"exp" also displays help for the exponential function. With example("exp"), you can get examples of usage of the exponential function. You can also search in help using a command—either help.search("topic") or ??"topic". With RSiteSearch("exp"), you can perform an online search for documentation about exponential functions over multiple R sites. The following code summarizes these help options:

# Getting help on help 
help(); 
# General help 
help.start(); 
# Help about global options 
help("options"); 
# Help on the function exp() 
help("exp"); 
# Examples for the function exp() 
example("exp"); 
# Search 
help.search("constants"); 
??"constants"; 
# Online search  
RSiteSearch("exp"); 

Finally, there is also the demo() function. This function runs some more advanced demo scripts, showing you the capabilities of R code. For example, the following call to this function shows you some graphic capabilities:

demo("graphics");

In RStudio, you get the code to execute in the Console window (bottom-left window by default). You need to move the cursor to that window and hit the Enter key to execute the first part of the code. You get the first graph in the bottom-right window. Then you hit the Enter key a couple of times more to scroll through all demo graphs. One part of the code that creates a nice pie chart is shown here:

pie.sales <- c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12); 
names(pie.sales) <- c("Blueberry", "Cherry", "Apple", 
                      "Boston Cream", "Other", "Vanilla Cream"); 
pie(pie.sales, 
    col = c("purple","violetred1","green3","cornsilk","cyan","white")); 
 
title(main = "January Pie Sales", cex.main = 1.8, font.main = 1); 
title(xlab = "(Don't try this at home kids)", cex.lab = 0.8, font.lab = 3); 

The graphical results are shown in the following figure:

Starting with R

Figure 13.1: Demo pie chart

All of the objects created exist in memory. Each session has its own workspace. When you finish a session, you can save the workspace or the objects from memory to disk in an .RData file and load them in the next session. The workspace is saved in the folder from which R reads the source code files, or in a default folder. You can check the objects in the current workspace with the objects() function or with the ls() function. You can check the working folder, or directory, with the getwd() call. You can change it interactively with the setwd(dir) function, where the dir parameter is a string representing the new working directory. You can remove single objects from the workspace and memory with the rm(objectname) function, or a full list of objects using the same function with the list of objects as a parameter (you will learn about lists and other data structures later in this chapter). There are many more possibilities, including saving images, getting the history of last commands, saving the history, reloading the history, and even saving a specific object to a specific file. You will learn about some of them later in this chapter.

R language Basics

You can start investigating R by writing simple expressions that include literals and operators. Here are some examples:

1 + 1;
2 + 3 * 4;
3 ^ 3;
sqrt(81);
pi;

This code evaluates three mathematical expressions first using the basic operators. Check the results and note that R, as expected, evaluates the expressions using operator precedence as we know from mathematics. Then it calls the sqrt() function to calculate the square root of 81. Finally, the code checks the value of the base package built-in constant for the number pi (π). R has some built-in constants. Check them by searching help for them with ??"constants".

It is easy to generate sequences. The following code shows some examples:

rep(1,10); 
3:7;          
seq(3,7); 
seq(5,17,by=3);       

The first command replicates number 1 ten times using the rep() function. The second line generates the sequence of numbers between three and seven. The third line does exactly the same, just this time using the seq() function. This function gives you additional possibilities, as the fourth line shows. This command generates a sequence of numbers between 5 and 17, but this time with an increment of 3.

Writing ad hoc expressions means you need to rewrite them whenever you need them. To reuse the values, you need to store them in variables. You assign a value to a variable with an assignment operator. R supports multiple assignment operators. You can use the left assignment operator (<-), where the variable name is on the left side, or right assignment operator (->), where the variable name is on the right side. You can also use the equals (=) operator. The left assignment operator is the one you will see most commonly in R code. The following code stores the numbers 2, 3, and 4 in variables and then performs a calculation using the variables:

x <- 2; 
y <- 3; 
z <- 4; 
x + y * z; 

The result is 14. Note again that R is case sensitive. For example, the following line of code produces an error, because variables X, Y, and Z are not defined:

X + Y + Z; 

You can separate part of a variable name with a dot. This way, you can organize your objects into namespaces, just as you can in .NET languages. Here is an example:

This.Year <- 2016; 
This.Year; 

You can check whether the equals assignment operator really works:

x = 2; 
y = 3; 
z = 4; 
x + y * z; 

If you executed the last code, you would get the same result as with the code that used the left assignment operator instead of the equals operator.

Besides mathematical operators, R supports logical operators as well. To test the exact equality, use the double equals (==) operator. Other logical operators include <, <=, >, >=, and != to test the inequality. In addition, you can combine two logical expressions into a third one using the logical AND (&) and logical OR (|) operators. The following code checks a variable for exact equality with a number literal:

x <- 2; 
x == 2; 

The result is TRUE.

Every variable in R is actually an object. A simple scalar variable is a vector of length one. A vector is a one-dimensional array of scalars of the same type, or mode: numeric, character, logical, complex (imaginary numbers), and raw (bytes). You use the combine function c() to define the vectors. Here are the ways to assign variable values as vectors. Note that the variables with the same names will be overwritten:

x <- c(2,0,0,4);        
assign("y", c(1,9,9,9));  
c(5,4,3,2) -> z;               
q = c(1,2,3,4); 

The first line uses the left assignment operator. The second assigns the second vector to the variable y using the assign() function. The third line uses the right assignment operator, and the fourth line the equals operator.

You can perform operations on vectors just like you would perform them on scalars (remember, after all, a scalar is just a vector of length one). Here are some examples of vector operations:

x + y; 
x * 4; 
sqrt(x); 

The results of the previous three lines of code are:

3  9  9 13
8  0  0 16
1.414214 0.000000 0.000000 2.000000

You can see that the operations were performed element by element. You can operate on a selected element only as well. You use numerical index values to select specific elements. Here are some examples:

x <- c(2,0,0,4);   
x[1];    
x[-1];    
x[1] <- 3; x;    
x[-1] = 5; x;   

First, the code assigns a vector to a variable. The second line selects the first element of the vector. The third line selects all elements except the first one and returns a vector of three elements. The fourth line of the code assigns a new value to the first element and then shows the vector. The last line assigns new values to all elements but the first one, and then shows the vector. The results are, therefore:

2
0 0 4
3 0 0 4
3 5 5 5

You can also use logical operators on vectors. Here are some examples:

y <- c(1,9,9,9); 
y < 8;     
y[4] = 1; 
y < 8; 
y[y<8] = 2; y; 

The first line assigns a vector to variable y. The second line compares each vector value to a numeric constant 8 and returns TRUE for those elements where the value is lower than the given value. The third line assigns a new value to the fourth element of the vector. The fourth line performs the same comparison of the vector elements to number 8 again and returns TRUE for the first and fourth element. The last line edits the elements of vector y that satisfy the condition in the parenthesis—those elements where the value is less than 8. The result is:

TRUE FALSE FALSE FALSE
TRUE FALSE FALSE  TRUE
2 9 9 2

Vectors and scalars are very basic data structures. You will learn about more advanced data structures in the next section of this chapter. Before that, it is high time to mention a very important concept in R: packages.

Note

The R code shown in this chapter has used only core capabilities so far; capabilities that you get when you install the R engine. Although these capabilities are already very extensive, the real power of R comes with additional packages.

Packages are optional modules you can download and install. Each package brings additional functions, or demo data, in a well-defined format. The number of available packages is growing year by year. At the time of writing this book, in summer 2016, the number of downloadable packages was already nearly 9,000. A small set of standard packages is already included when you install the core engine. You can check out installed packages with the installed.packages() command. Packages are stored in the folder called library. You can get the path to the library with the .libPaths() function (note the dot in the name). You can use the library() function to list the packages in your library.

The most important command to learn is the install.packages("packagename"). This command searches the CRAN sites for the package, downloads it, unzips it, and installs it. Of course, you need a web connection in order to execute it successfully. You can imagine that such a simplistic approach is not very welcome in highly secure environments. Of course, in order to use R Services in SQL Server, the package installation is more secure and more complex, as you will learn later in this chapter.

Once a package is installed, you load it into memory with the library(packagename) command. You can get help on the content of the package with the help(package = "packagename") command. For example, if you want to read the data from a SQL Server database, you need to install the RODBC library. The following code installs this library, loads it, and gets the help for functions and methods available in it:

install.packages("RODBC"); 
library(RODBC); 
help(package = "RODBC"); 

Before reading the data from SQL Server, you need to perform two additional tasks. First, you need to create a login and a database user for the R session and give the user the permission to read the data. Then, you need to create an ODB data source name (DSN) that points to this database. In SQL Server Management Studio, connect to your SQL Server, and then in Object Explorer, expand the Security folder. Right-click on the Logins subfolder. Create a new login and a database user in the WideWorldImportersDW database, and add this user to the db_datareader role.

I created a SQL Server login called RUser with password Pa$$w0rd, and a user with the same name, as the following screenshot shows:

R language Basics

Figure 13.2: Generating the RUser login and database user

After that, I used the ODBC Data Sources tool to create a system DSN called WWIDW. I configured the DSN to connect to my local SQL Server with the RUser SQL Server login and appropriate password, and change the context to the WideWorldImportersDW database. If you've successfully finished both steps, you can execute this R code to read the data from SQL Server:

con <- odbcConnect("WWIDW", uid="RUser", pwd="Pa$$w0rd"); 
sqlQuery(con,  
         "SELECT c.Customer, 
            SUM(f.Quantity) AS TotalQuantity, 
            SUM(f.[Total Excluding Tax]) AS TotalAmount, 
            COUNT(*) AS SalesCount 
          FROM Fact.Sale AS f 
           INNER JOIN Dimension.Customer AS c 
            ON f.[Customer Key] = c.[Customer Key] 
          WHERE c.[Customer Key] <> 0 
          GROUP BY c.Customer 
          HAVING COUNT(*) > 400 
          ORDER BY SalesCount DESC;"); 
close(con); 

The code returns the following (abbreviated) result:

                          Customer TotalQuantity TotalAmount SalesCount
1          Tailspin Toys (Vidrine, LA)      18899    340163.8        455
2 Tailspin Toys (North Crows Nest, IN)      17684    313999.5        443
3            Tailspin Toys (Tolna, ND)      16240    294759.1        443
4      Wingtip Toys (Key Biscayne, FL)      18658    335053.5        441
5   Tailspin Toys (Peeples Valley, AZ)      15874    307461.0        437
6   Wingtip Toys (West Frostproof, FL)      18564    346621.5        436
7         Tailspin Toys (Maypearl, TX)      18219    339721.9        436
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset