Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Beyond relational

Beyond relational is actually only a marketing term. The relational model, used in the relational database management system, is nowhere limited to specific data types or specific languages only. However, with the term "beyond relational," we typically mean specialized and complex data types that might include spatial and temporal data, XML or JSON data, and extending the capabilities of the Transact-SQL language with CLR languages such as Visual C#, or statistical languages such as R. SQL Server in versions before 2016 already supports some of the features mentioned. Here is a quick review of this support that includes:

Spatial data
CLR support
XML data

Defining locations and shapes with Spatial Data

In modern applications, often you want to show your data on a map, using the physical location. You might also want to show the shape of the objects that your data describes. You can use spatial data for tasks like these. You can represent the objects with points, lines, or polygons. From simple shapes, you can create complex geometrical objects or geographical objects, for example, cities and roads. Spatial data appears in many contemporary databases. Acquiring spatial data has become quite simple with the Global Positioning System (GPS) and other technologies. In addition, many software packages and database management systems help you to work with spatial data. SQL Server supports two spatial data types, both implemented as .NET common language runtime data types, from version 2008:

The geometry type represents data in a Euclidean (flat) coordinate system.
The geography type represents data in a round-earth coordinate system.

We need two different spatial data types because of some important differences between them. These differences include units of measurement and orientation.

In the planar, or flat-earth, system, you define the units of measurements. The length of a distance and the surface of an area are given in the same unit of measurement as you use for the coordinates of your coordinate system. You, as the database developer, know what the coordinates mean and what the unit of measure is. In geometry, the distance between the points described with the coordinates (1, 3) and (4, 7) is 5 units regardless of the units used. You, as the database developer who created the database where you are storing this data, know the context. You know what these 5 units mean: 5 kilometers, or 5 inches.

When talking about locations on earth, coordinates are given in degrees of latitude and longitude. This is the round-earth, or ellipsoidal system. Lengths and areas are usually measured in the metric system, in meters and square meters. However, not everywhere in the world is the metric system used for spatial data. The spatial reference identifier (SRID) of the geography instance defines the unit of measure. Therefore, whenever measuring a distance or area in the ellipsoidal system, you should always quote the SRID used, which defines the units.

In the planar system, the ring orientation of a polygon is not an important factor. For example, a polygon described by the points ((0, 0), (10, 0), (0, 5), (0, 0)) is the same as a polygon described by ((0, 0), (5, 0), (0, 10), (0, 0)). You can always rotate the coordinates appropriately to get the same feeling of the orientation. However, in geography, the orientation is needed to completely describe a polygon. Just think of the equator, which divides the earth into two hemispheres. Is your spatial data describing the northern or southern hemisphere?

The Wide World Importers data warehouse includes city locations in the Dimension.City table. The following query retrieves it for cities in the main part of the USA:

SELECT City, 
  [Sales Territory] AS SalesTerritory, 
  Location AS LocationBinary, 
  Location.ToString() AS LocationLongLat 
FROM Dimension.City 
WHERE [City Key] <> 0 
  AND [Sales Territory] NOT IN 
      (N'External', N'Far West');

Here is the partial result of the query.

City        SalesTerritory  LocationBinary       LocationLongLat                
----------- --------------- -------------------- ------------------
Carrollton  Mideast         0xE6100000010C70...  POINT (-78.651695 42.1083969)
Carrollton  Southeast       0xE6100000010C88...  POINT (-76.5605078 36.9468152)
Carrollton  Great Lakes     0xE6100000010CDB...  POINT (-90.4070632 39.3022693)

You can see that the location is actually stored as a binary string. When you use the ToString() method of the location, you get the default string representation of the geographical point, which is the degrees of longitude and latitude.

In SSMS, you send the results of the previous query to a grid, in the results pane you get an additional representation for the spatial data. Click on the Spatial results tab, and you can see the points represented in the longitude-latitude coordinate system, as you can see in the following screenshot:

Defining locations and shapes with Spatial Data

Figure 2.1: Spatial results showing customers' locations

If you executed the query, you might have noticed that the spatial data representation control in SSMS has some limitations. It can show only 5,000 objects. The result displays only the first 5,000 locations. Nevertheless, as you can see from the previous figure, this is enough to realize that these points form a contour of the main part of the USA. Therefore, the points represent the customers' locations for customers from USA.

The following query gives you the details, such as location and population, for Denver, Colorado:

SELECT [City Key] AS CityKey, City, 
  [State Province] AS State, 
  [Latest Recorded Population] AS Population, 
  Location.ToString() AS LocationLongLat 
FROM Dimension.City 
WHERE [City Key] = 114129 
  AND [Valid To] = '9999-12-31 23:59:59.9999999';

Spatial data types have many useful methods. For example, the STDistance() method returns the shortest line between two geography types. This is a close approximate to the geodesic distance, defined as the shortest route between two points on the Earth's surface. The following code calculates this distance between Denver, Colorado, and Seattle, Washington:

DECLARE @g AS GEOGRAPHY; 
DECLARE @h AS GEOGRAPHY; 
DECLARE @unit AS NVARCHAR(50); 
SET @g = (SELECT Location FROM Dimension.City 
          WHERE [City Key] = 114129); 
SET @h = (SELECT Location FROM Dimension.City 
          WHERE [City Key] = 108657); 
SET @unit = (SELECT unit_of_measure  
             FROM sys.spatial_reference_systems 
             WHERE spatial_reference_id = @g.STSrid); 
SELECT FORMAT(@g.STDistance(@h), 'N', 'en-us') AS Distance, 
 @unit AS Unit;

The result of the previous batch is as follows:

Distance      Unit  
------------- ------
1,643,936.69  metre

Note that the code uses the sys.spatial_reference_system catalog view to get the unit of measure for the distance of the SRID used to store the geographical instances of data. The unit is meter. You can see that the distance between Denver, Colorado, and Seattle, Washington, is more than 1,600 kilometers.

The following query finds the major cities within a circle of 1,000 km around Denver, Colorado. Major cities are defined as those with a population larger than 200,000:

DECLARE @g AS GEOGRAPHY; 
SET @g = (SELECT Location FROM Dimension.City 
          WHERE [City Key] = 114129); 
SELECT DISTINCT City, 
  [State Province] AS State, 
  FORMAT([Latest Recorded Population], '000,000') AS Population, 
  FORMAT(@g.STDistance(Location), '000,000.00') AS Distance 
FROM Dimension.City 
WHERE Location.STIntersects(@g.STBuffer(1000000)) = 1 
  AND [Latest Recorded Population] > 200000 
  AND [City Key] <> 114129 
  AND [Valid To] = '9999-12-31 23:59:59.9999999' 
ORDER BY Distance;

Here is the result abbreviated to the 12 closest cities to Denver, Colorado:

City              State       Population  Distance   
----------------- ----------- ----------- -----------
Aurora            Colorado    325,078     013,141.64
Colorado Springs  Colorado    416,427     101,487.28
Albuquerque       New Mexico  545,852     537,221.38
Wichita           Kansas      382,368     702,553.01
Lincoln           Nebraska    258,379     716,934.90
Lubbock           Texas       229,573     738,625.38
Omaha             Nebraska    408,958     784,842.10
Oklahoma City     Oklahoma    579,999     809,747.65
Tulsa             Oklahoma    391,906     882,203.51
El Paso           Texas       649,121     895,789.96
Kansas City       Missouri    459,787     898,397.45
Scottsdale        Arizona     217,385     926,980.71

There are many more useful methods and properties implemented in these two spatial data types. In addition, you can improve the performance of spatial queries with the help of specialized spatial indexes. Refer to the MSDN article Spatial Data (SQL Server) at https://msdn.microsoft.com/en-us/library/bb933790.aspx for more details on spatial data types, their methods, and spatial indexes.

CLR integration

You probably noticed that the two spatial data types are implemented as CLR data types. The spatial data types are shipped with SQL Server; therefore, Microsoft developers created them. However, you also can create your own CLR data types. SQL Server featured CLR inside the Database Engine for the first time in the 2005 version.

You can create the following CLR objects in a SQL Server database:

User-defined functions
Stored procedures
Triggers
User-defined aggregate functions
User-defined data types

You can use CLR objects to extend the functionality of the Transact-SQL language. You should use CLR for objects that you can't create in Transact-SQL, like user-defined aggregates or user-defined data types. For objects that you can also create in Transact-SQL, like functions, stored procedures and triggers, you should use Transact-SQL to manipulate the data, and CLR only in the areas where CLR languages like Visual C# are faster than Transact-SQL, such as complex calculations, and string manipulations.

For example, Transact-SQL language includes only a fistful of aggregate functions. To describe a distribution of a continuous variable, in the descriptive statistics you use the first four population moments, namely the:

Mean, or average value
Standard deviation
Skewness, or tailed-ness
Kurtosis, or peaked-ness

Transact-SQL includes only aggregate functions for calculating the mean and the standard deviation. These two measures might be descriptors good enough to describe the regular normal, or Gaussian distribution, as this figure shows:

Figure 2.2: Normal or Gaussian distribution

However, a distribution in the real world might not follow the normal curve exactly. Often, it is skewed. A typical example is income, which is usually highly skewed to the right, known as a positive skew. The following figure shows a positively skewed distribution, where you have a long tail on the right side of the distribution:

Figure 2.3: Positively skewed distribution

Here is the formula for skewness:

The formula for skewness uses the mean value and the standard deviation in the formula. I don't want to calculate these values in advance. If I calculated these values in advance, I would need to scan through the data twice. I want to have a more efficient algorithm, an algorithm that will scan the data only once.

I use a bit of mathematics for this optimization. First, I expand the formula for the subtraction of the mean from the i^th value on the third degree:

Then I use the fact that the sum is distributive over the product, as shown in the formula for two values only:

This formula can be generalized for all values:

Of course, I can do the same mathematics for the remaining elements of the expanded formula for the subtraction and calculate all the aggregates I need with a single pass through the data, as shown in the following C# code for the user-defined aggregate function that calculates the skewness. The first part of the code declares the namespaces used:

-- C# code for skewness 
using System; 
using System.Data; 
using System.Data.SqlClient; 
using System.Data.SqlTypes; 
using Microsoft.SqlServer.Server;

You represent a user-defined aggregate (UDA) with a class or a structure in CLR. First, you decorate it with attributes that give some information about the UDA's behavior and information for the potential optimization of the queries that use it:

[Serializable] 
[SqlUserDefinedAggregate( 
   Format.Native,                 
   IsInvariantToDuplicates = false,  
   IsInvariantToNulls = true,        
   IsInvariantToOrder = true,      
   IsNullIfEmpty = false)]

The next part of the code for the UDA defines the structure and internal variables used to hold the intermediate results for the elements of the calculation, as I explained in the formula reorganization:

public struct Skew 
{ 
  private double rx;   
  private double rx2;   
  private double r2x;   
  private double rx3;   
  private double r3x2; 
  private double r3x;   
  private Int64 rn;

Structures or classes that represent UDAs must implement four methods. The Init() method initializes the internal variables:

  public void Init() 
  { 
    rx = 0; 
    rx2 = 0; 
    r2x = 0; 
    rx3 = 0; 
    r3x2 = 0; 
    r3x = 0; 
    rn = 0; 
  }

The Accumulate() method does the actual work of aggregating:

  public void Accumulate(SqlDouble inpVal) 
  { 
    if (inpVal.IsNull) 
    { 
      return; 
    } 
    rx = rx + inpVal.Value; 
    rx2 = rx2 + Math.Pow(inpVal.Value, 2); 
    r2x = r2x + 2 * inpVal.Value; 
    rx3 = rx3 + Math.Pow(inpVal.Value, 3); 
    r3x2 = r3x2 + 3 * Math.Pow(inpVal.Value, 2); 
    r3x = r3x + 3 * inpVal.Value; 
    rn = rn + 1; 
  }

The Merge() method accepts another aggregate as the input. It merges two aggregates. Where do two or more aggregates come from? SQL Server might decide to execute the aggregating query in parallel, store the intermediate aggregate results internally, and then merge them by using the Merge()method:

  public void Merge(Skew Group) 
  { 
    this.rx = this.rx + Group.rx; 
    this.rx2 = this.rx2 + Group.rx2; 
    this.r2x = this.r2x + Group.r2x; 
    this.rx3 = this.rx3 + Group.rx3; 
    this.r3x2 = this.r3x2 + Group.r3x2; 
    this.r3x = this.r3x + Group.r3x; 
    this.rn = this.rn + Group.rn; 
  }

The Terminate() method does the final calculations and returns the aggregated value to the calling query. The return value type must be in compliance with SQL Server data types:

  public SqlDouble Terminate() 
  { 
    double myAvg = (rx / rn); 
    double myStDev = Math.Pow((rx2 - r2x * myAvg + rn *      Math.Pow(myAvg, 2)) 
                     / (rn - 1), 1d / 2d); 
    double mySkew = (rx3 - r3x2 * myAvg + r3x * Math.Pow(myAvg, 2) 
                    - rn * Math.Pow(myAvg, 3)) / 
           Math.Pow(myStDev,3) * rn / (rn - 1) / (rn - 2); 
    return (SqlDouble)mySkew; 
  } 
 
}

You can use the C# compiler to compile the code for the UDA. However, in the associated code for the book, a compiled assembly, the .dll file, is provided for your convenience. The code also includes the function that calculates the kurtosis; for the sake of brevity, this code is not explained in detail here.

In order to use CLR objects, you need to enable CLR for your instance. Then, you need to catalog, or deploy, the assembly in the database with the CREATE ASSEMBLY statement. Then, you create the aggregate functions with the CREATE AGGREGATE statement. The following code enables CLR, deploys the assembly provided with the book, and then creates the two aggregate functions. The code assumes that the assembly is stored in the C:SQL2016DevGuide folder:

EXEC sp_configure 'clr enabled', 1; 
RECONFIGURE WITH OVERRIDE; 
 
CREATE ASSEMBLY DescriptiveStatistics  
FROM 'C:SQL2016DevGuideDescriptiveStatistics.dll' 
WITH PERMISSION_SET = SAFE; 
 
CREATE AGGREGATE dbo.Skew(@s float) 
RETURNS float 
EXTERNAL NAME DescriptiveStatistics.Skew; 
 
CREATE AGGREGATE dbo.Kurt(@s float) 
RETURNS float 
EXTERNAL NAME DescriptiveStatistics.Kurt;

Once the assembly is cataloged and UDAs have been created, you can use them just like built-in aggregate functions. The following query calculates the four moments for the sum over customers of the amount ordered without tax. In a CTE, it calculates the sum of the amount per customer, and then in the outer query the average, the standard deviation, the skewness, and the kurtosis for this total:

WITH CustomerSalesCTE AS 
( 
SELECT c.Customer,  
  SUM(f.[Total Excluding Tax]) AS TotalAmount 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS c 
    ON f.[Customer Key] = c.[Customer Key] 
WHERE c.[Customer Key] <> 0  
GROUP BY c.Customer 
) 
SELECT ROUND(AVG(TotalAmount), 2) AS Average, 
  ROUND(STDEV(TotalAmount), 2) AS StandardDeviation,  
  ROUND(dbo.Skew(TotalAmount), 6) AS Skewness, 
  ROUND(dbo.Kurt(TotalAmount), 6) AS Kurtosis 
FROM CustomerSalesCTE;

Here is the result:

Average       StandardDeviation Skewness Kurtosis 
------------- ----------------- -------- ---------
270479.220000 38586.08          0.005943 -0.263897

After you have tested the UDAs, you can execute the following code to clean up your database, and potentially disable CLR. Note that you need to drop the UDAs before you drop the assembly:

DROP AGGREGATE dbo.Skew; 
DROP AGGREGATE dbo.Kurt; 
DROP ASSEMBLY DescriptiveStatistics; 
/* 
EXEC sp_configure 'clr enabled', 0; 
RECONFIGURE WITH OVERRIDE; 
*/

XML support in SQL Server

SQL Server in version 2005 also started to feature extended support for XML data inside the database engine, although some basic support was already included in version 2000. The support starts by generating XML data from tabular results. You can use the FOR XML clause of the SELECT statement for this task.

The following query generates an XML document from the regular tabular result set by using the FOR XML clause with AUTO option, to generate an element-centric XML instance, with namespace and inline schema included:

SELECT c.[Customer Key] AS CustomerKey, 
  c.[WWI Customer ID] AS CustomerId, 
  c.[Customer],  
  c.[Buying Group] AS BuyingGroup, 
  f.Quantity, 
  f.[Total Excluding Tax] AS Amount, 
  f.Profit 
FROM Dimension.Customer AS c 
  INNER JOIN Fact.Sale AS f 
    ON c.[Customer Key] = f.[Customer Key] 
WHERE c.[Customer Key] IN (127, 128) 
FOR XML AUTO, ELEMENTS,  
  ROOT('CustomersOrders'), 
  XMLSCHEMA('CustomersOrdersSchema'); 
GO

Here is the partial result of this query. The first part of the result is the inline schema:

<CustomersOrders>
  <xsd:schema targetNamespace="CustomersOrdersSchema" ...
    <xsd:import namespace="http://schemas.microsoft.com/sqlserver
     /2004/sqltypes" ...
<xsd:element name="c">
<xsd:complexType>
    <xsd:sequence>
      <xsd:element name="CustomerKey" type="sqltypes:int" />
      <xsd:element name="CustomerId" type="sqltypes:int" />
      <xsd:element name="Customer">
        <xsd:simpleType>
          <xsd:restriction base="sqltypes:nvarchar" ...
            <xsd:maxLength value="100" />
          </xsd:restriction>
        </xsd:simpleType>
      </xsd:element>
      ...
    </xsd:sequence>
  </xsd:complexType>
</xsd:element>
</xsd:schema>
<c xmlns="CustomersOrdersSchema">
<CustomerKey>127</CustomerKey>
<CustomerId>127</CustomerId>
<Customer>Tailspin Toys (Point Roberts, WA)</Customer>
<BuyingGroup>Tailspin Toys</BuyingGroup>
<f>
<Quantity>3</Quantity>
  <Amount>48.00</Amount>
  <Profit>31.50</Profit>
</f>
<f>
<Quantity>9</Quantity>
  <Amount>2160.00</Amount>
  <Profit>1363.50</Profit>
</f>
</c>
<c xmlns="CustomersOrdersSchema">
<CustomerKey>128</CustomerKey>
<CustomerId>128</CustomerId>
<Customer>Tailspin Toys (East Portal, CO)</Customer>
<BuyingGroup>Tailspin Toys</BuyingGroup>
<f>
<Quantity>84</Quantity>
  <Amount>420.00</Amount>
  <Profit>294.00</Profit>
</f>
</c>
...
</CustomersOrders>

You can also do the opposite process: convert XML to tables. Converting XML to relational tables is known as shredding XML. You can do this by using the nodes() method of the XML data type or with the OPENXML() rowset function.

Inside SQL Server, you can also query the XML data from Transact-SQL to find specific elements, attributes, or XML fragments. XQuery is a standard language for browsing XML instances and returning XML, and is supported inside XML data type methods.

You can store XML instances inside a SQL Server database in a column of the XML data type. An XML data type includes five methods that accept XQuery as a parameter. The methods support querying (the query() method), retrieving atomic values (the value() method), existence checks (the exist() method), modifying sections within the XML data (the modify() method), as opposed to overriding the whole thing and shredding XML data into multiple rows in a result set (the nodes() method).

The following code creates a variable of the XML data type to store an XML instance in it. Then, it uses the query() method to return XML fragments from the XML instance. This method accepts XQuery query as a parameter. The XQuery query uses the FLWOR expressions to define and shape the XML returned:

DECLARE @x AS XML; 
SET @x = N' 
<CustomersOrders> 
  <Customer custid="1"> 
    <!-- Comment 111 --> 
    <companyname>CustA</companyname> 
    <Order orderid="1"> 
      <orderdate>2016-07-01T00:00:00</orderdate> 
    </Order> 
    <Order orderid="9"> 
      <orderdate>2016-07-03T00:00:00</orderdate> 
    </Order> 
    <Order orderid="12"> 
      <orderdate>2016-07-12T00:00:00</orderdate> 
    </Order> 
  </Customer> 
  <Customer custid="2"> 
    <!-- Comment 222 -->   
    <companyname>CustB</companyname> 
    <Order orderid="3"> 
      <orderdate>2016-07-01T00:00:00</orderdate> 
    </Order> 
    <Order orderid="10"> 
      <orderdate>2016-07-05T00:00:00</orderdate> 
    </Order> 
  </Customer> 
</CustomersOrders>'; 
SELECT @x.query('for $i in CustomersOrders/Customer/Order 
                 let $j := $i/orderdate 
                 where $i/@orderid < 10900 
                 order by ($j)[1] 
                 return  
                 <Order-orderid-element> 
                  <orderid>{data($i/@orderid)}</orderid> 
                  {$j} 
                 </Order-orderid-element>') 
       AS [Filtered, sorted and reformatted orders with let clause];

Here is the result of the previous query:

<Order-orderid-element>
  <orderid>1</orderid>
<orderdate>2016-07-01T00:00:00</orderdate>
</Order-orderid-element>
<Order-orderid-element>
  <orderid>3</orderid>
<orderdate>2016-07-01T00:00:00</orderdate>
</Order-orderid-element>
<Order-orderid-element>
  <orderid>9</orderid>
<orderdate>2016-07-03T00:00:00</orderdate>
</Order-orderid-element>
<Order-orderid-element>
  <orderid>10</orderid>
<orderdate>2016-07-05T00:00:00</orderdate>
</Order-orderid-element>
<Order-orderid-element>
  <orderid>12</orderid>
<orderdate>2016-07-12T00:00:00</orderdate>
</Order-orderid-element>

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Beyond relational

Create new playlist

Sign In

Sign Up

Beyond relational

Defining locations and shapes with Spatial Data

CLR integration

XML support in SQL Server

Table of Contents for
Beyond relational