© Davide Mauri, Silvano Coriani, Anna Hoffman, Sanjay Mishra, Jovan Popovic 2021
D. Mauri et al.Practical Azure SQL Database for Modern Developershttps://doi.org/10.1007/978-1-4842-6370-9_6

6. Practical Use of Tables and Indexes

Davide Mauri1 , Silvano Coriani2, Anna Hoffman3, Sanjay Mishra1 and Jovan Popovic4
(1)
Redmond, WA, USA
(2)
Legnano, Milano, Italy
(3)
Sherman Oaks, CA, USA
(4)
Belgrade, Serbia
 

When people imagine a table, in most of the cases, this looks like an Excel spreadsheet. There are a bunch of cells organized in rows and columns, and there is usually one column that contains the identifiers of the rows.

Azure SQL enables you to use much more than a plain table. You can configure and optimize your table for some specific query patterns, complex analytics, or highly concurrent updates. You can also configure your tables to keep a full history of changes, implement some parts of domain data-integrity logic, and apply fine-grained security rules. In this chapter, you can find some practical advice that can help you leverage the features that Azure SQL provides to create a table that is best fit for your needs.

Designing good tables

As a first step, we need to understand some best practices for table design. Table design is tightly coupled with Domain-Driven Design (DDD) theory described in Eric Evans book. In DDD, we are defining Domain Model entities (objects or classes) that are divided into Bounded Contexts. A Bounded Context is a logical boundary where particular terms, definitions, and rules apply in a consistent way. There are also associations between Domain Model entities, as shown in Figure 6-1.
../images/493913_1_En_6_Chapter/493913_1_En_6_Fig1_HTML.jpg
Figure 6-1

Domain model of application

Definition of entities in the Domain Model is known as a domain or logical design. Eventually, we need to decide how we should store Domain Model objects in the database. If you decide to store Domain Model entities in Azure SQL, there is a straightforward way to define your underlying tables. Every Bounded Context (Sales or Support in Figure 6-1) can be represented as a schema. In certain cases, the Bounded Contexts can be even represented by individual databases. Every Domain Model entity should be mapped to one or many tables in the associated schemas. Some high-level guidelines that you can use to map your Domain Model entities to database tables are:
  • Every Domain Model entity should be mapped to one table with a column that uniquely identifies the domain object (so-called Primary Key ).

  • Every simple property of a domain object should be represented as a column in the table with a matching SQL type.

  • Every collection, array, or list in the domain object should be represented as a separate table. This table should have a column (known as Foreign Key ) that matches the primary key of the row that contains this collection.

  • If there is a relationship from one domain object to another, the underlying table should have a column that contains a primary key of related objects . In some complex cases, there should be a separate mapping table with pairs of primary keys of related tables.

Looks complex? It might seem at the beginning. The process of rearranging complex domain objects into a set of tables with the minimum amount of redundancy is definitely a challenge. We already discussed a bit about this process, called normalization , as you may recall from Chapter 4.

Normalization enables us to avoid accidental mistakes and prevent data inconsistency. As an example, people use words like “inquiry” and “enquiry” interchangeably (some might consider “enquiry” to be a spelling mistake of “inquiry”). There are also lots of words in American and British English that are spelled with a slight difference (e.g., colour and color, center and centre). Let’s imagine that you have an entity stored as a single row (or document in NoSQL document databases) that has type property with value “inquiry” and an array of tags where one tag is “.NET”. Someone might insert another similar entity with type “enquiry” and “dotnet” as one of the tags. A functionality that searches entities by types and tags will return different results based on search criterion because clients will enter “inquiry” and “enquiry” or “.NET” and “dotnet” and you probably cannot implement sophisticated rules that understand these synonyms. In these cases, it might be a better idea to store values of types and tags in a separate table, assign them identifiers, and use these identifiers in entities. This way, someone might choose a standardized term that will be used, maybe a list of synonyms for that term, and searches will always return consistent results.

How can you do this correctly if you have never done it before? Luckily, creating a good model is an iterative application of the normalization rules, so we’re not alone here. As normalization is all about avoiding potential data inconsistencies and redundancy, it is worth understanding why we absolutely try to avoid those at all costs.

Redundancy is not necessarily bad per se, but it surely brings challenges. Let’s say, for example, that an Order Header shows that the order value is 100$, but then if you sum up all the items in the Order Details table for that specific order, you find a total value of 120$. Which one is correct? You don’t know. Even worse, you cannot know the answer just by looking at the data you have, as data is inconsistent. Data is there, but information is lost. And this is just one example. There are several well-known data anomalies (for details, see https://aka.ms/sednda) that can be prevented by a correct normalization.

The normalization process will help you decompose objects into smaller pieces without losing any information. It is a “non-loss decomposition” process: it gives you the ability to always rebuild the original information when you need it. At the same time, it helps to avoid the perils of redundancy and thus inconsistency, favoring reuse of the smaller piece of information created as a result of its application.

We just have scratched the surface of this topic, and detailing the whole normalization process is beyond the scope of this book. You can start with the correct foot by keeping in mind that a good model is one where tables have scalar columns that depend only on the primary key and where foreign keys are used to reference and enforce relationships with other tables, so that information is never lost.

Normalization is an amazingly interesting topic, as is one of the few formal processes that can be applied to informatics: you can find more details about database design in the books referenced and at the end of this chapter.

So far, we focused on the logical aspect of database design: normalization and mapping of domain classes to a table are not the end of the process. To optimally design a table, you need to think about the following:
  • What is the best way to define types and other properties for the columns in your table?

  • How to optimize operations on your tables using indexes?

  • What rules and constraints you want to define in your table to ensure that your data is valid?

  • How to make your tables secure?

In the following sections, you will be presented some of the guidelines that can help you to answer these questions.

Creating tables

Azure SQL enables you to easily create a table without need to use graphical tools or generators. You can use a short CREATE TABLE statement to create a new table:
CREATE TABLE Customer (
     CustomerID tinyint,
     CustomerName varchar(max),
     LocationID bigint
);
Although it is easy to create a specification of a table with columns, this is still not table design. Proper table design is a process where you identify table characteristics that are the best fit for your application. By looking at the table definition in the previous example, we might ask ourselves few questions that would challenge the design:
  • Could there be two customers with the same CustomerID? Do we need to ensure that no one might insert two customers with the same identifier?

  • Is tinyint type a proper choice for ID of customer? tinyint column may contain only 256 different values, so would this be enough for all customers that we might have?

  • varchar(max) type enables us to put strings with size up to 2GB; however, is this overkill? If we know that customer names cannot be longer than 100 characters, could we use some smaller type like varchar(100)?

  • varchar types enable us to use most of the common characters from English and other Western languages. However, could we have some customer with some non-common Latin (e.g., Hungarian ű or ő or Slavic đ, č, or ć), Cyrillic (ђ, ч, or ћ), or Asian characters? In that case, nvarchar type might be a better choice.

  • Should we ignore letter casing when we sort or compare customer names? Are there some other language rules that we need to apply?

  • In LocationID column, we can store 264 different locations, but do we really have so many locations? Could we use some smaller type?

  • Is the LocationID value an identifier of some location in a separate table? In that case, how do we ensure that the value in LocationID exists in another table? How to ensure that location will not be deleted if there is still a customer that has an id of that location?

Proper table design should answer these questions and enable you to have the right solution for your table.

You may also have noticed that quite a few questions were around proper data type size. Today there is no shortage of space and resources, so spending time just to spare a bunch of bytes may seem a waste of time. Well, keep in mind that if your solution is successful, your database can easily contain literally billions of rows. Wasting space will not only mean wasting storage space but also more memory needs and higher CPU consumption. In the cloud, where you pay for what you use…well, I guess you figured it out already: more costs for exactly the same performance of an otherwise optimized model. Be a better developer!

Determine the right column types

When you design your table, you need to find a SQL column type for every data type from your application that you want to persist. You need to consider the following to effectively design your tables:

  • Define column types that describe your data.

  • Specify collations to define sorting and comparison rules for textual data.

  • Use computed columns for pre-calculated expressions.

In the following sections, you will see some best practices and guidelines that you can use while you are defining your tables.

Define your column types

Most of the value types from application programming languages have a matching SQL type counterpart, from which you can quickly determine which SQL types to store the values from domain objects. There are also lots of tools that automatically generate table columns and their types based on the type of class properties (so-called Code First approach). Although you can use tools and deterministic mapping, it would be beneficial if you learn how to carefully examine and improve your table design. The main reason is the fact that you know more about the domain model than your tools, and you should not let an automatic generator create the most essential part of your application using a set of generic rules. Some examples of default mapping rules that might not be the best fit for your design are
  • String types are by default mapped to NVARCHAR(MAX) type because it can contain any Unicode text with arbitrary string length (up to 2GB). The downside of this decision is that NVARCHAR(MAX) is a list of 4000-character strings, and therefore, it is suboptimal in many operations compared with the string types with the length less than 4000 characters. If you know that your text will have some length limit, try to use a more precise type like NVARCHAR(100), which might boost performance of your queries.

  • Some languages, like C#, do not have fine-grained types for date and datetime. You might use DateTime type for dates even if you do not use time parts. These data values are mapped to datetime2. If you know that your DateTime values contain only dates without time, it would be more optimal to use the exact date type.

  • Decimal numbers (float, real types) are mapped to decimal type. Instead of plain decimal type, in Azure SQL, you can explicitly specify how many decimal places it can have and that is the size. Size of decimal numbers might vary between 5 and 17 bytes per number depending on the precision. If you know that you are using a decimal number to store money where you have just two decimal places, and you know the maximal theoretical amount of money that your application will use, you can explicitly specify the size.

These examples are minor checks and adjustments that can make you a hero if you spend some time to validate the results of the default rules and improve them with your knowledge of the domain model.

General advice is to try to make your columns as small as possible and find the minimal SQL types for a given type. With appropriately sized smaller types, you can store the same amount of application data using a smaller amount of storage. Beyond storage savings, smaller types can positively impact query performances. Azure SQL database has an internal memory cache (known as SQL buffer pool) that contains a part of the data from persistent storage, increasing the speed of retrieval for that data. With unnecessarily large types, you waste the precious buffer pool space with unnecessarily padded data. Also, smaller types require fewer disk and network input/output operations which allows you to more quickly load or save data. Further, big columns can trick the Azure SQL query optimizer into reserving excess memory for the queries on those columns. The memory would not be leveraged, but since it is reserved for the queries, other queries may not get enough memory. In addition, data with a smaller memory footprint can be allocated to smaller and much faster memory cache layers. In some cases, you might find that query performance increases just because data does not have useless 0 values padded because someone chose a larger than necessary data type.

Some recommend to use globally unique identifier (GUID) values as the identifiers of your domain model objects to avoid dependency on databases that will generate object identifiers. The advantage of GUID is that you can assign the identity to the Model object in your app without consulting your database. However, the big disadvantage is the fact that GUID values are 16-byte long compared to int (4-byte) or bigint (8-byte). While storage consumption is not optimal, it is not the main concern here. Finding or joining the rows could be much slower if you unnecessarily used GUID values, as data is spread randomly everywhere, impacting a lot on cache efficiency, especially once your table grows after you go to production, especially if the solution is successful and it is used by a lot of users. Good design might prevent a lot of performance issues that might happen once your application faces a real scenario with millions of rows per table.

Collations

Collation is another important text property in Azure SQL. Differing collations enable you to define what linguistic rules should be used when Azure SQL compares or sorts the text values.

In standard programming languages, you use plain strings and you need to explicitly specify that you want to use case-insensitive comparison or sorting rules when you need it. For example, if you need to use some locale linguistic rules to sort strings in C#, you need to specify CultureInfo for the target language. In most programming languages, you need to specify case insensitivity or culture explicitly in some method. Azure SQL enables you to define language rules and case sensitivity, but it also enables some more advanced comparison and sorting rules like accent sensitivity, Kana sensitivity (to distinguish Japanese Hiragana and Katakana characters), or Variation selector sensitivity (to apply some fine-grained sorting rules in modern digital encodings).

You can easily set collation on any string column using a statement like the following one:
ALTER TABLE Warehouse.StockItems
      ALTER COLUMN Brand NVARCHAR(50)
      COLLATE Serbian_Cyrillic_100_CI_AI

This statement will specify that any string operation on the column will use Serbian linguistic rules and ignore casing (CI) and accents (AI) in strings. Changing the property of the column will automatically be applied on all sorting and comparison rules that use this column.

Azure SQL has collations for most of the world languages and enables you to turn on/off case, accent, or kana sensitivity rules for all of them. This is the huge benefit for you if you need to implement applications for global markets.

If you do not speak Japanese and you get a bug that some Japanese words in the application are incorrectly ordered, the last thing that you would like to do is start learning about the difference between Katakana and Hiragana just to fix the bug. This is the time when you would leverage the power of linguistic rules that Azure SQL provides and rely on the fact that Azure SQL can correctly sort results.

Another important property of text is encoding. Characters can be stored using different binary representations, and the rule that translates a character to its binary form is called encoding schema. Unicode standard assigned a unique numeric identifier (known as code point) to most of the relevant characters that are currently used (even emojis have their unique code points!). Encoding schema defines how to serialize characters or their code points as a series of bytes.

One of the simplest encoding schemas is ASCII that uses 7 bits to encode most common characters used in English language with one byte per character (e.g., code 0x5B is used for “]”). ASCII encoding supports only common characters used in English language, but not other characters used in Western or Latin languages such as ä, ö, ü, ñ, or ß. Windows-1252 and ISO-8859-1 are the extensions of ASCII that use 8 bits to encode a wider character set. These encoding schemas encode other common characters in Western languages. However, since these encoding schemas cannot represent characters used in the other languages, other nations derived their own 8-bit encoding schemas for national character sets. Examples of other national character sets are ISCII (India), VSCII (Vietnam), or YUSCII (former Yugoslavia). The single-byte national character sets use the code points in range up to 256 to represent common characters used in their national alphabets. Azure SQL uses varchar type to represent a character in some of the single-byte encoding schemas (although there are also some multi-byte codepages). To differentiate what national character set the encoded value belongs to some codepage, Azure SQL uses the column collation. If a binary value in the varchar column is 0xC6 and column collation is one of the Latin collations (codepage Windows-1250), Azure SQL will assume that this is character “Ć”. If collation is one of the Serbian collations, Azure SQL will assume that this is character “Ж” (in Serbian_Cyrillic collations - codepage Windows-1251) or “Æ” (in Serbian_Latin collations - codepage Windows-1252). Identifying characters based on 8-bit code point and national character set is unnecessarily simplistic and error-prone, so Unicode standard uses multi-byte encoding space to encode characters from all languages with a different code point for every character. There are several encoding schemas for Unicode code points like UTF-8 where every character is represented using 1, 2, 3, or 4 bytes and UTF-16 where every character is represented using 2 or 4 bytes. Azure SQL uses a varchar type to represent Unicode characters with UTF-8 encoding, but these values must use a collation ending with UTF8. To represent UTF-16 encoding scheme Azure SQL uses nvarchar type, which is not dependent on collations.

Always use Unicode unless you are sure that characters in string columns have only limited known values. nvarchar type is a good choice in most of the scenarios. Although it uses 2 bytes even for common characters that can fit into one byte, Azure SQL has some optimization that automatically compresses these values. If you are really concerned about performance and want to optimize for common characters, you should use varchar type with a UTF-8 collation.

Computed columns

In addition to classic columns, you can use special computed columns that represent the named expressions. As an example, imagine that you have quantity of purchased products, price for each unit, and the tax rate. You could create a function that calculates the profit and provide the values of columns, but in some cases, it would be better to create an automatically re-calculated virtual column:
ALTER TABLE Sales.OrderLines
     ADD Profit AS (Quantity*UnitPrice)*(1-TaxRate)
An application that reads this table will get the profit value without need to know some function for profit calculation. For external applications, this value would look like yet another value in the table. This computed column does not occupy additional space, and it is dynamically re-calculated whenever some query uses it. This lightweight computed column is an ideal solution if the expressions can be quickly calculated. However, if you have some heavy calculation that requires string processing, you might want to permanently store calculated values until the base column changes. This way, you will use some additional space, but the queries will not wait for the calculation to finish for every row that should be returned. These types of computed columns are known as persisted computed columns and can be created by adding the keyword PERSISTED in the preceding example. The resulting code will be this:
ALTER TABLE Sales.OrderLines
     ADD Profit AS (Quantity*UnitPrice)*(1-TaxRate) PERSISTED

The values in persisted computed columns are automatically re-calculated whenever any of the base values is changed.

This may seem very similar to what you can get using a User-Defined Function, as you learned in the previous chapter: in fact, you could decide to use a User-Defined Function to create a calculated column if you want too. If you don’t plan to use a User-Defined Function anywhere else, you may simply define the logic as an expression for a calculated column and you’ll just save some code and effort.

Complex-type columns

Scalar-type columns are standard and the most used column types in relational databases. In most of the cases, you will be able to represent your domain model with normalized tables containing only scalar-type columns. However, in some scenarios, you would need to represent your application object with some non-scalar types. Azure SQL enables you to store the following types in the table columns:
  • XML – Azure SQL enables you to store properly formatted XML documents in columns.

  • Spatial columns contain some common elements that are used to represent geographical and geometrical concepts such as points, lines, polygons, and so on.

  • CLR column – These columns contain serialized binary representation of .NET objects.

These types are not just a binary serialization of their application counterparts. Azure SQL enables you to apply some methods that are specific for these types. The following code example shows a Nearest Neighbor query that returns five cities that are closest to the given point:
DECLARE @g geography = 'POINT(-95 45.5)';
SELECT TOP(5) Location.ToString(), CityName
FROM Application.Cities
ORDER BY Location.STDistance(@g) ASC;

Spatial columns have methods such as STDistance that calculates the distance between the point and the location provided as an argument, making sure that correct calculations, for example, taking into account that the Earth is a spheroid and that there are several ways to project Earth surface on a map, are correctly applied. Not-so-easy trigonometric calculations are there for you to use for free!

Complex types in Azure SQL enable you to solve very specific problems that need some special handling that does beyond classic table types.

Azure SQL enables you to store JSON documents; however, there is no special JSON type. Azure SQL uses standard nvarchar type to store text in JSON format. This means that any programming language can support it without the need of special libraries.

Declarative constraints

One of the golden rules of application development is that the business rules should be implemented in the application layer. The purists may argue that pushing business logic to database Stored Procedures, Functions, and Triggers or on the client side might cause issues and maintenance nightmares. We discussed this a bit already, clearly describing the need to push compute to data, but let’s set all the previously discussed reasons aside for a second, and let’s focus on just Business Rules. Business Rules must stay in the application layer: is this always true? Let us look at some examples of business rules that you might need to implement:
  • You must ensure that the email address or username of a person is unique. Are you going to query a table to check if there is another user with the same email before you create the new one?

  • You must ensure that a product cannot be saved if it references some customer that does not exist. Are you going to read the customer table every time before you save the product just to ensure that nobody deleted the customer in the meantime?

  • When you delete a customer from the system, all additional information (like documents) should be also deleted, except orders and invoices where you need to break the relationship with the customer and set a null value in the reference column. If the customer still has some active accounts, they should not be deleted. Are you going to query Orders and Invoices tables from the application layer to ensure that they will not return any result before you delete the product? This might be a non-trivial strategy that you need to implement and thoroughly test.

Although you can implement these rules in your application, it is arguable should you do it. The business rules that describe data state and behavior on some actions can be easily declared in the database. Those business rules, in fact, are closely tied to the data, up to the point that they really are more a kind of data consistency rules. Surely, it is always part of the big family of “business rules” definition, but, as always, generalizing too much can be more harmful than beneficial. In fact, by applying these rules at the database level, you can be sure that the data will be consistent with the business rules and that rules will be applied without a line of application code, following the golden principle of keeping things as simple as possible (but not simpler!) and also helping in establishing a good and clear separation of concerns. Data is definitely a database concern. In addition, there would be no additional performance impact on your action because you would not need to run additional queries within the main action just to validate additional rules or implement some side activities. So, you’ll also get better performance. This should help you better understand up to which degree business rules should be applied in the application layer and which should be applied on the database.

Azure SQL enables you to enhance your tables and declaratively define the rules that will ensure that the data is valid using the following constraints:
  • Primary keys enable you to specify that a column (or a set of columns) holds a unique identifier of the object.

  • Foreign keys specify that an object (table row) logically belongs to some other object or it is somehow related to that object. This is known as referential integrity , and it guarantees that the relationship between the two rows bounded with foreign key and primary key cannot be accidentally broken.

  • Check constraints are the conditions that should be checked on every object. These constraints guarantee that some condition would always be true on the data that is saved in the table. You can implement these checks as a last line of defense, so even if your application code fails to check some condition or you have multiple services that can update the same data, you can be sure the database will guarantee that the condition must always be true.

  • Unique constraints guarantee that a value is unique in the column. They are almost like primary keys (a big difference is that they allow for null values) and can be used for values like usernames or email addresses.

Azure SQL enables you to declaratively specify these constraints without need to implement them. The following declarations describe relations between a customer and their invoices and orders. These declarations would also specify what the action should be taken on related data once a customer is deleted. When someone tries to delete the customer, this relationship would delete all customer invoices:
ALTER TABLE Sales.Invoices
ADD CONSTRAINT FK_Cust_Inv FOREIGN KEY (CustomerID)
    REFERENCES Sales.Customers (CustomerID) ON DELETE CASCADE;
The following declaration will stop the customer delete action if there is some order for the customer:
ALTER TABLE Sales.Orders
ADD CONSTRAINT FK_Cust_Orders FOREIGN KEY (CustomerID)
    REFERENCES Sales.Customers (CustomerID) ON DELETE NO ACTION;

With a few lines of code, you can easily implement business rules that might be hard to code (and especially test) in your application layer. Using these built-in constraints, you can rely on Azure SQL to maintain data integrity without any effort in your application logic.

Making your tables secure

Azure SQL enables you to associate fine-grained permissions to any user or login that can access your database. Let us imagine that you have a Sales microservice that manages information about the sales orders. This microservice accesses the Azure SQL database using the login SalesMicroservice that should be able to read or insert Sales.Orders table but not to update or delete data. It should just be able to read information about Customers and Invoices. For some custom actions, it might use only procedures that are placed in the Sales schema. A minimum set of permissions required for this microservice can be defined using the following rules:
GRANT SELECT, INSERT ON OBJECT::Sales.Orders TO SalesMicroservice
GRANT SELECT ON OBJECT::Sales.Customers TO SalesMicroservice
GRANT SELECT ON OBJECT::Sales.Invoices TO SalesMicroservice
GRANT EXECUTE ON SCHEMA::Sales TO SalesMicroservice

These are easy to write and review security rules that you can define in your database. This security model is important in the architectures such as Command Query Responsibility Segregation (CQRS) where you have separate processes that are allowed only to read or update data. Assigning the minimum required set of permissions would ensure that you are not vulnerable to various security attacks such as SQL Injection or unauthorized changes. The security rules in Azure SQL database are the last line of defense of your data that should prevent data breaches even if there is some security hole in your application layer.

If you need to examine permissions assigned to users, you can log in as the user or impersonate as user and check the permissions using fn_my_permissions function :
EXECUTE AS USER = 'SalesMicroservice';
SELECT * FROM fn_my_permissions('Sales.Customers', 'OBJECT');
REVERT;

In addition to manual inspection, there is a Vulnerability Assessment tool built into any Azure SQL that automatically inspects permissions assigned to the users and makes recommendations.

Improve performance with indexes

A table, unless you create a clustered index on it (you’ll learn about this in a few pages; for now, just keep in mind that by default a primary key also creates a clustered index behind the scenes), is a set of rows physically stored in the area like the heap memory in .NET, Java, and other modern programming languages or runtime environments. That kind of table is called a heap table in Azure SQL terminology.

A classic heap table can be useful in an extremely limited set of scenarios (like scanning small tables or appending rows). However, the most expected operations in Azure SQL database would be finding the row by identifier (primary key), finding or updating a set by some criterion, and so on. These common operations are extremely inefficient on a plain set of rows organized as heaps. As you will soon learn, even a plain scanning of the entire table is more efficient if we use some special Columnstore format.

Azure SQL enables you to use indexes to enhance this basic structure and speed up the access to the rows in various scenarios. Indexes are the structures of row addresses that enable you to organize the rows in your table or build a structure that the queries can use more efficiently than a basic table. There are several kinds of indexes in Azure SQL:
  • B-tree indexes that represent multi-level tree-like structures of pointers. These indexes enable you to easily locate one or a set or rows without scanning the entire table.

  • Columnstore indexes organize data in columnar format instead of default row format. These structures, which will be described in the next chapter, are an excellent solution for speeding up reporting and analytic queries that scan large parts of the table.

  • Domain-specific indexes are special types of indexes built to efficiently support some non-standard structures or data types. Azure SQL supports Spatial indexes, Full-text indexes, and XML indexes. These indexes enable you to run some non-standard filters like containment of the coordinate within the area, XPath expression, filtering based on proximity of two words in the text, and so on.

B-Tree indexes are the most common structure that you will find and use in any Azure SQL database (and more generally in any database, relational or not). B-tree indexes and tables are so tightly coupled that we cannot talk about them separately. In Azure SQL terminology, they are called Hobits (HOBT – Heap or B-Tree), and they represent the most common structures in most databases.

Columnstore and domain-specific indexes are advanced concepts that will be explained in future chapters.

B-Tree indexes

A B-Tree index is a hierarchical structure of nodes divided by the values present in the table columns. It’s very common in all relational and non-relational databases, as it is extremely efficient and computationally simple. This is a Composite design pattern described in the famous Design Patterns book written by Erich Gamma et al. In the B-Tree index, every leaf node has references to table rows, and non-leaf nodes have references to other nodes. These multi-level tree-like structures enable you to easily locate one or a set of rows. Every B-Tree index has a so-called key column(s). The values from this column(s) are used to search the rows for the desired values.

Figure 6-2 shows an index that divides pointers to rows using Price column.
../images/493913_1_En_6_Chapter/493913_1_En_6_Fig2_HTML.jpg
Figure 6-2

Dividing rows into groups using index on Price column

Under the root node, you have a set of groups representing the ranges of rows based on the value of the Price column. Each group might contain references to the rows with these values or references to other nodes with more granular ranges. Imagine that we need to find a set of rows with Price equal to 73. Starting from the root, we can locate the node that represents a range of references to the rows with Price values between 50 and 100. This node would lead us to the node that contains the references to the rows with Price column in the range between 70 and 100. This is the leaf node that contains the addresses of the rows, and we can search the list of references in this node to find the ones with desired value. The number of references in the leaf node is not big, so the rows are found quickly.

Leaf nodes in the index are lists of rows or row addresses, and these lists should not be big to make sure performance is always optimal. If the number of references goes above a defined limit, Azure SQL will split the node into two nodes (action known as page split).

Leaf nodes might contain the references (addresses) of table rows or the cells placed in actual table rows. Therefore, we have the following types of indexes:
  • Clustered indexes hold actual rows in their leaf nodes. When you navigate to the leaf node using some criterion, you will find the actual row. Clustered indexes physically organize table rows. Their structure defines how the rows would be physically organized. Since there can be only one structure of table rows, you can have only one clustered index per table.

  • Nonclustered indexes hold the addresses of the rows in their leaf nodes. When you navigate to the leaf node using some criterion, you will need one additional read to get the data from the actual row. Nonclustered indexes are the structures built on top of table rows as an additional secondary set of references that should enable you to locate rows by some criterion.

General advice is to add a clustered index on the primary key column because filtering rows by primary key is the most common pattern for filtering and you want the most optimal index to cover this case. If you know that you have some other dominant pattern, you can choose some other column for clustered index (e.g., foreign key or some date column).

Nonclustered indexes are additional utility structures that optimize the queries that are common but not dominant. You can add several nonclustered indexes, and the query optimizer will choose the one that is the best fit for the query.

Let’s see in detail when to use what.

When to use B-Tree indexes?

B-Tree indexes are the most common type of the indexes that you will find in Azure SQL databases. Their structure can help us in the following scenarios:
  • Index seeks operation that locates a single row or a set of rows with the specified values of columns.

  • Range scans operation that locates a set of rows that have values in some range.

The indexes might be a helpful performance booster in the following queries:
SELECT c.CustomerName, c.CreditLimit,
       o.CustomerPurchaseOrderNumber, o.DeliveryInstructions
FROM Sales.Customers AS c
INNER JOIN Sales.Orders AS o
ON c.CustomerID = o.CustomerID
WHERE c.PostalCityID = 42
Azure SQL would need to read the Customers table to find all customers with PostalCityID value 42. Then, it would need to read the Orders table to find orders with values of CustomerID column that match the CustomerID values fetched from the first read. It might be very inefficient to scan the entire table just to fetch a few rows. Let us see what will happen if we create the following indexes:
CREATE INDEX ix_Customer_PostalCityID
       ON Sales.Customers(PostalCityID);
CREATE INDEX ix_Orders_CustomerID
       ON Sales.Orders(CustomerID);

With an index on the Customers table that has a PostalCityID column , Azure SQL would use the created index to quickly locate the Customers rows with the given PostalCityID. Then it will find the orders in the second index based on CustomerID values. As you can imagine, in large tables, this can be a huge performance benefit, not to mention less resource, CPU, RAM, and IO usage, which in turn means better scalability for lower costs.

When to create indexes?

People are sometimes tempted to optimize a database by adding various indexes whenever they believe that they might help. Some people even believe that automatic indexes created on every column might be the holy grail of database design. However, you need to be aware what the price of indexes is before you create them.

Indexes are not silver bullets that will always boost performance of your queries. They might be a perfect solution in read-only workloads because they give more choices for better execution plans, while the worst impact is just some additional space that they occupy. However, for write operations, they represent a burden because every index that contains the updated value must be updated together with the table. If you have a table with five indexes, there is a chance that every insert, delete, and many update statements would need to update not just the primary table but also all indexes.

You should not add indexes blindly on every column or foreign key because this might hurt your database performance. Some generic guidelines that you can use to choose where to add the indexes are:
  • You can start by creating clustered indexes on primary key columns, because the most common operation on a table is finding a row by primary key value.

  • Tables might need to have nonclustered indexes on foreign keys because you might expect that queries would need to read all rows with foreign key value.

  • If you find some important query for your application that is slow, you might need to create an index on the columns that the query uses.

As the complexity of your database and solution increases, you may find that a very common query that returns all the orders done in the last day using the OrderDate column. In such cases, it may be useful to create the clustered index on that column, so all the orders done on the same date will be physically close together. Remember, in fact, that the clustered index holds rows in its leaf level and that rows are ordered by the columns on which the index is created. This means that Azure SQL can start to read from the first order placed in the day and can stop once the last is found, optimizing the reads to the minimum possible. Of course, as there is only one clustered index, you need to carefully evaluate this strategy based on your knowledge of how the application works and how users mostly use it.

As you can see, determining the correct indexes is not an easy task. For this reason, there are various tools like Performance Dashboard in SQL Server Management Studio, Database Tuning Advisor, or open source scripts like Tiger Toolbox or First Responder Kit (reference to both in the links at the end of the chapter) that can provide a list of indexes that you might need to create. Always use tools to prove that you need an index before you create it. Over-indexing and rarely used indexes are two of the most common causes of performance issues.

Azure SQL has a set of intelligent features like Automatic tuning that provides recommendations about the indexes that you might need to create. This is the best way to identify missing indexes because the recommendations are powered by machine learning algorithms trained on the substantial number of databases in Azure SQL. Even better, you can let the Azure SQL Automatic tuning capability automatically create indexes for you. The main benefit of the Automatic tuning capability is that it verifies that the created index would improve performance of your queries. Azure SQL has built-in machine learning models that measure performance of the database workload before and after index creation, identify the queries that are affected by the index, and check whether the query performance is improved or degraded. If the performance improvement is not big enough or if the new index degrades performance, the change will be reverted. In addition, Automatic tuning scans the usage of your indexes and recommends to you the indexes that should be dropped. Another important benefit is that Automatic tuning combines index recommendations to provide the best indexes. As an example, if there are two recommendations, one to create index on column A and another to create index on columns A and B, Azure SQL index recommendation will recommend only the second one, because it can also cover the queries that need only column A.

Maintaining the indexes

Indexes are updated whenever the data in the indexed columns is changed. Initially, the created index is a perfectly balanced tree that provides optimal performance. However, on an actively used database, the underlying data changes over time, making the index unbalanced and leaving some nodes half-populated. Index maintenance operations regenerate the indexes and return them in better state. There are two maintenance operations that you can perform on indexes:
  • REBUILD is an operation that fully re-creates the index, drops the old structure, and re-creates a new perfectly balanced index. This operation guarantees that your index will be in the perfect shape at the cost of the resources that it uses.

  • REORGANIZE is an operation that reshuffles the data in the parts of indexes trying to fill the unused space and balance parts of the indexes. This is a very lightweight operation that usually doesn’t affect your application performances or consume a lot of resources, but on the other hand, it might not be able to fix all problems.

The SQL statement that re-creates the index is shown in the following example:
ALTER INDEX FK_Sales_Customers_DeliveryCityID ON Sales.Customers
      REBUILD WITH ( ONLINE = ON);
If you want to reorganize the index, you will just replace REBUILD keyword with REORGANIZE. One interesting optional option in this statement is the ONLINE option. There are two ways to rebuild the indexes:
  • OFFLINE is the default option that is the fastest way to rebuild your index, but it will block any query that might touch the data in indexed columns.

  • ONLINE enables your workload to change the data since it will pick up any changes that are made in the meantime. This is a more flexible operation, but it might consume more resources than OFFLINE version and might affect performance until it finishes.

These options are available only in REBUILD actions. REORGANIZE is always ONLINE.

The most important questions in index maintenance are what is the option that we should use and on what indexes. Rebuilding or reorganizing all indexes will just spend the resources that should be used for queries, and most of the indexes might not even need any maintenance. The criterion that is commonly used to determine the health state of the indexes is fragmentation. Fragmentation is the percentage of space in the index that contains indexed data.

The following query returns fragmentation of all indexes for a provided database and object where indexes should be scanned:
DECLARE @db_id int = DB_ID('Wide World Importers');
DECLARE @object_id int = OBJECT_ID('Sales.Customers');
SELECT index_id, partition_number, avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats(@db_id, @object_id, NULL, NULL , 'LIMITED');

In the result of the query, you can find index_id, partition_number, and fragmentation, so you can find the fragmented indexes using system views such as sys.indexes. Notice the last parameter in the function – LIMITED. Investigating fragmentation of indexes is not a lightweight operation especially on large indexes, and you can specify whether you would like to get approximate results by scanning a limited part of the index, calculate statistics from a sample, or analyze the entire index.

The action that you need to take on the index depends on the fragmentation. Some general rules of thumb are as follows:
  • If fragmentation of the index is between 5% and 30%, use REORGANIZE operation.

  • If fragmentation is greater than 30% and you can find some period where the table will not be used (a.k.a. Maintenance period), use OFFLINE REBUILD operation.

  • If your table is frequently used and you cannot block the application workload, use ONLINE REBUILD operation.

You can also use proven open source scripts that can help you to identify candidates for maintenance and apply recommended actions. The most commonly used scripts for index maintenance are Ola Hallengren scripts and SQL Tiger Toolbox.

Resumability is another important index maintenance strategy. If you cannot avoid potentially expensive REBUILD action and you don’t have a maintenance period, Azure SQL enables you to temporarily pause a rebuild and resume it later. This is an excellent choice in the cases where you find out that index maintenance affects performance of your application and need to hold off index rebuild, let your application finish some job, and then continue with maintenance without losing the work that is already done.

Retiring the indexes

Creating indexes is easy, but the more important question is “are they used or are they causing more harm than benefit?” Over-indexing might cause big performance issues. Even if you have an ideal set of indexes, your application patterns and database workload will change over time, and some of the indexes may need to be removed, as not used anymore. Eventually, you would need to retire and drop some indexes.

Experienced database administrators can easily identify the unnecessary indexes by looking at system views and query execution plans, or by examining the number of logical reads when setting STATISTICS IO option. If deep expertise in this area is not one of your career goals, you might again rely on the benefits that Azure SQL provides. In addition to index creation recommendations, Azure SQL provides drop index recommendations. Azure SQL scans usage of your indexes to identify unused indexes. It can also identify duplicate indexes that should be dropped. You can either take the scripts and drop the indexes or let Azure SQL clean up the indexes for you. Azure SQL will not just drop and forget the index. Automatic tuning will use the same algorithm that is used for index creation and monitor if the drop index operation decreased performance. Azure SQL will re-create the dropped index if it detects performance issues.

If you want to know more

The knowledge that you gained in this chapter enables you to professionally create proper database designs and understand the issues that you might find in databases. The concepts explained in this chapter represent the core knowledge required for database design. In the next chapter, we will go one step further – you will understand how to use some advanced concepts and types of tables and indexes that can improve the performance of your database.

To know more about the topics mentioned in this chapter, you can start from these resources:
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset