Physical Model Implementation Case Study
The whole difference between construction and creation is exactly this: that a thing constructed can only be loved after it is constructed; but a thing created is loved before it exists.
—Charles Dickens, writer and social critic, author of A Christmas Carol
In some respects, the hardest part of the database project is when you actually start to create code. If you really take the time to do the design well, you begin to get attached to the design, largely because you have created something that has not existed before. Once the normalization task is complete, you have pretty much everything ready for implementation, but tasks still need to be performed in the process for completing the transformation from the logical model to the physical, relational model. We are now ready for the finishing touches that will turn the designed model into something that users (or at least developers) can start using. At a minimum, between normalization and actual implementation, take plenty of time to review the model to make sure you are completely happy with it.
In this chapter, we’ll take the normalized model and convert it into the final blueprint for the database implementation. Even starting from the same logical model, different people tasked with implementing the relational database will take a subtly (or even dramatically) different approach to the process. The final physical design will always be, to some extent, a reflection of the person/organization who designed it, although usually each of the reasonable solutions “should” resemble one another at its core.
The model we have discussed so far in the book is pretty much implementation agnostic and unaffected by whether the final implementation would be on Microsoft SQL Server, Microsoft Access, Oracle, Sybase, or any relational database management system. (You should expect a lot of changes if you end up implementing with a nonrelational engine, naturally.) However, during this stage, in terms of the naming conventions that are defined, the datatypes chosen, and so on, the design is geared specifically for implementation on SQL Server 2016 (or earlier). Each of the relational engines has its own intricacies and quirks, so it is helpful to understand how to implement on the system you are tasked with. In this book, we will stick with SQL Server 2016, noting where you would need to adjust if using one of the more recent previous versions of SQL Server, such as SQL Server 2012 or SQL Server 2014.
We will go through the following steps to transform the database from a blueprint into an actual functioning database:
Finally, we’ll work on a complete (if really small) database example in this chapter, rather than continue with any of the examples from previous chapters. The example database is tailored to keeping the chapter simple and to avoiding difficult design decisions, which we will cover in the next few chapters.
Note For this and subsequent chapters, I’ll assume that you have SQL Server 2016 installed on your machine. For the purposes of this book, I recommend you use the Developer edition, which is (as of printing time) available for free as a part of the Visual Studio Dev Essentials from www.visualstudio.com/products/visual-studio-dev-essentials. The Developer Edition gives you all of the functionality of the Enterprise edition of SQL Server for developing software, which is considerable. (The Enterprise Evaluation Edition will also work just fine if you don’t have any money to spend. Bear in mind that licensing changes are not uncommon, so your mileage may vary. In any case, there should be a version of SQL Server available to you to work through the examples.)
Another possibility is Azure SQL Database (https://azure.microsoft.com/en-us/services/sql-database/) that I will also make mention of. The features of Azure are constantly being added to, faster than a book can keep up with, but Azure SQL Database will get many features before the box product that I will focus on. I will provide scripts for this chapter that will run on Azure SQL Database with the downloads for this book. Most other examples will run on Azure SQL Database as well.
The main example in this chapter is based on a simple messaging database that a hypothetical company is building for its hypothetical upcoming conference. Any similarities to other systems are purely coincidental, and the model is specifically created not to be overly functional but to be very, very small. The following are the simple requirements for the database:
Figure 6-1 shows the logical database design for this application, on which I’ll base the physical design.
Figure 6-1. Simple logical model of conferencing message database
The following is a brief documentation of the tables and columns in the model. To keep things simple, I will expound on the needs as we get to each need individually.
UserHandle: The name the user wants to be known as. Initially preloaded with a value based on the person’s first and last name, plus an integer value, changeable by the user.
Choosing a Physical Model for Your Tables
Prior to SQL Server 2014 there was simply one relational database engine housed in SQL Server. Every table worked the same way (and we liked it, consarn it!). In 2014, a second “in-memory” “OLTP” engine (also referred to as “memory optimized”) was introduced, which works very differently than the original engine internally but, for SQL programmers, works in basically the same, declarative manner. In this section I will introduce some of the differences at a high level, and will cover the differences in more detail in later chapters. For this chapter, I will also provide a script that will build the tables and code as much as possible using the in-memory engine, to show some of the differences.
Briefly, there are two engines that you can choose for your objects:
What is awesome about the engine choice is that you make the setting at a table level, so you can have tables in each model, and those tables can interact in joins, as well as interpreted T-SQL (as opposed to natively compiled T-SQL objects, which will be noted later in this section). The in-memory engine is purpose-built for much higher performance scenarios than the on-disk engine can handle, because it does not use blocking operations for concurrency control. It has two major issues for most common immediate usage:
The differences, particularly the latter one, mean that in-memory is not a simple “go faster” button.
Since objects using both engines can reside in the same database, you can take advantage of both as it makes sense. For example, you could have a table of products that uses the on-disk model because it is a very large table with very little write contention, but the table of orders and their line items may need to support tens of thousands of write operations per second. This is just one simple scenario where it may be useful for you to use the in-memory model.
In addition to tables being in-memory, there are stored procedures, functions, and triggers that are natively compiled at create time (using T-SQL DDL, compiled to native code) as well that can reference the in-memory objects (but not on on-disk ones). They are limited in programming language surface, but can certainly be worth it for many scenarios. The implementation of SQL Server 2016 is less limited than that of 2014, but both support far less syntax than normal interpreted T-SQL objects.
There are several scenarios in which you might apply the in-memory model. The Microsoft In-Memory OLTP web page (msdn.microsoft.com/en-us/library/dn133186.aspx) recommends it for the following:
So when you are plotting out your physical database design, it is useful to consider which mode is needed for your design. In most cases, the on-disk model will be the one that you will want to use (and if you are truly unsure, start with on-disk and adjust from there). It has supported some extremely large data sets in a very efficient manner and will handle an amazing amount of throughput when you design your database properly. However, as memory becomes cheaper, and the engine approaches coding parity with the interpreted model we have had for 20 years, this model may become the default model. For now, databases that need very high throughput (for example, like ticket brokers when the Force awakened) are going to be candidates for employing this model.
If you start out with the on-disk model for your tables, and during testing determine that there are hot spots in your design that could use in-memory, SQL Server provides tools to help you decide if the In-Memory engine is right for your situation, which are describe in the page named “Determining if a Table or Stored Procedure Should Be Ported to In-Memory OLTP” (https://msdn.microsoft.com/en-us/library/dn205133.aspx).
The bottom line at this early part of the implementation chapters of the book is that you should be aware that there are two query processing models available to you as a data architect, and understand the basic differences so throughout the rest of the book when the differences are explained more you understand the basics.
The examples in this chapter will be done solely in the on-disk model, because it would be the most typical place to start, considering the requirements will be for a small user community. If the user community was expanded greatly, the in-memory model would likely be useful, at least for the highest-contention areas such as creating new messages.
Note Throughout the book there will be small asides to how things may be affected if you were to employ in-memory tables, but the subject is not a major one throughout the book. The topic is covered in greater depth in Chapter 10, “Index Structures and Application,” because the internals of how these tables are indexed is important to understand; Chapter 11, “Matters of Concurrency,” because concurrency is the biggest difference; and somewhat again in Chapter 13, “Architecting Your System,” as we will discuss a bit more about when to choose the engine, and coding using it.
Choosing Names
The target database for our model is (obviously) SQL Server, so our table and column naming conventions must adhere to the rules imposed by this database system, as well as being consistent and logical. In this section, I’ll briefly cover some of the different concerns when naming tables and columns. All of the system constraints on names have been the same for the past few versions of SQL Server, going back to SQL Server 7.0.
Names of columns, tables, procedures, and so on are referred to technically as identifiers. Identifiers in SQL Server are stored in a system datatype of sysname. It is defined as a 128-character (or less, of course) string using Unicode characters. SQL Server’s rules for identifier consist of two distinct naming methods:
If you need to put a closing brace (]) or even a double quote character in the name, you have to include two closing braces (]]), just like when you need to include a single quote within a string. So, the name fred]olicious would have to be delimited as [fred]]olicious]. However, if you find yourself needing to include special characters of any sort in your names, take a good long moment to consider whether you really do need this (or if you need to consider alternative employment opportunities). If you determine after some thinking that you do, please ask someone else for help naming your objects, or e-mail me at [email protected]. This is a pretty horrible thing to do to your fellow human and will make working with your objects very cumbersome. Even just including space characters is a bad enough practice that you and your users will regret for years. Note too that [name] and [name] are treated as different names in some contexts (see the embedded space) as will [name].
Note Using policy-based management, you can create naming standard checks for whenever a new object is created. Policy-based management is a management tool rather than a design one, though it could pay to create naming standard checks to make sure you don’t accidentally create objects with names you won’t accept. In general, I find doing things that way too restrictive, because there are always exceptions to the rules and automated policy enforcement only works with a dictator’s hand. (Have you met Darth Vader, development manager? He is nice!)
While the rules for creating an object name are pretty straightforward, the more important question is, “What kind of names should be chosen?” The answer I generally give is: “Whatever you feel is best, as long as others can read it and it follows the local naming standards.” This might sound like a cop-out, but there are more naming standards than there are data architects. (On the day this paragraph was first written, I actually had two independent discussions about how to name several objects and neither person wanted to follow the same standard.) The standard I generally go with is the standard that was used in the logical model, that being Pascal-cased names, little if any abbreviation, and as descriptive as necessary. With space for 128 characters, there’s little reason to do much abbreviating. A Pascal-cased name is of the form PartPartPart, where words are concatenated with nothing separating them. Camel-cased names do not start with a capital letter, such as partPartPart.
Caution Because most companies have existing systems, it’s a must to know the shop standard for naming objects so that it matches existing systems and so that new developers on your project will be more likely to understand your database and get up to speed more quickly. The key thing to make sure of is that you keep your full logical names intact for documentation purposes.
As an example, let’s consider the name of the UserConnection table we will be building later in this chapter. The following list shows several different ways to build the name of this object:
One specific place where abbreviations do make sense are when the abbreviation is very standard in the organization. As an example, if you were writing a purchasing system and you were naming a purchase-order table, you could name the object PO, because this is widely understood. Often, users will desire this, even if some abbreviations don’t seem that obvious. Just be 100% certain, so you don’t end up with PO also representing disgruntled customers along with purchase orders.
Choosing names for objects is ultimately a personal choice but should never be made arbitrarily and should be based first on existing corporate standards, then existing software, and finally legibility and readability. The most important thing to try to achieve is internal consistency. Your goal as an architect is to ensure that your users can use your objects easily and with as little thinking about structure as possible. Even most pretty bad naming conventions will be better than having ten different good ones being implemented by warring architect/developer factions.
A particularly hideous practice that is somewhat common with people who have grown up working with procedural languages (particularly interpreted languages) is to include something in the name to indicate that a table is a table, such as tblSchool or tableBuilding. Please don’t do this (really…I beg you). It’s clear by the context what is a table. This practice, just like the other Hungarian-style notations, makes good sense in a procedural programming language where the type of object isn’t always clear just from context, but this practice is never needed with SQL tables. Note that this dislike of prefixes is just for names that are used by users. We will quietly establish prefixes and naming patterns for non-user-addressable objects as the book continues.
Note There is something to be said about the quality of corporate standards as well. If you have an archaic standard, like one that was based on the mainframe team’s standard back in the 19th century, you really need to consider trying to change the standards when creating new databases so you don’t end up with names like HWWG01_TAB_USR_CONCT_T just because the shop standards say so (and yes, I do know when the 19th century was).
The naming rules for columns are the same as for tables as far as SQL Server is concerned. As for how to choose a name for a column—again, it’s one of those tasks for the individual architect, based on the same sorts of criteria as before (shop standards, best usage, and so on). This book follows this set of guidelines:
Some example names might be:
Many possible classwords could be used, and this book is not about giving you all the standards to follow at that level. Too many variances from organization to organization make that too difficult. The most important thing is that if you can establish a standard, make it work for your organization and follow it.
Note Just as with tables, avoid prefixes like col to denote a column as it is a really horrible practice.
I’ll use the same naming conventions for the implementation model as I did for the logical model: Pascal-cased names with a few abbreviations (mostly in the classwords, like “id” for “identifier”). Later in the book I will use a Hungarian-style prefix for objects other than tables, such as constraints, and for coded objects, such as procedures. This is mostly to keep the names unique and avoid clashes with the table names, plus it is easier to read in a list that contains multiple types of objects (the tables are the objects with no prefixes). Tables and columns are commonly used directly by users. They write queries and build reports directly using database object names and shouldn’t need to change the displayed name of every column and table.
In our demonstration model, the first thing we will do is to rename the User table to MessagingUser because “User” is a SQL Server reserved keyword. While User is the more natural name than MessagingUser, it is one of the trade-offs we have to make because of the legal values of names. In rare cases, when an unsuitable name can’t be created, I may use a bracketed name, but even if it took me four hours to redraw graphics and undo my original choice of User as a table name, I don’t want to give you that as a typical practice. If you find you have used a reserved keyword in your model (and you are not writing a chapter in a book that is 80+ pages long about it), it is usually a very minor change.
In the model snippet in Figure 6-2, I have made that change.
Figure 6-2. Table User has been changed to MessagingUser
The next change we will make will be to a few of the columns in this table. We will start off with the TypeOfAttendee column. The standard we discussed was to use a classword at the end of the column name. In this case, Type will make an acceptable class, as when you see AttendeeType, it will be clear what it means. The implementation will be a value that will be an up to 20-character value.
The second change will be to the AccessKey column. Key itself would be acceptable as a classword, but it will give the implication that the value is a key in the database (a standard I have used in my data warehousing dimensional database designs). So suffixing Value to the name will make the name clearer and distinctive. Figure 6-3 reflects the change in name.
Figure 6-3. MessagingUser table after change to AccessKey column name
The next step in the process is to choose how to implement the keys for the table. In the model at this point, it has one key identified for each table, in the primary key. In this section, we will look at the issues surrounding key choice and, in the end, will set the keys for the demonstration model. We will look at choices for implementing primary keys and then note the choices for creating alternate keys as needed.
Primary Key
Choosing the style of implementation for primary keys is an important choice. Depending on the style you go with, the look and feel of the rest of the database project will be affected, because whatever method you go with, the primary key value will be migrated to other tables as a reference to the particular row. Choosing a primary key style is one of the most argued about topics on the forums and occasionally over dinner after a SQL Saturday event. In this book, I’ll be reasonably agnostic about the whole thing, and I’ll present several methods for choosing the implemented primary key throughout the book. In this chapter, I will use a very specific method for all of the tables, of course.
Presumably, during the logical phase, you’ve identified the different ways to uniquely identify a row. Hence, there should be several choices for the primary key, including the following:
Each of these choices has pros and cons for the implementation. I’ll look at them in the following sections.
Basing a Primary Key on Existing Columns
In many cases, a table will have an obvious, easy-to-use primary key. This is especially true when talking about independent entities. For example, take a table such as product. It would often have a productNumber defined. A person usually has some sort of identifier, either government or company issued. (For example, my company has an employeeNumber that I have to put on all documents, particularly when the company needs to write me a check.)
The primary keys for dependent tables can often generally take the primary key of the independent tables, add one or more attributes, and—presto!—primary key.
For example, I used to have a Ford SVT Focus, made by the Ford Motor Company, so to identify this particular model, I might have a row in the Manufacturer table for Ford Motor Company (as opposed to GM, for example). Then, I’d have an automobileMake with a key of manufacturerName = ’Ford Motor Company’ and makeName = ’Ford’ (as opposed to Lincoln), style = ’SVT’, and so on, for the other values. This can get a bit messy to deal with, because the key of the automobileModelStyle table would be used in many places to describe which products are being shipped to which dealership. Note that this isn’t about the size in terms of the performance of the key, just the number of values that make up the key. Performance will be better the smaller the key, as well, but this is true not only of the number of columns, but this also depends on the size of the value or values that make up a key. Using three 2-byte values could be better than one 15-byte key, though it is a lot more cumbersome to join on three columns.
Note that the complexity in a real system such as this would be compounded by the realization that you have to be concerned with model year, possibly body style, different prebuilt packages of options, and so on. The key of the table may frequently have many parts, particularly in tables that are the child of a child of a child, and so on.
Basing a Primary Key on a New, Surrogate Value
The other common key style is to use only a single column for the primary key, regardless of the size of the other keys. In this case, you’d specify that every table will have a single artificially generated primary key column and implement alternate keys to protect the uniqueness of the natural keys in your tables, as shown in Figure 6-4.
Figure 6-4. Single-column key example
Note that in this scenario, all of your relationships will be implemented in the database as nonidentifying type relationships, though you will implement them to all be required values (no NULLs). Functionally, this is the same as if theparentKeyValue was migrated from parent through child and down to grandChild, though it makes it harder to see in the model.
In the model in Figure 6-4, the most important thing you should notice is that each table has not only the primary key but also an alternate key. The term “surrogate” has a very specific meaning, even outside of computer science, and that is that it serves as a replacement. So the surrogate key for the parent object ofparentKeyValue can be used as a substitute for the defined key, in this caseotherColumnsForAltKey.
This method does have some useful advantages:
There are also disadvantages to this method, such as always having to join to a table to find out the meaning of the surrogate key value. In our example table in Figure 6-4, you would have to join from the grandChild table through the child table to get key values from parent. Another issue is that some parts of the self-documenting nature of relationships are obviated, because using only single-column keys eliminates the obviousness of all identifying relationships. So in order to know that the logical relationship between parent and grandChild is identifying, you will have trace the relationship and look at the uniqueness constraints and foreign keys carefully.
Assuming you have chosen to use a surrogate key, the next choice is to decide what values to use for the key. Let’s look at two methods of implementing these keys, either by deriving the key from some other data or by using a meaningless surrogate value.
A popular way to define a primary key is to simply use a meaningless surrogate key like we’ve modeled previously, such as using a column with the IDENTITY property or generated from a SEQUENCE object, which automatically generates a unique value. In this case, you rarely let the user have access to the value of the key but use it primarily for programming.
It’s exactly what was done for most of the entities in the logical models worked on in previous chapters: simply employing the surrogate key while we didn’t know what the actual value for the primary key would be. This method has one nice property:
You never have to worry about what to do when the key value changes.
Once the key is generated for a row, it never changes, even if all the data changes. This is an especially nice property when you need to do analysis over time. No matter what any of the other values in the table have been changed to, as long as the row’s surrogate key value (as well as the row) represents the same thing, you can still relate it to its usage in previous times. (This is something you have to be clear about with the DBA/programming staff as well. Sometimes, they may want to delete all data and reload it, but if the surrogate changes, your link to the unchanging nature of the surrogate key is likely broken.) Consider the case of a row that identifies a company. If the company is named Bob’s Car Parts and it’s located in Topeka, Kansas, but then it hits it big, moves to Detroit, and changes the company name to Car Parts Amalgamated, only one row is touched: the row where the name is located. Just change the name, address, etc. and it’s done. Keys may change, but not primary keys. Also, if the method of determining uniqueness changes for the object, the structure of the database needn’t change beyond dropping one UNIQUE constraint and adding another.
Using a surrogate key value doesn’t in any way prevent you from creating additional single part keys, like we did in the previous section. In fact, it generally demands it. For most tables, having a small code value is likely going to be a desired thing. Many clients hate long values, because they involve “too much typing.” For example, say you have a value such as “Fred’s Car Mart.” You might want to have a code of “FREDS” for it as the shorthand value for the name. Some people are even so programmed by their experiences with ancient database systems that had arcane codes that they desire codes such as “XC10” to refer to “Fred’s Car Mart.”
In the demonstration model, I set all of the keys to use natural keys based on how one might do a logical model, so in a table like MessagingUser in Figure 6-5, it uses a key of the entire handle of the user.
Figure 6-5. MessagingUser table before changing model to use surrogate key
This value is the most logical, but this name, based on the requirements (current and future), can change. Changing this to a surrogate value will make it easier to make the name change and not have to worry about existing data in this table and related tables. Making this change to the model results in the change shown in Figure 6-6, and now, the key is a value that is clearly recognizable as being associated with the MessagingUser, no matter what the uniqueness of the row may be. Note that I made the UserHandle an alternate key as I switched it from primary key.
Figure 6-6. MessagingUser table after changing model to use surrogate key
Next up, we will take a look at the Message table shown in Figure 6-7. Note that the two columns that were named UserHandle and SentToUserHandle have had their role names changed to indicate the change in names from when the key of MessagingUser was UserHandle.
Figure 6-7. Message table before changing model to use surrogate key
We will transform this table to use a surrogate key by moving all three columns to nonkey columns, placing them in a uniqueness constraint, and adding the new MessageId column. Notice, too, in Figure 6-8 that the table is no longer modeled with rounded corners, because the primary key no longer is modeled with any migrated keys in the primary key.
Figure 6-8. Message table before changing model to use surrogate key
Having a common pattern for every table is useful for programming with the tables as well. Because every table has asingle-column key that isn’t updatable and is the same datatype, it’s possible to exploit this in code, making code generation a far more straightforward process. Note once more that nothing should be lost when you use surrogate keys, because a surrogate of this style only stands in for an existing natural key. Many of the object relational mapping (ORM) tools that are popular (if controversial in the database community) require a single-column integer key as their primary implementation pattern. I don’t favor forcing the database to be designed in any manner to suit client tools, but sometimes, what is good for the database is the same as what is good for the tools, making for a relatively happy ending, at least.
By implementing tables using this pattern, I’m covered in two ways: I always have a single primary key value, but I always have a key that cannot be modified, which eases the difficulty for loading a secondary copy like a data warehouse. No matter the choice of human-accessible key, surrogate keys are the style of key that I use for nearly all tables in databases I create (and always for tables with user-modifiable data, which I will touch on when we discuss “domain” tables later in this chapter). In Figure 6-9, I have completed the transformation to using surrogate keys.
Figure 6-9. Messaging Database Model progression after surrogate key choices
Keep in mind that I haven’t specified any sort of implementation details for the surrogate key at this point, and clearly, in a real system, I would already have done this during the transformation. For this chapter example, I am using a deliberately detailed process to separate each individual step, so I will put off that discussion until the DDL section of this book, where I will present code to deal with this need along with creating the objects.
Alternate Keys
In the model so far, we have already identified alternate keys as part of the model creation (MessagingUser.AttendeeNumber was our only initial alternate key), but I wanted to just take a quick stop on the model and make it clear in case you have missed it. Every table should have a minimum of one natural key; that is, a key that is tied to the meaning of what the table is modeling. This step in the modeling process is exceedingly important if you have chosen to do your logical model with surrogates, and if you chose to implement with single part surrogate keys, you should at least review the keys you specified.
A primary key that’s manufactured or even meaningless in the logical model shouldn’t be your only defined key. One of the ultimate mistakes made by people using such keys is to ignore the fact that two rows whose only difference is a system-generated value are not different. With only an artificially generated value as your key, it becomes more or less impossible to tell one row from another.
For example, take Table 6-1, a snippet of a Part table, where PartID is an IDENTITY column and is the primary key for the table.
Table 6-1. Sample Data to Demonstrate How Surrogate Keys Don’t Make Good Logical Keys
PartID |
PartNumber |
Description |
---|---|---|
1 |
XXXXXXXX |
The X part |
2 |
XXXXXXXX |
The X part |
3 |
YYYYYYYY |
The Y part |
How many individual items are represented by the rows in this table? Well, there seem to be three, but are rows with PartIDs 1 and 2 actually the same row, duplicated? Or are they two different rows that should be unique but were keyed in incorrectly? You need to consider at every step along the way whether a human being could not pick a desired row from a table without knowledge of the surrogate key. This is why there should be a key of some sort on the table to guarantee uniqueness, in this case likely on PartNumber.
Caution As a rule, each of your tables should have a natural key that means something to the user and that can uniquely identify each row in your table. In the very rare event that you cannot find a natural key (perhaps, for example, in a table that provides a log of events that could occur in the same .000001 of a second), then it is acceptable to make up some artificial key, but usually, it is part of a larger key that helps you tell two rows apart.
In a well-designed logical model, you should not have anything to do at this point with keys that protect the uniqueness of the data from a requirements standpoint. The architect (probably yourself) has already determined some manner of uniqueness that can be implemented. For example, in Figure 6-10, a MessagingUser row can be identified by either the UserHandle or the AttendeeNumber.
Figure 6-10. MessagingUser table for review
A bit more interesting is the Message table, shown in Figure 6-11. The key is the RoundedMessageTime, which is the time, rounded to the hour, the text of the message, and the UserId.
Figure 6-11. Message table for review
In the business rules, it was declared that the user could not post the same message more than once an hour. Constraints such as this are not terribly easy to implement in a simple manner, but breaking it down to the data you need to implement the constraint can make it easier. In our case, by putting a key on the message, user, and the time rounded to the hour (which we will find some way to implement later in the process), configuring the structures is quite easy.
Of course, by putting this key on the table, if the UI sends the same data twice, an error will be raised when a duplicate message is sent. This error will need to be dealt with at the client side, typically by translating the error message to something nicer.
The last table I will cover here is the MessageTopic table, shown in Figure 6-12.
Figure 6-12. MessageTopic table for review
What is interesting about this table is the optional UserDefinedTopicName value. Later, when we are creating this table, we will load some seed data that indicates that the TopicId is ‘UserDefined’, which means that the UserDefinedTopicName column can be used. Along with this seed data, on this table will be a check constraint that indicates whether the TopicId value represents the user-defined topic. I will use a 0 surrogate key value. In the check constraint later, we will create a check constraint to make sure that all data fits the required criteria.
At this point, to review, we have the model at the point in Figure 6-13.
Figure 6-13. Messaging model for review
Determining Domain Implementation
In logical modeling, the concept of domains is used to specify a template for datatypes and column properties that are used over and over again. In physical modeling, domains are used to choose the datatype to use and give us a guide as to the validations we will need to implement.
For example, in the logical modeling phase, domains are defined for such columns as name and description, which occur regularly across a database/enterprise. The reason for defining domains might not have been completely obvious at the time of logical design (it can seem like work to be a pompous data architect, rather than a programmer), but it becomes clear during physical modeling if it has been done well up front. During implementation domains serve several purposes:
Domains aren’t a requirement of logical or physical database design, nor does SQL Server actually make it easy for you to use them, but even if you just use them in a spreadsheet or design tool, they can enable easy and consistent design and are a great idea. Of course, consistent modeling is always a good idea regardless of whether you use a tool to do the work for you. I personally have seen a particular column type implemented in four different ways in five different columns when proper domain definitions were not available. So, tool or not, having a data dictionary that identifies columns that share a common type by definition is extremely useful.
For example, for the TopicName domain that’s used often in the Topic and MessageTopic tables in our ConferenceMessage model, the domain may have been specified by the contents of Table 6-2.
Table 6-2. Sample Domain: TopicName
Property |
Setting |
---|---|
Name |
TopicName |
Optional |
No |
Datatype |
Unicode text, 30 characters |
Value Limitations |
Must not be empty string or only space characters |
Default Value |
n/a |
I’ll defer the CHECK constraint and DEFAULT bits until later in this chapter, where I discuss implementation in more depth. Several tables will have a TopicName column, and you’ll use this template to build every one of them, which will ensure that every time you build one of these columns it will have a type of nvarchar(30). Note that we will discuss data types and their usages later in this chapter.
A second domain that is used very often in our model is SurrogateKey, shown in Table 6-3.
Table 6-3. Sample Domain: SurrogateKey
Property |
Setting |
---|---|
Name |
SurrogateKey |
Optional |
When used for primary key, not optional, typically auto-generated. When used as a nonkey, foreign key reference, optionality determined by utilization in the relationship. |
Datatype |
int |
Value Limitations |
N/A |
Default Value |
N/A |
This domain is a bit different, in that it will be implemented exactly as specified for a primary key attribute, but when it is migrated for use as a foreign key, some of the properties will be changed. First, if using identity columns for the surrogate, it won’t have the IDENTITY property set. Second, for an optional relationship, an optional relationship will allow nulls in the migrated key, but when used as the primary key, it will not allow them. Finally, let’s set up one more domain definition to our sample, the UserHandle domain, shown in Table 6-4.
Table 6-4. Sample Domain: UserHandle
Property |
Setting |
---|---|
Name |
UserHandle |
Optional |
no |
Datatype |
Basic character set, 20 characters maximum |
Value Limitations |
Must be 5–20 simple alphanumeric characters and must start with a letter |
Default Value |
n/a |
In the next four subsections, I’ll discuss several topics concerning the implementation of domains:
Getting the domain of a column implemented correctly is an important step in getting the implementation correct. Too many databases end up with all columns with the same datatype and size, allowing nulls (except for primary keys, if they have them), and lose the integrity of having properly sized and constrained constraints.
Enforce Domain in the Column, or With a Table?
Although many domains have only minimal limitations on values, often a domain will specify a fixed set of named values that a column might have that is less than can be fit into one of the base datatypes. For example, in the demonstration table MessagingUser shown in Figure 6-14, a column AttendeeType has a domain of AttendeeType.
Figure 6-14. MessageUser table for reference
This domain might be specified as in Table 6-5.
Table 6-5. Genre Domain
Property |
Setting |
---|---|
Name |
AttendeeType |
Optional |
No |
Datatype |
Basic character set, maximum 20 characters |
Value Limitations |
Regular, Volunteer, Speaker, Administrator |
Default Value |
Regular |
The value limitation limits the values to a fixed list of values. We could choose to implement the column using a declarative control (a CHECK constraint, which we will cover in more detail later in the chapter) with a predicate of AttendeeType IN (’Regular’, ’Volunteer’, ’Speaker’, ’Administrator’) and a literal default value of ’Regular’. There are a couple of minor annoyances with this form:
I nearly always include tables for all domains that are essentially “lists” of items, as it is just far easier to manage, even if it requires more tables. The choice of key for a domain table can be a bit different than for most tables. Sometimes, I use a surrogate key for the actual primary key, and other times, I use a natural key. The general difference is whether or not the values are user manageable, and if the programming tools require the integer/GUID approach (for example, if the front-end code uses an enumeration that is being reflected in the table values). In the model, I have two examples of such types of domain implementations. In Figure 6-15, I have added a table to implement the domain for attendee types, and for this table, I will use the natural key.
Figure 6-15. AttendeeType domain implemented as a table
This lets an application treat the value as if it is a simple value just like if this was implemented without the domain table. So if the application wants to manage the value as a simple string value, I don’t have to know about it from the database standpoint. I still get the value and validation that the table implementation affords me, plus the ability to have a Description column describing what each of the values actually means (which really comes in handy at 12:10 AM on December the 25th when the system is crashing and needs to be fixed, all while you are really thinking about the bicycle you haven’t finished putting together).
In the original model, we had the Topic table, shown in Figure 6-16, which is a domain similar to the AttendeeType but is designed to allow a user to make changes to the topic list.
Figure 6-16. Topic table for reference
The Topic entity has the special case that it can be added to by the application managers, so it will be implemented as a numeric surrogate value. We will initialize the table with a row that represents the user-defined topic that allows the user to enter their own topic in the MessageTopic table.
Choosing the Datatype
Choosing proper datatypes to match the domain chosen during logical modeling is an important task. One datatype might be more efficient than another of a similar type. For example, you can store integer data in an integer datatype, a numeric datatype, a floating-point datatype, or even a varchar(10) type, but these datatypes are certainly not alike in implementation or performance.
Note I have broken up the discussion of datatypes into two parts. First, there is this and other sections in this chapter in which I provide some basic guidance on the types of datatypes that exist for SQL Server and some light discussion on what to use. Appendix A at the end of this book is an expanded look at all of the datatypes and is dedicated to giving examples and example code snippets with all the types.
It’s important to choose the best possible datatype when building the column. The following list contains the intrinsic datatypes (built-in types that are installed when you install SQL Server) and a brief explanation of each of them. As you are translating domains to implementation, step 1 will be to see which of these types matches to need best first, then we will look to constrain the data even further with additional techniques.
Choice of datatype is a tremendously important part of the process, but if you have defined the domain well, it is not that difficult of a task. In the following sections, we will look at a few of the more important parts of the choice. A few of the considerations we will include are
I didn’t use too many of the different datatypes in the sample model, because my goal was to keep the model very simple and not try to be an AdventureWorks-esque model that tries to show every possible type of SQL Server in one model (or even the newer WideWorldImporters database, which is less unrealistically complex than AdventureWorks and will be used in several chapters later in the book).
Deprecated or Bad Choice Types
I didn’t include several datatypes in the previous list because they have been deprecated for quite some time, and it wouldn’t be surprising if they are completely removed from the version of SQL Server after 2016 (even though I said the same thing in the previous few versions of the book, so be sure to stop using them as soon as possible). Their use was common in versions of SQL Server before 2005, but they’ve been replaced by types that are far easier to use:
If you have ever tried to use the text datatype in SQL code, you know it is not a pleasant thing. Few of the common text operators were implemented to work with it, and in general, it just doesn’t work like the other native types for storing string data. The same can be said with image and other binary types. Changing from text to varchar(max), and so on, is definitely a no-brainer choice.
The second types that are generally advised against being used are the two money types:
In general, the money datatype sounds like a good idea, but using it has some confusing consequences. In Appendix A, I spend a bit more time covering these consequences, but here are two problems:
Hence, it’s generally accepted that it’s best to store monetary data in decimal datatypes. This also gives you the ability to assign the numeric types to sizes that are reasonable for the situation. For example, in a grocery store, having the maximum monetary value of a grocery item over 200,000 dollars is probably unnecessary, even figuring for a heck of a lot of inflation. Note that in the appendix I will include a more thorough example of the types of issues you could see.
Common Datatype Configurations
In this section, I will briefly cover concerns and issues relating to Boolean/logical values, large datatypes, and complex types and then summarize datatype concerns in order to discuss the most important thing you need to know about choosing a datatype.
Boolean values (TRUE or FALSE) are another of the hotly debated choices that are made for SQL Server data. There’s no Boolean type in standard SQL, since every type must support NULL, and a NULL Boolean makes life far more difficult for the people who implement SQL, so a suitable datatype needs to be chosen through which to represent Boolean values. Truthfully, though, what we really want from a Boolean is the ability to say that the property of the modeled entity “is” or “is not” for some basic setting.
There are three common choices to implement a value of this sort:
As an example of a Boolean column in our messaging database, I’ll add a simple flag to the MessagingUser table that tells whether the account has been disabled, as shown in Figure 6-17. As before, we are keeping things simple, and in simple cases, a simple flag might do it. But of course, in a sophisticated system, you would probably want to have more information, like who did the disabling, why they did it, when it took place, and perhaps even when it takes effect (these are all questions for design time, but it doesn’t hurt to be thorough).
Figure 6-17. MessagingUser table with DisabledFlag bit column
In SQL Server 2005, dealing with large datatypes changed quite a bit (and hopefully someday Microsoft will kill the text and image types for good). By using the max specifier on varchar, nvarchar, and varbinary types, you can store far more data than was possible in previous versions using a “normal” type, while still being able to deal with the data using the same functions and techniques you can on a simple varchar(10) column, though performance will differ slightly.
As with all datatype questions, use the varchar(max) types only when they’re required, always use the smallest types possible. The larger the datatype, the more data possible, and the more trouble the row size can be to get optimal storage retrieval times. In cases where you know you need large amounts of data or in the case where you sometimes need greater than 8,000 bytes in a column, the max specifier is a fantastic thing.
Note Keep on the lookout for uses that don’t meet the normalization needs, as you start to implement. Most databases have a “comments” column somewhere that morphed from comments to a semistructured mess that your DBA staff then needs to dissect using SUBSTRING and CHARINDEX functions.
There are two special concerns when using these types:
The first issue is pretty simple, but it can be a bit confusing at times. For example, concatenate ’12345’ + ’67890’. You’ve taken two char(5) values, and the result will be contained in a value that is automatically recast as a char(10). But if you concatenate two varchar(8000) values, you don’t get a varchar(16000) value, and you don’t get a varchar(max) value. The values get truncated to a varchar(8000) value. This isn’t always intuitively obvious. For example, consider the following code:
SELECT LEN( CAST(REPLICATE (’a’,8000) AS varchar(8000))
+ CAST(REPLICATE(’a’,8000) AS varchar(8000))
)
It returns a value 8000, as the two columns concatenate to a type of varchar(8000). If you cast one of the varchar(8000) values to varchar(max), then the result will be 16,000:
SELECT LEN(CAST(REPLICATE(’a’,8000) AS varchar(max))
+ CAST(REPLICATE(’a’,8000) AS varchar(8000))
)
Second, because the size of columns stored using the varchar(max) datatype can be so huge, it wouldn’t be favorable to always pass around these values just like you do with smaller values. Because the maximum size of a varchar(max) value is 2GB, imagine having to update a value of this size in its entirety. Such an update would be pretty nasty, because the client would need to get the whole value, make its changes, and then send the value back to the server. Most client machines may only have 2GB of physical RAM, so paging would likely occur on the client machine, and the whole process would crawl and more than likely crash occasionally. So, you can do what are referred to as chunked updates. These are done using the .WRITE clause in the UPDATE statement. For example:
UPDATE TableName
SET VarcharMaxCol.WRITE(’the value’, <offset>, <expression>)
WHERE . . .
One important thing to note is that varchar(max) values will easily cause the size of rows to be quite large. In brief, a row that fits into 8,060 bytes can be stored in one physical unit. If the row is larger than 8,060 bytes, string data can be placed on what are called overflow pages. Overflow pages are not terribly efficient because SQL Server has to go fetch extra pages that will not be in line with other data pages. (Physical structures, including overflow pages, are covered more in Chapter 10 when the physical structures are covered.)
I won’t go over large types in any more detail at this point. Just understand that you might have to treat the data in the (max) columns differently if you’re going to allow huge quantities of data to be stored. In our model, we’ve used a varbinary(max) column in the Customer table to store the image of the customer.
The main point to understand here is that having a datatype with virtually unlimited storage comes at a price. Versions of SQL Server starting in 2008 allow you some additional freedom when dealing with varbinary(max) data by placing it in the file system using what is called filestream storage. I will discuss large object storage in Chapter 8 in more detail, including file tables.
One really excellent sounding feature that you can use to help make your code cleaner is a user-defined type (UDT), which is really an alias to a type. You can use a datatype alias to specify a commonly used datatype configuration that’s used in multiple places using the following syntax:
CREATE TYPE <typeName>
FROM <intrinsic type> --any type that can be used as a column of a
--table, with precision and scale or length,
--as required by the intrinsic type
[NULL | NOT NULL];
When declaring a table, if nullability isn’t specified, then NULL or NOT NULL is based on the setting of ANSI_NULL_DFLT_ON, except when using an alias type (variables will always be nullable). In general, it is best to always specify the nullability in the table declaration, and something I will do always in the book, though I do sometimes forget in real life.
For example, consider the UserHandle column. Earlier, we defined its domain as being varchar(20), not optional, alphanumeric, with the data required to be between 5 and 20 characters. The datatype alias would allow us to specify
CREATE TYPE UserHandle FROM varchar(20) NOT NULL;
Then, in the CREATE TABLE statement, we could specify
CREATE TABLE MessagingUser
...
UserHandle UserHandle,
By declaring that the UserHandle type will be varchar(20), you can ensure that every time the type of UserHandle is used, in table declarations, and variable declarations will be varchar(20) and as long as you don’t specify NULL or NOT NULL. It is not possible to implement the requirement that data be between 5 and 20 characters on any other constraints on the type, including the NULL specification.
For another example, consider an SSN type. It’s char(11), so you cannot put a 12-character value in, sure. But what if the user had entered 234433432 instead of including the dashes? The datatype would have allowed it, but it isn’t what’s desired. The data will still have to be checked in other methods such as CHECK constraints.
I am personally not a user of these types. I have never really used these kinds of types because of the fact that you cannot do anything with these other than simply alias a type. Any changes to the type also require removal of all references to the type making a change to the type a two step process.
I will note, however, that I have a few architect friends who make extensive use of them to help keep data storage consistent. I have found that using domains and a data modeling tool serves me better, but I do want to make sure that you have at least heard of them and know the pros and cons.
In SQL Server 2005 and later, we can build our own datatypes using the SQL CLR (Common Language Runtime). Unfortunately, they are quite cumbersome, and the implementation of these types does not lend itself to the types behaving like the intrinsic types. Utilizing CLR types will require you to install the type on the client for them to get the benefit of the type being used.
For the most part you should use CLR types only in the cases where it makes a very compelling reason to do so. There are a few different possible scenarios where you could reasonably use user-defined types to extend the SQL Server type system with additional scalar types or different ranges of data of existing datatypes. Some potential uses of UDTs might be
Although the possibilities are virtually unlimited, I suggest that CLR UDTs be considered only for specialized circumstances that make the database design extremely more robust and easy to work with. CLR UDTs are a nice addition to the DBA’s and developer’s toolkit, but they should be reserved for those times when adding a new scalar datatype solves a complex business problem.
Microsoft has provided several intrinsic types based on the CLR to implement hierarchies and spatial datatypes. I point this out here to note that if Microsoft is using the CLR to implement complex types (and the spatial types at the very least are pretty darn complex), the sky is the limit. I should note that the spatial and hierarchyId types push the limits of what should be in a type, and some of the data stored (like a polygon) is more or less an array of connected points.
Choosing the Right Datatype
SQL Server gives you a wide range of datatypes, and many of them can be declared in a wide variety of sizes. I never cease to be amazed by the number of databases around in which every single column is either an integer or a varchar(N) (where N is the same for every single string column, sometimes in the 8000 range) and varchar(max). One particular example I’ve worked with had everything, including GUID-based primary keys, all stored in NVARCHAR(200) columns! It is bad enough to store your GUIDs in a varchar column at all, since it can be stored as a 16-byte binary value, whereas if you use a varchar column, it will take 36 bytes; however, store it in an nvarchar (Unicode) column, and now it takes 72 bytes! What a terrible waste of space. Even worse, someone could put in a non-GUID value up to 200 characters wide. Now, people using the data will feel like they need to allow for 200 characters on reports and such for the data. Time wasted, space wasted, money wasted.
As another example, say you want to store a person’s name and date of birth. You could choose to store the name in a varchar(max) column and the date of birth in a varchar(max) column. In all cases, these choices would certainly store the data that the user wanted, but they wouldn’t be good choices at all. The name should be in something such as a nvarchar(50) column and the date of birth in a date column. Notice that I used a variable-sized type for the name. This is because you don’t know the length, and not all names are the same size. Because most names aren’t nearly 50 bytes, using a variable-sized type will save space in your database. I used a Unicode type because person’s names do actually fit the need of allowing nontypical Latin characters.
Of course, in reality, seldom would anyone make such poor choices of a datatype as putting a date value in a varchar(max) column. Most choices are reasonably easy. However, it’s important to keep in mind that the datatype is the first level of domain enforcement. Thinking back to our domain for UserHandle, we had the datatype definition and value limitations specified in Table 6-6.
Table 6-6. Sample Domain:UserHandle
Property |
Setting |
---|---|
Name |
UserHandle |
Optional |
no |
Datatype |
Basic character set, maximum 20 characters |
Value Limitations |
Must be 5-20 simple alphanumeric characters and start with a letter |
Default Value |
n/a |
You can enforce the first part of this at the database level by declaring the column as a varchar(20). A column of type varchar(20) won’t even allow a 21-character or longer value to be entered. It isn’t possible to enforce the rule of greater than or equal to five characters using only a datatype. I’ll discuss more about how to enforce simple domain requirements later in this chapter, and in Chapter 7 I’ll discuss patterns of integrity enforcement that are more complex. In this case, I do use a simple ASCII character set because the requirements called for simple alphanumeric data.
Initially, we had the model in Figure 6-18 for the MessagingUser table.
Figure 6-18. MessagingUser table before choosing exact datatypes
Choosing types, we will use an int for the surrogate key (and in the DDL section, we will set the implementation of the rest of the optionality rule set in the domain: “Not optional auto-generated for keys, optionality determined by utilization for nonkey”, but will replace items of SurrogateKey domain with int types. User handle was discussed earlier in this section. In Figure 6-19, I chose some other basic types for Name. AccessKeyValue, and the AttendeeType columns.
Figure 6-19. MessagingUser after datatype choice
Sometimes, you won’t have any real domain definition, and you will use common sizes. For these, I suggest using either a standard type (if you can find them, like on the Internet) or look through data you have in your system. Until the system gets into production, changing types is fairly easy from a database standpoint, but the more code that accesses the structures the more difficult it gets to make changes.
For the Message table in Figure 6-20, we will choose types.
Figure 6-20. Message table before datatype choice
The text column isn’t datatype text but is the text of the message, limited to 200 characters. For the time columns, in Figure 6-21, I choose datetime2(0) for the MessageTime, since the requirements specified time down to the second. For RoundedMessageTime, we will be rounding to the hour, and so I chose also will expect it to be datetime2(0), though it will be a calculated column based on the MessageTime value. Hence, MessageTime and RoundedMessageTime are two views of the same data value.
Figure 6-21. Message table after datatype choice, with calculated column denoted
So, I am going to use a calculated column as shown in Figure 6-21. I will specify the type of RoundedMessageTime as a nonexistent datatype (so if I try to create the table, it will fail). A calculated column is a special type of column that isn’t directly modifiable, as it is based on the result of an expression.
Later in this chapter, we will specify the actual implementation, but for now, we basically just set a placeholder. Of course, in reality, I would specify the implementation immediately, but again, for this first learning process, I am doing things in this deliberate manner to keep things orderly. So, in Figure 6-22, I have the model with all of the datatypes set.
Figure 6-22. Messaging system model after datatype choices
The next step in the process is to set nullability of columns. In our domains, we specified if the columns were optional, so this will generally be a simple task. For the Message table in Figure 6-23, I have chosen the following nullability settings for the columns.
Figure 6-23. Message table for review
The interesting choice was for the two MessagingUserId columns. In Figure 6-24, you can see the full model, but note the relationships from MessagingUser to Message. The relationship for the user that sent the message (MessagingUserId) is NOT NULL, because every message is sent by a user. However, the relationship representing the user the message was sent to is nullable, since not every message needs to be sent to a user.
Figure 6-24. Messaging system model, with NULLs chosen
At this point, our model is very nearly done and very much resembles a database that could be built and employed in an application. Just a bit more information is needed to finish out the model.
The collation of a string value sets up the way a string is compared with another string, as well as how they are sorted. Many character sets are used by the many different cultures around the world. While you can choose a Unicode datatype if you need to store the characters for almost any character set, there still is the question of how data is sorted (case sensitive or not) and compared (accent sensitive or not). SQL Server and Windows provide a tremendous number of collation types to choose from. The collation is specified at many levels, starting with the server. The server collation determines how much of the system metadata is stored. Then the database has a collation, and finally, each column may have a different collation.
It’s a somewhat uncommon need for the average database to change the collation from the default, which is usually chosen to be the most typical for all users of a system. This is usually a case-insensitive collation, which allows that when doing comparisons and sorts, ’A’ = ’a’. I’ve only used an alternative collation a few times for columns where case sensitivity was desired (one time was so that a client could force more four-character codes than a case-insensitive collation would allow!).
To see the current collation type for the server and database, you can execute the following commands:
SELECT SERVERPROPERTY(’collation’);
SELECT DATABASEPROPERTYEX(’DatabaseName’,’collation’);
On most systems installed in English-speaking countries, the default collation type is SQL_Latin1_General_CP1_CI_AS, where Latin1_General represents the normal Latin alphabet, CP1 refers to code page 1252 (the SQL Server default Latin 1 ANSI character set), and the last parts represent case insensitive and accent sensitive, respectively. You can find full coverage of all collation types in the SQL Server documentation. However, the default is rarely the desired collation to use, and is set to that older collation for backward compatibility. Ideally, you will use a Windows collation, and I will use Latin1_General_100_CI_AS for my example code, which will be what I have installed my server as. The 100 indicates this is a newer collation that supports later Unicode characters.
In addition to the normal collations, there are also binary collations that you can use to sort and compare data based on in its raw format. There are two types of binary collations: the older ones are suffixed _bin, and newer ones _bin2. Bin2 collations do what is referred to as pure code-point collations, meaning they compare data only on the binary value (the bin collation compared the first byte as a WCHAR, which is an OLEDB datatype). Note that there will be a difference between binary sort order and case sensitive, so it will behoove you to take some time to understand the collation that you end up using.
To list all the sort orders installed in a given SQL Server instance, you can execute the following statement:
SELECT *
FROM ::fn_helpcollations();
On the computer on which I do testing, this query returned more than 3,800 rows, but usually, you don’t need to change from the default that the database administrator initially chooses. To set the collation sequence for a char, varchar, text, nchar, nvarchar, or ntext column when creating a column, you specify it using the COLLATE clause of the column definition, like so:
CREATE SCHEMA alt;
CREATE TABLE alt.OtherCollate
(
OtherCollateId int IDENTITY
CONSTRAINT PKAlt_OtherCollate PRIMARY KEY ,
Name nvarchar(30) NOT NULL,
FrenchName nvarchar(30) COLLATE French_CI_AS_WS NULL,
SpanishName nvarchar(30) COLLATE Modern_Spanish_CI_AS_WS NULL
);
Now, when you sort output by FrenchName, it’s case insensitive, but arranges the rows according to the order of the French language. The same applies with Spanish, regarding the SpanishName column. For this chapter we will stick with the default in almost all cases, and I would suggest taking a look at Books Online if you have the need to store data in multiple languages. The only “normal” variation from the default collation of the database is if you need a different case sensitivity for a column, in which case you might use a binary collation, or a case sensitive one.
One quick note, you can specify the collation in a WHERE clause using the COLLATE keyword:
SELECT Name
FROM alt.OtherCollate
WHERE Name COLLATE Latin1_General_CS_AI
LIKE ’[A-Z]%’ collate Latin1_General_CS_AI; --case sensitive and
--accent insensitive
It is important to be careful when choosing a collation that is different from the default, because at the server level it is extremely hard to change, and at the database level it is no picnic. You can change the collation of a column with an ALTER command, but it can’t have constraints or indexes referencing it, and you may need to recompile all of your objects that reference the tables.
If you do find yourself on a server with multiple collations, a handy collation setting to use can be database_default, which uses the default for the context of the database you are executing from.
It is important to choose your collation wisely when setting up a server, database, table, etc. Changing the collation of a database can be done using a simple ALTER statement. However, it will not change any objects in the database. You will have to change every column’s collation individually, and while you can use a simple ALTER TABLE…ALTER COLUMN statement to change the collation, you will have to drop all indexes, constraints, and schema-bound objects that reference the column first. It is a pretty painful task.
A schema is a namespace: a container where database objects are contained, all within the confines of a database. We will use schemas to group our tables and eventually views, procedures, functions, etc. into functional groups. Naming schemas is a bit different than naming tables or columns. Schema names should sound right, so sometimes, they make sense to be plural, and other times singular. It depends on how they are being used. I find myself using plural names most of the time because it sounds better, and because sometimes, you will have a table named the same thing as the schema if both were singular.
In our model in Figure 6-25, we will put the tables that are used to represent messages in a Messages schema, and the ones that represent Attendees and their relationships to one another in a schema we will name Attendees.
Figure 6-25. Messages model with schemas assigned
Note, too, that I often will set up schemas late in the process, and it might seem more correct to start there. I find that it is often easier to discover the different areas of the implementation, and that schemas aren’t necessarily easy to start with, but that different areas come and go until I get to the final solution. Sometimes, it is by necessity because you have multiple tables with the same name, though this can be a sign of a bad design. In this manufactured solution, I simply did it last to make the point that it could be last.
What makes schemas so nice is that you can deal with permissions on a schema level, rather than on an object-by-object level. Schemas also give you a logical grouping of objects when you view them within a list, such as in Management Studio.
I’m not going to go any further into the security aspects of using schemas at this point in the book, but understand that schemas are not just for aesthetics. Throughout this book, I’ll always name the schema that a table is in when doing examples. Schemas will be part of any system I design in this book, simply because it’s going to be best practice to do so. On a brief trip back to the land of reality, I said in the previous editions of this book that beginning to use schemas in production systems will be a slow process, and it still can be jarring to some users 11+ years later. Chapter 9 will discuss using schemas for security in more detail.
Finally, I will add one more thing to the database: columns to support the implementation of the code only (and not to support a user requirement directly). A very common use is to have columns that indicate when the row was created, when it was updated, and perhaps by whom. In our model, I will stick to the simple case of the times mentioned and will demonstrate how to implement this in the database. A lot of implementers like to leave these values to the client, but I very much prefer using the database code because then I have one clock managing times, rather than multiples. (I once managed a system that used two clocks to set row times, and occasionally a row was created years after it was last updated!)
So in Figure 6-26, I add two NOT NULL columns to every table for the RowCreateTime and RowLastUpdateTime, except for the AttendeeType table, which we specified to be not user manageable, and so I chose not to include the modified columns for that table. Of course, you might want to do this to let your development team know when the row was first available. I also left off columns to denote who changed the row for simplicity.
Figure 6-26. Message model after adding RowCreateTime and RowLastUpdateTime to tables
As a final note, it is generally best to only use these implementation columns strictly for metadata purposes. For example, consider the Messages.Message table. If you need to know when the message was created, you should use the MessageTime column as that value may represent the time when the user clicked the create button, captured from a different clock source, even if it took five minutes to actually store the data. Plus, if you need to load the data into a new table, the row may have been created in 2016, but the data in 2000, so not using these columns as user data means you can truthfully reflect when the row was created.
That is why I use such clunky names for the implementation column. Many tables will include the creation time, but that data may be modifiable. I don’t want users changing the time when the row was created, so the name notes that the time of creation is strictly for the row, and I don’t allow this column to be modified by anyone.
Sometimes, I will use these columns in concurrency control to denote when a row was changed, but when I have control over the design, I will use a rowversion type if the client can (and will) make use of it. Concurrency control is a very important topic that I will spend a full chapter on in Chapter 11.
Using DDL to Create the Database
So far, we have been molding the model to make it fit our needs to implement. We added columns, added tables, and specified constraints. Now, in this latter half of the chapter, we move toward the mechanical bits of the process, in that all that’s left is to implement the tables we have spent so much time designing. The blueprints have been drawn up, and now, we can finally grab a hammer and start driving nails.
Just like in the rest of this book, I’ll do this work manually using DDL, because it will help you understand what a tool is building for you. It’s also a good exercise for any database architect or DBA to review the SQL Server syntax; I personally wouldn’t suggest building a database with 300 tables without a data modeling tool, but I definitely do know people who do and wouldn’t consider using a tool to create any of their database objects. On the other hand, the same data modeling tools that could be used to do the logical modeling can usually create the tables and often some of the associated code, saving your fingers from added wear and tear, plus giving you more time to help Mario save the princess who always seems to get herself captured. No matter how you do the work, you need to make sure that you end up with scripts of DDL that you or the tool uses to create objects in some manner in the file system, because they’re invaluable tools for the DBA to apply changes to production, test, development, QA, or whatever environments have been set up to allow developers, users, and DBAs to coexist throughout the process.
Make sure that your scripts are in a source control system too, or at the very least backed up. In SQL Server, we have two tools that we can work in, Management Studio and Visual Studio Data Tools. Data Tools is the development-oriented tool that allows a developer to work in a manner kind of like a .Net developer would. Management Studio is more administrative oriented in its toolset, but has tools to view and edit objects directly.
In this book, I am going to stick to the DDL that any of the tools will use to construct objects using an online paradigm where I create a database directly one command at a time. Such scripts can be executed using any SQL Server tool, but I will generally just use a query window in Management Studio, or sometimes the SQLCMD.exe command-line tools when executing multiple scripts. You can download Management Studio or Data Tools from the Microsoft web site for free at msdn.microsoft.com/library/mt238290.aspx or msdn.microsoft.com/en-us/library/mt204009.aspx, respectively, though these locations are certainly apt to change in the years after this book is released.
Before starting to build anything else, you’ll need a database. I’ll create this database using all default values, and my installation is very generic on a Hyper-V VM on my laptop (most any VM technology will do, or you can install on most modern versions of Windows; and as of this writing, a version of SQL Server running on Linux is being previewed, and the crust of the earth has not frozen over). I use the Developer Edition, and I used most of the default settings when installing (other than setting up mixed mode for security to allow for some security testing later, and the collation, which I will set to Latin1_General_100_CI_AS), which will not be optimal when setting up a server for real multiuser use, or if you want to test features like Always On, Replication, etc. All users of a server for the material I am presenting can be managed and created in SQL Server code alone.
If you are using a shared server, such as a corporate development server, you’ll need to do this with an account that has rights to create a database. If you install your server yourself, part of the process will be to set up users so the server will be accessible.
Choosing a database name is in the same level of importance as naming of other objects, and I tend to take the same sort of naming stance. Keep it as simple as possible to differentiate between all other databases, and follow the naming standards in place for your organization. I would try to be careful to try to standardize names across instances of SQL Server to allow moving of databases from server to server. In the code downloads, I will name the database ConferenceMessaging.
The steps I’ll take along the way are as follows:
I will use the following statement to create a small database: CREATE DATABASE ConferenceMessaging; You can see where the database files were placed by running the following statement (note that size is presented in 8KB pages—more on the internal structures of the database storage in Chapter 10):
SELECT type_desc, size*8/1024 AS [size (MB)],physical_name
FROM sys.master_files
WHERE database_id = DB_ID(’ConferenceMessaging’);
This returns
type_desc size (MB) physical_name
------------ ----------- ----------------------------------------------
ROWS 8 C:Program FilesMicrosoft...SQLDATAConferenceMessaging.mdf
LOG 8 C:Program FilesMicrosoft...SQLDATAConferenceMessaging_log.ldf
Next, we want to deal with the owner of the database. The database is owned by the user who created the database, as you can see from the following query:
USE ConferenceMessaging;
--determine the login that is linked to the dbo user in the database
SELECT SUSER_SNAME(sid) AS databaseOwner
FROM sys.database_principals
WHERE name = ’dbo’;
On my instance, I created the database using a user named louis with a machine named WIN-8F59BO5AP7D:
databaseOwner
---------------------
WIN-8F59BO5AP7Dlouis
You can see the owner of all databases on an instance using the following query:
--Get the login of owner of the database from all database
SELECT SUSER_SNAME(owner_sid) AS databaseOwner, name
FROM sys.databases;
On a typical corporate production server, I almost always will set the owner of the database to be the system administrator account so that all databases are owned by the same users. The only reason to not do this is when you are sharing databases or when you have implemented cross-database security that needs to be different for multiple databases (more information about security in Chapter 9). You can change the owner of the database by using the ALTER AUTHORIZATION statement:
ALTER AUTHORIZATION ON DATABASE::ConferenceMessaging TO SA;
Going back and using the code to see the database owner, you will see that the owner is now SA.
Tip Placing a semicolon at the end of every statement in your T-SQL is fast becoming a standard that will, in a future version of SQL Server, be required.
Creating the Basic Table Structures
The next step is to create the basic tables of the database. In this section, we will form CREATE TABLE statements to create the tables. The following is the basic syntax for the CREATE TABLE statement:
CREATE TABLE [<database>.][<schema>.]<tablename>
(
<column specification>
);
If you look in Books Online, you will see a lot of additional settings that allow you to use either an on-disk or in-memory configuration, place the table on a filegroup, partition the table onto multiple filegroups, control where maximum/overflow data is placed, and so on. Some of this will be discussed in Chapter 10 on table structures and indexing. As will be the typical normal for most databases created, we will be using on-disk tables for most of the examples in the book.
Tip Don’t make this your only source of information about DDL in SQL Server. Books Online is another great place to get exhaustive coverage of DDL, and other sorts of books will cover the physical aspects of table creation in great detail. In this book, we focus largely on the relational aspects of database design with enough of the physical implementation to start you on the right direction. Many of the remaining chapters of the book will delve into the more complex usage patterns, but even then we will not cover every possible, even useful, setting that exists.
The base CREATE clause is straightforward:
CREATE TABLE [<database>.][<schema>.]<tablename>
I’ll expand on the items between the angle brackets (< and >). Anything in square brackets ([ and ]) is optional.
For the table name, if the first character is a single # symbol, the table is a temporary table. If the first two characters of the table name are ##, it’s a global temporary table. Temporary tables are not so much a part of database design as a mechanism to hold intermediate results in complex queries, so they don’t really pertain to the database design. You can also declare a local variable table that has the same scope as a variable by using an @ in front of the name, which can be used to hold small sets of data.
The combination of schema and tablename must be unique in a database, and tablename must be unique from any other objects in the database, including include tables, views, procedures, constraints, and functions, among other things. It is why I will suggest a prefix or naming pattern for objects other than tables. Some things that look like objects are not, such as indexes. This will all be clearer as we progress through the book.
As discussed in the earlier section where we defined schemas for our database, a schema is a namespace: a container where database objects are contained, all within the confines of a database. One thing that is nice is that because the schema isn’t tightly tied to a user, you can drop the user without changing the exposed name of the object. Changing owners of the schema changes owners of the objects within the schema.
In SQL Server 2000 and earlier, the table was owned by a user, which made using schemas difficult. Without getting too deep into security, objects owned by the same user are easier to handle with security due to ownership chaining. If one object references another and they are owned by the same user, the ownership chain isn’t broken. So we had every object owned by the same user.
Starting with SQL Server 2005, a schema is owned by a user, and tables are contained in a schema. You can access objects using the following naming method, just like when using the CREATE TABLE statement:
[<databaseName>.][<schemaName>.]objectName
The <databaseName> defaults to the current database. The <schemaName> defaults to the user’s default schema.
Schemas are of great use to segregate objects within a database for clarity of use. In our database, we have already specified two schemas earlier: Messages and Attendees. The basic syntax is simple, just CREATE SCHEMA <schemaName> (it must be the first statement in the batch). So, I will create them using the following commands:
CREATE SCHEMA Messages; --tables pertaining to the messages being sent
GO
CREATE SCHEMA Attendees; --tables pertaining to the attendees and how they can send messages
GO
The CREATE SCHEMA statement has another variation where you create the objects that are contained within it that is rarely used, as in the following example:
CREATE SCHEMA Example
CREATE TABLE ExampleTableName ... ; --no schema name on the object
The schema will be created, as well as any objects you include in the script, which will be members of that schema. If you need to drop a schema (like if you created the Example schema to try out the syntax like I did!), use DROP SCHEMA Example;.
You can view the schemas that have been created using the sys.schemas catalog view:
SELECT name, USER_NAME(principal_id) AS principal
FROM sys.schemas
WHERE name <> USER_NAME(principal_id); --don’t list user schemas
This returns
name principal
-------------- ----------------------
Messages dbo
Attendees dbo
Sometimes, schemas end up owned by a user other than dbo, like when a developer without db_owner privileges creates a schema. Or sometimes a user will get a schema with the same name as their user built by some of the SQL Server tools (which is the purpose of WHERE name <> USER_NAME(principal_id) in the schema query). You can change the ownership using the ALTER AUTHORIZATION statement much like for the database:
ALTER AUTHORIZATION ON SCHEMA::Messages TO DBO;
As a note, it is suggested to always specify the two-part name for objects in code. It is safer, because you know what schema it is using, and it doesn’t need to check the default on every execution. However, for ad hoc access, it can be annoying to type the schema if you are commonly using a certain schema. You can set a default schema for a user in the CREATE and ALTER USER statements, like this:
CREATE USER <schemaUser>
FOR LOGIN <schemaUser>
WITH DEFAULT SCHEMA = schemaname;
The ALTER USER command allows the changing of default schema for existing users (and in SQL Server 2012 and later, it works for Windows Group–based users as well; for 2005-2008R2, it only worked for standard users).
The next part of the CREATE TABLE statement is for the column specifications:
CREATE TABLE [<database>.][<schema>.]<tablename>
(
<columnName> <datatype> [<NULL specification>]
[IDENTITY [(seed,increment)]
--or
<columnName> AS <computed definition>
);
The <columnName> placeholder is where you specify the name of the column.
There are two types of columns:
Most of the columns in any database will be implemented columns, but computed columns have some pretty cool uses, so don’t think they’re of no use just because they aren’t talked about much. You can avoid plenty of code-based denormalizations by using computed columns. In our example tables, we specified one computed column, shown in Figure 6-27.
Figure 6-27. Message table with computed column highlighted
So the basic columns (other than the computed column) are fairly simple, just name and datatype:
MessageId int,
SentToMessagingUserId int,
MessagingUserId int,
Text nvarchar(200),
MessageTime datetime2(0),
RowCreateTime datetime2(0),
RowLastUpdateTime datetime2(0)
The requirements called for the person to not send the same message more than once an hour. So we construct an expression that takes the MessageTime in datetime2(0) datatype. That time is at a level of seconds, and we need the data in the form of hours. I start out with a variable of the type of the column we are deriving from and then set it to some value. I start with a variable of datetime2(0) and load it with the time from SYSDATETIME():
declare @pointInTime datetime2(0);
set @pointInTime = SYSDATETIME();
Next, I write the following expression:
DATEADD(HOUR,DATEPART(HOUR,@pointInTime),CAST(CAST(@PointInTime AS date) AS datetime2(0)) )
which can be broken down fairly simply, but basically takes the number of hours since midnight and adds that to the date-only value by casting it to a date and then to a datetime2, which allows you to add hours to it. Once the expression is tested, you replace the variable with the MessageTime column and we define our calculated column as follows:
,RoundedMessageTime AS DATEADD(HOUR,DATEPART(HOUR,MessageTime),
CAST(CAST(MessageTime AS date) AS datetime2(0)) ) PERSISTED
The PERSISTED specification indicates that the value will be calculated and saved as part of the definition and storage of the table, just like a fully implemented column, meaning that using the column will not require the value to be calculated at runtime. In order to be persisted, the expression must be deterministic, which basically means that for the same input, you will always get the same output (much like we covered in normalization back in Chapter 5). You can also use a computed column based on a deterministic expression as a column in an index (we will use it as part of the uniqueness constraint for this table). So an expression like GETDATE() is possible to use in a computed column, but you could not persist or index it, since the value would change for every execution.
In the column-create phrase, simply change the <NULL specification> in your physical model to NULL to allow NULLs, or NOT NULL not to allow NULLs:
<columnName> <data type> [<NULL specification>]
There’s nothing particularly surprising here. For the noncomputed columns in the Messages.Message table back in Figure 6-27, we will specify the following nullabilities:
MessageId int NOT NULL,
SentToMessagingUserId int NULL ,
MessagingUserId int NOT NULL ,
Text nvarchar(200) NOT NULL ,
MessageTime datetime2(0) NOT NULL ,
RowCreateTime datetime2(0) NOT NULL ,
RowLastUpdateTime datetime2(0) NOT NULL
Note Leaving off the NULL specification altogether, the SQL Server default is used, which is governed by the ANSI_NULL_DFLT_OFF and ANSI_NULL_DFLT_ON database properties (see https://msdn.microsoft.com/en-us/library/ms187356.aspx for more details). It is very much a best practice to always specify the nullability of a column, but I won’t attempt to demonstrate how those settings work, as they are fairly confusing.
Managing Non-natural Primary Keys
Finally, before getting too excited and completing the table creation script, there’s one more thing to discuss. Earlier in this chapter, I discussed the basics of using a surrogate key in your table. In this section, I’ll present the method that I typically use. I break down surrogate key values into the types that I use:
Of course, if your tables don’t use any sort of surrogate values, you can move on to the next sections.
In the example model, I have one such situation in which I set up a domain table where I won’t allow users to add or subtract rows from the table. Changes to the rows in the table could require changes to the code of the system and application layers of the application. Hence, instead of building tables that require code to manage, as well as user interfaces, we simply choose a permanent value for each of the surrogate values. This gives you control over the values in the key and allows usage of the key directly in code if desired (likely as a constant construct in the host language). It also allows a user interface to cache values from this table or to even implement them as constants, with confidence that they won’t change without the knowledge of the programmer who is using them (see Figure 6-28).
Figure 6-28. AttendeeType table for reference
Note that it’s generally expected that once you manually create a value for a table that is build to implement a domain, the meaning of this value will never change. For example, you might have a row, (’SPEAKER’, ’Persons who are speaking at the conference and have special privileges’). In this case, it would be fine to change the Description but not the value for AttendeeType.
Generation Using the IDENTITY Property
Most of the time, tables are created to allow users to create new rows. Implementing a surrogate key on these tables can be done using (what are commonly referred to as) IDENTITY columns. For any of the precise numeric datatypes, there’s an option to create an automatically incrementing (or decrementing, depending on the increment value) column. The identity value increments automatically, and it works as an autonomous transaction that is outside of the normal transaction so it works extremely fast and doesn’t lock other connections from anything other than the generation of a new sequential value. The column that implements this IDENTITY column should also be defined as NOT NULL. From our initial section on columns, I had this for the column specification:
<columnName> <data type> [<NULL specification>] IDENTITY [(SEED,INCREMENT)]
The SEED portion specifies the number that the column values will start with, and the INCREMENT is how much the next value will increase. For example, take the Movie table created earlier, this time implementing the IDENTITY-based surrogate key:
MessageId int NOT NULL IDENTITY(1,1) ,
SentToMessagingUserId int NULL ,
MessagingUserId int NOT NULL ,
Text nvarchar(200) NOT NULL ,
MessageTime datetime2(0) NOT NULL ,
RowCreateTime datetime2(0) NOT NULL ,
RowLastUpdateTime datetime2(0) NOT NULL
To the column declaration for the MessageId column of the Message table we have been using in the past few sections, I’ve added the IDENTITY property for the MovieId column. The seed of 1 indicates that the values will start at 1, and the increment says that the second value will be 1 greater, in this case 2, the next 3, and so on. You can set the seed and increment to any value that is of the datatype of the column it is being applied to. For example, you could declare the column as IDENTITY(1000,-50), and the first value would be 1000, the second 950, the third 900, and so on.
The IDENTITY property is useful for creating a surrogate primary key that’s small and fast. The int datatype requires only 4 bytes and is good because most tables will have fewer than 2 billion rows. There are, however, a couple of major caveats that you have to understand about IDENTITY values:
Keep in mind the fact (I hope I’ve said this enough) that the surrogate key should not be the only key on the table or that the only uniqueness is a more or less random value!
Generation Using a Default Constraint
Using identity values, you get a very strict key management system, where you have to use special syntax (SET IDENTITY_INSERT) to add a new row to the table. Instead of using a strict key generation tool like the identity, there are a couple of things you can use in a DEFAULT constraint to set values when a value isn’t provided.
First, if using GUIDs for a key, you can simply default the column to NEWID(), which will generate a random GUID value. Or you can use NEWSEQUENTIALID(), which will generate a GUID with a higher value each time it is called. Having a monotonically increasing sequence of values has benefits to indexes, something we will discuss more in Chapter 10.
Another method of generating an integer value for a surrogate is to use a SEQUENCE object to generate new values for you. Like an identity column, it is not subject to the primary transaction, so it is really fast, but a rollback will not recover a value that is used, leaving gaps on errors/rollbacks.
For our database, I will use the SEQUENCE object with a default constraint instead of the identity column for the key generator of the Topic table. Users can add new general topics, but special topics will be added manually with a specific value. The only real concern with using a SEQUENCE based surrogate key is that you are not limited to using the values the SEQUENCE object generates. So if someone enters a value of 10 in the column without getting it from the SEQUENCE object, you may get a key violation. SEQUENCE objects have techniques to let you allocate (or sometimes referred to as burn) sets of data (which I will cover in a few pages).
I will start the user-generated key values at 10000, since it is unlikely that 10,000 specially coded topics will be needed:
CREATE SEQUENCE Messages.TopicIdGenerator
AS INT
MINVALUE 10000 --starting value
NO MAXVALUE --technically will max out at max int
START WITH 10000 --value where the sequence will start, differs from min based on
--cycle property
INCREMENT BY 1 --number that is added the previous value
NO CYCLE --if setting is cycle, when it reaches max value it starts over
CACHE 100; --Use adjust number of values that SQL Server caches. Cached values would
--be lost if the server is restarted, but keeping them in RAM makes access faster;
You can get the first two values using the NEXT VALUE statement for sequence objects:
SELECT NEXT VALUE FOR Messages.TopicIdGenerator AS TopicId;
SELECT NEXT VALUE FOR Messages.TopicIdGenerator AS TopicId;
This returns
TopicId
-----------
10000
TopicId
-----------
10001
Note The default datatype for a SEQUENCE object is bigint, and the default starting point is the smallest number that the SEQUENCE supports. So if you declared CREATE SEQUENCE dbo.test and fetched the first value, you would get -9223372036854775808, which is an annoying starting place for most usages. Like almost every DDL you will use in T-SQL, it is generally desirable to specify most settings, especially those that control settings that affect the way the object works for you.
You can then reset the sequence to the START WITH value using the ALTER SEQUENCE statement with a RESTART clause:
--To start a certain number add WITH <starting value literal>
ALTER SEQUENCE Messages.TopicIdGenerator RESTART;
For the Topic table, I will use the following column declaration to use the SEQUENCE object in a default. This is the first time I use a default, so I will note that the name I gave the default object starts with a prefix of DFLT, followed by the table name, underscore, and then the column the default pertains to. This will be sufficient to keep the names unique and to identify the object in a query of the system catalog.
TopicId int NOT NULL CONSTRAINT DFLTTopic_TopicId
DEFAULT(NEXT VALUE FOR Messages.TopicIdGenerator),
In the final section of this chapter, I will load some data for the table to give an idea of how all the parts work together. One additional super-nice property of SEQUENCE objects is that you can preallocate values to allow for bulk inserts. So if you want to load 100 topic rows, you can get the values for use, build your set, and then do the insert. The allocation is done using a system stored procedure:
DECLARE @range_first_value sql_variant, @range_last_value sql_variant,
@sequence_increment sql_variant;
EXEC sp_sequence_get_range @sequence_name = N’Messages.TopicIdGenerator’
, @range_size = 100
, @range_first_value = @range_first_value OUTPUT
, @range_last_value = @range_last_value OUTPUT
, @sequence_increment = @sequence_increment OUTPUT;
SELECT CAST(@range_first_value AS int) AS firstTopicId,
CAST(@range_last_value AS int) AS lastTopicId,
CAST(@sequence_increment AS int) AS increment;
Since our object was just reset, the first 100 values are returned, along with the increment (which you should not assume when you use these values and you want to follow the rules of the object):
firstTopicId lastTopicId increment
------------ ----------- -----------
10000 10099 1
If you want to get metadata about the SEQUENCE objects in the database, you can use the sys.sequences catalog view:
SELECT start_value, increment, current_value
FROM sys.sequences
WHERE SCHEMA_NAME(schema_id) = ’Messages’
AND name = ’TopicIdGenerator’;
For the TopicGenerator object we set up, this returns
start_value increment current_value
--------------- -------------- ---------------------
10000 1 10099
Sequences can be a great improvement on identities, especially whenever you have any need to control the values in the surrogate key (like having unique values across multiple tables). They are a bit more work than identity values, but the flexibility is worth it when you need it. I foresee identity columns to remain the standard way of creating surrogate keys for most purposes, as their inflexibility offers some protection against having to manage data in the surrogate key, since you have to go out of your way to insert a value other than what the next identity value is with SET IDENTITY_INSERT ON.
The Actual DDL to Build Tables
We have finally reached the point where we are going to create the basic table structures we have specified, including generating the primary keys and the calculated column that we created. Note that we have already created the SCHEMA and SEQUENCE objects earlier in the chapter. I will start the script with a statement to drop the objects if they already exist, as this lets you create and try out the code in a testing manner. At the end of the chapter I will discuss strategies for versioning your code, keeping a clean database from a script, either by dropping the objects or by dropping and re-creating a database.
--DROP TABLE IF EXISTS is new in SQL Server 2016. The download will have a method for older
--versions of SQL Server demonstrated as well. The order of the tables is actually set based
--on the order they will need to be dropped due to later foreign key constraint order
DROP TABLE IF EXISTS Attendees.UserConnection,
Messages.MessageTopic,
Messages.Topic,
Messages.Message,
Attendees.AttendeeType,
Attendees.MessagingUser
Now we will create all of the objects:
CREATE TABLE Attendees.AttendeeType (
AttendeeType varchar(20) NOT NULL ,
Description varchar(60) NOT NULL
);
--As this is a non-editable table, we load the data here to
--start with
INSERT INTO Attendees.AttendeeType
VALUES (’Regular’, ’Typical conference attendee’),
(’Speaker’, ’Person scheduled to speak’),
(’Administrator’,’Manages System’);
CREATE TABLE Attendees.MessagingUser (
MessagingUserId int NOT NULL IDENTITY ( 1,1 ) ,
UserHandle varchar(20) NOT NULL ,
AccessKeyValue char(10) NOT NULL ,
AttendeeNumber char(8) NOT NULL ,
FirstName nvarchar(50) NULL ,
LastName nvarchar(50) NULL ,
AttendeeType varchar(20) NOT NULL ,
DisabledFlag bit NOT NULL ,
RowCreateTime datetime2(0) NOT NULL ,
RowLastUpdateTime datetime2(0) NOT NULL
);
CREATE TABLE Attendees.UserConnection
(
UserConnectionId int NOT NULL IDENTITY ( 1,1 ) ,
ConnectedToMessagingUserId int NOT NULL ,
MessagingUserId int NOT NULL ,
RowCreateTime datetime2(0) NOT NULL ,
RowLastUpdateTime datetime2(0) NOT NULL
);
CREATE TABLE Messages.Message (
MessageId int NOT NULL IDENTITY ( 1,1 ) ,
RoundedMessageTime as (DATEADD(hour,DATEPART(hour,MessageTime),
CAST(CAST(MessageTime AS date) AS datetime2(0)) ))
PERSISTED,
SentToMessagingUserId int NULL ,
MessagingUserId int NOT NULL ,
Text nvarchar(200) NOT NULL ,
MessageTime datetime2(0) NOT NULL ,
RowCreateTime datetime2(0) NOT NULL ,
RowLastUpdateTime datetime2(0) NOT NULL
);
CREATE TABLE Messages.MessageTopic (
MessageTopicId int NOT NULL IDENTITY ( 1,1 ) ,
MessageId int NOT NULL ,
UserDefinedTopicName nvarchar(30) NULL ,
TopicId int NOT NULL ,
RowCreateTime datetime2(0) NOT NULL ,
RowLastUpdateTime datetime2(0) NOT NULL
);
CREATE TABLE Messages.Topic (
TopicId int NOT NULL CONSTRAINT DFLTTopic_TopicId
DEFAULT(NEXT VALUE FOR Messages.TopicIdGenerator),
Name nvarchar(30) NOT NULL ,
Description varchar(60) NOT NULL ,
RowCreateTime datetime2(0) NOT NULL ,
RowLastUpdateTime datetime2(0) NOT NULL
);
After running this script, you are getting pretty far down the path, but there are still quite a few steps to go before we get finished, but sometimes, this is as far as people go when building a “small” system. It is important to do all of the steps in this chapter for almost every database you create to maintain a reasonable level of data integrity.
Note If you are trying to create a table (or any object, setting, security principal [user or login], etc.) in SQL Server, Management Studio will almost always have tooling to help. For example, right-clicking the Tables node under DatabasesConferenceMessaging in Management Studio will give you a New Table menu (as well as some other specific table types that we will look at later in the book). If you want to see the script to build an existing table, right-click the table and select Script Table As and then CREATE to and choose how you want it scripted. I use these options very frequently to see the structure of a table, and I expect you will too.
Adding Uniqueness Constraints
As I’ve mentioned several (or perhaps, too many) times, it’s important that every table have at least one constraint that prevents duplicate rows from being created. In this section, I’ll introduce the following tasks, plus a topic (indexes) that inevitably comes to mind when I start talking about keys that are implemented with indexes:
Both PRIMARY KEY and UNIQUE constraints are implemented on top of unique indexes to do the enforcing of uniqueness. It’s conceivable that you could use unique indexes instead of constraints, but I specifically use constraints because of the meaning that they suggest: constraints are intended to semantically represent and enforce some limitation on data, whereas indexes (which are covered in detail in Chapter 10) are intended to speed access to data.
In actuality, it doesn’t matter how the uniqueness is implemented, but it is necessary to have either unique indexes or UNIQUE constraints in place. Usually an index will be useful for performance as well, as usually when you need to enforce uniqueness, it’s also the case that a user or process will be searching for a reasonably small number of values in the table.
Adding Primary Key Constraints
The first set of constraints we will add to the tables will be PRIMARY KEY constraints. The syntax of the PRIMARY KEY declaration is straightforward:
[CONSTRAINT constraintname] PRIMARY KEY [CLUSTERED | NONCLUSTERED]
As with all constraints, the constraint name is optional, but you should never treat it as such. I’ll name PRIMARY KEY constraints using a name such as PK<tablename>. Generally, you will want to make the primary key clustered for the table, as normally the columns of the primary key will be the most frequently used for accessing rows. This is definitely not always the case, and will usually be something discovered later during the testing phase of the project. In Chapter 10, I will describe the physical/internal structures of the database and will give more indications of when you might alter from the clustered primary key path.
Tip The primary key and other constraints of the table will be members of the table’s schema, so you don’t need to name your constraints for uniqueness over all objects, just those in the schema.
You can specify the PRIMARY KEY constraint when creating the table, just like we did the default for the SEQUENCE object. If it is a single-column key, you could add it to the statement like this:
CREATE TABLE Messages.Topic (
TopicId int NOT NULL CONSTRAINT DFLTTopic_TopicId
DEFAULT(NEXT VALUE FOR dbo.TopicIdGenerator)
CONSTRAINT PKTopic PRIMARY KEY,
Name nvarchar(30) NOT NULL ,
Description varchar(60) NOT NULL ,
RowCreateTime datetime2(0) NOT NULL ,
RowLastUpdateTime datetime2(0) NOT NULL
);
Or, if it is a multiple-column key, you can specify it inline with the columns like the following example:
CREATE TABLE Examples.ExampleKey
(
ExampleKeyColumn1 int NOT NULL,
ExampleKeyColumn2 int NOT NULL,
CONSTRAINT PKExampleKey
PRIMARY KEY (ExampleKeyColumn1, ExampleKeyColumn2)
)
Another common method is use the ALTER TABLE statement and simply alter the table to add the constraint, like the following, which is the code in the downloads that will add the primary keys (CLUSTERED is optional, but is included for emphasis):
ALTER TABLE Attendees.AttendeeType
ADD CONSTRAINT PKAttendeeType PRIMARY KEY CLUSTERED (AttendeeType);
ALTER TABLE Attendees.MessagingUser
ADD CONSTRAINT PKMessagingUser PRIMARY KEY CLUSTERED (MessagingUserId);
ALTER TABLE Attendees.UserConnection
ADD CONSTRAINT PKUserConnection PRIMARY KEY CLUSTERED (UserConnectionId);
ALTER TABLE Messages.Message
ADD CONSTRAINT PKMessage PRIMARY KEY CLUSTERED (MessageId);
ALTER TABLE Messages.MessageTopic
ADD CONSTRAINT PKMessageTopic PRIMARY KEY CLUSTERED (MessageTopicId);
ALTER TABLE Messages.Topic
ADD CONSTRAINT PKTopic PRIMARY KEY CLUSTERED (TopicId);
Tip Although the CONSTRAINT <constraintName> part of any constraint declaration is optional, it’s a very good idea always to name constraint declarations using some name. Otherwise, SQL Server will assign a name for you, and it will be ugly and will be different each and every time you execute the statement (and will be far harder to compare to multiple databases created from the same script, like for Dev, Test, and Prod). For example, create the following object in tempdb:
CREATE TABLE TestConstraintName (TestConstraintNameId int PRIMARY KEY);
Look at the object name with this query:
SELECT constraint_name
FROM information_schema.table_constraints
WHERE table_schema = ’dbo’
AND table_name = ’TestConstraintName’;
You see the name chosen is something hideous like PK__TestCons__BA850E1F645CD7F4.
Adding Alternate Key Constraints
Alternate key creation is an important task of implementation modeling. Enforcing these keys is probably more important than for primary keys, especially when using an artificial key. When implementing alternate keys, it’s best to use an UNIQUE constraint. These are very similar to PRIMARY KEY constraints (other than they can have nullable columns) and can even be used as the target of a relationship (relationships are covered later in the chapter).
The syntax for their creation is as follows:
[CONSTRAINT constraintname] UNIQUE [CLUSTERED | NONCLUSTERED] [(ColumnList)]
Just like the PRIMARY KEY, you can declare it during table creation or as an ALTER statement. I usually use an ALTER statement for code I am managing, because having the table create separate seems cleaner, but as long as the constraints get implemented, either way is fine.
ALTER TABLE Messages.Message
ADD CONSTRAINT AKMessage_TimeUserAndText UNIQUE
(RoundedMessageTime, MessagingUserId, Text);
ALTER TABLE Messages.Topic
ADD CONSTRAINT AKTopic_Name UNIQUE (Name);
ALTER TABLE Messages.MessageTopic
ADD CONSTRAINT AKMessageTopic_TopicAndMessage UNIQUE
(MessageId, TopicId, UserDefinedTopicName);
ALTER TABLE Attendees.MessagingUser
ADD CONSTRAINT AKMessagingUser_UserHandle UNIQUE (UserHandle);
ALTER TABLE Attendees.MessagingUser
ADD CONSTRAINT AKMessagingUser_AttendeeNumber UNIQUE
(AttendeeNumber);
ALTER TABLE Attendees.UserConnection
ADD CONSTRAINT AKUserConnection_Users UNIQUE
(MessagingUserId, ConnectedToMessagingUserId);
The only really interesting tidbit here is in the Messages.Message declaration. Remember in the table declaration this was a computed column, so now, by adding this constraint, we have prevented the same message from being entered more than once per hour. This should show you that you can implement some fairly complex constraints using the basic building blocks we have covered so far. I will note again that the computed column you specify must be based on a deterministic expression to be used in an index. Declaring the column as PERSISTED is a good way to know if it is deterministic or not.
In the next few chapters, we will cover many different patterns for using these building blocks in very interesting ways. Now, we have covered all of the uniqueness constraints that are needed in our ConferenceMessaging database.
What About Indexes?
The topic of indexes is one that generally begins to be discussed before the first row of data is loaded into the first table. Indexes that you will add to a table other than uniqueness constraints will generally have a singular responsibility for increasing performance. At the same time, they have to be maintained, so they decrease performance too, though hopefully considerably less than they increase it. This conundrum is the foundation of the “science” of performance tuning. Hence, it is best to leave any discussion of adding indexes until data is loaded into tables and queries are executed that show the need for indexes.
In the previous section, we created uniqueness constraints whose purpose is to constrain the data in some form to make sure integrity is met. These uniqueness constraints we have just created are actually built using unique indexes and will also incur some performance penalty just like any index will. To see the indexes that have been created for your constraints, you can use the sys.indexes catalog view:
SELECT CONCAT(OBJECT_SCHEMA_NAME(object_id),’.’,
OBJECT_NAME(object_id)) AS object_name,
name, is_primary_key, is_unique_constraint
FROM sys.indexes
WHERE OBJECT_SCHEMA_NAME(object_id) <> ’sys’
AND (is_primary_key = 1
OR is_unique_constraint = 1)
ORDER BY object_name, is_primary_key DESC, name;
which, for the constraints we have created so for, returns
object_name name primary_key unique_constraint
--------------------------- ------------------------------ ------------ -------------------
Attendees.AttendeeType PKAttendees_AttendeeType 1 0
Attendees.MessagingUser PKAttendees_MessagingUser 1 0
Attendees.MessagingUser AKAttendees_MessagingUser_A... 0 1
Attendees.MessagingUser AKAttendees_MessagingUser_U... 0 1
Attendees.UserConnection PKAttendees_UserConnection 1 0
Attendees.UserConnection AKAttendees_UserConnection_... 0 1
Messages.Message PKMessages_Message 1 0
Messages.Message AKMessages_Message_TimeUser... 0 1
Messages.MessageTopic PKMessages_MessageTopic 1 0
Messages.MessageTopic AKMessages_MessageTopic_Top... 0 1
Messages.Topic PKMessages_Topic 1 0
Messages.Topic AKMessages_Topic_Name 0 1
As you start to do index tuning, one of the major tasks is to determine whether indexes are being used and to eliminate the indexes that are never (or very rarely) used to optimize queries, but you will not want to remove any indexes that show up in the results of the previous query, because they are there for data integrity purposes.
If a user doesn’t know what value to enter into a table, the value can be omitted, and the DEFAULT constraint sets it to a valid predetermined value. This helps, in that you help users avoid having to make up illogical, inappropriate values if they don’t know what they want to put in a column yet they need to create a row. However, the true value of defaults is lost in most applications, because the user interface would have to honor this default and not reference the column in an INSERT operation (or use the DEFAULT keyword for the column value for a DEFAULT constraint to matter).
We used a DEFAULT constraint earlier to implement the primary key generation, but here, I will spend a bit more time describing how it works. The basic syntax for the default constraint is
[CONSTRAINT constraintname] DEFAULT (<simple scalar expression>)
The scalar expression must be a literal or a function, even a user-defined one that accesses a table. Table 6-7 has sample literal values that can be used as defaults for a few datatypes.
Table 6-7. Sample Default Values
Datatype |
Possible Default Value |
---|---|
Int |
1 |
varchar(10) |
’Value’ |
binary(2) |
0x0000 |
Datetime |
’20080101’ |
As an example in our sample database, we have the DisabledFlag on the Attendees.MessagingUser table. I’ll set the default value to 0 for this column here:
ALTER TABLE Attendees.MessagingUser
ADD CONSTRAINT DFLTMessagingUser_DisabledFlag
DEFAULT (0) FOR DisabledFlag;
Beyond literals, you can use system functions to implement DEFAULT constraints. In our model, we will use a default on all of the table’s RowCreateTime and RowLastUpdateTime columns. To create these constraints, I will demonstrate one of the most useful tools in a DBA’s toolbox: using the system views to generate code. Since we have to do the same code over and over, I will query the metadata in the INFORMATION_SCHEMA.COLUMN view, and put together a query that will generate the DEFAULT constraints (you will need to set your output to text and not grids in SSMS to use this code):
SELECT CONCAT(’ALTER TABLE ’,TABLE_SCHEMA,’.’,TABLE_NAME,CHAR(13),CHAR(10),
’ ADD CONSTRAINT DFLT’, TABLE_NAME, ’_’ ,
COLUMN_NAME, CHAR(13), CHAR(10),
’ DEFAULT (SYSDATETIME()) FOR ’, COLUMN_NAME,’;’)
FROM INFORMATION_SCHEMA.COLUMNS
WHERE COLUMN_NAME in (’RowCreateTime’, ’RowLastUpdateTime’)
and TABLE_SCHEMA in (’Messages’,’Attendees’)
ORDER BY TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME;
This code will generate the code for ten constraints:
ALTER TABLE Attendees.MessagingUser
ADD CONSTRAINT DFLTAttendees_MessagingUser_RowCreateTime
DEFAULT (SYSDATETIME()) FOR RowCreateTime;
ALTER TABLE Attendees.MessagingUser
ADD CONSTRAINT DFLTAttendees_MessagingUser_RowLastUpdateTime
DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;
ALTER TABLE Attendees.UserConnection
ADD CONSTRAINT DFLTAttendees_UserConnection_RowCreateTime
DEFAULT (SYSDATETIME()) FOR RowCreateTime;
ALTER TABLE Attendees.UserConnection
ADD CONSTRAINT DFLTAttendees_UserConnection_RowLastUpdateTime
DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;
ALTER TABLE Messages.Message
ADD CONSTRAINT DFLTMessages_Message_RowCreateTime
DEFAULT (SYSDATETIME()) FOR RowCreateTime;
ALTER TABLE Messages.Message
ADD CONSTRAINT DFLTMessages_Message_RowLastUpdateTime
DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;
ALTER TABLE Messages.MessageTopic
ADD CONSTRAINT DFLTMessages_MessageTopic_RowCreateTime
DEFAULT (SYSDATETIME()) FOR RowCreateTime;
ALTER TABLE Messages.MessageTopic
ADD CONSTRAINT DFLTMessages_MessageTopic_RowLastUpdateTime
DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;
ALTER TABLE Messages.Topic
ADD CONSTRAINT DFLTMessages_Topic_RowCreateTime
DEFAULT (SYSDATETIME()) FOR RowCreateTime;
ALTER TABLE Messages.Topic
ADD CONSTRAINT DFLTMessages_Topic_RowLastUpdateTime
DEFAULT (SYSDATETIME()) FOR RowLastUpdateTime;
Obviously it’s not the point of this section, but generating code with the system metadata is a very useful skill to have, particularly when you need to add some type of code over and over.
Adding Relationships (Foreign Keys)
Adding relationships is perhaps the most tricky of the constraints because both the parent and child tables need to exist to create the constraint. Hence, it is more common to add foreign keys using the ALTER TABLE statement, but you can also do this using the CREATE TABLE statement.
The typical foreign key is implemented as a primary key of one table migrated to the child table that represents the entity from which it comes. You can also reference a UNIQUE constraint as well, but it is a pretty rare implementation. An example is a table with an identity key and a textual code. You could migrate the textual code to a table to make it easier to read, if user requirements required it and you failed to win the argument against doing something that will confuse everyone for years to come.
The syntax of the statement for adding FOREIGN KEY constraints is pretty simple:
ALTER TABLE TableName [WITH CHECK | WITH NOCHECK]
ADD [CONSTRAINT <constraintName>]
FOREIGN KEY REFERENCES <referenceTable> (<referenceColumns>)
[ON DELETE <NO ACTION | CASCADE | SET NULL | SET DEFAULT> ]
[ON UPDATE <NO ACTION | CASCADE | SET NULL | SET DEFAULT> ];
The components of this syntax are as follows:
If you are using surrogate keys, you will very rarely need either of the ON UPDATE options, since the value of a surrogate is rarely editable. For deletes, 98.934% of the time you will use NO ACTION, because most of the time, you will simply want to make the user delete the children first to avoid accidentally deleting a lot of data. Lots of NO ACTION FOREIGN KEY constraints will tend to make it much harder to execute an accidental DELETE FROM <tableName> when you accidentally didn’t highlight the WHERE clause in SSMS. The most common of the actions is ON DELETE CASCADE, which is frequently useful for table sets where the child table is, in essence, just a part of the parent table. For example, invoice <-- invoiceLineItem. Usually, if you are going to delete the invoice, you are doing so because it is bad, and you will want the line items to go away too. On the other hand, you want to avoid it for relationships like Customer <-- Invoice. Deleting a customer who has invoices as a general rule is probably not desired. So you will want the client code to specifically delete the invoices before deleting the customer on the rare occasion that is desired to be deleted.
Note too the optional [WITH CHECK | WITH NOCHECK] specification. When you create a constraint, the WITH NOCHECK setting (the default) gives you the opportunity to create the constraint without checking existing data. Using NOCHECK and leaving the values unchecked is a generally bad thing to do because if you try to resave the exact same data that existed in a row, you could get an error. Also, if the constraint is built using WITH CHECK, the query optimizer can possibly make use of this fact when building a query plan.
Thinking back to our modeling, there were optional and required relationships such as the one in Figure 6-29.
Figure 6-29. Optional parent-to-child relationship requires NULL on the migrated key
The child.parentId column needs to allow NULLs (which it does on the model). For a required relationship, the child.parentId would not be null, like in Figure 6-30.
Figure 6-30. Required parent-to-child relationship requires NOT NULL on the migrated key
This is all you need to do, because SQL Server knows that when the referencing key allows a NULL, the relationship value is optional. In our model, represented in Figure 6-31, we have seven relationships modeled.
Figure 6-31. Messaging model for reference
Recall from Figure 6-9 that we had given the relationships a verb phrase, which is used to read the name. For example, in the relationship between User and Message, we have two relationships. One of them was verb phrased as "Is Sent" as in User-Is Sent-Message. In order to get interesting usage of these verb phrases, I will use them as part of the name of the constraint, so that constraint will be named:
FKMessagingUser$IsSent$Messages_Message
Doing this greatly improves the value of the names for constraints, particularly when you have more than one foreign key going between the same two tables. Now, let’s go through the seven constraints and decide the type of options to use on the foreign key relationships. First up is the relationship between AttendeeType and MessagingUser. Since it uses a natural key, it is a target for the UPDATE CASCADE option. However, note that if you have a lot of MessagingUser rows, this operation can be very costly, so it should be done during off-hours. And, if it turns out it is done very often, the choice to use a volatile natural key value ought to be reconsidered. We will use ON DELETE NO ACTION, because we don’t usually want to cascade a delete from a table that is strictly there to implement a domain.
ALTER TABLE Attendees.MessagingUser
ADD CONSTRAINT FKMessagingUser$IsSent$Messages_Message
FOREIGN KEY (AttendeeType) REFERENCES Attendees.AttendeeType(AttendeeType)
ON UPDATE CASCADE
ON DELETE NO ACTION;
Next, let’s consider the two relationships between the MessagingUser table and the UserConnection table. Since we modeled both of the relationships as required, if one user is deleted (as opposed to being disabled), then we would delete all connections to and from the MessagingUser table. Hence, you might consider implementing both of these as DELETE CASCADE. However, if you execute the following statements:
ALTER TABLE Attendees.UserConnection
ADD CONSTRAINT
FKMessagingUser$ConnectsToUserVia$Attendees_UserConnection
FOREIGN KEY (MessagingUserId) REFERENCES Attendees.MessagingUser(MessagingUserId)
ON UPDATE NO ACTION
ON DELETE CASCADE;
ALTER TABLE Attendees.UserConnection
ADD CONSTRAINT
FKMessagingUser$IsConnectedToUserVia$Attendees_UserConnection
FOREIGN KEY (ConnectedToMessagingUserId)
REFERENCES Attendees.MessagingUser(MessagingUserId)
ON UPDATE NO ACTION
ON DELETE CASCADE;
Introducing FOREIGN KEY constraint ’FKMessagingUser$IsConnectedToUserVia$Attendees_UserConnection’ on table ’UserConnection’ may cause cycles or multiple cascade paths. Specify ON DELETE NO ACTION or ON UPDATE NO ACTION, or modify other FOREIGN KEY constraints.
Basically, this message is stating that you cannot have two CASCADE operations on the same table. This even more limits the value of the CASCADE operations. Instead, we will use NO ACTION for the DELETE and will just have to implement the cascade in the client code or using a trigger (which I will do as an example).
I will also note that, in many ways, having these limitations on cascading operations is probably a good thing. Too much automatically executing code is going to make developers antsy about what is going on with the data, and if you accidentally delete a user, having NO ACTION specified can actually be a good thing to stop dumb mistakes.
I will change the constraints to NO ACTION and re-create (dropping the one that was created first):
ALTER TABLE Attendees.UserConnection
DROP CONSTRAINT
FKMessagingUser$ConnectsToUserVia$Attendees_UserConnection;
GO
ALTER TABLE Attendees.UserConnection
ADD CONSTRAINT
FKMessagingUser$ConnectsToUserVia$Attendees_UserConnection
FOREIGN KEY (MessagingUserId) REFERENCES Attendees.MessagingUser(MessagingUserId)
ON UPDATE NO ACTION
ON DELETE NO ACTION;
ALTER TABLE Attendees.UserConnection
ADD CONSTRAINT
FKMessagingUser$IsConnectedToUserVia$Attendees_UserConnection
FOREIGN KEY (ConnectedToMessagingUserId)
REFERENCES Attendees.MessagingUser(MessagingUserId)
ON UPDATE NO ACTION
ON DELETE NO ACTION;
GO
And the following INSTEAD OF trigger will go ahead and delete the rows before the actual operation that the user tried is executed:
CREATE TRIGGER MessagingUser$InsteadOfDeleteTrigger
ON Attendees.MessagingUser
INSTEAD OF DELETE AS
BEGIN
DECLARE @msg varchar(2000), --used to hold the error message
--use inserted for insert or update trigger, deleted for update or delete trigger
--count instead of @@rowcount due to merge behavior that sets @@rowcount to a number
--that is equal to number of merged rows, not rows being checked in trigger
@rowsAffected int = (select count(*) from inserted);
@rowsAffected int = (select count(*) from deleted);
--no need to continue on if no rows affected
IF @rowsAffected = 0 RETURN;
SET NOCOUNT ON; --to avoid the rowcount messages
SET ROWCOUNT 0; --in case the client has modified the rowcount
BEGIN TRY
--[validation section]
--[modification section]
--implement multi-path cascade delete in trigger
DELETE FROM Attendees.UserConnection
WHERE MessagingUserId IN (SELECT MessagingUserId FROM DELETED);
DELETE FROM Attendees.UserConnection
WHERE ConnectedToMessagingUserId IN (SELECT MessagingUserId FROM DELETED);
--<perform action>
DELETE FROM Attendees.MessagingUser
WHERE MessagingUserId IN (SELECT MessagingUserId FROM DELETED);
END TRY
BEGIN CATCH
IF @@trancount > 0
ROLLBACK TRANSACTION;
THROW;
END CATCH;
END;
For the two relationships between MessagingUser and Message, it may seem like we want to use cascade operations, but in this case, since we implemented a disabled indicator in the MessagingUser table, we would probably not use cascade operations. If the MessagingUser had not created message rows yet, it could be deleted, otherwise it would usually be disabled (and the row would not be deleted, which would cause the previous trigger to fail and leave the connection rows alone too).
If a system administrator wants to remove the user completely, a module would likely be created to manage this operation, as it would be the exception rather than the rule. So for this example, we will implement NO ACTION on DELETE:
ALTER TABLE Messages.Message
ADD CONSTRAINT FKMessagingUser$Sends$Messages_Message FOREIGN KEY
(MessagingUserId) REFERENCES Attendees.MessagingUser(MessagingUserId)
ON UPDATE NO ACTION
ON DELETE NO ACTION;
ALTER TABLE Messages.Message
ADD CONSTRAINT FKMessagingUser$IsSent$Messages FOREIGN KEY
(SentToMessagingUserId) REFERENCES Attendees.MessagingUser(MessagingUserId)
ON UPDATE NO ACTION
ON DELETE NO ACTION;
Whether to use cascading operations should not be considered lightly. NO ACTION relationships prevent mistakes from causing calamity. CASCADE operations (including SET NULL and SET DEFAULT, which give you additional possibilities for controlling cascading operations), are powerful tools to make coding easier, but could wipe out a lot of data. For example, if you case cascade from MessagingUser to Message and UserConnection, then do DELETE MessagingUser, data can be gone in a hurry.
The next relationship we will deal with is between Topic and MessageTopic. We don’t want Topics to be deleted once set up and used, other than by the administrator as a special operation perhaps, where special requirements are drawn up and not done as a normal thing. Hence, we use the DELETE NO ACTION:
ALTER TABLE Messages.MessageTopic
ADD CONSTRAINT
FKTopic$CategorizesMessagesVia$Messages_MessageTopic FOREIGN KEY
(TopicId) REFERENCES Messages.Topic(TopicId)
ON UPDATE NO ACTION
ON DELETE NO ACTION;
The next-to-last relationship to implement is the MessageTopic to Message relationship. Just like the Topic to MessageTopic relationship, there is no need to automatically delete messages if the topic is deleted.
ALTER TABLE Messages.MessageTopic
ADD CONSTRAINT FKMessage$isCategorizedVia$MessageTopic FOREIGN KEY
(MessageId) REFERENCES Messages.Message(MessageId)
ON UPDATE NO ACTION
ON DELETE NO ACTION;
One of the primary limitations on constraint-based foreign keys is that the tables participating in the relationship cannot span different databases (or different engine models, in that a memory-optimized table’s foreign keys cannot reference an on-disk table’s constraints). When this situation occurs, these relationship types need to be implemented via triggers from the on-disk table, or interpreted T-SQL stored procedures for in-memory tables.
While cross-container (as the memory-optimized and on-disk storage/transaction engines are commonly referred to) relationships may be useful at times (more on this in later chapters), it is definitely considered a bad idea to design databases with cross-database relationships. A database should be considered a unit of related tables that are always kept in sync. When designing solutions that extend over different databases or even servers, carefully consider how spreading around references to data that isn’t within the scope of the database will affect your solution. You need to understand that SQL Server cannot guarantee the existence of the value, because SQL Server uses databases as its “container,” and another user could restore a database with improper values, even an empty database, and the cross-database RI would be invalidated. Of course, as is almost always the case with anything that isn’t best-practice material, there are times when cross-database relationships are unavoidable, and I’ll demonstrate building triggers to support this need in the next chapter on data protection.
In the security chapter (Chapter 9), we will discuss more about how to secure cross-database access, but it is generally considered a less than optimal usage. In SQL Server 2012, the concepts of contained databases, and even SQL Azure, the ability to cross database boundaries is changing in ways that will generally be helpful for building secure databases that exist on the same server.
Adding Basic CHECK Constraints
In our database, we have specified a couple of domains that need to be implemented a bit more strictly. In most cases, we can implement validation routines using simple CHECK constraints. CHECK constraints are simple, single-row predicates that can be used to validate the data in a row. The basic syntax is
ALTER TABLE <tableName> [WITH CHECK | WITH NOCHECK]
ADD [CONSTRAINT <constraintName>]
CHECK <BooleanExpression>
One thing interesting about CHECK constraints is how the <BooleanExpression> is evaluated. The <BooleanExpression> component is similar to the WHERE clause of a typical SELECT statement, but with the caveat that no subqueries are allowed. (Subqueries are allowed in standard SQL but not in T-SQL. In T-SQL, you must use a function to access other tables.)
CHECK constraints can reference system and user-defined functions and use the name or names of any columns in the table. However, they cannot access any other table, and they cannot access any row other than the current row being modified (except through a function, and the row values you will be checking will already exist in the table). If multiple rows are modified, each row is checked against this expression individually.
The interesting thing about this expression is that, unlike a WHERE clause, the condition is checked for falseness rather than truth. If the result of a comparison is UNKNOWN because of a NULL comparison, the row will pass the CHECK constraint and be entered. Even if this isn’t immediately confusing, it is often confusing when figuring out why an operation on a row did or did not work as you might have expected. For example, consider the Boolean expression value <> ’fred’. If value is NULL, this is accepted, because NULL <> ’fred’ evaluates to UNKNOWN. If value is ’fred’, it fails because ’fred’ <> ’fred’ is False. The reason for the way CHECK constraints work with Booleans is that if the column is defined as NULL, it is assumed that you wish to allow a NULL value for the column value. You can look for NULL values bvy explicitly checking for them using IS NULL or IS NOT NULL. This is useful when you want to ensure that a column that technically allows nulls does not allow NULLs if another column has a given value. As an example, if you have a column defined name varchar(10) null, having a CHECK constraint that says name = ’fred’ technically says name = ’fred’ or name is null. If you want to ensure it is not null if the column NameIsNotNullFlag = 1, you would state ((NameIsNotNullFlag = 1 and Name is not null) or (NameIsNotNullFlag = 0)).
Note that the [WITH CHECK | WITH NOCHECK] specification works just like it did for FOREIGN KEY constraints
In our model, we had two domain predicates specified in the text that we will implement here. The first is the TopicName, which called for us to make sure that the value is not an empty string or all space characters. I repeat it here in Table 6-8 for review.
Table 6-8. Domain: TopicName
Property |
Setting |
---|---|
Name |
TopicName |
Optional |
No |
Datatype |
Unicode text, 30 characters |
Value Limitations |
Must not be an empty string or only space characters |
Default Value |
n/a |
The maximum length of 30 characters was handled by the datatype of nvarchar(30) we used but now will implement the rest of the value limitations. The method I will use for this is to do an LTRIM on the value and then check the length. If it is 0, it is either all spaces or empty. We used the TopicName domain for two columns, the Name column from Messages.Topic, and the UserDefinedTopicName column from the Messages.MessageTopic table: