Azure SQL and its underlying SQL Server query engine are well known in the industry for the low administrative barrier generally required for applications to get good performance and reliability on most conditions with different data sizes and shapes.
- 1.
Understanding how the system is behaving under a given workload: this means looking at major performance and availability metrics during your load test or production periods to make sure the system is up and running and can cope with a given workload.
- 2.
Investigate deeply into specific activities that may happen, or have happened, if results of the previous point are showing some critical behaviors.
The former is a continuous, long-term, monitoring scenario that is crucial to make sure your solution will “keep the lights on” and usually requires specific characteristics, like the ability to store and analyze vast amounts of data points and trigger some alerts when key metrics are crossing certain thresholds. On the Azure platform, there is a common backbone across all services to collect diagnostic information emitted by all components of a given solution, called Azure Monitor. All services can be configured to asynchronously emit logs and metrics to Azure Monitor, which can then redirect these to both near real-time (through visualizing metrics in Azure Portal or pushing data into an Azure Event Hub instance for further processing and alerting) and more long-term analytical options (like Azure Blob Storage or Log Analytics), depending on your needs.
You can get more details on this specific scenario by looking at official documentation at this link: https://aka.ms/asdbmto.
The other major scenario is instead related to immediate diagnostic and troubleshooting investigations that you can execute, while a specific issue is happening, thanks to the extensive instrumentation capabilities exposed by the Azure SQL engine. Every single internal subsystem of the service, from connection and session management down to query processing and storage engine, is in fact emitting a rich and detailed set of diagnostic information that we, as developer, can take advantage of to understand how a single query is executed or how subsystems like memory management or resource governance are behaving.
As developers, we don’t necessarily need to understand every single aspect of how the internals of the service are working, but it is quite important to know the basic tools and techniques available to make sure we can investigate how our queries and workloads are performing and how we can make our applications more efficient when working with Azure SQL.
Dynamic Management Views (DMVs)
Dynamic management views (and functions) are a way to execute T-SQL queries against an Azure SQL endpoint and return state information that can be used to monitor the health of an Azure SQL server or database instance, diagnose problems, and tune performance.
Azure SQL enables a subset of dynamic management views to diagnose performance problems, which might be caused by blocked or long-running queries, resource bottlenecks, poor query plans, and so on. In an instance of SQL Server and in Azure SQL Managed Instance, dynamic management views return server state information. In Azure SQL Database, they return information regarding your current logical database only.
Database-related dynamic management views (prefixed by sys.dm_db_*)
Execution-related dynamic management views (prefixed by sys.dm_exec_*)
Transaction-related dynamic management views (prefixed by sys.dm_tran_*)
Basically, this view is returning key metrics around resource consumption for your database instance for the last hour, with a granularity of 15 seconds. If any of these metrics is getting closer to 100%, you usually have the choice of scaling up your database service or compute tier or, more likely, drilling deeper into resource utilization to understand if there are ways to make your workload more efficient.
Let’s say, as an example, that you’re maxing out on CPU utilization: a good next step would be to look at what are the top CPU-consuming queries to understand if there are ways to improve them in any way.
In details, we’re getting a binary value indicating all queries with the same “shape” (where there may only be a change in a parameter value or such), an indication of how many times that query has been executed followed by the query text itself. From there, a long list of very important metrics indicating total, min, average, and max usage of critical resources like CPU (worker) time, data page reads and writes, and much more. By changing the order by column, we can look at what queries are executed more often (by execution_count) or what queries are reading or writing more data pages and so on.
This example here is just scratching the surface of what we can do with DMVs, but there is much more. We can investigate and identify performance bottlenecks due to poorly designed table or index structures, monitoring database and object sizes or locking and blocking issues between concurrent users and so on. For more details on what is possible, we recommend you to take a look at Azure SQL official documentation here: https://aka.ms/asdbmwd.
Execution plans
This is the graphical representation of the query execution plan used to resolve that particular query and shows all detailed steps that Azure SQL engine took to be able to produce the result returned to the client. Azure SQL has a pool of memory that is used to store both execution plans and data buffers. When any Transact-SQL statement is executed, the Database Engine first looks through the plan cache to verify that an existing execution plan for the same Transact-SQL statement exists. The Transact-SQL statement qualifies as existing if it literally matches a previously executed Transact-SQL statement with a cached plan, character per character. Azure SQL reuses any existing plan it finds, saving the overhead of recompiling the Transact-SQL statement. If no execution plan exists, Azure SQL generates a new execution plan for the query, trying to find the one with the lowest cost in terms of resource utilization, within a reasonable time interval.
Understanding how this execution plan is created and executed is critical to make sure that the workload generated by our application is optimized for our data model and indexing strategy or instead requires some optimizations.
Many modern databases use the more technical term DAG – Directed Acyclic Graph – instead of the user-friendlier term Execution Plan. If you already have experience with Apache Spark, for example, a DAG and an Execution Plan are basically the same thing.
As a rule of thumb, execution plans visualized via Management Studio steps should be interpreted from right to left and from top to bottom to determine the order in which they have been executed. In our preceding example, the first step is a Seek operator using a nonclustered index created on the CustomerID column of the Orders table to find one or more rows with a value of 832. As the index that the query engine is using does not likely contain all the columns in the select list (where we mistakenly used an asterisk instead of the full column list), the next step in the execution plan is to loop on all rows retrieved by the first operator and for each of them to execute a Lookup operator that will use the clustered index key retrieved by the Seek operator to read all the other columns from clustered index (PK_Sales_Orders) created on the Orders table.
For example, we can see that the Key Lookup operator is accounting for the 99% of the cost of the entire query, and it is executed 127 times (like the number of rows filtered by the first Seek operator).
A logical page is nothing more than the page concept we discussed in previous chapters. Every time Azure SQL needs to read some data that is needed to process the query or to return the result to the end user, it will use one I/O operation to read an entire page, 8KB. A physical read is a read operation done on a page that was not already in memory – and therefore quite expensive. A logical read is a read operation done on a page that was already in the buffer pool, the in-memory cache. If a page is not already in the buffer pool, it will be read from the disk and added to the pool. For that reason, the logical reads are always greater or equal to the number of physical reads. As a page may be read more than once, the number of logical page reads is a good indicator of how much I/O your query is doing overall. As I/O is the most expensive operation in a database, you generally want to reduce I/O as much as possible, finding the perfect balance for you between read and write performances.
Using DMVs to extract execution plans of the most expensive queries hitting our Azure SQL database is a great way to investigate the performance of our application. In client tools like SQL Server Management Studio or Azure Data Studio, you can also obtain the same information while running your queries by selecting the “Include Actual Execution Plan” toolbar button (or use Ctrl+M shortcut) in Management Studio or clicking the “Explain” button in Data Studio.
What we have described here represents the foundational approach you can take to understand how your queries are executed by the Azure SQL engine and how you can optimize your workload to reduce useless resource consumption and improve overall performance.
The way the Azure SQL engine is creating and executing query plans is a complex and fascinating subject; if you want to learn more, please read the official documentation here: https://aka.ms/qpag.
Query store
We previously mentioned that all internal data structures representing the state of our Azure SQL databases and sourcing Dynamic Management Views and Functions are kept in database process memory, so when an instance is restarted or there’s a planned or unplanned failover, all diagnostic information gets lost. Following the introduction of the Query Store feature, back in SQL Server 2016, now Azure SQL can persist most of this diagnostic information across restarts by default, as Query Store has been enabled on the entire cloud database fleet. This feature, in fact, provides you with insight on query plan choice and performance and simplifies performance troubleshooting by helping you quickly find performance differences caused by query plan changes. Query Store automatically captures a history of queries, plans, and runtime statistics and retains these for your review. It separates data by time windows so you can see database usage patterns and understand when query plan changes happened on the server. Query Store works effectively like a flight data recorder, constantly collecting compile and runtime information related to queries and plans. Query-related data is persisted in the internal tables and presented to users through a set of views.
Plan store, containing execution plan details
Runtime stats store, where execution statistics are persisted
Wait stats store, with historical data around wait statistics
As we discussed previously, once created, query execution plans for any specific query in Azure SQL can change over time due statistics changes, schema changes, index changes, and so on. The procedure cache stores the latest execution plan, and plans can also get evicted from cache under memory pressure. If a newly created execution plan, for whatever reason, turns out to be suboptimal, it is usually quite challenging to understand what caused that change. By keeping multiple versions of an execution plan per a given query, Query Store can help figuring out what happened, and it is also possible to enforce a policy to direct the Query Processor to use a specific execution plan. This is referred to as plan forcing, where a mechanism like the USE PLAN query hint is applied without requiring any change to query syntax in your app.
Query Store collects plans for DML Statements such as SELECT, INSERT, UPDATE, DELETE, MERGE, and BULK INSERT.
Another great source of information during performance troubleshooting sessions is the availability of statistics related to wait states of the system, basically a collection of the underlying reasons why Azure SQL is taking a given amount of time to respond to user queries. Before the Query Store, wait statistics were usually available at the database instance level, and it was not trivial to correlate them to a specific query.
Let’s look at how Query Store collects its data: query text and the initial plan are sent to the Query Store when a query gets compiled for the first time, and it is updated in case the query gets recompiled. In case a new plan is created, this is added as a new entry for the query, and previous ones are kept along with their runtime execution stats. Runtime statistics are sent to the Query Store for each query execution and are aggregated at plan level within the currently active time interval.
As mentioned, Query Store is enabled by default in Azure SQL and cannot be switched off. Default configuration is optimized for continuous data collection, but you can still control some of the configuration knobs like max store size (default 100MB) or the interval length used to aggregate statistics for query executions (default 60 minutes). If you don’t have specific needs, like during short troubleshooting sessions where you want to speed up the process, we recommend leaving settings to default values for most use cases.
sys.query_store_query_text is reporting unique query texts executed against the database, where every statement in the batch generates a separate query text entry .
sys.query_context_settings presents unique combinations of plan-affecting settings under which queries are executed.
sys.query_store_query shows query entries that are tracked and forced separately in Query Store.
sys.query_store_plan returns an estimated plan for the query with the compile-time statistics. Stored plan is equivalent to one that you get by using SET SHOWPLAN_XML ON.
sys.query_store_runtime_stats_interval shows runtime statistics aggregated in automatically generated time windows (intervals) for every executed plan. We can control the size of the interval using INTERVAL_LENGTH_MINUTES in ALTER DATABASE SET statement.
sys.query_store_runtime_stats reports aggregated runtime statistics for executed plans. Captured metrics are in the form of four statistical functions: Average, Minimum, Maximum, and Standard Deviation.
By querying these views, you can quickly get detailed information on how your workload is executing; here are some examples.
You’re probably starting to get what possibilities this feature is opening in terms of monitoring and troubleshooting, right? Let’s now look at some more complex scenarios where using the Query Store feature can help during our performance investigations.
Query Store at work
Let’s say your application’s performance has degraded over the last week, and you want to understand if this can be related to some changes in the database.
This query is basically calculating what queries have introduced additional duration compared to the previous execution period and is returning information like recent and historical execution counts and total duration.
It’s important to notice that, for this kind of optimization, you won’t need to change your application code in any way.
Another interesting scenario for investigating overall performance of your application is looking at Query Wait Statistics for a given database.
Other similar scenarios are related to Top Resource Consuming Queries or Queries With The Highest Variation, where you can follow similar paths.
To get more details on these interesting capabilities of Azure SQL, we recommend you to take a look at the official documentation here: https://aka.ms/daipr.
Raising and catching exceptions in SQL
Error handling in Transact-SQL is similar exception handling in traditional programming languages. You can wrap a group of Transact-SQL statements in a TRY block, and if an error occurs, control is passed to the following CATCH block where other statements will be executed. Each error in Azure SQL is associated with a given severity level, and TRY...CATCH construct catches all execution errors that have a severity higher than 10 that do not close the database connection. If no errors happened in the TRY block, control passes to the statement immediately after the associated END CATCH statement. Within the CATCH block, you can use several system functions to get details about the error that caused the CATCH block to be executed: these function names are pretty self-explanatory (ERROR_NUMBER(), ERROR_SEVERITY(), ERROR_STATE(), ERROR_PROCEDURE(), ERROR_LINE(), ERROR_MESSAGE()). They return NULL if they are called outside the scope of the CATCH block.
When we trap errors within a CATCH block, these are not returned to the calling application. If instead we want to capture error details with the CATCH block, but also report back all or part of these details to the calling application, we could call RAISERROR or THROW to bubble up the exception to the caller or simply return a resultset through a SELECT statement.
We can create complex error management logics by nesting multiple TRY...CATCH constructs. When an error happens within a CATCH block, it is treated like any other error, so if the block contains a nested TRY...CATCH construct, any error in the nested TRY block will pass control to the nested CATCH block. If there is no nested TRY...CATCH construct, the error is passed back to the caller.
If our code is calling other Stored Procedures or Triggers, errors raised by those external modules can be trapped in their own code (if they contain TRY blocks) or can be trapped by TRY...CATCH constructs in the calling code. User-defined functions, though, cannot contain TRY...CATCH constructs.
If errors have a severity equal or lower than 10, then these won’t be trapped by our TRY...CATCH block (as they are not considered errors, but just warnings).
Some errors with severity equal or higher than 20 will cause Azure SQL to stop processing other tasks on that session, so TRY...CATCH won’t equally trap these errors.
Compile errors, such as syntax errors, that prevent a batch from running won’t be trapped by our TRY...CATCH block, as such errors will happen at compile time and not at runtime.
Statement-level recompilation or object name resolution errors (e.g., trying to use a view that has been dropped) will not equally be trapped.
When errors happening in a TRY block are invalidating the state of the current transaction, then the transaction is classified as an uncommittable transaction. An uncommittable transaction can only perform read operations or a ROLLBACK TRANSACTION. We can call the XACT_STATE function to verify if the current transaction has been classified as uncommittable. If the function returns –1, that is the case. At the end of the batch, Azure SQL rolls back uncommittable transactions and will send an error message to application.
As you have seen, Azure SQL provides very complete and powerful exception handling features which are needed in any modern application, as managing exceptions is absolutely important to provide a great user experience.
Keep it simple!
Keep in mind that besides T-SQL, .NET or Python (or any of your preferred languages) code also has great exception support, and the best user experience is usually obtained when they work together as a team.
With this in mind and with the idea of keeping our solution as simple as possible, a very common pattern where your programming language of choice and T-SQL work very well together is the one that uses XACT_ABORT ON .
With XACT_ABORT ON, anything that is in a transaction must be correctly (exactly) executed, or Azure SQL will abort the transaction and terminate the current code execution.
Thanks to the XACT_ABORT being set to on, in the preceding code, or both the INSERT and UPDATE will run without any errors so that the COMMIT will be executed, or if there is any error during execution of INSERT or UPDATE, the entire transaction will be automatically rolled back (even if there is no ROLLBACK TRAN in the code). Execution of the code will also be interrupted, and the raised error will be returned to the caller (the application code in our sample).
As you can see, you have the full spectrum of options when deciding how to deal with exceptions and errors. You can decide that it is better to handle it inside Azure SQL or you can bubble it up to the application. In both cases, you are in control so that you can implement the best option for your solution.
Integration with application insights
As application developers creating cloud-based solutions, it is quite critical to understand that to troubleshoot and debug our apps’ issues, in most cases we cannot just connect to a specific server as we would have done in a traditional on-premises context. This is especially true for applications leveraging Platform as a Service components and services where you do not even have the notion of a physical or virtual server to connect. That is why it is so important to consider proper instrumentation within our codebase to emit all the diagnostic information required to remotely drill down into our application behaviors. Building all this infrastructure ourselves can be a challenging task; that’s why several native and third-party solutions that solely focus on solving the instrumentation challenge became quite successful over the last years.
Application Insights, a feature of Azure Monitor, is an extensible Application Performance Management (APM) service for developers and DevOps professionals who can use it to monitor their live applications deployed on the Azure platform. It will help detect performance anomalies and includes powerful analytics tools to help you diagnose issues and to understand what users do with your app. It is designed to help you continuously improve performance and usability. It works for apps on a wide variety of platforms including .NET, Node.js, Java, and Python hosted on-premises, hybrid, or any public cloud. It integrates with your DevOps process and has connection points to a variety of development tools and services.
Application Insights will let you understand all sorts of insights related to how the components of your application are performing, things like request rates, response times, and failure rates. It also captures dependencies and interactions with Azure services like Azure SQL, plus a bunch of other bonuses. You can then create live dashboards on top of collected data to be used as part of regular production-level monitoring, but also drill down into specific issues or exceptions happening or export diagnostic data to other services.
This is a very complete and useful tool that will assist you on both development and production phases for monitoring, debugging, and troubleshooting purposes.
You can get all details of what we discussed here and much more on the official Application Insights documentation here: https://aka.ms/amaio.
If you want to know more
Query Store for SQL Server 2019: Identify and Fix Poorly Performing Queries – www.amazon.com/Query-Store-SQL-Server-2019/dp/1484250036
Expert Performance Indexing in SQL Server 2019: Toward Faster Results and Lower Maintenance – www.amazon.com/Expert-Performance-Indexing-Server-2019/dp/1484254635
Pro SQL Server 2019 Wait Statistics: A Practical Guide to Analyzing Performance in SQL Server – www.amazon.com/Pro-Server-2019-Wait-Statistics/dp/1484249151
SQL Server 2017 Query Performance Tuning: Troubleshoot and Optimize Query Performance – www.amazon.com/Server-2017-Query-Performance-Tuning-ebook/dp/B07H49LN75
Glenn Berry’s DMVs – https://glennsqlperformance.com/resources/
sp_WhoIsActive – http://whoisactive.com/
sp_Blitz – www.brentozar.com/blitz/
Tiger toolbox – https://github.com/Microsoft/tigertoolbox