In the last section of this chapter, you will learn how to manage a clustered columnstore index (CCI). Besides optimizing query performance, you will also learn how to add a regular B-tree nonclustered index (NCI) to a CCI and use it instead of the primary key or unique constraints. When creating the NCCI in the previous section, the LZ77 or the archive compression has not been used yet. You will use it with a CCI in this section. Altogether, you will learn how to do the following:
Let's start by dropping both indexes from the demo fact table, the NCCI and the CI, to make a heap again:
USE WideWorldImportersDW; -- Drop the NCCI DROP INDEX NCCI_FactTest ON dbo.FactTest; -- Drop the CI DROP INDEX CL_FactTest_DateKey ON dbo.FactTest; GO
Now let's create the clustered columnstore index (CCI):
CREATE CLUSTERED COLUMNSTORE INDEX CCI_FactTest ON dbo.FactTest; GO
And, of course, the next step is to recheck the space used by the test fact table:
EXEC sys.sp_spaceused N'dbo.FactTest', @updateusage = N'TRUE'; GO
The result is as follows:
Name rows reserved data index_size unused ------------ ------- --------- --------- ---------- ------- dbo.FactTest 2279810 23560 KB 23392 KB 0 KB 168 KB
The CCI even uses slightly less space than the NCCI. You can check the number of segments with the following query:
SELECT ROW_NUMBER() OVER (ORDER BY s.column_id, s.segment_id) AS rn, COL_NAME(p.object_id, s.column_id) AS column_name, S.segment_id, s.row_count, s.min_data_id, s.max_data_id, s.on_disk_size FROM sys.column_store_segments AS s INNER JOIN sys.partitions AS p ON s.hobt_id = p.hobt_id INNER JOIN sys.indexes AS i ON p.object_id = i.object_id WHERE i.name = N'CCI_FactTest' ORDER BY s.column_id, s.segment_id;
This time, the number of segments is 44. The CCI does not show a dictionary segment per row group; it has a global dictionary that apparently covers the whole table. How does that influence the queries?
Again, the first test is with the simple query:
SET STATISTICS IO ON; SELECT f.StockItemKey, SUM(f.TotalAmount) AS Sales FROM dbo.FactTest AS f WHERE f.StockItemKey < 30 GROUP BY f.StockItemKey ORDER BY f.StockItemKey;
From the statistics IO output, you can see that SQL Server used only 82 logical reads. This time, this is really impressive. Note that the execution plan again used the Columnstore Index Scan
operator, only this time scanning the CCI and selecting only the rowgroups and segments needed to satisfy the query. You can see the execution plan in the following figure:
Figure 10-16: CCI scan execution plan
The next query to test is the complex query:
SELECT f.SaleKey, f.CustomerKey, f.Customer, cu.[Buying Group], f.CityKey, f.City, ci.Country, f.DateKey, d.[Calendar Year], f.StockItemKey, f.Product, f.Quantity, f.TotalAmount, f.Profit FROM dbo.FactTest AS f INNER JOIN Dimension.Customer AS cu ON f.CustomerKey = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.CityKey = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.StockItemKey = s.[Stock Item Key] INNER JOIN Dimension.Date AS d ON f.DateKey = d.Date;
This time, SQL Server needed 6,101 LOB logical reads. SQL Server used a serial plan on my virtual machine and mixed batch (for CCI scan and for hash joins) and row mode operators (for other index scans). This is only slightly better than when using an NCCI. How about the point query?
SELECT CustomerKey, Profit FROM dbo.FactTest WHERE CustomerKey = 378; SET STATISTICS IO OFF;
The point query this time used 484 LOB logical reads. Better, but still not the best possible.
You might remember that there is still one option left for the columnar storage compression—the archive compression. Let's turn it on with the following code:
ALTER INDEX CCI_FactTest ON dbo.FactTest REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE); GO
You can imagine what comes next: recheck the space used by the test fact table:
EXEC sys.sp_spaceused N'dbo.FactTest', @updateusage = N'TRUE'; GO
The result is as follows:
Name rows reserved data index_size unused ------------ ------- --------- --------- ---------- ------- dbo.FactTest 2279810 19528 KB 19336 KB 0 KB 192 KB
The LZ77 algorithm added some additional compression. Compare the data size now with the initial data size when the data size was 498,528 KB; now, it is only 19,336 KB; the compression rate is more than 25 times! This is really impressive. Of course, you can expect that the test queries are even more efficient now. For example, here is the simple query:
SET STATISTICS IO ON; SELECT f.StockItemKey, SUM(f.TotalAmount) AS Sales FROM dbo.FactTest AS f WHERE f.StockItemKey < 30 GROUP BY f.StockItemKey ORDER BY f.StockItemKey;
This time, SQL Server needed only 23 LOB logical reads. The next query to test is the complex query:
SELECT f.SaleKey, f.CustomerKey, f.Customer, cu.[Buying Group], f.CityKey, f.City, ci.Country, f.DateKey, d.[Calendar Year], f.StockItemKey, f.Product, f.Quantity, f.TotalAmount, f.Profit FROM dbo.FactTest AS f INNER JOIN Dimension.Customer AS cu ON f.CustomerKey = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.CityKey = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.StockItemKey = s.[Stock Item Key] INNER JOIN Dimension.Date AS d ON f.DateKey = d.Date;
This time, SQL Server needed 4,820 LOB logical reads in the test fact table. It can't get much better than this for scanning all of the data. And what about the point query?
SELECT CustomerKey, Profit FROM dbo.FactTest WHERE CustomerKey = 378;
This time, it used 410 LOB logical reads. This number can still be improved.
There is still one query, the point query, which needs additional optimization. In SQL Server 2016, you can create regular, rowstore B-tree nonclustered indexes on a clustered columnstore index, on the table that is organized as columnar storage. The following code adds a nonclustered index with included column, an index that is going to cover the point query:
CREATE NONCLUSTERED INDEX NCI_FactTest_CustomerKey ON dbo.FactTest(CustomerKey) INCLUDE(Profit); GO
Before executing the queries, let's check the space used by the demo fact table:
EXEC sys.sp_spaceused N'dbo.FactTest', @updateusage = N'TRUE'; GO
The result is as follows:
Name rows reserved data index_size unused ------------ ------- --------- --------- ---------- ------- dbo.FactTest 2279810 90256 KB 19344 KB 70192 KB 720 KB
You can see that the row storage uses much more space than the columnar storage. However, a regular NCI is very efficient for seeks. Let's test the queries, starting with the simple query:
SET STATISTICS IO ON; SELECT f.StockItemKey, SUM(f.TotalAmount) AS Sales FROM dbo.FactTest AS f WHERE f.StockItemKey < 30 GROUP BY f.StockItemKey ORDER BY f.StockItemKey;
This query still needed 23 LOB logical reads. If you check the execution plan, you can see that SQL Server is still using the columnstore index scan. Of course, the NCI is not very useful for this query. How about the complex query?
SELECT f.SaleKey, f.CustomerKey, f.Customer, cu.[Buying Group], f.CityKey, f.City, ci.Country, f.DateKey, d.[Calendar Year], f.StockItemKey, f.Product, f.Quantity, f.TotalAmount, f.Profit FROM dbo.FactTest AS f INNER JOIN Dimension.Customer AS cu ON f.CustomerKey = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.CityKey = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.StockItemKey = s.[Stock Item Key] INNER JOIN Dimension.Date AS d ON f.DateKey = d.Date;
Again, SQL Server needed 4,820 LOB logical reads in the test fact table. The NCI didn't improve this query; it is already optimized. Finally, let's check the point query:
SELECT CustomerKey, Profit FROM dbo.FactTest WHERE CustomerKey = 378; SET STATISTICS IO OFF;
This time, the query needed only 13 logical reads. SQL Server query optimizer decided to use the covering NCI index, as you can see from the following figure, showing the execution plan for the point query for this execution:
Figure 10-17: Execution plan for the point query that uses the nonclustered covering index
We don't need the nonclustered index anymore, so let's drop it:
DROP INDEX NCI_FactTest_CustomerKey ON dbo.FactTest; GO
You can check the physical status of the rowgroups of the CCI using the sys.dm_db_column_store_row_group_physical_stats
dynamic management view, like the following query shows:
SELECT OBJECT_NAME(object_id) AS table_name, row_group_id, state, state_desc, total_rows, deleted_rows FROM sys.dm_db_column_store_row_group_physical_stats WHERE object_id = OBJECT_ID(N'dbo.FactTest') ORDER BY row_group_id;
Here is the result:
table_name row_group_id state state_desc total_rows deleted_rows ---------- ------------ ----- ---------- ---------- ------------ FactTest 0 3 COMPRESSED 1048576 0 FactTest 1 3 COMPRESSED 343592 0 FactTest 2 3 COMPRESSED 444768 0 FactTest 3 3 COMPRESSED 442874 0
You can see that all rowgroups are closed and compressed.
In SQL 2016, you can also add primary key and unique constraints to a CCI table. The following code adds a unique constraint to the test fact table. Note that you cannot add a primary key constraint because the SaleKey
column is nullable:
ALTER TABLE dbo.FactTest ADD CONSTRAINT U_SaleKey UNIQUE (SaleKey); GO
You can check in Object Explorer that the Unique constraint is enforced with help from unique rowstore nonclustered index. The following screenshot of the Object Explorer window shows that the SaleKey
column is nullable as well:
Figure 10-18: Unique constraint on a CCI table
Let's test the constraint. The following command tries to insert 75,993 rows into the test fact table that already exist in the table:
INSERT INTO dbo.FactTest (SaleKey, CustomerKey, Customer, CityKey, City, DateKey, StockItemKey, Product, Quantity, TotalAmount, Profit) SELECT 10 * 1000000 + f.[Sale Key] AS SaleKey, cu.[Customer Key] AS CustomerKey, cu.Customer, ci.[City Key] AS CityKey, ci.City, f.[Delivery Date Key] AS DateKey, s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS cu ON f.[Customer Key] = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.[City Key] = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.[Stock Item Key] = s.[Stock Item Key] INNER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date WHERE f.[Sale Key] % 3 = 0;
If you execute the code, you get the error 2627, violating the Unique constraint. Let's recheck the status of the rowgroups:
SELECT OBJECT_NAME(object_id) AS table_name, row_group_id, state, state_desc, total_rows, deleted_rows FROM sys.dm_db_column_store_row_group_physical_stats WHERE object_id = OBJECT_ID(N'dbo.FactTest') ORDER BY row_group_id;
This time, the result differs slightly:
table_name row_group_id state state_desc total_rows deleted_rows ---------- ------------ ----- ---------- ---------- ------------ FactTest 0 3 COMPRESSED 1048576 0 FactTest 1 3 COMPRESSED 343592 0 FactTest 2 3 COMPRESSED 444768 0 FactTest 3 3 COMPRESSED 442874 0 FactTest 4 1 OPEN 0 0
Although the insert was rejected, SQL Server did not close or delete the delta storage. Of course, this makes sense since this storage might become useful pretty soon for next updates of the data. You can rebuild the index to get rid of this delta storage. The following command rebuilds the CCI, this time without archive compression:
ALTER INDEX CCI_FactTest ON dbo.FactTest REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE); GO
You can check the rowgroup's status again:
SELECT OBJECT_NAME(object_id) AS table_name, row_group_id, state, state_desc, total_rows, deleted_rows FROM sys.dm_db_column_store_row_group_physical_stats WHERE object_id = OBJECT_ID(N'dbo.FactTest') ORDER BY row_group_id;
Here is the result:
table_name row_group_id state state_desc total_rows deleted_rows ---------- ------------ ----- ---------- ---------- ------------ FactTest 0 3 COMPRESSED 1048576 0 FactTest 1 3 COMPRESSED 343592 0 FactTest 2 3 COMPRESSED 444768 0 FactTest 3 3 COMPRESSED 442874 0
So far, only an unsuccessful insert was tested. Of course, you can also try to insert some valid data. Before that, let's drop the constraint since it is not needed for further explanation anymore:
ALTER TABLE dbo.FactTest DROP CONSTRAINT U_SaleKey; GO
Next, you can insert some valid rows. The following statement inserts 113,990 rows to the test fact table. Note that this is more than the 102,400 rows limit for the trickle inserts; therefore, you should expect that this is treated as a bulk insert:
INSERT INTO dbo.FactTest (SaleKey, CustomerKey, Customer, CityKey, City, DateKey, StockItemKey, Product, Quantity, TotalAmount, Profit) SELECT 11 * 1000000 + f.[Sale Key] AS SaleKey, cu.[Customer Key] AS CustomerKey, cu.Customer, ci.[City Key] AS CityKey, ci.City, f.[Delivery Date Key] AS DateKey, s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS cu ON f.[Customer Key] = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.[City Key] = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.[Stock Item Key] = s.[Stock Item Key] INNER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date WHERE f.[Sale Key] % 2 = 0;
You can check whether this was a bulk insert by checking the rowgroups again:
SELECT OBJECT_NAME(object_id) AS table_name, row_group_id, state, state_desc, total_rows, deleted_rows FROM sys.dm_db_column_store_row_group_physical_stats WHERE object_id = OBJECT_ID(N'dbo.FactTest') ORDER BY row_group_id;
The result shows you that you have only compressed rowgroups:
table_name row_group_id state state_desc total_rows deleted_rows ---------- ------------ ----- ---------- ---------- ------------ FactTest 0 3 COMPRESSED 1048576 0 FactTest 1 3 COMPRESSED 343592 0 FactTest 2 3 COMPRESSED 444768 0 FactTest 3 3 COMPRESSED 442874 0 FactTest 4 3 COMPRESSED 113990 0
Although all rowgroups are compressed, you will notice that the last rowgroups have less rows than the other rowgroups. It would be more efficient if you could use bulk inserts with more rows, closer to 1,000,000 rows. Now let's try to insert a smaller amount of rows. This time, you can also turn on the graphical execution plan:
INSERT INTO dbo.FactTest (SaleKey, CustomerKey, Customer, CityKey, City, DateKey, StockItemKey, Product, Quantity, TotalAmount, Profit) SELECT 12 * 1000000 + f.[Sale Key] AS SaleKey, cu.[Customer Key] AS CustomerKey, cu.Customer, ci.[City Key] AS CityKey, ci.City, f.[Delivery Date Key] AS DateKey, s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product, f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit FROM Fact.Sale AS f INNER JOIN Dimension.Customer AS cu ON f.[Customer Key] = cu.[Customer Key] INNER JOIN Dimension.City AS ci ON f.[City Key] = ci.[City Key] INNER JOIN Dimension.[Stock Item] AS s ON f.[Stock Item Key] = s.[Stock Item Key] INNER JOIN Dimension.Date AS d ON f.[Delivery Date Key] = d.Date WHERE f.[Sale Key] % 3 = 0;
In the plan, you can see that only row mode operators were used. SQL Server does not use batch mode operators for DDL operations. The following screenshot shows a portion of the execution plan, with one Hash Match
operator highlighted, to show that the row mode processing was used:
Figure 10.19: DDL operations are processed in row mode
Let's recheck the status of the rowgroups:
SELECT OBJECT_NAME(object_id) AS table_name, row_group_id, state, state_desc, total_rows, deleted_rows FROM sys.dm_db_column_store_row_group_physical_stats WHERE object_id = OBJECT_ID(N'dbo.FactTest') ORDER BY row_group_id;
This time, another open rowgroup is in the result:
table_name row_group_id state state_desc total_rows deleted_rows ---------- ------------ ----- ---------- ---------- ------------ FactTest 0 3 COMPRESSED 1048576 0 FactTest 1 3 COMPRESSED 343592 0 FactTest 2 3 COMPRESSED 444768 0 FactTest 3 3 COMPRESSED 442874 0 FactTest 4 3 COMPRESSED 113990 0 FactTest 5 1 OPEN 75993 0
Let's rebuild the index to get only compressed rowgroups again:
ALTER INDEX CCI_FactTest ON dbo.FactTest REBUILD; GO
After the rebuild, let's see what happened to the rowgroups.
SELECT OBJECT_NAME(object_id) AS table_name, row_group_id, state, state_desc, total_rows, deleted_rows FROM sys.dm_db_column_store_row_group_physical_stats WHERE object_id = OBJECT_ID(N'dbo.FactTest') ORDER BY row_group_id;
The result shows you that you have only compressed rowgroups:
table_name row_group_id state state_desc total_rows deleted_rows ---------- ------------ ----- ---------- ---------- ------------ FactTest 0 3 COMPRESSED 1048576 0 FactTest 1 3 COMPRESSED 428566 0 FactTest 2 3 COMPRESSED 495276 0 FactTest 3 3 COMPRESSED 497375 0
SQL Server has added the rows from the trickle insert to other rowgroups. Let's now select the rows from the last trickle insert:
SELECT * FROM dbo.FactTest WHERE SaleKey >= 12000000 ORDER BY SaleKey;
Let's test what happens when you delete rows from a CCI with the following DELETE
command. Before executing the command, you can check the estimated execution plan. Therefore, don't execute the following command yet:
DELETE FROM dbo.FactTest WHERE SaleKey >= 12000000;
The following screenshot shows the execution plan. Again, you can see that the DDL operations use row mode processing operators only:
Figure 10-20: Estimated execution plan for a DELETE
And here is the final check of the state of the rowgroups:
SELECT OBJECT_NAME(object_id) AS table_name, row_group_id, state, state_desc, total_rows, deleted_rows FROM sys.dm_db_column_store_row_group_physical_stats WHERE object_id = OBJECT_ID(N'dbo.FactTest') ORDER BY row_group_id;
The result shows you that you have only compressed rowgroups:
table_name row_group_id state state_desc total_rows deleted_rows ---------- ------------ ----- ---------- ---------- ------------ FactTest 0 3 COMPRESSED 1048576 0 FactTest 1 3 COMPRESSED 428566 0 FactTest 2 3 COMPRESSED 495276 0 FactTest 3 3 COMPRESSED 497375 75993
You can see that one of the rowgroups has deleted rows. Although the total number of rows in this rowgroup did not change, you cannot access the deleted rows; the delete bitmap B-tree structure for this rowgroup defines which rows are deleted. You can try to retrieve the deleted rows:
SELECT * FROM dbo.FactTest WHERE SaleKey >= 12000000 ORDER BY SaleKey;
This time, no rows are returned.
Finally, let's clean the WideWorldImporters
database with the following code:
USE WideWorldImportersDW; GO DROP TABLE dbo.FactTest; GO
Before finishing this chapter, the following table summarizes the space needed for different versions of row and columnar storage:
Storage |
Rows |
Reserved |
Data |
Index |
CI |
227,981 |
49,672 KB |
48,528 KB |
200 KB |
CI row compression |
227,981 |
25,864 KB |
24,944 KB |
80 KB |
CI page compression |
227,981 |
18,888 KB |
18,048 KB |
80 KB |
CI (10 times more rows) |
2,279,810 |
502,152 KB |
498,528 KB |
2,072 KB |
CI with NCCI |
2,279,810 |
529,680 KB |
498,528 KB |
29,432 KB |
CCI |
2,279,810 |
23,560 KB |
23,392 KB |
0 KB |
CCI archive compression |
2,279,810 |
19,528 KB |
19,336 KB |
0 KB |
CCI archive compression and NCI |
2,279,810 |
90,256 KB |
19,344 KB |
70,192 KB |