Clustered columnstore indexes

In the last section of this chapter, you will learn how to manage a clustered columnstore index (CCI). Besides optimizing query performance, you will also learn how to add a regular B-tree nonclustered index (NCI) to a CCI and use it instead of the primary key or unique constraints. When creating the NCCI in the previous section, the LZ77 or the archive compression has not been used yet. You will use it with a CCI in this section. Altogether, you will learn how to do the following:

  • Create clustered columnstore indexes
  • Use archive compression
  • Add B-tree NCI to a CCI
  • Use B-tree NCI for a constraint
  • Update data in a CCI

Compression and query performance

Let's start by dropping both indexes from the demo fact table, the NCCI and the CI, to make a heap again:

USE WideWorldImportersDW; 
-- Drop the NCCI 
DROP INDEX NCCI_FactTest 
  ON dbo.FactTest; 
-- Drop the CI 
DROP INDEX CL_FactTest_DateKey 
  ON dbo.FactTest; 
GO 

Now let's create the clustered columnstore index (CCI):

CREATE CLUSTERED COLUMNSTORE INDEX CCI_FactTest 
  ON dbo.FactTest; 
GO 

And, of course, the next step is to recheck the space used by the test fact table:

EXEC sys.sp_spaceused N'dbo.FactTest', @updateusage = N'TRUE'; 
GO 

The result is as follows:

Name          rows     reserved   data       index_size  unused 
------------  -------  ---------  ---------  ----------  -------
dbo.FactTest  2279810   23560 KB   23392 KB        0 KB   168 KB

The CCI even uses slightly less space than the NCCI. You can check the number of segments with the following query:

SELECT ROW_NUMBER() OVER (ORDER BY s.column_id, s.segment_id) AS rn, 
 COL_NAME(p.object_id, s.column_id) AS column_name, 
 S.segment_id, s.row_count,  
 s.min_data_id, s.max_data_id, 
 s.on_disk_size 
FROM sys.column_store_segments AS s    
INNER JOIN sys.partitions AS p    
    ON s.hobt_id = p.hobt_id    
INNER JOIN sys.indexes AS i    
    ON p.object_id = i.object_id   
WHERE i.name = N'CCI_FactTest' 
ORDER BY s.column_id, s.segment_id; 

This time, the number of segments is 44. The CCI does not show a dictionary segment per row group; it has a global dictionary that apparently covers the whole table. How does that influence the queries?

Testing the clustered columnstore Index

Again, the first test is with the simple query:

SET STATISTICS IO ON; 
SELECT f.StockItemKey, 
 SUM(f.TotalAmount) AS Sales 
FROM dbo.FactTest AS f 
WHERE f.StockItemKey < 30 
GROUP BY f.StockItemKey 
ORDER BY f.StockItemKey; 

From the statistics IO output, you can see that SQL Server used only 82 logical reads. This time, this is really impressive. Note that the execution plan again used the Columnstore Index Scan operator, only this time scanning the CCI and selecting only the rowgroups and segments needed to satisfy the query. You can see the execution plan in the following figure:

Testing the clustered columnstore Index

Figure 10-16: CCI scan execution plan

The next query to test is the complex query:

SELECT f.SaleKey, 
  f.CustomerKey, f.Customer, cu.[Buying Group], 
  f.CityKey, f.City, ci.Country, 
  f.DateKey, d.[Calendar Year], 
  f.StockItemKey, f.Product,  
  f.Quantity, f.TotalAmount, f.Profit 
FROM dbo.FactTest AS f 
  INNER JOIN Dimension.Customer AS cu 
    ON f.CustomerKey = cu.[Customer Key] 
  INNER JOIN Dimension.City AS ci 
    ON f.CityKey = ci.[City Key] 
  INNER JOIN Dimension.[Stock Item] AS s 
    ON f.StockItemKey = s.[Stock Item Key] 
  INNER JOIN Dimension.Date AS d 
    ON f.DateKey = d.Date; 

This time, SQL Server needed 6,101 LOB logical reads. SQL Server used a serial plan on my virtual machine and mixed batch (for CCI scan and for hash joins) and row mode operators (for other index scans). This is only slightly better than when using an NCCI. How about the point query?

SELECT CustomerKey, Profit 
FROM dbo.FactTest 
WHERE CustomerKey = 378; 
SET STATISTICS IO OFF; 

The point query this time used 484 LOB logical reads. Better, but still not the best possible.

Using archive compression

You might remember that there is still one option left for the columnar storage compression—the archive compression. Let's turn it on with the following code:

ALTER INDEX CCI_FactTest 
 ON dbo.FactTest 
 REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE_ARCHIVE); 
GO

You can imagine what comes next: recheck the space used by the test fact table:

EXEC sys.sp_spaceused N'dbo.FactTest', @updateusage = N'TRUE'; 
GO 

The result is as follows:

Name          rows     reserved   data       index_size  unused 
------------  -------  ---------  ---------  ----------  -------
dbo.FactTest  2279810   19528 KB   19336 KB        0 KB   192 KB

The LZ77 algorithm added some additional compression. Compare the data size now with the initial data size when the data size was 498,528 KB; now, it is only 19,336 KB; the compression rate is more than 25 times! This is really impressive. Of course, you can expect that the test queries are even more efficient now. For example, here is the simple query:

SET STATISTICS IO ON; 
SELECT f.StockItemKey, 
 SUM(f.TotalAmount) AS Sales 
FROM dbo.FactTest AS f 
WHERE f.StockItemKey < 30 
GROUP BY f.StockItemKey 
ORDER BY f.StockItemKey; 

This time, SQL Server needed only 23 LOB logical reads. The next query to test is the complex query:

SELECT f.SaleKey, 
  f.CustomerKey, f.Customer, cu.[Buying Group], 
  f.CityKey, f.City, ci.Country, 
  f.DateKey, d.[Calendar Year], 
  f.StockItemKey, f.Product,  
  f.Quantity, f.TotalAmount, f.Profit 
FROM dbo.FactTest AS f 
  INNER JOIN Dimension.Customer AS cu 
    ON f.CustomerKey = cu.[Customer Key] 
  INNER JOIN Dimension.City AS ci 
    ON f.CityKey = ci.[City Key] 
  INNER JOIN Dimension.[Stock Item] AS s 
    ON f.StockItemKey = s.[Stock Item Key] 
  INNER JOIN Dimension.Date AS d 
    ON f.DateKey = d.Date; 

This time, SQL Server needed 4,820 LOB logical reads in the test fact table. It can't get much better than this for scanning all of the data. And what about the point query?

SELECT CustomerKey, Profit 
FROM dbo.FactTest 
WHERE CustomerKey = 378; 

This time, it used 410 LOB logical reads. This number can still be improved.

Adding B-Tree indexes and constraints

There is still one query, the point query, which needs additional optimization. In SQL Server 2016, you can create regular, rowstore B-tree nonclustered indexes on a clustered columnstore index, on the table that is organized as columnar storage. The following code adds a nonclustered index with included column, an index that is going to cover the point query:

CREATE NONCLUSTERED INDEX NCI_FactTest_CustomerKey 
 ON dbo.FactTest(CustomerKey) 
 INCLUDE(Profit); 
GO 

Before executing the queries, let's check the space used by the demo fact table:

EXEC sys.sp_spaceused N'dbo.FactTest', @updateusage = N'TRUE'; 
GO 

The result is as follows:

Name          rows     reserved   data       index_size  unused 
------------  -------  ---------  ---------  ----------  -------
dbo.FactTest  2279810   90256 KB   19344 KB    70192 KB   720 KB

You can see that the row storage uses much more space than the columnar storage. However, a regular NCI is very efficient for seeks. Let's test the queries, starting with the simple query:

SET STATISTICS IO ON; 
SELECT f.StockItemKey, 
 SUM(f.TotalAmount) AS Sales 
FROM dbo.FactTest AS f 
WHERE f.StockItemKey < 30 
GROUP BY f.StockItemKey 
ORDER BY f.StockItemKey; 

This query still needed 23 LOB logical reads. If you check the execution plan, you can see that SQL Server is still using the columnstore index scan. Of course, the NCI is not very useful for this query. How about the complex query?

SELECT f.SaleKey, 
  f.CustomerKey, f.Customer, cu.[Buying Group], 
  f.CityKey, f.City, ci.Country, 
  f.DateKey, d.[Calendar Year], 
  f.StockItemKey, f.Product,  
  f.Quantity, f.TotalAmount, f.Profit 
FROM dbo.FactTest AS f 
  INNER JOIN Dimension.Customer AS cu 
    ON f.CustomerKey = cu.[Customer Key] 
  INNER JOIN Dimension.City AS ci 
    ON f.CityKey = ci.[City Key] 
  INNER JOIN Dimension.[Stock Item] AS s 
    ON f.StockItemKey = s.[Stock Item Key] 
  INNER JOIN Dimension.Date AS d 
    ON f.DateKey = d.Date; 

Again, SQL Server needed 4,820 LOB logical reads in the test fact table. The NCI didn't improve this query; it is already optimized. Finally, let's check the point query:

SELECT CustomerKey, Profit 
FROM dbo.FactTest 
WHERE CustomerKey = 378; 
SET STATISTICS IO OFF; 

This time, the query needed only 13 logical reads. SQL Server query optimizer decided to use the covering NCI index, as you can see from the following figure, showing the execution plan for the point query for this execution:

Adding B-Tree indexes and constraints

Figure 10-17: Execution plan for the point query that uses the nonclustered covering index

We don't need the nonclustered index anymore, so let's drop it:

DROP INDEX NCI_FactTest_CustomerKey 
 ON dbo.FactTest; 
GO 

You can check the physical status of the rowgroups of the CCI using the sys.dm_db_column_store_row_group_physical_stats dynamic management view, like the following query shows:

SELECT OBJECT_NAME(object_id) AS table_name, 
 row_group_id, state, state_desc, 
 total_rows, deleted_rows 
FROM sys.dm_db_column_store_row_group_physical_stats 
WHERE object_id = OBJECT_ID(N'dbo.FactTest') 
ORDER BY row_group_id; 

Here is the result:

table_name  row_group_id  state  state_desc  total_rows deleted_rows
----------  ------------  -----  ----------  ---------- ------------
FactTest               0      3  COMPRESSED     1048576            0
FactTest               1      3  COMPRESSED      343592            0
FactTest               2      3  COMPRESSED      444768            0
FactTest               3      3  COMPRESSED      442874            0

You can see that all rowgroups are closed and compressed.

In SQL 2016, you can also add primary key and unique constraints to a CCI table. The following code adds a unique constraint to the test fact table. Note that you cannot add a primary key constraint because the SaleKey column is nullable:

ALTER TABLE dbo.FactTest 
 ADD CONSTRAINT U_SaleKey UNIQUE (SaleKey); 
GO 

You can check in Object Explorer that the Unique constraint is enforced with help from unique rowstore nonclustered index. The following screenshot of the Object Explorer window shows that the SaleKey column is nullable as well:

Adding B-Tree indexes and constraints

Figure 10-18: Unique constraint on a CCI table

Let's test the constraint. The following command tries to insert 75,993 rows into the test fact table that already exist in the table:

INSERT INTO dbo.FactTest 
(SaleKey, CustomerKey,  
 Customer, CityKey, City, 
 DateKey, StockItemKey, 
 Product, Quantity, 
 TotalAmount, Profit) 
SELECT 10 * 1000000 + f.[Sale Key] AS SaleKey, 
  cu.[Customer Key] AS CustomerKey, cu.Customer, 
  ci.[City Key] AS CityKey, ci.City,  
  f.[Delivery Date Key] AS DateKey, 
  s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product,  
  f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS cu 
    ON f.[Customer Key] = cu.[Customer Key] 
  INNER JOIN Dimension.City AS ci 
    ON f.[City Key] = ci.[City Key] 
  INNER JOIN Dimension.[Stock Item] AS s 
    ON f.[Stock Item Key] = s.[Stock Item Key] 
  INNER JOIN Dimension.Date AS d 
    ON f.[Delivery Date Key] = d.Date 
WHERE f.[Sale Key] % 3 = 0; 

If you execute the code, you get the error 2627, violating the Unique constraint. Let's recheck the status of the rowgroups:

SELECT OBJECT_NAME(object_id) AS table_name, 
 row_group_id, state, state_desc, 
 total_rows, deleted_rows 
FROM sys.dm_db_column_store_row_group_physical_stats 
WHERE object_id = OBJECT_ID(N'dbo.FactTest') 
ORDER BY row_group_id; 

This time, the result differs slightly:

table_name  row_group_id  state  state_desc  total_rows deleted_rows
----------  ------------  -----  ----------  ---------- ------------
FactTest               0      3  COMPRESSED     1048576            0
FactTest               1      3  COMPRESSED      343592            0
FactTest               2      3  COMPRESSED      444768            0
FactTest               3      3  COMPRESSED      442874            0
FactTest               4      1  OPEN                 0            0

Although the insert was rejected, SQL Server did not close or delete the delta storage. Of course, this makes sense since this storage might become useful pretty soon for next updates of the data. You can rebuild the index to get rid of this delta storage. The following command rebuilds the CCI, this time without archive compression:

ALTER INDEX CCI_FactTest 
 ON dbo.FactTest 
 REBUILD WITH (DATA_COMPRESSION = COLUMNSTORE); 
GO 

You can check the rowgroup's status again:

SELECT OBJECT_NAME(object_id) AS table_name, 
 row_group_id, state, state_desc, 
 total_rows, deleted_rows 
FROM sys.dm_db_column_store_row_group_physical_stats 
WHERE object_id = OBJECT_ID(N'dbo.FactTest') 
ORDER BY row_group_id; 

Here is the result:

table_name  row_group_id  state  state_desc  total_rows  deleted_rows
----------  ------------  -----  ----------  ----------  ------------
FactTest               0      3  COMPRESSED     1048576             0
FactTest               1      3  COMPRESSED      343592             0
FactTest               2      3  COMPRESSED      444768             0
FactTest               3      3  COMPRESSED      442874             0

Note

Note that your results for the number of rows in each row group can vary slightly.

Updating a clustered columnstore index

So far, only an unsuccessful insert was tested. Of course, you can also try to insert some valid data. Before that, let's drop the constraint since it is not needed for further explanation anymore:

ALTER TABLE dbo.FactTest 
 DROP CONSTRAINT U_SaleKey; 
GO 

Next, you can insert some valid rows. The following statement inserts 113,990 rows to the test fact table. Note that this is more than the 102,400 rows limit for the trickle inserts; therefore, you should expect that this is treated as a bulk insert:

INSERT INTO dbo.FactTest 
(SaleKey, CustomerKey,  
 Customer, CityKey, City, 
 DateKey, StockItemKey, 
 Product, Quantity, 
 TotalAmount, Profit) 
SELECT 11 * 1000000 + f.[Sale Key] AS SaleKey, 
  cu.[Customer Key] AS CustomerKey, cu.Customer, 
  ci.[City Key] AS CityKey, ci.City,  
  f.[Delivery Date Key] AS DateKey, 
  s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product,  
  f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS cu 
    ON f.[Customer Key] = cu.[Customer Key] 
  INNER JOIN Dimension.City AS ci 
    ON f.[City Key] = ci.[City Key] 
  INNER JOIN Dimension.[Stock Item] AS s 
    ON f.[Stock Item Key] = s.[Stock Item Key] 
  INNER JOIN Dimension.Date AS d 
    ON f.[Delivery Date Key] = d.Date 
WHERE f.[Sale Key] % 2 = 0; 

You can check whether this was a bulk insert by checking the rowgroups again:

SELECT OBJECT_NAME(object_id) AS table_name, 
 row_group_id, state, state_desc, 
 total_rows, deleted_rows 
FROM sys.dm_db_column_store_row_group_physical_stats 
WHERE object_id = OBJECT_ID(N'dbo.FactTest') 
ORDER BY row_group_id; 

The result shows you that you have only compressed rowgroups:

table_name  row_group_id  state  state_desc  total_rows  deleted_rows
----------  ------------  -----  ----------  ----------  ------------
FactTest               0      3  COMPRESSED     1048576             0
FactTest               1      3  COMPRESSED      343592             0
FactTest               2      3  COMPRESSED      444768             0
FactTest               3      3  COMPRESSED      442874             0
FactTest               4      3  COMPRESSED      113990             0

Although all rowgroups are compressed, you will notice that the last rowgroups have less rows than the other rowgroups. It would be more efficient if you could use bulk inserts with more rows, closer to 1,000,000 rows. Now let's try to insert a smaller amount of rows. This time, you can also turn on the graphical execution plan:

INSERT INTO dbo.FactTest 
(SaleKey, CustomerKey,  
 Customer, CityKey, City, 
 DateKey, StockItemKey, 
 Product, Quantity, 
 TotalAmount, Profit) 
SELECT 12 * 1000000 + f.[Sale Key] AS SaleKey, 
  cu.[Customer Key] AS CustomerKey, cu.Customer, 
  ci.[City Key] AS CityKey, ci.City,  
  f.[Delivery Date Key] AS DateKey, 
  s.[Stock Item Key] AS StockItemKey, s.[Stock Item] AS Product,  
  f.Quantity, f.[Total Excluding Tax] AS TotalAmount, f.Profit 
FROM Fact.Sale AS f 
  INNER JOIN Dimension.Customer AS cu 
    ON f.[Customer Key] = cu.[Customer Key] 
  INNER JOIN Dimension.City AS ci 
    ON f.[City Key] = ci.[City Key] 
  INNER JOIN Dimension.[Stock Item] AS s 
    ON f.[Stock Item Key] = s.[Stock Item Key] 
  INNER JOIN Dimension.Date AS d 
    ON f.[Delivery Date Key] = d.Date 
WHERE f.[Sale Key] % 3 = 0; 

In the plan, you can see that only row mode operators were used. SQL Server does not use batch mode operators for DDL operations. The following screenshot shows a portion of the execution plan, with one Hash Match operator highlighted, to show that the row mode processing was used:

Updating a clustered columnstore index

Figure 10.19: DDL operations are processed in row mode

Let's recheck the status of the rowgroups:

SELECT OBJECT_NAME(object_id) AS table_name, 
 row_group_id, state, state_desc, 
 total_rows, deleted_rows 
FROM sys.dm_db_column_store_row_group_physical_stats 
WHERE object_id = OBJECT_ID(N'dbo.FactTest') 
ORDER BY row_group_id; 

This time, another open rowgroup is in the result:

table_name  row_group_id  state  state_desc  total_rows  deleted_rows
----------  ------------  -----  ----------  ----------  ------------
FactTest               0      3  COMPRESSED     1048576             0
FactTest               1      3  COMPRESSED      343592             0
FactTest               2      3  COMPRESSED      444768             0
FactTest               3      3  COMPRESSED      442874             0
FactTest               4      3  COMPRESSED      113990             0
FactTest               5      1  OPEN             75993             0

Let's rebuild the index to get only compressed rowgroups again:

ALTER INDEX CCI_FactTest 
 ON dbo.FactTest REBUILD; 
GO 

After the rebuild, let's see what happened to the rowgroups.

SELECT OBJECT_NAME(object_id) AS table_name, 
 row_group_id, state, state_desc, 
 total_rows, deleted_rows 
FROM sys.dm_db_column_store_row_group_physical_stats 
WHERE object_id = OBJECT_ID(N'dbo.FactTest') 
ORDER BY row_group_id; 

The result shows you that you have only compressed rowgroups:

table_name  row_group_id  state  state_desc  total_rows  deleted_rows
----------  ------------  -----  ----------  ----------  ------------
FactTest               0      3  COMPRESSED     1048576             0
FactTest               1      3  COMPRESSED      428566             0
FactTest               2      3  COMPRESSED      495276             0
FactTest               3      3  COMPRESSED      497375             0

SQL Server has added the rows from the trickle insert to other rowgroups. Let's now select the rows from the last trickle insert:

SELECT * 
FROM dbo.FactTest 
WHERE SaleKey >= 12000000 
ORDER BY SaleKey; 

Deleting from a clustered columnstore index

Let's test what happens when you delete rows from a CCI with the following DELETE command. Before executing the command, you can check the estimated execution plan. Therefore, don't execute the following command yet:

DELETE 
FROM dbo.FactTest 
WHERE SaleKey >= 12000000; 

The following screenshot shows the execution plan. Again, you can see that the DDL operations use row mode processing operators only:

Deleting from a clustered columnstore index

Figure 10-20: Estimated execution plan for a DELETE

And here is the final check of the state of the rowgroups:

SELECT OBJECT_NAME(object_id) AS table_name, 
 row_group_id, state, state_desc, 
 total_rows, deleted_rows 
FROM sys.dm_db_column_store_row_group_physical_stats 
WHERE object_id = OBJECT_ID(N'dbo.FactTest') 
ORDER BY row_group_id; 

The result shows you that you have only compressed rowgroups:

table_name  row_group_id  state  state_desc  total_rows  deleted_rows
----------  ------------  -----  ----------  ----------  ------------
FactTest               0      3  COMPRESSED     1048576             0
FactTest               1      3  COMPRESSED      428566             0
FactTest               2      3  COMPRESSED      495276             0
FactTest               3      3  COMPRESSED      497375         75993

You can see that one of the rowgroups has deleted rows. Although the total number of rows in this rowgroup did not change, you cannot access the deleted rows; the delete bitmap B-tree structure for this rowgroup defines which rows are deleted. You can try to retrieve the deleted rows:

SELECT * 
FROM dbo.FactTest 
WHERE SaleKey >= 12000000 
ORDER BY SaleKey; 

This time, no rows are returned.

Finally, let's clean the WideWorldImporters database with the following code:

USE WideWorldImportersDW; 
GO 
DROP TABLE dbo.FactTest; 
GO 

Before finishing this chapter, the following table summarizes the space needed for different versions of row and columnar storage:

Storage

Rows

Reserved

Data

Index

CI

227,981

49,672 KB

48,528 KB

200 KB

CI row compression

227,981

25,864 KB

24,944 KB

80 KB

CI page compression

227,981

18,888 KB

18,048 KB

80 KB

CI (10 times more rows)

2,279,810

502,152 KB

498,528 KB

2,072 KB

CI with NCCI

2,279,810

529,680 KB

498,528 KB

29,432 KB

CCI

2,279,810

23,560 KB

23,392 KB

0 KB

CCI archive compression

2,279,810

19,528 KB

19,336 KB

0 KB

CCI archive compression and NCI

2,279,810

90,256 KB

19,344 KB

70,192 KB

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset