Chapter 3 – Clever Features of Vertica

“Always remember that you are unique just like everyone else.”

- Anonymous

Super Projections

image

Vertica creates a default superprojection for each table in the database so that all SQL queries can be answered. A superprojection consists of all columns in the table and this is done by default when the data is first loaded or inserted. Notice that both superprojections above are either replicated across all nodes or the rows are hashed to different nodes.

Vertica Projections

image

Vertica stores data physically in views called projections. Each projection contains a subset of the columns, but each subset can be sorted differently. Projections can even contain columns from multiple tables, like a materialized view. Every data element in a table will appear in at least one projection. Tables occupy no physical storage! It is only the projections that are stored. This allows Vertica to group columns used most often together right next to each other in the physical storage.

The Five Advantages of Projections

1.The Vertica query optimizer automatically picks the best projections to use for any query, so no user interaction is required.

2.Projections compress and encode data to greatly reduce the space required for storing data.

3.Vertica operates on the encoded data when it can in order to avoid the cost of decoding.

4.Because Vertica uses a combination of both compression and encoding, this ensures the smallest disk space possible and yet it still maximizes query performance.

5.Projections also provide high availability and recovery by duplicating table columns on at least K+1 nodes within its cluster. If a node fails, the database continues to operate by using duplicate data on a buddy node(s).

Vertica projections store data in encoded format designed for automatic performance tuning. Think of projections similar to Join Indexes in Teradata or a materialized views in Oracle. Projections are really result sets that are stored on disk. Instead of computing these results each time they can be used in each query. Projection results are automatically refreshed whenever data values are inserted, deleted, updated or copied.

Creating a Projection

image

Vertica projections store data in encoded format designed for automatic performance tuning. Think of projections similar to Join Indexes in Teradata or a materialized views in Oracle. Projections are really result sets that are stored on disk. Above, we did create a projection using all columns in the table, but we determined how we wanted the data sorted.

Read-Optimized Store (ROS)/Write-Optimized Store (WOS)

image

Vertica caches all updates to a main memory called the Write-Optimized Store (WOS), which by the way is queryable. The WOS puts the data into projections in collection buckets that are uncompressed and unsorted, but are in update order. The Tuple Mover then migrates the recent updates during certain periods to the permanent disk storage in the Read-Optimized Store (ROS). The data in the ROS is sorted, compressed and packed into variable length disk blocks.

Write-Optimized Store (WOS) is Memory Resident

image

Vertica's Write Optimized Store (WOS) is always memory-resident and it is buffer for INSERT, UPDATE, DELETE, and COPY operations.

To support very fast data load speeds, the WOS stores records without data compression or indexing. A projection in the WOS is sorted only when it is queried. It remains sorted as long as no further data is inserted into it. The WOS organizes data by epoch and holds both committed and uncommitted transaction data. Both the Read Optimized Store (ROS) and the Write Optimized Store (WOS) are arranged by projections. This technique allows for continuous loading throughout the day without having a major impact on read queries.

Updates are collected in Time-Based Buckets called Epochs

image

Vertica caches all updates to a main memory called the Write-Optimized Store (WOS), which by the way is queryable. The WOS is designed so updated can be collected in time-based buckets. At fixed intervals, Vertica closes the current epoch and begins a new Epoch. The non-current Epochs are queryable and deemed for migration by the Tuple Mover to update the permanent disks called the Read-Optimized Storage (ROS). This design allows for the majority of users who only need to read data to have an open gateway, however it also allows for near real-time data warehouses with high append data volumes.

Vertica Does Not Support In-Place Updates

image

Vertica's Tuple Mover updates by deleting and re-inserting rows. Appended data is added to the end of a column-store block and updated data (in the middle) of a block is deleted and re-inserted. Vertica does not support in-place updates.

K-Safety

image

The K in K-Safety means how many duplicate copies are stored. In this example K = 1. This example is not designed to represent mirroring, but in effect each node has a buddy node that it keeps a backup copy of its data in case of a failure. Node 1 holds the backup for node 2 and node 2 holds the backup for node 3, etc.

You can view a list of critical nodes in your database by running the query below from your Nexus Chameleon.

SELECT * FROM v_monitor.critical_nodes;

Any of the nodes in the cluster example above could fail, and the database would still be able to continue perform. The performance would be lower because one node would have to handle its own workload and the workload of the failed node.

K-Safety of 2

image

To see the K-Safety numbers just run the query below from your Nexus Chameleon.

SELECT current_fault_tolerance FROM system ;

Any two nodes in the cluster example above could fail, and the database would still be able to continue perform. Each node is a buddy the nodes before and after it.

The Five Data Isolation Modes

1.Snapshot – Queries and updates do not interfere with each other, so read only queries do not require locking.

2.Serializable - Transactions run in serial order. Locks are acquired for both read and write operations, which ensures that any successive SELECT commands within a single transaction always produce the same results.

3.Repeatable read – Auto-converts to SERIALIZABLE.

4.Read Committed - SELECT queries sees a snapshot of the committed data at the start of the transaction and any results of updates run within its transaction, even if they have not been committed.

5.Read Uncommitted (Read Without Integrity)

By default, Vertica uses the READ COMMITTED isolation level.

Vertica supports all types of database isolation. Database isolation refers to how the concurrent users of data affect each other as they read and change data in the database. The key question comes down to integrity of data vs. concurrency. Although the optimizer understands all five standard SQL isolation levels, internally Vertica uses only two isolation levels. They are "Read Committed" and "Serializable". So, you may not get the other isolations you request. Vertica automatically translates "Read Uncommitted" to "Read Committed" and "Repeatable Read" to "Serializable".

Import/Export between Multiple Vertica Systems

image

Entire databases, or certain portions of databases, can be moved from one Vertica system to another by using a simple SQL statement. Notice that the instances do not need to be the same size or have the same storage requirements. The data is automatically re-segmented to match the new configuration, and projections are resorted based on queries being run.

Roles

image

Roles simplify database administration by assigning access rights to tables and other objects, and then groups of people with similar job functions (or roles) can access these objects. It is as simple as creating different roles for different job functions and responsibilities, and then granting specific privileges (access rights) on database objects to these roles, and then granting a role or roles to users who share the same privileges. Vertica database security supports roles conforming to SQL 2008 specifications. This type of security is essential for management of data access across large organizations.

Compression

Vertica compresses data in order to save space. Here are the facts:

Vertica can utilize over twelve different compression options.

The compression depends on the data.

Vertica will choose which compression option to apply.

NULLs take up no space on Vertica because they are compressed.

Vertically will compress data on average 70%.

HP Vertica queries data in encoded form.

When similar data is grouped, you have even more options.

One of the key advantages to columnar storage is the ability to compress column data. When column stores are compressed they can stores more data, provides more projections and use less hardware. This can provide up to 50% more historical data being stored and queried. The following pages show just some of the compression techniques utilized. Encoding is the process of converting data into a standard format. Encoded data can be directly processed, however compressed data cannot. Vertica operates on encoded data when it can to avoid the heavy costs of decoding.

Runlength encoding

image

Runlength encoding replaces a value that is repeated consecutively with a token that consists of the value and a count of the number of consecutive occurrences (the length of the run). This is where the name Runlength comes into play. A separate dictionary of unique values is created for each block of column values on disk. This encoding is best suited to a table in which data values are often repeated consecutively, for example, when the table is sorted by those values.

LZO Encoding

image

Designed to work best with Char and Varchar data that store long character strings

Is a portable lossless data compression library written in ANSI C

Offers fast compression but extremely fast decompression

Includes slower compression levels achieving a quite competitive compression ratio while still decompressing at this very high speed

Often implemented with a tool called LZOP

Lempel–Ziv–Oberhumer (LZO) is a lossless data compression algorithm that is focused on decompression speed. LZO encoding provides a high compression ratio with good performance. LZO encoding is designed to work well with character data. It is especially good for CHAR and VARCHAR columns that store very long character strings, especially free form text such as product descriptions, user comments or JSON strings.

Delta Encoding

image

The Delta encodings are very useful for date and time columns. Delta encoding compresses data by recording the difference between values that follow each other in the column. These differences are recorded in a separate dictionary for each block of column values on disk. If the column contains 10 integers in sequence from 1 to 10, the first will be stored as a 4-byte integer (plus a 1-byte flag), and the next 9 will each be stored as a byte with the value 1, indicating that it is one greater than the previous value. Delta encoding comes in two variations. DELTA records the differences as 1-byte values (8-bit integers), and DELTA32K records differences as 2-byte values (16-bit integers).

Block Based Dictionary Encoding for Character Data

image

Block Based Dictionary Encoding utilizes a separate dictionary of unique values for each block of column values on disk. Remember, each Vertica disk block occupies 1 MB. The dictionary contains up to 256 one-byte values that are stored as indexes to the original data values. If more than 256 values are stored in a single block, the extra values are written into the block in raw, uncompressed form. The process repeats for each disk block. This encoding is very effective when a column contains a limited number of unique values, and it is especially optimal when there is less than 256 unique values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset