How it works...

Let's spend some time understanding the Parquet format at a deeper level. The following is some sample data represented in a table format:

First_Name	Last_Name	Age
Barack	Obama	55
George	Bush	70
Bill	Clinton	70

In the row format, the data will be stored like this:

Barack

Obama

George

Bush

Bill

Clinton

In the columnar layout, the data will be stored like this:

Barack

George

Bill

Obama

Bush

Clinton

The following diagram illustrates how the data is laid out in a file:

Here's a brief description of the different parts:

Row group: This shows the horizontal partitioning of the data into rows. A row group consists of column chunks.
Column chunk: A column chunk has data for a given column in a row group. A column chunk is always physically contiguous. A row group has only one column chunk per column.
Page header: Every column chunk has a shared page header.
Page: A column chunk is divided into pages. A page is a unit of storage and cannot be further divided. Pages are written back to back in a column chunk. The data on a page can be compressed.

If a column has null values, it's not stored. If a column chunk has null values, even the column chunk is not stored.

Data in Parquet is not only compressed but also encoded. This leads to a significant reduction in disk footprint, network I/O footprint, and memory footprint. There is a slight increase in CPU cycles as data needs to be compressed/decompressed and encoded/decoded.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...