Let's spend some time understanding the Parquet format at a deeper level. The following is some sample data represented in a table format:
First_Name | Last_Name | Age |
Barack | Obama | 55 |
George | Bush | 70 |
Bill | Clinton | 70 |
In the row format, the data will be stored like this:
Barack | Obama | 55 | George | Bush | 70 | Bill | Clinton | 70 |
In the columnar layout, the data will be stored like this:
Barack | George | Bill | Obama | Bush | Clinton | 56 | 70 | 70 |
The following diagram illustrates how the data is laid out in a file:
Here's a brief description of the different parts:
- Row group: This shows the horizontal partitioning of the data into rows. A row group consists of column chunks.
- Column chunk: A column chunk has data for a given column in a row group. A column chunk is always physically contiguous. A row group has only one column chunk per column.
- Page header: Every column chunk has a shared page header.
- Page: A column chunk is divided into pages. A page is a unit of storage and cannot be further divided. Pages are written back to back in a column chunk. The data on a page can be compressed.
If a column has null values, it's not stored. If a column chunk has null values, even the column chunk is not stored.
Data in Parquet is not only compressed but also encoded. This leads to a significant reduction in disk footprint, network I/O footprint, and memory footprint. There is a slight increase in CPU cycles as data needs to be compressed/decompressed and encoded/decoded.