How it works...

Let's spend some time understanding the Parquet format at a deeper level. The following is some sample data represented in a table format:

First_Name Last_Name Age
Barack Obama 55
George Bush 70
Bill Clinton 70

In the row format, the data will be stored like this:

Barack Obama 55 George Bush 70 Bill Clinton 70

In the columnar layout, the data will be stored like this:

Barack George Bill Obama Bush Clinton 56 70 70

The following diagram illustrates how the data is laid out in a file:

Here's a brief description of the different parts:

  • Row group: This shows the horizontal partitioning of the data into rows. A row group consists of column chunks.
  • Column chunk: A column chunk has data for a given column in a row group. A column chunk is always physically contiguous. A row group has only one column chunk per column.
  • Page header: Every column chunk has a shared page header.
  • Page: A column chunk is divided into pages. A page is a unit of storage and cannot be further divided. Pages are written back to back in a column chunk. The data on a page can be compressed.

If a column has null values, it's not stored. If a column chunk has null values, even the column chunk is not stored. 

Data in Parquet is not only compressed but also encoded. This leads to a significant reduction in disk footprint, network I/O footprint, and memory footprint. There is a slight increase in CPU cycles as data needs to be compressed/decompressed and encoded/decoded. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset