Overview

This section describes a high-level overview of parquet and what we will do (roughly) to parse it.

Parquet File Format

Parquet is a columnar file format; unlike traditional row-based formats, parquet stores column data close together.

column data in parquet is grouped together

Along with the actual column data, a parquet file also contains metadata such as column position, column type, encoding, etc. These provide enough information to the parser for extracting all the column data. This means the parser must understand both the metadata and the column data to completely parse a parquet file.

metadata contains enough information to extract column data

Metadata

In parquet, metadata includes file metadata, column metadata, row group metadata, column chunk metadata, etc. There are many of them, and writing a parser to parse them is a daunting task. Luckily, the spec comes with a full Thrift definition for all the metadata, which can be used to generate all parsing code.

The starter code includes a ready-to-use function for this: read_thrift_metadata, which takes a Bytes, and returns a corresponding metadata with the remaining bytes, based on the template argument.

let (metadata, remaining) = read_thrift_metadata::<MetaData>(data);

Column data

The main focus in the book will be parsing column data. The flow can be simplified like this:

Read the metadata
Parse the column data
Merge all the column data together

Implementation

The book has several sections building on top of each other. Below is a rough guideline:

Parse the file metadata
Parse a single column with Plain Encoding
Parse many columns
Parse a complete file
Parse files with RLE Bit-packing Hybrid Encoding
Parse files with null values
Parse files with Dictionary Encoding
Parse files with compression

Keyboard shortcuts

Write yourself a parquet parser

Overview

Parquet File Format

Metadata

Column data

Implementation