Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Row Group

From the Understand File Format section, we know a parquet file has multiple row groups, each row group has multiple column chunks. In this step, we will read all of them!

row groups and column chunks data

The relationship above looks like this from the metadata spec.

Metadata relationship from the spec, a file metadata contains multiple row groups, a row group contains multiple columns

As some of you might expect, to represent the data for a row group and a parquet file, we use Polars DataFrame.

Task

read_row_group

Implement the read_row_group function in scr/row_group.rs. It takes the entire file data as Bytes and returns a DataFrame.

pub fn read_row_group(data: Bytes, row_group: &RowGroup) -> Result<DataFrame> {
    todo!("step07: implement read row group")
}

You can use DataFrame::new_infer_height to group multiple columns together into a single DataFrame.

read_row_groups

Implement the read_row_groups function in src/row_group.rs. It takes the entire file data as Bytes and returns a DataFrame.

pub fn read_row_groups(data: Bytes, file_metadata: &FileMetaData) -> Result<DataFrame> {
    todo!("step07: implement read row groups")
}

You can use concat to concatenate the DataFrame from all groups into a single DataFrame.

Test

Test case for this step is step07_row_group.

Hints and Solution

Hint (How to concatenate multiple data frames)

Convert the DataFrame into a LazyFrame, then use the concat function.

// convert `DataFrame` into `LazyFrame`
let lazyframes: Vec<LazyFrame> = dataframes.into_iter().map(|df| df.lazy()).collect();

// concatenate `LazyFrame` to a single `DataFrame`
concat(
    lazyframes,
    UnionArgs {
        strict: true,
        ..Default::default()
    },
)?
.collect()?;
Solution

read_row_group:

pub fn read_row_group(data: Bytes, row_group: &RowGroup) -> Result<DataFrame> {
    let mut columns = Vec::with_capacity(row_group.columns.len());
    for column_chunk in &row_group.columns {
        let column = read_column(data.clone(), column_chunk)?;
        columns.push(column);
    }
    let df = DataFrame::new_infer_height(columns)?;
    Ok(df)
}

read_row_groups:

pub fn read_row_groups(data: Bytes, file_metadata: &FileMetaData) -> Result<DataFrame> {
    let mut dfs = Vec::with_capacity(file_metadata.row_groups.len());
    for row_group in &file_metadata.row_groups {
        let df = read_row_group(data.clone(), row_group)?;
        dfs.push(df.lazy());
    }
    let df = concat(
        dfs,
        UnionArgs {
            strict: true,
            ..Default::default()
        },
    )?
    .collect()?;
    Ok(df)
}