Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Bit-packed Integers

We have implemented the dictionary decoder for 2 values and 1 value. This time, we make it work with any number of unique values. This means the RLE Bit-packing Hybrid decoder must be able to decode data with arbitrary bit-width.

In this step, we handle the bit-packing case, the next step will be the RLE case.

Encode

Encoding is pretty similar to 1-bit width: encode the data and pack them into groups of bytes. One issue with larger bit-width is that the encoded bits might cross group boundaries. Let’s visualize it using the example from the spec with 3-bit width.

Bit-packed arbitrary encoding raw data example

Encoded data is packed into 8-bit groups using different colors.

Bit-packed arbitrary encoding animation

Decode

In 1-bit width, we fetch and decode an 8-bit group at a time. To handle the encoded bits crossing group boundaries in larger bit-width, we fetch groups into a buffer and decode it instead of the actual groups.

Let’s see an example. This is a packed data with 3-bit width.

Bit-packed arbitrary decoding start

At the beginning, the decoder fetches the first 8-bit group into the buffer and decodes the first 3 bits (LSB).

Bit-packed decoder decodes the first 3 bits

Then the next 3 bits.

Bit-packed decoder decodes the next 3 bits

Now, the buffer only has 2 bits (the value spans a group boundary), so the decoder fetches the next group from the packed encoded data.

Bit-packed decoder fetches the next group

And decodes the next 3 bits.

Bit-packed decoder decodes 3 bits after fetching group

This is the full animation:

Bit-packed arbitrary decoding animation

The algorithm can be optimized by fetching more than 8-bit at a time.

Task

Update the bit_packed_decode function in src/decoder/bit_packed.rs to handle arbitrary bit-width.

pub fn bit_packed_decode(
    encoded_data: Bytes,
    parquet_type: Type,
    bit_width: u8,
    num_values: usize,
) -> Result<Vec<Scalar>> {
    // ...
}

The data type for indexes should be Type::INT32, this must be passed by the caller.

Test

Test case for this step is step12_03_dictionary_decoder_bit_packed.

Hints and Solution

Hint (how to fetch groups)

Groups should be fetched to the left of the buffer, this can be done using bit-shift and bit-or, you also need to keep track of how many bits the buffer contains.

let group = encoded_data.get_u8() as u64;
buffer |= group << buffer_bits;
Hint (how to extract values out of the buffer)

A single value can be extracted using bit-and with the bit-width mask.

let mask = u64::MAX >> (64 - bit_width);
let value = buffer & mask;
Solution
pub fn bit_packed_decode(
    encoded_data: Bytes,
    parquet_type: Type,
    bit_width: u8,
    num_values: usize,
) -> Result<Vec<Scalar>> {
    let mut encoded_data = encoded_data;
    let mut scalars = Vec::with_capacity(num_values);

    let mask = u64::MAX >> (64 - bit_width);
    let mut buffer = 0;
    let mut buffer_bits = 0;

    while scalars.len() < num_values {
        // Buffer needs more bits
        while buffer_bits < bit_width {
            let group = encoded_data.get_u8() as u64;
            // put the group data to the left of the current buffer
            buffer |= group << buffer_bits;
            buffer_bits += 8;
        }

        let scalar = match parquet_type {
            Type::BOOLEAN => Scalar::from(buffer & 1 == 1),
            Type::INT32 => Scalar::from((buffer & mask) as i32),
            _ => unimplemented!("bit_packed_decode: unsupported type: {:?}", parquet_type),
        };
        scalars.push(scalar);

        buffer = buffer.checked_shr(bit_width as u32).unwrap_or(0);
        buffer_bits -= bit_width;
    }

    Ok(scalars)
}