Bonus: Interactive Testing
This section introduces an interactive way to test the parser using the CLI.
CLI
The starter code comes with a CLI that has several useful commands:
Usage: parquet-parser <COMMAND>
Commands:
read Read parquet file
write Write a csv file to a parquet file
metadata Print the metadata for a parquet file
verify Verify the current parser output with the official parquet parser
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
At the moment, the parser can only read parquet files with plain encoding, no compression, etc.;
such files are rare in the wild. The write command is a convenient way to convert CSV into parquet
(the default arguments creates a plain encoding with no compression parquet file).
Write a csv file to a parquet file
Usage: parquet-parser write [OPTIONS] <CSV> <PARQUET>
Arguments:
<CSV> The input csv file
<PARQUET> The output parquet file
Options:
--author <AUTHOR>
The author [default: "Hello parquet!"]
--dictionary
Whether to enable dictionary encoding
--encodings <ENCODINGS>
Encoding for each column. Syntax: `--encodings <column_name>=<encoding>`. Supported encodings: [rle]
--compression <COMPRESSION>
Compression for the output parquet [default: uncompressed] [possible values: uncompressed, snappy]
--rows-per-page <ROWS_PER_PAGE>
The number of row per page in a column chunk
--rows-per-group <ROWS_PER_GROUP>
The number of row per groups in a row group
--dtypes <DTYPES>
Data type for each column. Syntax: `--dtypes <column_name>=<data_type>`. Supported data types: [boolean, int32, int64, float, double, string]
-h, --help
Print help
Try it out
Let’s convert this csv file to parquet, and read it using our parser.
# download csv file
curl -L -o public-cloud-provider-ip-ranges.csv https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/bda4bc1ac501f8bab9cd618b47eb336328e732cc/data/providers/all.csv
# convert to parquet
cargo run write public-cloud-provider-ip-ranges.csv public-cloud-provider-ip-ranges.parquet
# read it
cargo run read public-cloud-provider-ip-ranges.parquet
Result
This is the result:
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.34s
Running `target/debug/parquet-parser read public-cloud-provider-ip-ranges.parquet`
shape: (62_689, 6)
┌────────────────┬─────────────────┬──────────────┬─────────────────┬────────────────┬────────────────┐
│ cloud_provider ┆ cidr_block ┆ ip_address ┆ ip_address_mask ┆ ip_address_cnt ┆ region │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ str │
╞════════════════╪═════════════════╪══════════════╪═════════════════╪════════════════╪════════════════╡
│ AWS ┆ 1.178.1.0/24 ┆ 1.178.1.0 ┆ 24 ┆ 256 ┆ us-west-2 │
│ AWS ┆ 1.178.10.0/24 ┆ 1.178.10.0 ┆ 24 ┆ 256 ┆ eu-central-1 │
│ AWS ┆ 1.178.100.0/24 ┆ 1.178.100.0 ┆ 24 ┆ 256 ┆ us-west-1 │
│ AWS ┆ 1.178.101.0/24 ┆ 1.178.101.0 ┆ 24 ┆ 256 ┆ ap-northeast-3 │
│ AWS ┆ 1.178.102.0/24 ┆ 1.178.102.0 ┆ 24 ┆ 256 ┆ ap-southeast-5 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ Vultr ┆ 95.179.208.0/20 ┆ 95.179.208.0 ┆ 20 ┆ 4096 ┆ FR-93 │
│ Vultr ┆ 95.179.224.0/20 ┆ 95.179.224.0 ┆ 20 ┆ 4096 ┆ GB-LND │
│ Vultr ┆ 95.179.240.0/20 ┆ 95.179.240.0 ┆ 20 ┆ 4096 ┆ DE-HE │
│ Vultr ┆ 96.30.192.0/20 ┆ 96.30.192.0 ┆ 20 ┆ 4096 ┆ US-GA │
│ Vultr ┆ 96.30.208.0/20 ┆ 96.30.208.0 ┆ 20 ┆ 4096 ┆ US-FL │
└────────────────┴─────────────────┴──────────────┴─────────────────┴────────────────┴────────────────┘
Metadata
You can also inspect the metadata using the metadata command:
cargo run metadata public-cloud-provider-ip-ranges.parquet
The output is quite verbose, but we can see that all of the columns are encoded using plain encoding
(you can ignore RLE as it is used for definition levels).
...
version: 1
num of rows: 62689
created by: Hello parquet!
...
column 0:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "cloud_provider"
encodings: PLAIN RLE
...
column 1:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "cidr_block"
encodings: PLAIN RLE
...
column 2:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "ip_address"
encodings: PLAIN RLE
...
column 3:
--------------------------------------------------------------------------------
column type: INT64
column path: "ip_address_mask"
encodings: PLAIN RLE
...
column 4:
--------------------------------------------------------------------------------
column type: INT64
column path: "ip_address_cnt"
encodings: PLAIN RLE
...
column 5:
--------------------------------------------------------------------------------
column type: BYTE_ARRAY
column path: "region"
encodings: PLAIN RLE
...