Benchmark Parquet indices for duckdb-wasm performance #5

jdangerx · 2025-01-13T23:13:44Z

Overview

DuckDB will send a bunch of requests (~10-20 for most tables, but up to several hundred for the VCE RARE data) to download data. It looks like our Parquet files do not have any column indices - let's see if adding some helps.

Success Criteria

How will we know that we're done?

compared the DuckDB-wasm initial load times of several tables, using and not using indices: a very long but narrow one (VCE), a very wide but only moderately long one.

Next steps

Give feedback

identify good tables
make test parquet files and upload them to S3
grab logs from network tab and investigate
Options

zaneselvans · 2025-01-13T23:23:34Z

Have you looked at the monolithic EPA CEMS hourly parquet file? I think it might be the only one with logical row-groups, since we write in year-state chunks. But anyway if we can add a few meaningful indices to the Parquet files and make them faster and easier to query that sounds great!

jdangerx · 2025-01-13T23:40:17Z

I'm curious to see if the indices improve performance without having to also create meaningful partitions (which is what I think you meant by the CEMS logical row-groups thing? Maybe I misunderstood.)

jdangerx · 2025-01-13T23:41:11Z

But no, I haven't looked at CEMS since its metadata wasn't in the pudl database on datasette 😅 . Maybe we do #3 first...

zaneselvans · 2025-01-14T02:41:09Z

Oh, indices were also one potential explanation @bendnorman mentioned for why pudl.duckdb was so huge -- basically no smaller than pudl.sqlite despite the fact that the Parquet outputs are only a couple of GB.

jdangerx added this to Catalyst Megaproject Jan 13, 2025

github-project-automation bot moved this to New in Catalyst Megaproject Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Parquet indices for duckdb-wasm performance #5

Benchmark Parquet indices for duckdb-wasm performance #5

jdangerx commented Jan 13, 2025

Next steps

zaneselvans commented Jan 13, 2025

jdangerx commented Jan 13, 2025

jdangerx commented Jan 13, 2025

zaneselvans commented Jan 14, 2025

Benchmark Parquet indices for duckdb-wasm performance #5

Benchmark Parquet indices for duckdb-wasm performance #5

Comments

jdangerx commented Jan 13, 2025

Overview

Success Criteria

Next steps

zaneselvans commented Jan 13, 2025

jdangerx commented Jan 13, 2025

jdangerx commented Jan 13, 2025

zaneselvans commented Jan 14, 2025