Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Parquet indices for duckdb-wasm performance #5

Open
3 tasks
jdangerx opened this issue Jan 13, 2025 · 4 comments
Open
3 tasks

Benchmark Parquet indices for duckdb-wasm performance #5

jdangerx opened this issue Jan 13, 2025 · 4 comments

Comments

@jdangerx
Copy link
Member

Overview

DuckDB will send a bunch of requests (~10-20 for most tables, but up to several hundred for the VCE RARE data) to download data. It looks like our Parquet files do not have any column indices - let's see if adding some helps.

Success Criteria

How will we know that we're done?

  • compared the DuckDB-wasm initial load times of several tables, using and not using indices: a very long but narrow one (VCE), a very wide but only moderately long one.

Next steps

Preview Give feedback
@zaneselvans
Copy link
Member

Have you looked at the monolithic EPA CEMS hourly parquet file? I think it might be the only one with logical row-groups, since we write in year-state chunks. But anyway if we can add a few meaningful indices to the Parquet files and make them faster and easier to query that sounds great!

@jdangerx
Copy link
Member Author

I'm curious to see if the indices improve performance without having to also create meaningful partitions (which is what I think you meant by the CEMS logical row-groups thing? Maybe I misunderstood.)

@jdangerx
Copy link
Member Author

But no, I haven't looked at CEMS since its metadata wasn't in the pudl database on datasette 😅 . Maybe we do #3 first...

@zaneselvans
Copy link
Member

Oh, indices were also one potential explanation @bendnorman mentioned for why pudl.duckdb was so huge -- basically no smaller than pudl.sqlite despite the fact that the Parquet outputs are only a couple of GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

No branches or pull requests

2 participants