Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask tests could use optimization #56

Open
shughes-uk opened this issue Nov 5, 2023 · 3 comments
Open

Dask tests could use optimization #56

shughes-uk opened this issue Nov 5, 2023 · 3 comments

Comments

@shughes-uk
Copy link

Hi there,

At Coiled we have better optimized versions of the tests

https://github.com/coiled/benchmarks/blob/main/tests/benchmarks/test_h2o.py

Any chance of things being updated to give dask a fairer shot?

@jangorecki
Copy link

Thanks, code looks very promising. Definitely worth to add. Any idea about q10? Even if there is no optimized version for that one, we should still provide valid syntax. We can as well define exception comment that is used to explain the issue, even mentioning gh issue number.
I would also suggest to file a FR for dask to have an argument in csv reader function so it can do all those optimizations related to parquet parrow file splitting automatically.

@shughes-uk
Copy link
Author

Im traveling today but may have some time to poke around and make some tweaks.

@fjetter
Copy link

fjetter commented Nov 7, 2023

Apart from the code itself, I think a big-ish issue with the current setup is that I suspect our default deployment configuration is not ideal for the machine the benchmarks are running on. IIUC the benchmark server is a c6id.metal machine with 128 cores and 246GiB RAM. That will launch 16 workers (processes) with 8 threads each. That's not necessarily a sweet spot for dask and something we can improve on. (too many threads per process cause GIL contention). This doesn't explain why we're running out of memory.

Looking over the code briefly, I guess the most critical problem right now is that we are not pushing down column projections automatically. This is where https://github.com/dask-contrib/dask-expr should make a big difference.

Regarding the missing features, I don't think there is anything missing. I opened #58 to re-enable those missing queries. A follow up PR can go over the existing code and clean that up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants