Dask tests could use optimization #56

shughes-uk · 2023-11-05T04:38:41Z

Hi there,

At Coiled we have better optimized versions of the tests

https://github.com/coiled/benchmarks/blob/main/tests/benchmarks/test_h2o.py

Any chance of things being updated to give dask a fairer shot?

jangorecki · 2023-11-05T07:52:48Z

Thanks, code looks very promising. Definitely worth to add. Any idea about q10? Even if there is no optimized version for that one, we should still provide valid syntax. We can as well define exception comment that is used to explain the issue, even mentioning gh issue number.
I would also suggest to file a FR for dask to have an argument in csv reader function so it can do all those optimizations related to parquet parrow file splitting automatically.

shughes-uk · 2023-11-06T19:59:53Z

Im traveling today but may have some time to poke around and make some tweaks.

fjetter · 2023-11-07T10:44:41Z

Apart from the code itself, I think a big-ish issue with the current setup is that I suspect our default deployment configuration is not ideal for the machine the benchmarks are running on. IIUC the benchmark server is a c6id.metal machine with 128 cores and 246GiB RAM. That will launch 16 workers (processes) with 8 threads each. That's not necessarily a sweet spot for dask and something we can improve on. (too many threads per process cause GIL contention). This doesn't explain why we're running out of memory.

Looking over the code briefly, I guess the most critical problem right now is that we are not pushing down column projections automatically. This is where https://github.com/dask-contrib/dask-expr should make a big difference.

Regarding the missing features, I don't think there is anything missing. I opened #58 to re-enable those missing queries. A follow up PR can go over the existing code and clean that up

Tmonster mentioned this issue Nov 8, 2023

Dask: Enable Q6 to Q10 again #58

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask tests could use optimization #56

Dask tests could use optimization #56

shughes-uk commented Nov 5, 2023

jangorecki commented Nov 5, 2023

shughes-uk commented Nov 6, 2023

fjetter commented Nov 7, 2023

Dask tests could use optimization #56

Dask tests could use optimization #56

Comments

shughes-uk commented Nov 5, 2023

jangorecki commented Nov 5, 2023

shughes-uk commented Nov 6, 2023

fjetter commented Nov 7, 2023