-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference in access time for shorter vs longer kerchunk files #538
Comments
Can you please run something like snakeviz so that we can see where most of the time is being spent? |
Yes I will work on this. |
So it looks a lot like it's dask struggling to cope with the increased number of tasks. It being a high-level graph, I don't know why the number of chunks should be material during graph processing in the client, but there it is. |
@martindurant Are you saying that both of the snakeviz plots indicate that dask is struggling, or just one of them? Both scenarios (default chunks and subchunks) have slower read speeds from this 24 year kerchunk file as compared to the same reads from a 1 year kerchunk file. |
No - the second one is the one spending all its time in dask blockwise/graph stuff. How big is the sub-chunked reference file in terms of reference count or size? |
The subchunked reference file (34MB) is quite a bit bigger than the default chunk reference file (1.2MB). My goal with subchunking is to see if I can create a reference file that has better access times for the access patterns that are slow for the default chunking reference file. In this case, I am trying to get time series access better with subchunking (but without rechunking). It has been pretty successful, except that the times (for both the default reference files and the subchunked reference file) are typically quite a bit longer for the 24 year reference file as compared to the 1 year reference file, despite accessing the same amount of model output. (I had noticed this before when comparing just default reference files - the one that doesn't appear to be dask-limited - and my solution has been to create the master reference files on the fly that are as long as needed for the application, but this won't work so well for serving the model output which is my present goal.) |
Feel free to send me the .prof files associated with the snakeviz output and I can dig around. One thing you might want to try: turning off dask optimization. I don't imagine it has too much to do, and might cut out some/most of that 100s wait time. This can be a dask config option, of if you do it manually: .compute(optimize_graph=False) Another possibility: if you can .sel your dataset early, perhaps there's some region= option? Or open with dask off and later call .chunk after selection? Worth tring a few things like that, but I am not so good on the xarray API, so you are experimenting on your own.
It seems unlikely that kerchunk/referenceFS is adding significant overhead. You will, I suppose have 24x as many dask tasks for the long data versus the short, and also multiply the task count by whatever your subchunking factor was (guess ~30 from the reference set file size). You may want to ensure that the there are not too many parquet reference files, that the number of references per file is sufficiently large. At 30MB, there isn't really much point in splitting them up at all. |
In terms of order of selecting and such within my timing script, I don't think there's much to do on the xarray end at least. I basically open the dataset (could be missing a useful flag there) and accessing the variable in some way and load the values, don't even save it, e.g.:
In the subchunked case, the 4D variables have 505 files (I assume these are the *.parq files in the variable directory in the reference file directory) and the 3D variables have 38. Is this too many?I see the default chunks file has 17 for 4D and 3 for 3D. Is this controlled by the subchunk factors I used? |
Opening any parquet file has a fixed cost to read and parse the footer, and these could add up. The loading of the arrays may be comparable or even smaller, for very small files. For 34MB of references, I don't see why it should be split into more than a small handful of files. The number of references per file is controlled when creating the lazy reference mapper. |
cc @phofl |
Is that controlled by the
I've read through a bunch of github issues about related issues but I'm sure I'm missing plenty of flags here and there to make this better! I'd love any guidance. |
Can you share a reproducer and your Dask / Xarray versions? |
I reran my original test for the 24 year reference file with Now for the 24 year case I have better performance for the subchunked case across the board: There is still a pretty big difference between the 1 year and 24 year kerchunk file cases, but it is smaller now. Is this expected behavior? It sounds like I might get better behavior by reducing the number of references per file, but @martindurant I am not sure how to do that. I would appreciate any tips. @phofl I'm willing but not sure what would be the useful thing to share in this case where using the 24 years of model output is what causes the problem I want to solve. Currently the model output is pretty difficult/brittle to access remotely, and I'm trying to improve that using kerchunk. Is there a specific thing that would be useful to try to recreate with fake model output, or a smaller amount of the real model output? |
Dask still struggles to make a graph with lots of tasks, I have no idea why. |
@martindurant Ok I'll email you, thanks. |
You can also post it here (in a .zip) or use gist |
Here is a profile run for each of the 24 year cases: |
The "defaults" one didn't work, it seems to have profiled ipython's event loop instead. The subchunks one shows only time in referenceFS and parquet loading, and in the latter, as much time is spent allocating memory as reading, so the files really are tiny (especially for local, where bandwidth is huge). Strangely, cat_file is showing as having taken 33000% of the time, so something is very weird in this profile! |
Hi! I use kerchunk all the time and love it, so thank you! I have a question. I find pretty different access times that get longer the longer the kerchunk file is.
1 year kerchunk file:
24 year kerchunk file (but same number of times and values accessed as in the 1 year kerchunk file):
(The comparisons I am running in each plot are trying out subchunking since these files are uncompressed netCDF4 files.)
Is this expected behavior? Is there anything I can do to counteract this effect? For example, would it be better if I used 24 1-year kerchunk files? I've read through a bunch of issues here and I wonder if there is a flag that would help with this.
Thank you for any help!
Edited to add: These are parquet kerchunk files.
The text was updated successfully, but these errors were encountered: