Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in access time for shorter vs longer kerchunk files #538

Open
kthyng opened this issue Jan 21, 2025 · 19 comments
Open

Difference in access time for shorter vs longer kerchunk files #538

kthyng opened this issue Jan 21, 2025 · 19 comments

Comments

@kthyng
Copy link

kthyng commented Jan 21, 2025

Hi! I use kerchunk all the time and love it, so thank you! I have a question. I find pretty different access times that get longer the longer the kerchunk file is.

1 year kerchunk file:

Image

24 year kerchunk file (but same number of times and values accessed as in the 1 year kerchunk file):

Image

(The comparisons I am running in each plot are trying out subchunking since these files are uncompressed netCDF4 files.)

Is this expected behavior? Is there anything I can do to counteract this effect? For example, would it be better if I used 24 1-year kerchunk files? I've read through a bunch of issues here and I wonder if there is a flag that would help with this.

Thank you for any help!

Edited to add: These are parquet kerchunk files.

@martindurant
Copy link
Member

Can you please run something like snakeviz so that we can see where most of the time is being spent?

@kthyng
Copy link
Author

kthyng commented Jan 21, 2025

Yes I will work on this.

@kthyng
Copy link
Author

kthyng commented Jan 22, 2025

I am not sure how to interpret the output in snakeviz so happy to take some guidance on what to show.

For the full 24 year kerchunk file, running all of the access tests except the 4D cross section that took the most time, we have for the "default" chunks:

Image

and for the subchunk case:

Image

@martindurant
Copy link
Member

So it looks a lot like it's dask struggling to cope with the increased number of tasks. It being a high-level graph, I don't know why the number of chunks should be material during graph processing in the client, but there it is.

@kthyng
Copy link
Author

kthyng commented Jan 22, 2025

@martindurant Are you saying that both of the snakeviz plots indicate that dask is struggling, or just one of them? Both scenarios (default chunks and subchunks) have slower read speeds from this 24 year kerchunk file as compared to the same reads from a 1 year kerchunk file.

@martindurant
Copy link
Member

No - the second one is the one spending all its time in dask blockwise/graph stuff.

How big is the sub-chunked reference file in terms of reference count or size?

@kthyng
Copy link
Author

kthyng commented Jan 22, 2025

The subchunked reference file (34MB) is quite a bit bigger than the default chunk reference file (1.2MB).

My goal with subchunking is to see if I can create a reference file that has better access times for the access patterns that are slow for the default chunking reference file. In this case, I am trying to get time series access better with subchunking (but without rechunking). It has been pretty successful, except that the times (for both the default reference files and the subchunked reference file) are typically quite a bit longer for the 24 year reference file as compared to the 1 year reference file, despite accessing the same amount of model output. (I had noticed this before when comparing just default reference files - the one that doesn't appear to be dask-limited - and my solution has been to create the master reference files on the fly that are as long as needed for the application, but this won't work so well for serving the model output which is my present goal.)

@martindurant
Copy link
Member

Feel free to send me the .prof files associated with the snakeviz output and I can dig around.

One thing you might want to try: turning off dask optimization. I don't imagine it has too much to do, and might cut out some/most of that 100s wait time. This can be a dask config option, of if you do it manually: .compute(optimize_graph=False)

Another possibility: if you can .sel your dataset early, perhaps there's some region= option? Or open with dask off and later call .chunk after selection? Worth tring a few things like that, but I am not so good on the xarray API, so you are experimenting on your own.

Edited to add: These are parquet kerchunk files.

It seems unlikely that kerchunk/referenceFS is adding significant overhead. You will, I suppose have 24x as many dask tasks for the long data versus the short, and also multiply the task count by whatever your subchunking factor was (guess ~30 from the reference set file size).

You may want to ensure that the there are not too many parquet reference files, that the number of references per file is sufficiently large. At 30MB, there isn't really much point in splitting them up at all.

@kthyng
Copy link
Author

kthyng commented Jan 23, 2025

In terms of order of selecting and such within my timing script, I don't think there's much to do on the xarray end at least. I basically open the dataset (could be missing a useful flag there) and accessing the variable in some way and load the values, don't even save it, e.g.:

ds["temp"].isel(eta_rho=100, xi_rho=200, s_rho=-1, ocean_time=slice(0,365*24)).values

You may want to ensure that the there are not too many parquet reference files

In the subchunked case, the 4D variables have 505 files (I assume these are the *.parq files in the variable directory in the reference file directory) and the 3D variables have 38. Is this too many?I see the default chunks file has 17 for 4D and 3 for 3D. Is this controlled by the subchunk factors I used?

@martindurant
Copy link
Member

Opening any parquet file has a fixed cost to read and parse the footer, and these could add up. The loading of the arrays may be comparable or even smaller, for very small files. For 34MB of references, I don't see why it should be split into more than a small handful of files.

The number of references per file is controlled when creating the lazy reference mapper.

@dcherian
Copy link
Contributor

cc @phofl

@kthyng
Copy link
Author

kthyng commented Jan 23, 2025

The number of references per file is controlled when creating the lazy reference mapper.

Is that controlled by the record_size? If so, I played around with that and had to have a smaller number to control the memory use on the machine I was running on. Even so I was barely able to finish running it before blowing out the memory and this was after first combining from yearly to 4 year json files to finally do the following to make the final parquet reference file:

out = LazyReferenceMapper.create(fname_overall_kerchunk_parq, fs=None, record_size=50000)
# out = LazyReferenceMapper.create(fname_overall_kerchunk_parq, fs=None, record_size=100000)  # maxxed out memory on server

json_list = ["ciofs_kerchunk_subchunk0to3.json", "ciofs_kerchunk_subchunk4to7.json", "ciofs_kerchunk_subchunk8to11.json", "ciofs_kerchunk_subchunk12to15.json", "ciofs_kerchunk_subchunk16to19.json", "ciofs_kerchunk_subchunk20to23.json"]
mzz = MultiZarrToZarr(
    json_list,
    concat_dims=["ocean_time"],
    identical_dims= ['lat_rho', 'lon_rho', "lon_psi", "lat_psi",
                    "lat_u", "lon_u", "lat_v", "lon_v", 
                    "Akk_bak","Akp_bak","Akt_bak","Akv_bak","Cs_r","Cs_w",
                    "FSobc_in","FSobc_out","Falpha","Fbeta","Fgamma","Lm2CLM",
                    "Lm3CLM", "LnudgeM2CLM", "LnudgeM3CLM", "LnudgeTCLM",
                    "LsshCLM", "LtracerCLM", "LtracerSrc", "LuvSrc",
                    "LwSrc", "M2nudg", "M2obc_in", "M2obc_out", "M3nudg",
                    "M3obc_in", "M3obc_out", "Tcline", "Tnudg","Tobc_in", "Tobc_out",
                    "Vstretching", "Vtransform", "Znudg", "Zob", "Zos", "angle",
                    "dstart", "dt", "dtfast", "el", "f", "gamma2", "grid", "h",
                    "hc", "mask_psi", "mask_rho", "mask_u", "mask_v", "nHIS", "nRST",
                    "nSTA", "ndefHIS",  "ndtfast", "ntimes", "pm", "pn", "rdrg", 
                    "rdrg2", "rho0", "spherical", "theta_b", "theta_s", "xl",
                    ],
    coo_map = {"ocean_time": "cf:ocean_time",
               },
    # postprocess=postprocess,
    out=out,
).translate()
out.flush()

I've read through a bunch of github issues about related issues but I'm sure I'm missing plenty of flags here and there to make this better! I'd love any guidance.

@phofl
Copy link

phofl commented Jan 23, 2025

Can you share a reproducer and your Dask / Xarray versions?

@kthyng
Copy link
Author

kthyng commented Jan 24, 2025

I reran my original test for the 24 year reference file with chunks=None to avoid dask for the subchunked file. (I had tested both on the 1 year case and thought I apply what I learned there so hadn't used that originally for the 24 year case but it is pretty different between the two.)

Now for the 24 year case I have better performance for the subchunked case across the board:

Image

There is still a pretty big difference between the 1 year and 24 year kerchunk file cases, but it is smaller now. Is this expected behavior?

It sounds like I might get better behavior by reducing the number of references per file, but @martindurant I am not sure how to do that. I would appreciate any tips.

@phofl I'm willing but not sure what would be the useful thing to share in this case where using the 24 years of model output is what causes the problem I want to solve. Currently the model output is pretty difficult/brittle to access remotely, and I'm trying to improve that using kerchunk. Is there a specific thing that would be useful to try to recreate with fake model output, or a smaller amount of the real model output?

@martindurant
Copy link
Member

a pretty big difference between the 1 year and 24 year kerchunk file cases, but it is smaller now. Is this expected behavior?

Dask still struggles to make a graph with lots of tasks, I have no idea why.
I would love to see your .prof files (profiles!) which you can get with %prun -T <filename> in ipython; this is what snakeviz does internally. Then we can assign blame!

@kthyng
Copy link
Author

kthyng commented Jan 24, 2025

@martindurant Ok I'll email you, thanks.

@martindurant
Copy link
Member

You can also post it here (in a .zip) or use gist

@kthyng
Copy link
Author

kthyng commented Jan 24, 2025

Here is a profile run for each of the 24 year cases:
profs.zip

@martindurant
Copy link
Member

The "defaults" one didn't work, it seems to have profiled ipython's event loop instead.

The subchunks one shows only time in referenceFS and parquet loading, and in the latter, as much time is spent allocating memory as reading, so the files really are tiny (especially for local, where bandwidth is huge). Strangely, cat_file is showing as having taken 33000% of the time, so something is very weird in this profile!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants