Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance #13

Open
nikola-rados opened this issue Feb 24, 2021 · 2 comments
Open

Improve performance #13

nikola-rados opened this issue Feb 24, 2021 · 2 comments
Assignees

Comments

@nikola-rados
Copy link
Contributor

While the script is working as intended thus far the performance may become a concern to its viability. With this issue we will seek out ways to improve the speed.

@nikola-rados nikola-rados self-assigned this Feb 24, 2021
@nikola-rados
Copy link
Contributor Author

Examining the snakeviz output for a request of size 571mb (this is the reported size from Dataset.nbytes / 2) we get a pretty clear picture of what is holding back the performance:
image
Note: Given the exact same parameters I've seen this time vary quite a bit, anywhere from late 200 seconds to early 400 seconds.

The Dataset.to_netcdf() method takes up basically the entire runtime of the program. If we follow the call stack to the bottom, we see that the method is already using some threading to handle its execution:
image

Despite this is doesn't seem to do things particularly quickly (at least it feels that way). @cairosanders and I have already tried to incorporate asyncio to simultaneously load the the individual requests but support from xarray of asynchronous tasks is pretty limited. Also the main bottleneck of to_netcdf still exists unfortunately.

I don't know what the performance requirements/expectations are for orca but I get the feeling this may be a little too slow. As such I was hoping to open up some discussion about how to go about potentially speeding this up.

@nikola-rados
Copy link
Contributor Author

To add some more details the results above were achieved by running: make performance which runs a test case that splits a single request into two. Here is a look at the parameters passed into the script (found in the link above):

scripts/process.py -u tasmax_day_BCCAQv2_bcc-csm1-1-m_historical-rcp26_r1i1p1_19500101-21001231_Canada -v tasmax[0:1:15000] -t [0:1:91] -n [0:1:206] -l DEBUG

The original request is split into these two requests:

'https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[0:1:7500][0:1:91][0:1:206]'
'https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[7501:1:15000][0:1:91][0:1:206]'

These are split in half on the time variable such that both requests are under the threshold.

Here is the full set of logs from the run:

2021-02-26 13:08:15 INFO: Processing data file request
2021-02-26 13:08:15 DEBUG: Starting db session
2021-02-26 13:08:15 DEBUG: Got filepath: /storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc
2021-02-26 13:08:15 DEBUG: Initial url: https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[0:1:15000][0:1:91][0:1:206]
2021-02-26 13:08:15 INFO: Downloading data file(s)
2021-02-26 13:08:16 DEBUG: Splitting, request over threshold: 571358088.0
2021-02-26 13:08:16 DEBUG: URL(s) for downloading: ['https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[0:1:7500][0:1:91][0:1:206]', 'https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/thredds/dodsC/datasets/storage/data/climate/downscale/BCCAQ2/bccaqv2_with_metadata/tasmax_day_BCCAQv2+ANUSPLIN300_bcc-csm1-1-m_historical+rcp26_r1i1p1_19500101-21001231.nc?tasmax[7501:1:15000][0:1:91][0:1:206]'])
2021-02-26 13:08:16 DEBUG: Downloading and merging 2 split files
2021-02-26 13:13:57 DEBUG: File writing complete
2021-02-26 13:13:57 INFO: Complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant