-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarking of GeoDataset for a paper result #81
Comments
I think this will require a significant rework of our |
I think we can also consider the following I/O strategies:
Merging should happen after the fact so that (tile 1, tile 2, tile 1 + 2) don't end up being 3 different entries in the cache. I don't think we need to consider situations in which we:
These strategies make sense for tile-based raster images, but are slightly more complicated for vector geometries or static regional maps. We may need to change the default behavior based on the dataset. |
For timing, we should choose some arbitrary epoch size, then experiment with various batch sizes and see how long it takes to load an entire epoch. |
Here's where I'm currently stuck to remind myself when I next pick this up: Our process right now is:
Steps 1 and 2 don't actually do anything and are almost instantaneous. It isn't until you actually try to |
Another hurdle: the size of each array depends greatly on the dataset, but most are around 0.5 GB per file. We can't really assume users have >8 GB of RAM, which greatly limits our LRU cache size. We could use something like |
For now, I think we can rely on GDAL's internal caching behavior. When I read a VRT the second time around, it seems to be significantly faster. Still not as fast as reading the raw data or as indexing from a loaded array, but good enough for a first round of benchmarking. GDAL also lets you configure the cache size. |
@adamjstewart, sketch of the full experiment:
|
@calebrob6 the above proposal covers the matrix of:
There are a lot of additional constraints that we're currently skipping:
Do you think it's fine to skip these for the sake of time? I doubt reviewers would straight up reject us for not including one of these permutations, and can always ask us to perform additional experiments if they want. Also, we should definitely benchmark not only |
Also, do we want to compare with different batch_sizes or different num_workers? |
I'd do the first matrix as quickly as possible because the results of that are going to be very informative. If that all works out then you can repeat the same with a vectordataset.
I don't think this is important right now. I.e. we can just assume the data is in a good format (COG and shapefile/geopackage)
In the above sketch you can repeat the experiments with the manually aligned versions of the dataset to test the "already in correct CRS/res" case. The first set of experiments is with "change CRS and res". It might be interesting to see if warping or resampling is more expensive, but not interesting for the paper I think.
Sure! These experiments should be very quick to run once you have a script for them. |
Some things to discuss soon:
|
We're following up on this discussion in #1330 (comment) |
Datasets
We want to test several popular image sources, as well as both raster and vector labels.
There is also a question of which file formats to test. For example, sampling from GeoJSON can take 3 min per getitem, whereas ESRI Shapefile only takes 1 sec per getitem (#69 (comment)).
Experiments
For the warping strategy, we should test the following possibilities:
What is the upfront cost of these pre-processing steps?
Example notebook: https://gist.github.com/calebrob6/d9bc5609ff638d601e2c35a1ab0a2dec
The text was updated successfully, but these errors were encountered: