-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TileDatasets #1353
base: main
Are you sure you want to change the base?
TileDatasets #1353
Conversation
calebrob6
commented
May 20, 2023
- Introduces a new class of datasets called TileDatasets that are indexed by filename, xoffset, yoffset, and patch size.
- Implements samplers for these
- Implements a L7IrishDataModule using this scheme
Thanks for opening this proof of concept! There's a few questions here:
My current opinions:
|
Maybe -- although it feels like we should be able to get there by making RasterDataset and the Samplers more complex. I think I sketched out a method for allowing RasterDataset
Depends on the first question
We've talked about this a bunch of times before -- I almost never want to search a file system and keep things that match a regex. The nice thing about this is that you can rename |
else: | ||
experiment_name = f"{model}_{lr}_{loss}_{wd}_{weights}_{seed}" | ||
|
||
config_file = os.path.join("conf", "l7irishtile.yaml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make one run_{downstream_task}.py
script that has the config file name as an additional variable at the beginning of the file? Then we can use this one script for all the different downstream tasks, or are there some differences beyond the config files that we need to account for?
I'd like to revive this one but maybe we can also make the samplers work with non georeferenced images. Some of the datasets we have like GID-15, LEVIR-CD, etc. are large images that I may want to sample smaller patches from for training but I don't want to manually preprocess them to a specific patch size beforehand in case I want to run an ablation by varying the patch size. |
@adamjstewart -- do you have any more recent thoughts on this? I use these types of datasets somewhat frequently in my day-to-day work and would love to have them in TorchGeo, but also don't have a burning passion to merge it |
Overall, I think I would be okay with this approach. It's kind of like what other libraries did before TorchGeo existed. I'm very okay with this for datasets like GID-15 and LEVIR-CD where the images are large but there are no geocoordinates. I'm slightly okay with this for datasets like L7 Irish where the images are large and there are geocoordinates. Fancy is cool, but fast is better. Especially if you're already regularly using datasets like this. I want to make sure TorchGeo is actually useful to its intended audience, including its maintainers. If the maintainers aren't using the builtin datasets, we're doing something wrong.
Well this did not age well...
I don't think this is possible. If we subclass from GeoDataset, we are declaring that the dataset can interoperate with all other GeoDatasets. TileDataset really does deserve its own base class and samplers. I'm not even sure if samplers are the right approach here. This requires a lot of rethinking. How do people normally use these kind of datasets for ML? Are there other ideas from high-res imagery we can borrow from? Presumably we are not the only ones using large images.
We now support this, yay!
Note that I converted L7 Irish to an IntersectionDataset in 0.6.0. This significantly cuts down on the code (no more getitem). Just an aside, not really relevant to the discussion. |
Note that this would fix all issues with ChesapeakeCVPR as ChesapeakeCVPR should be a tile dataset! |