-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GeoDataset: avoid unnecessary reprojection #409
Comments
Just to give my two cents as I've been following #278 closely and trying to figure out how to use
Ok with this, but instead of storing
Just a minor (but important) detail. I insist that the x and y in
Could you elaborate more on this? By 'mode', do you mean having a flag where the default is reproject to the R-tree index's CRS (as with torchgeo v0.2.0), and a Either way, I'm happy to help review and/or open a PR to implement this in torchgeo! |
Thanks for the feedback @weiji14!
A named tuple could work. Proper type hints weren't added until Python 3.7 but we could use
I actually proposed this exact thing to @calebrob6 because I was worried that we would end up with more distortion if we use a corner instead of a centroid. However, the problem is that you can run into off-by-one errors if the width/height is odd vs even. It also adds a bit of complexity to the code if you want to specify a bounding box that rasterio can actually read. Again, I'm hoping that most users won't need to think about
I mean "mode" in the mathematical sense. If I'm sampling from an |
Actually, that's for |
We might have a problem with this. I was hoping to contribute our intersection/union logic directly to rtree: Toblerity/rtree#210. However, it isn't clear what the behavior should be when computing an intersection for general rtree indices. For example, if an rtree stores general objects, the only possible way to compute an intersection would be to store the tuple of objects from both indices. So a dataset like |
Is torchgeo still supporting Python 3.6? Considering that Python 3.6 is EOL already and many Python scientific/geospatial packages are moving towards adopting NEP29 (e.g. rasterio/rasterio#1813, Toblerity/Fiona#1034) which is technically Python 3.8+, I really don't feel the need to do the backporting. I did find b6925f6 and the closed PR #326 though, so if there's a reason to have Python 3.6 support for a few more months, that's ok too.
Right, I'm thinking of large full scene satellite images several km wide that would definitely get distorted if we choose a corner vs a centroid. Perhaps raise a warning if the user sets an odd number width/height, saying that there will be some offset or resampling going on? Let's try to draft an implementation and see how it goes, like you say, most of this should be backend math calculations casual users shouldn't have to worry about.
Ah ok, that statistical 'mode'! Personally, I'd prefer the sampling to be more deterministic by having an explicit flag/parameter to control what CRS is being returned, and whether or not any reprojection should happen. To your point on tensor concatenation shape mismatches (mentioned in alternative point 2 #409 (comment)), I think this will become a non-issue with #294, so we would be left with the problem on how to sample from overlapping regions, in which a 'mode' based sampling or 'first-raster' based sampling approach would be ok. |
For the time being, yes. I would prefer to stick to Python's EOL timeline, but there are many systems that still use Python 3.6. I strongly disagree with numpy's aggressive deprecation timeline, as Google Colab still runs Python 3.7 by default. I think we'll likely compromise and stick to whatever versions of Python that PyTorch supports (currently 3.6–3.9 in the latest release, soon to be 3.7–3.9 in the next release). Unless I can convince @calebrob6 to drop 3.6 support since PyTorch has already dropped it in master.
This sounds like alternative 1 (our current behavior). If we do this, we require all samples to be projected to the same CRS, which leads to extreme distortion and slow data loading times. For example, consider a dataset composed of files from 4 different UTM zones. If a sample contains files in zones A and B, we want to return a sample in one of those zones. If another sample contains files in zones C and D, we don't want to warp both files to zone A or B, we want to choose from C and D in order to minimize reprojection.
This sounds like alternative 3. As I mentioned above, there are many issues with this. If I specify |
Yep, I think keeping Python 3.7 is definitely good since Pytorch still supports it. Let's continue this discussion at #413 since
Consider a semantic segmentation task (X -> Y) with:
I would say that the training input X should not be reprojected. The sampler should ideally take be taking 256x256 crops of the UTM zone 11N image and UTM zone 12N image, both in their native projections. The land use classification mask (which is of lower quality) in EPSG:3857 would then be reprojected to either UTM zone 11N or UTM zone 12N, depending on where the sampler is sampling from. The key is that the training image X should remain in the native UTM projection, because we will always be running inference on the native UTM projection of Sentinel-2 images (X), and not the projection of the mask file (EPSG:3857). My initial idea of doing this is as follows, note that this assumes pixel sampling mode (c.f. #294): ## Land use classification (LUC) dataset stored in EPSG:3857
mask_dataset = RasterDataset(filepath="landuseclassification.tif")
# mask_sample = {
# "image": torch.tensor(mask), # shape: (1, 256, 256)
# "crs": self.crs, # in EPSG:3857
# "bbox": query, # in EPSG:3857
# }
## Sentinel-2 image with 12 bands across UTM zone 11N and 12N
image_dataset11 = RasterDataset(filepath="S2_UTM_Zone_11N.tif")
image_dataset12 = RasterDataset(filepath="S2_UTM_Zone_12N.tif")
# image_sample = {
# "image": torch.tensor(image), # shape: (12, 256, 256)
# "crs": self.crs, # in UTM projection
# "bbox": query, # bounds in UTM projection
# }
## Pair Sentinel-2 dataset with LUC dataset
utm11N_mask_dataset = IntersectionGeoDataset(dataset1=image_dataset11, dataset2=mask_dataset)
utm12N_mask_dataset = IntersectionGeoDataset(dataset1=image_dataset12, dataset2=mask_dataset)
# image_mask_sample1 = {
# "image": torch.tensor(image_mask), # shape: (13, 256, 256)
# "crs": self.crs, # in UTM projection 11N
# "bbox": query, # bounds in UTM projection 11N
# }
# image_mask_sample2 = {
# "image": torch.tensor(image_mask), # shape: (13, 256, 256)
# "crs": self.crs, # in UTM projection 12N
# "bbox": query, # bounds in UTM projection 12N
# }
## Combine UTM Zone 11N and Zone 12N GeoDatasets
full_image_dataset = UnionGeoDataset(
dataset1=utmzone11N_mask_dataset, dataset2=utmzone12N_mask_dataset
)
# full_image_sample = {
# "image": torch.tensor(image), # shape: (13, 256, 256)
# "crs": self.crs, # in UTM projection 11N or 12N, depending on area
# "bbox": query, # bounds in what projection?? UTM 11N, 12N, or EPSG:3857?
# } So the problem is with the last Union operation - what projection should the Rtree index be in? UTM Zone 11N, 12N, or EPSG:3857? Maybe you're right and what I'm after might not be a job for IntersectionGeoDataset/UnionGeoDataset, but just a VisionDataset, or something else entirely like a ConcatGeoDataset or CombinedGeoDataset (for lack of a better term) that does not do reprojection at all (i.e. the user has to handle it manually). It did occur to me that there is an inherent limitation with the |
What if the training dataset X contains a file A in UTM zone 11N and a file B in UTM zone 12N and both A and B intersect with the sampler bounding box? Do you a) crash, or b) only return one of the two images? How do you decide which image to return? You do make a good point that we should avoid reprojecting float images and prefer reprojecting int label masks when possible, although it's going to be difficult to assert that preference in my proposed solution. |
How about taking the image where the sampled bounding box overlaps with most? Or if the box happens to be in the middle and the overlap areas are the same, I would just go with returning the first one in the case of RandomGeoSampler (since it is random anyway). Not sure how to handle it for GridGeoSampler though, probably should return both images, but not sure if that is possible looking at the code.
Taking a step back, I think we can both agree that implementation details 1, 2 and 3 are good enough, and can be done right now just with some code refactoring. Most of the debate appears to be on implementation detail 4 (how GeoSampler should behave), and given that there are probably a whole lot of combinations we haven't even though about, maybe we should come up with the simplest method that makes sense (e.g. taking the 'mode' projection, or projection of the 'first' GeoDataset), and make it easier somehow for people to subclass or adapt GeoSampler to fit their own specific problem. |
The problem is that implementation details 1, 2, and 3 are only necessary if we decide to go with this proposal. Alternative 1 (not changing anything, requiring the user to specify the CRS to reproject to) means not changing anything. Let's decide if this proposal is the behavior we want to achieve before we worry about making any changes. I'm still interested in hearing other alternative proposals, especially since you seem interested in alternatives 1 and 3. |
Well, I would argue that implementation details 1 and 2 is actually quite useful regardless of what happens. At the very least, having the CRS information in the Rtree index would allow me to create a custom Dataset/DataModule containing data from different projections more easily, instead of having to subclass RasterDataset and rewrite these 70 lines of logic: torchgeo/torchgeo/datasets/geo.py Lines 313 to 383 in fa6e4e1
|
When we make this switch, we should also switch from |
If anyone ever tries to implement this, we should use a dict, not a tuple. There are a lot of things we would like to optionally store in the R-tree that aren't always available but could be used by the sampler. For example, we could store percent cloud cover and have a sampler weighted by cloud cover. |
Sorry to resurrect this old issue, but re: Raster Vision: the above is incorrect. The RV approach (which has not changed since this issue was created) is to have one You can even take it a step further and stack bands from files with different projections. Needless to say, the labels can come from a source with different projection than the imagery as well. The concatenation is possible because RV Happy to provide more details about RV's approach, if needed. *An exception is if you create a |
Thanks for the clarification @AdeelH! If I understand correctly, RV never does reprojection unless you stack bands or combine images and masks from different CRS? I think this is similar to what I'm proposing here. |
Yes, if by reprojection you mean transforming the bounding box being read between pixel and map coordinates. E.g. when sampling from a
Each |
By reprojection, I mean using |
In that case, no, RV never does reprojection nor requires users to do it. |
If users want to combine images and masks in the same location but two different CRSs, who does the reprojection? The user? |
It's supposed to work the same way as when stacking bands from different sources that I described above. Since RV samples windows in pixel coords e.g. If the masks are coming from a vector data source (e.g. GeoJSON), the geometries are transformed to pixel coords in-memory, as shown here. Edit: Note: It's entirely possible that this approach is not fully general and does not work for all CRS combinations. If you can think of cases where it might fail, I'd be eager to know. |
If the images are in different coordinate systems, don't you end up with rotated/warped images like in Figure 2 of https://arxiv.org/abs/2111.08872? How do you even compute pixel coordinates if the images don't have the same bounds? |
Example noteboook showing stacking of bands from NAIP, Sentinel-2, and Landsat with geospatial alignment without reprojection: https://colab.research.google.com/drive/1trjhV2ohQN9AkJb889XuOymr-Z9F7YLt?usp=sharing |
This issue is meant to serve as a proposal for how to fix #278. Please try to poke holes in this proposal and find ways in which it won't work, or suggest alternative solutions.
Rationale
Geospatial data files are stored in a particular CRS. In order to build a single R-tree index of a geospatial dataset, the bounds of all files must be reprojected to the same CRS. During sampling, we also reproject the file itself to a common CRS so that we can stitch together multiple files from the same layer, or combine multiple layers in an
IntersectionDataset
.Unfortunately, reprojection/resampling is very slow, and can negatively affect data loader performance (see section 4.2 of https://arxiv.org/abs/2111.08872). It can also lead to large distortions of images in polar regions. We would like to be able to avoid reprojection if possible.
Proposal
I propose to only reproject files during sampling when absolutely necessary. This includes any situation in which we sample from multiple files at once and all files are not in the same CRS, such as:
GeoDataset
containing overlapping files on the border of two UTM zonesIntersectionDataset
with datasets in different CRSesUnionDataset
with overlapping files and datasets in different CRSesThe details of how this can be done is left to the "Implementation" section below, but the basic idea is to add an additional parameter for
__getitem__
that contains the CRS to project to. The sampler is then responsible for setting this CRS in a way that minimizes reprojection.There is one issue that arises with this implementation. If we choose our bounding box in one CRS and reproject it to a different CRS, the bounding box will be rotated, and the enclosing bounding box may be larger than the original. This can result in tensors of different size that cannot be concatenated without padding. The "Implementation" section describes how this problem can be avoided.
Implementation
The details of what will need to be changed are as follows:
GeoDataset.index
: instead of storing{bbox: filename}
, store{bbox: [(filename, CRS), ...]}
{Intersection|Union}Dataset.index
: instead of storing{bbox}
, store{bbox: [(filename, CRS), ...]}
GeoDataset.__getitem__
: instead of taking aBoundingBox(minx, maxx, miny, maxy, mint, maxt)
, accept aBoundingBox(x, y, t, width, height, time)
and CRS, then project (x, y) to that CRS before samplingGeoSampler
: look at the native CRS of all files in the R-tree and select the mode as the CRS to returnBy reprojecting only (x, y) instead of (minx, maxx, miny, maxy), we avoid issues with inconsistent tensor sizes during concatenation. The width/height can be applied directly in the new CRS without distortion.
Note that the R-tree index will still use
(minx, maxx, miny, maxy, mint, maxt)
as its index, while__getitem__
will take aBoundingBox(x, y, t, width, height, time)
. This is quite different from the previous approach that used the former for both.Alternatives
There have been a few other alternatives proposed, each with their own pros/cons:
ChesapeakeCVPR
. It is the fastest possible solution with the least distortion, but it disallows datasets with overlapping images in different CRS, so it's somewhere betweenVisionDataset
andGeoDataset
. It also suffers from the aforementioned bug with tensor concatenation shape mismatches, although we could employ the same solution as proposed here to fix that.@RitwikGupta
The text was updated successfully, but these errors were encountered: