Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radiant MLHub Crop Type Datasets #512

Closed
wants to merge 6 commits into from

Conversation

nilsleh
Copy link
Collaborator

@nilsleh nilsleh commented Apr 21, 2022

This PR "superseeds" #511, because I found that Radiant ML Hub has multiple crop type datasets that follow almost the exact same format. Only the label.geojson files differ and hold the crop type label under different keys. This PR adds the following four crop type segmentation datasets under one abstract class:

Dataset Format:

  • separate sentinel 2 bands as tif file as well as a cloud probability layer (images in epsg 32736), or cloud mask
  • stac.json files for each input image tile (bboxes in epsg 4326)
  • geojson files with polygon annotation and label (polygon coordinates in epsg 32736), except South Africa dataset which has geotiff
  • stac.json for labels

These Issues still persist across datasets:

  • I am not creating the correct dummy data, something is off with the bounds
  • there aren't many annotations and the ones that are there are all very small (see below for example) making me question whether I mess up the indexing when creating a segmentation mask from the polygon annotations
  • there is also another design choice regarding the included datetime in stac.json files: if this datetime is used to populate the index, then RandomGeoSampler will also sample time instances and a given bounding box query will not return all timesteps for a given geographical XY location, as it is maybe expected.
  • Do not get full code coverage

@github-actions github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Apr 21, 2022
@nilsleh nilsleh changed the title Radiant crop type datasets Radiant MLHub Crop Type Datasets Apr 21, 2022
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Apr 26, 2022
@nilsleh
Copy link
Collaborator Author

nilsleh commented Apr 29, 2022

After spending some time on this I have a more general question about these types of datasets. And since to my knowledge there is not yet a GeoDataset that includes time-series rasters as input and a corresponding mask, I thought I would raise them here.

I am hereafter assuming that the desired behavior for such a time-series raster datasets is a __getitem__ method that returns all time-series steps for a given geographical location. This is inspired by the CV4A_Crop_Type_Dataset which returns all time-series steps for each label, but is a VisionDataset and therefore does not deal with bounding boxes. In case of the added datasets in this PR, the relationship between label and input is one-to-many.

The following outline different approaches and observations I have made:

  1. Using the information of each of the individual time-series images allows one to populate the index in such a way that all spatio-temporal information is available to the sampler. However, when using the sampler in a default way and passing the datasets bounds to it, then the sampler not just samples XY-coords but also the time-dimension, meaning that returned samples will not include all time-series steps for a specific region. Additionally, this approach can be slow because there can be many thousand input images to go through to populate the index and it hence takes a long time to instantiate the dataset.
  2. In response to the last comment, a faster instantiation of the dataset could be to populate the index with the spatiotemporal information coming from the single-label, albeit it might be more tricky to gather all the time information because that is not necessarily included in the label. However, this would yield the same "issue" as above where the sampler will also sample the time dimension and not return all time-series step for each label.
  3. Another approach could be to ignore the time dimension all together and just set it like it is being done in RasterDataset with mint: float = 0 maxt: float = sys.maxsize, populate the index that way and then the sampler would return all time-series steps for each label, since the time dimension would be the same for everything. The downside is that if the user would like to have some control over the time-dimension that is being returned, it would have to happen on their own behalf after the sample or batch is already returned.
  4. Another approach, that could be an add-on to 3 would be to add a start_date and end_date parameter to the constructor and filter the files in such a way that they comply with this time range when a sample is gathered without using the supplied date information in the index.

Another observation is that not all labels range over the same time-horizon. So while some labels have lets say 40 corresponding images, others might have 70. Hence, consider the case when a bounding box from the sampler suggests a region that intersects with two or more such labels. What is the proper way of merging the varying time dimensions of rasters to yield one sample, in addition to merging individual bands of each of the samples like RasterDataset does?

Maybe I am also thinking about this wrong or missing something. Either way, I would welcome suggestions/comments.

@weiji14
Copy link
Contributor

weiji14 commented Apr 29, 2022

Great points. Maybe move this #512 (comment) to a new issue. There's a lot to digest here, and I can see different use cases depending on time sensitivity which would require different indexing styles. E.g. time-sensitive flood mapping where you want to map 1 label mask: 1 time-slice, and landcover classifications where you could have 1 label mask: N time-slices (though land cover could change over longer periods of time).

@nilsleh nilsleh mentioned this pull request Apr 30, 2022
@nilsleh nilsleh marked this pull request as draft May 3, 2022 18:06
@adamjstewart adamjstewart added this to the 0.3.0 milestone May 5, 2022
@adamjstewart
Copy link
Collaborator

Tried to answer some of your time-series handling questions in #640, happy to iterate on ideas for that.

Where is this PR at? Should I give it a full review, or is it still a WIP? Should we close #511 or is it better to merge that for now and continue hacking on this?

@nilsleh
Copy link
Collaborator Author

nilsleh commented Jul 2, 2022

Yes I think #511 can be closed. But this PR still needs some work, since they are all time-series and additionally three of the datasets are veeery label sparse so not sure I want to add them. However, depending on the method we decide to handle time-series data, I will add the South Africa Crop Type dataset and then convert the CV4A_Crop_Type dataset to Geodatasets.

@adamjstewart adamjstewart modified the milestones: 0.3.0, 0.4.0 Jul 9, 2022
@adamjstewart adamjstewart removed this from the 0.4.0 milestone Jan 24, 2023
@yichiac
Copy link
Contributor

yichiac commented Feb 15, 2024

#1840 is adding South Africa Crop Type Competition

@adamjstewart
Copy link
Collaborator

I think we can close this. Radiant MLHub doesn't even exist anymore. Most (not all) datasets were moved to Source Cooperative, but the file hierarchy and file formats are completely different, so most of these datasets would have to be rewritten from scratch anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants