Radiant MLHub Crop Type Datasets #512

nilsleh · 2022-04-21T21:20:16Z

This PR "superseeds" #511, because I found that Radiant ML Hub has multiple crop type datasets that follow almost the exact same format. Only the label.geojson files differ and hold the crop type label under different keys. This PR adds the following four crop type segmentation datasets under one abstract class:

Dataset Format:

separate sentinel 2 bands as tif file as well as a cloud probability layer (images in epsg 32736), or cloud mask
stac.json files for each input image tile (bboxes in epsg 4326)
geojson files with polygon annotation and label (polygon coordinates in epsg 32736), except South Africa dataset which has geotiff
stac.json for labels

These Issues still persist across datasets:

I am not creating the correct dummy data, something is off with the bounds
there aren't many annotations and the ones that are there are all very small (see below for example) making me question whether I mess up the indexing when creating a segmentation mask from the polygon annotations
there is also another design choice regarding the included datetime in stac.json files: if this datetime is used to populate the index, then RandomGeoSampler will also sample time instances and a given bounding box query will not return all timesteps for a given geographical XY location, as it is maybe expected.
Do not get full code coverage

nilsleh · 2022-04-29T13:50:00Z

After spending some time on this I have a more general question about these types of datasets. And since to my knowledge there is not yet a GeoDataset that includes time-series rasters as input and a corresponding mask, I thought I would raise them here.

I am hereafter assuming that the desired behavior for such a time-series raster datasets is a __getitem__ method that returns all time-series steps for a given geographical location. This is inspired by the CV4A_Crop_Type_Dataset which returns all time-series steps for each label, but is a VisionDataset and therefore does not deal with bounding boxes. In case of the added datasets in this PR, the relationship between label and input is one-to-many.

The following outline different approaches and observations I have made:

Using the information of each of the individual time-series images allows one to populate the index in such a way that all spatio-temporal information is available to the sampler. However, when using the sampler in a default way and passing the datasets bounds to it, then the sampler not just samples XY-coords but also the time-dimension, meaning that returned samples will not include all time-series steps for a specific region. Additionally, this approach can be slow because there can be many thousand input images to go through to populate the index and it hence takes a long time to instantiate the dataset.
In response to the last comment, a faster instantiation of the dataset could be to populate the index with the spatiotemporal information coming from the single-label, albeit it might be more tricky to gather all the time information because that is not necessarily included in the label. However, this would yield the same "issue" as above where the sampler will also sample the time dimension and not return all time-series step for each label.
Another approach could be to ignore the time dimension all together and just set it like it is being done in RasterDataset with mint: float = 0 maxt: float = sys.maxsize, populate the index that way and then the sampler would return all time-series steps for each label, since the time dimension would be the same for everything. The downside is that if the user would like to have some control over the time-dimension that is being returned, it would have to happen on their own behalf after the sample or batch is already returned.
Another approach, that could be an add-on to 3 would be to add a start_date and end_date parameter to the constructor and filter the files in such a way that they comply with this time range when a sample is gathered without using the supplied date information in the index.

Another observation is that not all labels range over the same time-horizon. So while some labels have lets say 40 corresponding images, others might have 70. Hence, consider the case when a bounding box from the sampler suggests a region that intersects with two or more such labels. What is the proper way of merging the varying time dimensions of rasters to yield one sample, in addition to merging individual bands of each of the samples like RasterDataset does?

Maybe I am also thinking about this wrong or missing something. Either way, I would welcome suggestions/comments.

weiji14 · 2022-04-29T14:32:43Z

Great points. Maybe move this #512 (comment) to a new issue. There's a lot to digest here, and I can see different use cases depending on time sensitivity which would require different indexing styles. E.g. time-sensitive flood mapping where you want to map 1 label mask: 1 time-slice, and landcover classifications where you could have 1 label mask: N time-slices (though land cover could change over longer periods of time).

adamjstewart · 2022-07-01T20:38:08Z

Tried to answer some of your time-series handling questions in #640, happy to iterate on ideas for that.

Where is this PR at? Should I give it a full review, or is it still a WIP? Should we close #511 or is it better to merge that for now and continue hacking on this?

nilsleh · 2022-07-02T07:56:39Z

Yes I think #511 can be closed. But this PR still needs some work, since they are all time-series and additionally three of the datasets are veeery label sparse so not sure I want to add them. However, depending on the method we decide to handle time-series data, I will add the South Africa Crop Type dataset and then convert the CV4A_Crop_Type dataset to Geodatasets.

yichiac · 2024-02-15T13:22:15Z

#1840 is adding South Africa Crop Type Competition

adamjstewart · 2024-08-06T12:10:39Z

I think we can close this. Radiant MLHub doesn't even exist anymore. Most (not all) datasets were moved to Source Cooperative, but the file hierarchy and file formats are completely different, so most of these datasets would have to be rewritten from scratch anyway.

radiant crop type datasets

d42b919

github-actions bot added datasets Geospatial or benchmark datasets testing Continuous integration testing labels Apr 21, 2022

nilsleh changed the title ~~Radiant crop type datasets~~ Radiant MLHub Crop Type Datasets Apr 21, 2022

nilsleh added 4 commits April 22, 2022 10:40

increase test coverage

f717ffe

correct md5

b51d3dc

read first crs from image band and not label

dafd35c

add and adapt south africa data set

859a3be

github-actions bot added the documentation Improvements or additions to documentation label Apr 26, 2022

nilsleh mentioned this pull request Apr 30, 2022

Time-Series GeoDatasets #518

Closed

other approach

5cc5670

nilsleh marked this pull request as draft May 3, 2022 18:06

adamjstewart added this to the 0.3.0 milestone May 5, 2022

adamjstewart mentioned this pull request Jul 2, 2022

Add Great African Food Company Crop Type Tanzania Dataset #511

Closed

3 tasks

adamjstewart modified the milestones: 0.3.0, 0.4.0 Jul 9, 2022

adamjstewart removed this from the 0.4.0 milestone Jan 24, 2023

adamjstewart closed this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Radiant MLHub Crop Type Datasets #512

Radiant MLHub Crop Type Datasets #512

nilsleh commented Apr 21, 2022 •

edited

Loading

nilsleh commented Apr 29, 2022

weiji14 commented Apr 29, 2022 •

edited

Loading

adamjstewart commented Jul 1, 2022

nilsleh commented Jul 2, 2022

yichiac commented Feb 15, 2024

adamjstewart commented Aug 6, 2024

Radiant MLHub Crop Type Datasets #512

Radiant MLHub Crop Type Datasets #512

Conversation

nilsleh commented Apr 21, 2022 • edited Loading

nilsleh commented Apr 29, 2022

weiji14 commented Apr 29, 2022 • edited Loading

adamjstewart commented Jul 1, 2022

nilsleh commented Jul 2, 2022

yichiac commented Feb 15, 2024

adamjstewart commented Aug 6, 2024

nilsleh commented Apr 21, 2022 •

edited

Loading

weiji14 commented Apr 29, 2022 •

edited

Loading