Time-Series GeoDatasets #640

nilsleh · 2022-04-30T07:50:27Z

nilsleh
Apr 30, 2022
Maintainer

As suggested the following comment from #512 , is moved to its own issue:

After spending some time on the CropType Datasets in #512, I have a more general question about these types of time-series raster datasets. And since to my knowledge there is not yet a GeoDataset that includes time-series rasters as input and a corresponding mask, I thought I would raise them here.

I am hereafter assuming that the desired behavior for such a time-series raster datasets is a getitem method that returns all time-series steps for a given geographical location. This is inspired by the CV4A_Crop_Type_Dataset which returns all time-series steps for each label, but is a VisionDataset and therefore does not deal with bounding boxes. In case of the added datasets in this PR, the relationship between label and input is one-to-many. However, it was already pointed out that different geospatial datasets might require a different behavior.

The following outline different approaches and observations I have made:

Using the information of each of the individual time-series images allows one to populate the index in such a way that all spatio-temporal information is available to the sampler. However, when using the sampler in a default way and passing the datasets bounds to it, then the sampler not just samples XY-coords but also the time-dimension, meaning that returned samples will not include all time-series steps for a specific region. Additionally, this approach can be slow because there can be many thousand input images to go through to populate the index and it hence takes a long time to instantiate the dataset.
In response to the last comment, a faster instantiation of the dataset could be to populate the index with the spatiotemporal information coming from the single-label, albeit it might be more tricky to gather all the time information because that is not necessarily included in the label. However, this would yield the same "issue" as above where the sampler will also sample the time dimension and not return all time-series step for each label.
Another approach could be to ignore the time dimension all together and just set it like it is being done in RasterDataset with mint: float = 0 maxt: float = sys.maxsize, populate the index that way and then the sampler would return all time-series steps for each label, since the time dimension would be the same for everything. The downside is that if the user would like to have some control over the time-dimension that is being returned, it would have to happen on their own behalf after the sample or batch is already returned.
Another approach, that could be an add-on to 3 would be to add a start_date and end_date parameter to the constructor and filter the files in such a way that they comply with this time range when a sample is gathered without using the supplied date information in the index.

Another observation is that not all labels range over the same time-horizon. So while some labels have lets say 40 corresponding images, others might have 70. Hence, consider the case when a bounding box from the sampler suggests a region that intersects with two or more such labels. What is the proper way of merging the varying time dimensions of rasters to yield one sample, in addition to merging individual bands of each of the samples like RasterDataset does?

Maybe I am also thinking about this wrong or missing something. Either way, I would welcome suggestions/comments.

adamjstewart · 2022-07-01T20:33:19Z

adamjstewart
Jul 1, 2022
Maintainer

Sorry for taking so long to respond to this! Grad school/internships have kept me busy.

I think it's important to think about this not just from the perspective of curated benchmark datasets like CV4A Crop Type Dataset but also from the perspective of uncurated collections of geospatial data (e.g., decades of Landsat and CDL data). The latter is obviously a harder challenge, so if we can solve the latter, the former should be solved for free.

Specific comments to your 4 proposals with the above in mind:

I believe this is the right approach. We want to store this spatiotemporal info in the rtree so the sampler can make use of it. As you've noted, the default samplers don't make use of this in a way we want, so we'll need to implement new samplers for this. Some suggestions for common use cases are below.
This only works for datasets with both images and labels. For uncurated datasets, there may be multiple ways you might want to combine timeseries images and labels. For example, you might want to use an entire year worth of Landsat to make a single class prediction, or you may only want to use 2 Landsat images to do change detection.
This doesn't work for images with irregularly spaced timestamps, common with Landsat images with partial overlap.
Not a bad idea but something we'll probably want to integrate into the sampler, not the dataset.

Let's consider an example use case. A user has a decade worth of Landsat imagery downloaded and CDL data for each year in the same time range. There are multiple possible ways in which they may want to use this data:

Use all Landsat data for each year to predict CDL data for each year
Use only Landsat data during the growing season (spring to fall, or spring to early summer) to predict CDL
Compute the average of Landsat scenes to make a single prediction
Use historical CDL data and current Landsat data to predict location of different crops (maybe the same field always grows corn, or use predictable crop rotation)
Use pairs of Landsat images to try to detect when each field is harvested and how that corresponds to the crop type (change detection)
Only look at winter imagery to try to detect cover crops used in the offseason to add nutrients back to the soil
Look at the average image for a specific week or month for multiple years and make predictions of how different an image is from the typical mean image at that time of year

As you can see, these use cases are way to complex to handle with a single sampler. Some are more common than others, so we can focus on the common use cases and try to make things useful while still generic enough to handle many use cases. For example, I can image something like:

CyclicGeoSampler: specify a time range (defaults to first and last image in index) and a cycle (daily, annual, etc.) and return patches of all images/labels during each cycle
ForecastingGeoSampler: specify a time range, and a duration for training and testing (e.g., train on one month of data, predict next week), then return pairs of inputs in an ordered sequential fashion

We can keep adding to this list as we think of things that people might want to do. Point is, I think the only way to handle these complicated use cases is to consider it from the perspective of uncurated datasets. We should store all spatiotemporal info in the index and then let the sampler handle the complexity of deciding when and where to sample from. This allows us to do all of the above ideas with the same dataset implementation just by swapping out different samplers.

Let me know if this makes sense!

1 reply

nilsleh Oct 7, 2022
Maintainer Author

Sorry, this took a very long time to respond since I had forgotten about it, but I would like to pick this up again. I had not thought about new GeoSamlers doing much of the work but it makes a lot of sense. The GeoDataset class does have a time dimension when populating the index so it does seem enough, although I suppose it has not been tested with time-series indexing by a sampler? Thus maybe the first point of action would be the implementation of the ForecastingGeoSampler and CyclicGeoSampler to test there use with some existing uncurated datasets?

hfangcat · 2023-12-14T20:50:58Z

hfangcat
Dec 14, 2023

I am happy to know if there are any updates on that topic! I have been working on multi-temporal change detection for a while, and I found there are few datasets/models yet, making it a pain to research that...

0 replies

sfalkena · 2024-09-18T13:26:31Z

sfalkena
Sep 18, 2024

Hi, I have interest in using SITS datasets as well and also happy to deliver my input here. I have an initial version of this on my end that follows the (1.) approach and that seems to work well for what I need. Let me sketch the outline here:

I am populating the index with all geospatial hits.
GeoDatasets get an extra parameter called return_as_ts. If it set to true, the __get_item__ function returns an extra dimension for time.
The sampler reads the return_as_ts parameter of the dataset. If it is false, it samples like normal. If it is true, the min_t and max_t from the dataset are used by default.
Since the dataset index is now populated with all files, the sampler filters the hits based on the geospatial location, so that the hits across time do not get resampled (related to GridGeoSampler resamples same image repeatedly with separate_files and multiple dates #2221)

@adamjstewart regarding the way that the sampler gets parts of the temporal range of the dataset: Wouldn't your CyclicGeoSampler and ForecastingGeoSampler just be combining ROI's along the temporal dimension? Just like we have the roi_split for spatially discontinuous data, we can use it for temporally discontinuous data too?

Should I maybe just open a PR so you guys can have a look and we just get started somewhere? I feel it is better to at least have partial support than trying to cover all aspects right from the start.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time-Series GeoDatasets #640

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Time-Series GeoDatasets #640

nilsleh Apr 30, 2022 Maintainer

Replies: 3 comments · 1 reply

adamjstewart Jul 1, 2022 Maintainer

nilsleh Oct 7, 2022 Maintainer Author

hfangcat Dec 14, 2023

sfalkena Sep 18, 2024

nilsleh
Apr 30, 2022
Maintainer

Replies: 3 comments 1 reply

adamjstewart
Jul 1, 2022
Maintainer

nilsleh Oct 7, 2022
Maintainer Author

hfangcat
Dec 14, 2023

sfalkena
Sep 18, 2024