Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spatial stratification algorithm for splitting datasets into training and testing #433

Open
RaczeQ opened this issue Apr 12, 2024 · 4 comments

Comments

@RaczeQ
Copy link
Collaborator

RaczeQ commented Apr 12, 2024

Add an algorithm for splitting the dataset based on spatial location instead of random sampling.

@sabman
Copy link
Contributor

sabman commented Apr 14, 2024

@RaczeQ Thanks for creating this issue. I'd like to see if I can contribute. I am assuming this is in reference to the training loop for the embedding models? If so can you also reference the code module where this might be used. I am guessing its this

def _get_random_negative_df_loc(self, input_df_loc: int) -> int:

@RaczeQ
Copy link
Collaborator Author

RaczeQ commented Apr 21, 2024

Hello @sabman, thank you for showing interest in expanding the library 😊

I've created this issue specifically with end-tasks in mind, and I was planning on leaving the embedding models training (hex2vec, geovex etc) without changes - those will still be fitted on the whole provided dataset.

However, after you've mentioned this, I can see the potential use case in combination with existing embedder just for benchmarking purposes:

  1. Prepare regions / features geodataframes.
  2. Split them into training and validation data.
  3. Train embedder on training data.
  4. Transform validation data (with both encoder and decoder) and calculate the loss between the decoded and original values.

Currently we don't have any specific examples with downstream tasks in the documentation, there is one in our dedicated tutorial repository (https://github.com/kraina-ai/srai-tutorial).
I think about this functionality as a future utility for taking a given geodataframe and assigning a stratification class based on a geometry (or a more sophisticated scenario with class column AND geometry).

My previous comment is the list of materials I've gathered about this topic and if there is a good out of the shelf solution for this use case - we can just add it as a dependency and wrap it within srai API. If you have more ideas, examples or sources about it - I'd be thankful for sharing 🙇🏻.

@RaczeQ
Copy link
Collaborator Author

RaczeQ commented Apr 21, 2024

# just pseudo-coding here
def spatial_stratification(
    regions_gdf: GeoDataFrame,
    no_output_classes: int = 2,
    split_values: Optional[list[float]] = None,
    class_column: Optional[str] = None,
) -> pd.Series:
    """
    Generates a Pandas Series with stratification class value and an index from provided GeoDataFrame.

    Args:
        regions_gdf (gpd.GeoDataFrame): The regions that are being stratified.
        no_output_classes (int, optional): How many classes should be in the result series.
            Defaults to 2.
        split_values (Optional[list[float]], optional): The fraction between classes. When not provided,
            rows will be stratified equally. Defaults to None.
        class_column (Optional[str], optional): Name of the column used to additionally take into
            consideration when stratifying geometries. Defaults to None.
    """
    if no_output_classes < 1:
        raise ValueError("Number of output classes should be positive.")

    if not split_values:
        split_values = [1/no_output_classes for _ in range(no_output_classes)]

    normalized_split_values = [
        split_value / sum(split_values) for split_value in split_values
    ] # normalize to 1
    ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants