-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spatial stratification algorithm for splitting datasets into training and testing #433
Comments
@RaczeQ Thanks for creating this issue. I'd like to see if I can contribute. I am assuming this is in reference to the training loop for the embedding models? If so can you also reference the code module where this might be used. I am guessing its this
|
Hello @sabman, thank you for showing interest in expanding the library 😊 I've created this issue specifically with end-tasks in mind, and I was planning on leaving the embedding models training (hex2vec, geovex etc) without changes - those will still be fitted on the whole provided dataset. However, after you've mentioned this, I can see the potential use case in combination with existing embedder just for benchmarking purposes:
Currently we don't have any specific examples with downstream tasks in the documentation, there is one in our dedicated tutorial repository (https://github.com/kraina-ai/srai-tutorial). My previous comment is the list of materials I've gathered about this topic and if there is a good out of the shelf solution for this use case - we can just add it as a dependency and wrap it within |
# just pseudo-coding here
def spatial_stratification(
regions_gdf: GeoDataFrame,
no_output_classes: int = 2,
split_values: Optional[list[float]] = None,
class_column: Optional[str] = None,
) -> pd.Series:
"""
Generates a Pandas Series with stratification class value and an index from provided GeoDataFrame.
Args:
regions_gdf (gpd.GeoDataFrame): The regions that are being stratified.
no_output_classes (int, optional): How many classes should be in the result series.
Defaults to 2.
split_values (Optional[list[float]], optional): The fraction between classes. When not provided,
rows will be stratified equally. Defaults to None.
class_column (Optional[str], optional): Name of the column used to additionally take into
consideration when stratifying geometries. Defaults to None.
"""
if no_output_classes < 1:
raise ValueError("Number of output classes should be positive.")
if not split_values:
split_values = [1/no_output_classes for _ in range(no_output_classes)]
normalized_split_values = [
split_value / sum(split_values) for split_value in split_values
] # normalize to 1
... |
Add an algorithm for splitting the dataset based on spatial location instead of random sampling.
The text was updated successfully, but these errors were encountered: