example or documentation for StratifiedKFold use #370

cpoerschke · 2023-06-08T15:50:04Z

Is your feature request related to a problem? Please describe.

Using plain StratifiedKFold e.g. in a GridSearchCV conceptually might not work since or if the event-versus-no-event class cannot be determined from structured y that is passed to fit. Practically this can manifest as a not very obvious ValueError: n_splits=5 cannot be greater than the number of members in each class. exception.

Describe the solution you'd like

Information or an example in the documentation somewhere on how StratifiedKFold can be used would be great.

Describe alternatives you've considered

An example would be a nice-to-have i.e. the alternative is users figuring something out in other ways.

References and existing implementations

none and/or unknown

Code snippets

Something like this illustrates the issue and one possible approach.

import numpy as np
import sklearn.model_selection
import sklearn.tree
import sksurv.tree
import sksurv.util

n = 100

feature1 = np.arange(0, n, 1)
feature2 = n - feature1

times = np.arange(0,n) + 1
events = (times <= n/10)

X = np.vstack((feature1, feature2)).T
y = sksurv.util.Surv.from_arrays(time=times, event=events)


def train_test_generator(X, y, stratified=None, n_splits=2):
    if stratified is not None:
        return sklearn.model_selection.StratifiedKFold(n_splits=n_splits).split(X, y[stratified])
    else:
        return sklearn.model_selection.KFold(n_splits=n_splits).split(X)


for stratified in [None, "event"]:
    print(f"stratified={stratified}")
    for i, (train_index, test_index) in enumerate(train_test_generator(X, y, stratified=stratified)):
        print(f'split {i+1}: {y[train_index]["event"].sum()} + {y[test_index]["event"].sum()} = {y["event"].sum()}')

        
# this does not work
# gcv = sklearn.model_selection.GridSearchCV(estimator=sksurv.tree.SurvivalTree(random_state=0),
#                                            param_grid={ "min_samples_leaf": [3, 6, 9] },
#                                            cv=sklearn.model_selection.StratifiedKFold(n_splits=5))
# gcv.fit(X, y)

# this does work
gcv = sklearn.model_selection.GridSearchCV(estimator=sksurv.tree.SurvivalTree(random_state=0),
                                           param_grid={ "min_samples_leaf": [3, 6, 9] },
                                           cv=train_test_generator(X, y, stratified="event", n_splits=5))
gcv.fit(X, y)

gcv.cv_results_

The text was updated successfully, but these errors were encountered:

sebp · 2023-06-10T08:29:44Z

Like in regression, stratifying by event time is not very meaningful. As you pointed out, it only makes sense to stratify by event indicator. The stratified option appears in several of sklearn's model_selection classes, so I'm not sure if there can be a one-fits-all approach.

In any case, explaining the situation in the documentation would be great.

Lamgayin · 2024-08-12T08:40:42Z

import numpy
from sklearn.model_selection import StratifiedKFold

class SurvivalStratifiedKFold(StratifiedKFold):
    def __init__(self, n_splits=5, shuffle=True, random_state=None):
        super().__init__(n_splits=n_splits, shuffle=shuffle, random_state=random_state)

    def split(self, X, y, groups=None):
        y_stratify = np.array([label[0] for label in y])
        return super().split(X, y_stratify, groups)

Hope this method can help u to solve the issues.

sebp added the documentation label Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example or documentation for StratifiedKFold use #370

example or documentation for StratifiedKFold use #370

cpoerschke commented Jun 8, 2023

sebp commented Jun 10, 2023 •

edited

Loading

Lamgayin commented Aug 12, 2024

example or documentation for StratifiedKFold use #370

example or documentation for StratifiedKFold use #370

Comments

cpoerschke commented Jun 8, 2023

sebp commented Jun 10, 2023 • edited Loading

Lamgayin commented Aug 12, 2024

sebp commented Jun 10, 2023 •

edited

Loading