Cross validation makes duplicate samples #790

ov3rfit · 2025-01-22T15:26:54Z

Problem Description

My dataset has 59 samples (confirmed in data_info.json with "rows": 59), but I've observed the following issues:

fold_i_validation_indices.npy files in the folds directory contain indices 59-63 which exceed the dataset size
predictions_out_of_folds.csv files for all models contain 64 true labels and predictions (5 extra samples)

Environment

Version: 1.1.15

Models in my run:

Baseline
Neural Network
Random Forest
XGBoost
Decision Tree
Logistic Regression

Investigation

Tested with sklearn's StratifiedKFold which works correctly with the same dataset
The total number of validation samples is 64 (2^6), which might suggest oversampling or padding for computational efficiency
Spent over 2 hours investigating the mljar-supervised source code but unable to identify where/why the additional indices are generated

Impact

Cannot properly evaluate individual fold models due to unknown origin of extra indices
Validation set effectiveness is compromised due to duplicate samples

Questions

Is this intended behavior for specific models?
If oversampling/padding is required (e.g., for neural network batch size), how can we identify/remove the those extra samples?
How can we obtain the correct mapping between predictions and original data indices?

The text was updated successfully, but these errors were encountered:

ov3rfit · 2025-01-23T11:17:45Z

I addressed the problem. The problem was in base_automl.py, function _handle_drastic_imbalance. Line no. 629 defines min_samples_per_class = 20.

It tries to handle imbalance by sampling more data without any notification.
This is problematic behavior. At least it should provide a warning at the beginning stage and be stated somewhere, such as data_info.json, rather than doing so silently.

I understand that many models in mljar, such as tree ensembles require hyper-parameters like min_samples_leaf, so the existence of min_samples_per_class is understandable. However, this should not be linked to imbalance. What happens if a binary classification problem has 10 samples per each class? It's not about imbalance, so this should be moved elsewhere.

An even more serious issue is that all confusion matrices and ensembles based on appended predictions are built on an incorrect distribution. This is a significant loss of correctness. They should be calculated without appended samples. The fewer the samples, the more critical this issue becomes.

A better solution would be to make it an argument of AutoML. By making it an argument with the default value 20, users can recognize it and guess its behavior just by reading the initializer signature. By explicitly exposing it to users, the documentation burden could be reduced. Additionally, users can disable this feature by setting it to 0 or adjust the value for their specific needs.

pplonski · 2025-01-23T11:29:22Z

Hi @ov3rfit,

Thank you for creating an issue for that. Sorry, I didn't respond earlier. You are right, it is to handle datasets with classes which have small number of samples.

I like idea about adding additional argument to control up sampling, but I need to think about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross validation makes duplicate samples #790

Cross validation makes duplicate samples #790

ov3rfit commented Jan 22, 2025

ov3rfit commented Jan 23, 2025

pplonski commented Jan 23, 2025

Cross validation makes duplicate samples #790

Cross validation makes duplicate samples #790

Comments

ov3rfit commented Jan 22, 2025

Problem Description

Environment

Investigation

Impact

Questions

ov3rfit commented Jan 23, 2025

pplonski commented Jan 23, 2025