Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross validation makes duplicate samples #790

Open
ov3rfit opened this issue Jan 22, 2025 · 2 comments
Open

Cross validation makes duplicate samples #790

ov3rfit opened this issue Jan 22, 2025 · 2 comments

Comments

@ov3rfit
Copy link

ov3rfit commented Jan 22, 2025

Problem Description

My dataset has 59 samples (confirmed in data_info.json with "rows": 59), but I've observed the following issues:

  1. fold_i_validation_indices.npy files in the folds directory contain indices 59-63 which exceed the dataset size
  2. predictions_out_of_folds.csv files for all models contain 64 true labels and predictions (5 extra samples)

Environment

Version: 1.1.15

Models in my run:

  • Baseline
  • Neural Network
  • Random Forest
  • XGBoost
  • Decision Tree
  • Logistic Regression

Investigation

  • Tested with sklearn's StratifiedKFold which works correctly with the same dataset
  • The total number of validation samples is 64 (2^6), which might suggest oversampling or padding for computational efficiency
  • Spent over 2 hours investigating the mljar-supervised source code but unable to identify where/why the additional indices are generated

Impact

  1. Cannot properly evaluate individual fold models due to unknown origin of extra indices
  2. Validation set effectiveness is compromised due to duplicate samples

Questions

  1. Is this intended behavior for specific models?
  2. If oversampling/padding is required (e.g., for neural network batch size), how can we identify/remove the those extra samples?
  3. How can we obtain the correct mapping between predictions and original data indices?
@ov3rfit
Copy link
Author

ov3rfit commented Jan 23, 2025

I addressed the problem. The problem was in base_automl.py, function _handle_drastic_imbalance. Line no. 629 defines min_samples_per_class = 20.

It tries to handle imbalance by sampling more data without any notification.
This is problematic behavior. At least it should provide a warning at the beginning stage and be stated somewhere, such as data_info.json, rather than doing so silently.

I understand that many models in mljar, such as tree ensembles require hyper-parameters like min_samples_leaf, so the existence of min_samples_per_class is understandable. However, this should not be linked to imbalance. What happens if a binary classification problem has 10 samples per each class? It's not about imbalance, so this should be moved elsewhere.

An even more serious issue is that all confusion matrices and ensembles based on appended predictions are built on an incorrect distribution. This is a significant loss of correctness. They should be calculated without appended samples. The fewer the samples, the more critical this issue becomes.

A better solution would be to make it an argument of AutoML. By making it an argument with the default value 20, users can recognize it and guess its behavior just by reading the initializer signature. By explicitly exposing it to users, the documentation burden could be reduced. Additionally, users can disable this feature by setting it to 0 or adjust the value for their specific needs.

@pplonski
Copy link
Contributor

Hi @ov3rfit,

Thank you for creating an issue for that. Sorry, I didn't respond earlier. You are right, it is to handle datasets with classes which have small number of samples.

I like idea about adding additional argument to control up sampling, but I need to think about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants