-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cross validation makes duplicate samples #790
Comments
I addressed the problem. The problem was in It tries to handle imbalance by sampling more data without any notification. I understand that many models in mljar, such as tree ensembles require hyper-parameters like An even more serious issue is that all confusion matrices and ensembles based on appended predictions are built on an incorrect distribution. This is a significant loss of correctness. They should be calculated without appended samples. The fewer the samples, the more critical this issue becomes. A better solution would be to make it an argument of |
Hi @ov3rfit, Thank you for creating an issue for that. Sorry, I didn't respond earlier. You are right, it is to handle datasets with classes which have small number of samples. I like idea about adding additional argument to control up sampling, but I need to think about it. |
Problem Description
My dataset has 59 samples (confirmed in
data_info.json
with"rows": 59
), but I've observed the following issues:fold_i_validation_indices.npy
files in thefolds
directory contain indices 59-63 which exceed the dataset sizepredictions_out_of_folds.csv
files for all models contain 64 true labels and predictions (5 extra samples)Environment
Version: 1.1.15
Models in my run:
Investigation
mljar-supervised
source code but unable to identify where/why the additional indices are generatedImpact
Questions
The text was updated successfully, but these errors were encountered: