Random Sampling for imbalanced datasets #1004

ZahirBilal · 2022-08-24T19:34:23Z

ZahirBilal
Aug 24, 2022

Hello guys,

I am doing a research on performance of classifiers on extremely imbalanced datasets. For this use-case I tested almost all classifiers on River and also MOA (https://github.com/Waikato/moa).

So far the only algorithms that performed ok were the Random Sampling (both up and down) as well as the UOB algorithms in MOA.

My question is regarding the Random Sampling algorithms, was it developed based on some literature (paper?)? and if yes, please provide the publication.

Also, could someone please explain what is the parameter (sampling_rate) in class RandomSampler(ClassificationSampler) represent?

Best wishes,
Zahir

Answered by MaxHalford

Aug 24, 2022

Hey there.

I am doing a research on performance of classifiers on extremely imbalanced datasets. For this use-case I tested almost all classifiers on River and also MOA (https://github.com/Waikato/moa).

Some of the people behind MOA are the same ones behind scikit-multiflow, which merged with creme to become River. It's a small world :)

My question is regarding the Random Sampling algorithms, was it developed based on some literature (paper?)? and if yes, please provide the publication.

No, not really. I got the idea from reading Oza and Russell's online bagging and boosting paper. I then realized that rejection sampling was the mechanism I needed to sample data online with a desired …

View full answer

MaxHalford · 2022-08-24T20:40:09Z

MaxHalford
Aug 24, 2022
Maintainer

Hey there.

I am doing a research on performance of classifiers on extremely imbalanced datasets. For this use-case I tested almost all classifiers on River and also MOA (https://github.com/Waikato/moa).

Some of the people behind MOA are the same ones behind scikit-multiflow, which merged with creme to become River. It's a small world :)

My question is regarding the Random Sampling algorithms, was it developed based on some literature (paper?)? and if yes, please provide the publication.

No, not really. I got the idea from reading Oza and Russell's online bagging and boosting paper. I then realized that rejection sampling was the mechanism I needed to sample data online with a desired target distribution. I wrote a blog post about it, which led to the RandomUnderSampler implementation. Then it wasn't difficult to implement RandomOverSampler.

I can't find any of my notes concerning RandomSampler, but I vaguely remember doing the math by pen and paper. The sampling_rate simply indicates how much data to learn on. If sampling_rate=0.5, then the model wrapped by RandomSampler will be trained on 50% of the data.

I hope that helps.

1 reply

ZahirBilal Sep 8, 2022
Author

Thanks Max for your help and infos. I got it completely now :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Sampling for imbalanced datasets #1004

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Random Sampling for imbalanced datasets #1004

ZahirBilal Aug 24, 2022

Replies: 1 comment · 1 reply

MaxHalford Aug 24, 2022 Maintainer

ZahirBilal Sep 8, 2022 Author

ZahirBilal
Aug 24, 2022

Replies: 1 comment 1 reply

MaxHalford
Aug 24, 2022
Maintainer

ZahirBilal Sep 8, 2022
Author