-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] ENH: Class Senstive Scaling #416
base: master
Are you sure you want to change the base?
[WIP] ENH: Class Senstive Scaling #416
Conversation
Looking forward to it. I'll try to check it asap. In the meanwhile, if you can check why the tests are not passing. |
Hey, thanks for the lightning fast response! regarding travis:
regarding AppVeyor, in addition:
@glemaitre How do think to tackle 3. (error by design)? thank you so much for your efforts! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments without looking at the tests for the moments.
I am not sure regarding the new base class sampler.
PCA.py
Outdated
@@ -0,0 +1,2468 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this file?
imblearn/scaling/css.py
Outdated
|
||
def __init__(self, | ||
mode='linear', | ||
target='minority', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the first parameter and it should be rename sampling_strategy
most probably check #411 and the the PR which is ongoing for more
imblearn/scaling/css.py
Outdated
minority_class_value=None, | ||
shuffle=True, | ||
random_state=None): | ||
super(CSS, self).__init__(ratio=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ratio will be the sampling_strategy
imblearn/scaling/css.py
Outdated
self.minority_class_value = minority_class_value | ||
self.shuffle = shuffle | ||
|
||
def _validate_estimator(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove that
imblearn/scaling/css.py
Outdated
|
||
super(CSS, self).fit(X, y) | ||
|
||
self._validate_estimator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this
imblearn/scaling/css.py
Outdated
' corresponding to the value of the label in y') | ||
|
||
|
||
mcv = self.minority_class_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be explicit
minority_class
imblearn/scaling/css.py
Outdated
least_common = counts.most_common()[:-1-1:-1] | ||
mcv = least_common[0][0] | ||
|
||
# in the following _a is majority, _i is minority |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be explicit even if it is more verbose
imblearn/scaling/css.py
Outdated
|
||
# in the following _a is majority, _i is minority | ||
if self.target is "majority" or self.target is "both": | ||
col_means_a = np.mean(X[(y != mcv)], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
col is not good name because this is not a column actually. It is fine to cool it mean_majority_class
or something similar
imblearn/scaling/css.py
Outdated
col_means_a = np.mean(X[(y != mcv)], axis=0) | ||
if self.mode is "linear": | ||
distances_a = abs(np.subtract(X[y != mcv], col_means_a)) | ||
if self.target is "minority" or self.target is "both": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to extend the problem to multiclass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now, it only works in a 1-vs-all fashion. Is this required for an accept ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whatever was shown in the paper is fine. This is usually the way that people transform the balancing method for multiclass. The only thing is that the sampling_strategy
(which will replace target`` should a string which can be (auto, all, minority, majority, not minority, not majority). In short by calling the BaseSampler it will return a
sampling_strategy` which is a dict with the class that should be scaled depending of the string. Then you scaling should loop other the key.
I imagine that we should also accept a list which mention the classes which should be scaled.
But anyway, we need to merge #413 first. So I would address all other issue.
imblearn/scaling/css.py
Outdated
|
||
# merge scaled and non scaled stuff | ||
if self.target is "majority": | ||
X_scaled = np.concatenate([X_scaled_a, X[y == mcv]], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to use safe_indexing
from scikit-learn to be able to use sparse matrices
We can skip the test if does not apply. I have to check a bit more what the test was doing
This is strange that this is passing in travis. I don't think that your code can manage sparse matrices for the moment (you need to use Actually this is also in travis.
It is happening with sparse matrices :) I think that we should refer to https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/utils/sparsefuncs.py#L65 to compute the mean vectors. |
Thanks for the detailled feedback! I try to fix everything you pointed out and report back as soon I'm done or have questions. Please keep me updated regarding the "we can skip that test" / error by design issue. Thanks a lot ! |
We also miss some documentation. We need to add an example to illustrate how to use the class and how it is working. On the other hand we need to add a section in the user guide. We have a new type of "sampler" which is more a scaler so we would need a new section. |
Just a quick comment: I would use the name ClassSenstiveScaling instead of CSS. |
@chkoar Since that this method does not over- or under- samples. It just only scale. In some way, I have the impression that it should inherit from TransformerMixin and should use fit_transform since y is not modify. We could still use the sampling_strategy utls inside the class. What are you thoughts on that. |
@glemaitre I agree. We could place that class in the preprocessing module/package to stay inline with |
May I ask if there is anything I could help with ? |
Sorry I forgot to mention. Since that we actually do not sample here but scale, we think that it should be better to derivate from Could you try to migrate to a full transformer? |
Sorry for the late reply. You're saying:
thanks in advance! |
bbf2b12
to
513203c
Compare
@BernhardSchlegel actually, you should inherit from |
eae6c6b
to
ffdde80
Compare
@BernhardSchlegel are you willing to finish this PR? It would be a nice addition to imbalanced-learn package. |
Yeah sure, just tell me what do :) Last time I tried I wasted my time. |
This was the last proposal which still standing: It will a better candidate for a transformer than a samplwe |
Then we will then need to review with internal changes that could have occur since your last push. Looking at the PR, we also need user guide documentation to show what to expect from the method and when to use it and just add it to the API documentation |
@glemaitre we can close this PR? |
What does this implement/fix? Explain your changes.
This implements a new technique called "class senstive scaling" that removes borderline noise by scaling samples to their corresponding class center. This computes insanely fast and eases the identification of a decision boundary and is a alternative to Tomek Link based concepts like CNN or OSS that supposedly reveal the decision boundary by removing noisy borderline samples. For details please refer:
B. Schlegel, and B. Sick. "Dealing with class imbalance the scalable way: Evaluation of various techniques based on classification grade and computational complexity." 2017 IEEE International Conference on Data Mining Workshops, 2017.
Any other comments?
This is my first pull request!
If you want to have a simple visualization you can use the following snippet
Thanks for the great library by the way !
edit: updated sample code to match code changes.