You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a dataset with a feature that takes on only integer values (a performance rating, for example). Since SMOTE generates data along a continuous vector between two points, if they had differing values of this feature, the resulting point has a non-integer value. Using a decision tree, it quickly learns that these non-integer values are a good predictor of being in the minority class. For example, I end up with tree splits like: if performance <4 and then the next split is if performance > 3 to a leaf with all minority class data (all generated by SMOTE).
For example, a leaf of all SMOTE data when Performance Rating is 3 < x < 4.
In the original SMOTE paper, the authors suggested a SMOTE-NC for non-continuous data that would use the median of k-nearest neighbors for those non-continuous features. Is there is will to implement this feature? I suppose the user would need to pass the an index list of the non-continuous features over what is currently being passed to the algorithm.
The text was updated successfully, but these errors were encountered:
@jtsmith2 idk if we're ought to include SMOTE-NC algorithm considering that we're also not supporting SMOTE for categorical data. So, at least for now we should keep it simple. For your specific problem I'd perform the following workaround:
import numpy as np
# regular smote processing here
# resulting in X, y ...
nc_feat_idxs = [0,1,5]
X[:,nc_feat_idxs] = np.round(X[:,nc_feat_idxs])
I have a dataset with a feature that takes on only integer values (a performance rating, for example). Since SMOTE generates data along a continuous vector between two points, if they had differing values of this feature, the resulting point has a non-integer value. Using a decision tree, it quickly learns that these non-integer values are a good predictor of being in the minority class. For example, I end up with tree splits like: if performance <4 and then the next split is if performance > 3 to a leaf with all minority class data (all generated by SMOTE).
For example, a leaf of all SMOTE data when Performance Rating is 3 < x < 4.
![smote-nc](https://cloud.githubusercontent.com/assets/43162/18696663/ea0dd88a-7f88-11e6-8956-9a5310d9afbd.png)
In the original SMOTE paper, the authors suggested a SMOTE-NC for non-continuous data that would use the median of k-nearest neighbors for those non-continuous features. Is there is will to implement this feature? I suppose the user would need to pass the an index list of the non-continuous features over what is currently being passed to the algorithm.
The text was updated successfully, but these errors were encountered: