-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing for subsequent feature selection #7
Comments
Could you please show your modifications on the script? |
Hi @jijo7, I copied mutual_info_.py from sklearn 0.18.dev0 (https://github.com/scikit-learn/scikit-learn/blob/0f2a00f/sklearn/feature_selection/mutual_info_.py) then placed mutual_info_.py in the same folder as mifs.py. You will have to fix the imports at the top from:
to:
and For getting the first feature, change line 207 in mifs.py from:
to:
Change line 225 in mifs.py from:
to:
I was only using MRMR with continuous features and continuous targets, so I didn't need the branches inside the mi.get_mi_vector function that deal with JMI/JMIM and categorical values (I didn't test JMI and JMIM). This should make MRMR work for continuous features and continuous or categorical targets. mutual_info_regression and mutual_info_classif also check sparsity of input features to determine if they are discrete or continuous (or change the flag 'discrete_features'). I can't say for certain if this works properly or not as I haven't compared it against any other implementations. Based on the documentation, the method of MI estimation implemented in sklearn appears to be similar to the one here (kNN). |
@limi44 |
Hi @limi44 |
@limi44 Sorry, is it possible to ask a question about the data division into test and train for feature selection? |
@shahlaebrahimi Sure, I'll do my best to answer your question. Regarding passing the number of features for MRMR, at least based on the original paper by Peng, all features are assigned a ranking based on MRMR and then the number of features is selected through computing the accuracy on each subsequent subset of features. You could define a stopping condition to end the search early, such as what Daniel has done, but you would have to select a threshold empirically. With a smaller subset of features now, you can now apply an exhaustive feature selection (or whatever your choice is) with a much lower computational burden than using the entire feature set. The stopping condition defined in this code appears to be for JMI/JMIM, so I don't think it applies for MRMR. |
@limi44 Thanks for sharing your knowledge and time. 1-Load and prepare data Now, I read some points regarding how to divide data into train and test when doing feature selection here , here and here, so I am not sure if I am doing right? Especially, if it is supposed to use filter method in order to decrease computational burden, will filter approach apply on the whole data set or just train data? |
@shahlaebrahimi Unfortunately I'm not familiar with NSGA II so I can't comment on that. For dividing your data, it would make more sense to use a k-fold cross validation such that you can guarantee that all data points will be part of the test set once. The websites you've linked are just saying that you must be careful that you are not including data from your test set in your training set when estimating the error, as you can bias the results. What is often done is to do a search for the optimal number of features by running cross-validation for a different number of features each time, and then selecting the number of features that gives the lowest cross-validation error. I'm not sure I understand your last question - filter methods will generally have a lower computational burden since they are computing some information (e.g. some metric of separability) regarding your data set that ideally will lead to selection of features with good performance. It is assumed that the computation of this information is less expensive than actually training the model and computing the performance. You will only apply the filter method on your training data to select features, and then choose the same features from your test set when evaluating performance. Hope that helps. |
@limi44 Thank you very much for your all exquisite responses.
to:
However, this error is reported:
In fact, I load my dataset as follows:
When I change
to:
I encounter another error:
Thanks in advance. |
@shahlaebrahimi You should be changing line 255 to:
You don't need to check self.categorical because when you choose the features after the first feature, you are computing the mutual information between selected features and candidate features (both of which are continuous, hence using mutual_info_regression), not between a continuous feature and a discrete target. When you try to use mutual_info_classif with continuous targets, you're getting an error that the target type is not valid (must be int or str). For the DataConversionWarning, use np.ravel(y). The attribute 'values' only exists for the pandas dataframe, not for numpy arrays. |
Hi @limi44
It should be noted that in both case "auto" is passed as number of features. However, I use python 3.5 so I change line 222 from:
to:
Seemingly, it makes an error when feature "o" is selected (the above error shows that auto selected feature # 6: 0). |
@shahlaebrahimi To get JMI/JMIM to use the sklearn versions of MI, instead of changing line 225, you will have to modify mi.get_mi_vector (which calls _get_mi). If you change _mi_dc to mutual_info_classif and _mi_cc to mutual_info_regression, that should work (although you'll have to play around with the syntax to make sure it runs properly). As for the error you've shown above, there's a problem with your loop condition. You have an infinite loop if self.n_features == 'auto'. The only way to break from the loop is if the threshold condition is met at line 250:
What is likely happening is that when you have 60 features, the decay rate of JMI stabilizes over the last 5 selected so the condition is met, and the loop is exited before you reach 60 features selected. When you have 14 features, over the last 5 selected features, the JMI is still changing. So the loop continues, and you get an error because you've already selected all the features. To fix this, change your loop condition so that you don't continue selecting after all features are selected. Try this:
|
@limi44 Hi. |
For "JMI" and "JMIM", I change line 225 to:
Unfortunately, it displays this error:
Isn't necessary to change line 207? |
@limi44 Sorry, again in Python 3.5, I changed line 206 from:
to
should it change to this?
Best regards, |
@shahlaebrahimi Line 207 needs to be changed as I previously mentioned. Line 225 should be kept as it was originally:
You need to modify mi.py and change calls to the helper functions mi_dc and mi_cc to the functions from mutual_info (you will have to import to mi.py as well), do not use mutual_info._compute_mi_cd. You'll have to debug the changes to Python 3 yourself, I am only using Python 2. As for the toolbox you linked, it appears to have JMI and MRMR but you'll have to look at their documentation to figure out how to use it. |
@limi44 Hi. Thanks a lot.
line 34: from
to
line 37: from
to
line 41: from
to
line 65: from
to
line 68: from
to
line 77:
line 118:
In mifs.py:
Line 207: from
to
Line 225: However, the result was:
Sorry again, Kind regards, |
Regarding using sklearn functions for JMI/JMIM --- it can't be done because it was designed to compute MI between univariate random variables, and JMI/JMIM requires to compute MI involving a 2-variate variable (named This is not limitation of algorithms used, it is just how it was agreed to introduce to sklearn (to match its univariate feature selection scheme). As for negative MI --- algorithms in sklearn may compute negative values, in this case it is replaced by 0 (and it is reasonable), in this code it is replaced by nan. Generally the code in sklearn is more polished and tested, but I can't claim that it necessary does a better job that the code in this repo (although I would trust it more). @shahlaebrahimi I suggest you to try the code from this repo, but
But if I were you I would probably dig more deeply in the algorithms and implement a version I would be certain about or thoroughly check the provided implementation. |
@nmayorov Thanks for the tip regarding JMI/JMIM! And thanks for implementing the MI methods in sklearn, I'd love to see MI based feature selection eventually integrated as well. You are also correct about the change to line 225, I made a typo in my previous comment which I have now corrected. From what I understand, due to the approximations made in computing MI, it is possible sometimes to get negative values, even though by definition MI is positive. With the MI methods in this repo, I was getting many NaN values (negative MIs), which is why I switched to the sklearn methods, that at least gave me positive values. @shahlaebrahimi I agree with @nmayorov, you should try to understand how JMI/JMIM work before debugging the code, otherwise you won't be able to determine if the problem is syntax related or if there's a problem with the implementation. Regarding your comment about how the first and last feature are not selected by MRMR, keep in mind that Python data structures are zero-indexed so the features selected are numbered 0-13, not 1-14. |
Hi everyone, very sorry for my long absence. Life got a bit busy.. This thread exploded and there's no way I can read this through now.. @nmayorov I tried to get the sklearn guys to incorporate JMI based filter methods into the feature_selection module, and the author of a very comprehensive review paper in the topic also chimed in, but I'm not sure they were convinced. See here: scikit-learn/scikit-learn#6313 (comment) I'd be happy to work with you on integrating JMI based FS into an sklearn module. If you're interested. The methods behind these algorithms seem really well established and studied. |
Can you please try the latest version of the code and report back if you still encounter the bug? Thanks! |
Hi Daniel,
Thanks for sharing your code! I'm trying to use MRMR and I'm running into some problems. Using continuous features, the mutual information keeps producing negative values. I was able to get around this problem by using another MI implementation (see aside below) - however, the code used for selecting subsequent features does not match up with what I expected based on Peng's paper (see incremental algorithm, equation (7)). Here is the relevant code snippet from mifs.py lines 222-239:
The s that is passed to mi.get_mi_vector() should be the index of the previously selected feature according to the comment in get_mi_vector, but in this case s is just the number of currently selected features - 1 (which is meaningless when you use it to index X). Shouldn't the arguments be mi.get_mi_vector(self, F, S[-1]) so that you pass the index of the last selected feature?
Aside: I found implementations of MI for continuous and discrete features and outputs included in the newest development version (0.18.dev0) of sklearn (https://github.com/scikit-learn/scikit-learn/blob/0f2a00f/sklearn/feature_selection/mutual_info_.py#L290). Using mutual_info_regression, I don't get negative values for MI, so perhaps there are specific modifications that make it more stable.
Cheers,
Michael
The text was updated successfully, but these errors were encountered: