Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should testing data be used for unsupervised feature tranformation or selection #23

Closed
dhimmel opened this issue Aug 1, 2016 · 2 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Aug 1, 2016

Imagine splitting the data as follows, where X is the complete feature matrix and y is the outcome array (train_test_split doc):

X_train, X_test, y_train, y_test = sklearn.cross_validation.train_test_split(X, y)

The goal of this discussion is to evaluate whether we should apply any operations on X (the union of X_train and X_test). @htcai cautioned against feature selection/transformation on the entire X: #18 (comment).

What are the drawbacks and advantages of selection/transformation on an X that includes X_test?

@htcai
Copy link
Member

htcai commented Aug 1, 2016

k-fold cross-validation might be useful for undermining/eliminating biased estimate of the model's performance. This method is also available in sklearn. I'm not sure whether this has already been taken into account. Just in case.

@dhimmel
Copy link
Member Author

dhimmel commented Aug 1, 2016

@htcai my pull request #18 uses GridSearchCV, which performs cross validation behind the scenes. For your reference, the cross validation occurs inside X_train as defined in the comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants