Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardization of test data in Lab 6 should use training mean and standard deviation #11

Open
covuworie opened this issue Jul 21, 2018 · 2 comments

Comments

@covuworie
Copy link

covuworie commented Jul 21, 2018

Observed behavior

Hi, there are bugs in classification-and-pca-lab.ipynb for Lab 6 in the do_classify and classify_from_dataframe methods. When standardizing the testing data, its mean and standard deviation are used. This is incorrect for several reasons such as:

  • No information from the testing data should be used in the model prediction as it is a form of data snooping. The testing dataset has been contaminated by this.
  • The same variable is not being created during the transformation of the training and testing sets

Expected behavior

The training data mean and standard deviation should be used for standardizing the testing data like so:

dftest=(subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
Xte = (subdf.iloc[itest] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()

I think this was mentioned in one of the earlier lectures and here are some more references:

@pavlosprotopapas
Copy link
Contributor

pavlosprotopapas commented Jul 21, 2018 via email

@covuworie
Copy link
Author

Hi Pavlos,

Thanks for the response. Am I missing something here? As you say, "It is obvious that we
should not use the test set mean and std". However, this is precisely the bug (notice the use of the itest indices) I am reporting since it is what is being done in cell 18 in the do_classify function:

itrain, itest = train_test_split(range(subdf.shape[0]), train_size=train_size)
if standardize:
    dftrain=(subdf.iloc[itrain] - subdf.iloc[itrain].mean())/subdf.iloc[itrain].std()
    dftest=(subdf.iloc[itest] - subdf.iloc[itest].mean())/subdf.iloc[itest].std()

The same is also done in cell 20 in the classify_from_dataframe function.

Now referring to whether it is correct to use use the mean and std deviation of the whole dataset. As the Sebastian Raschka link above says:

'Note that in practice, if the dataset is sufficiently large, we wouldn’t notice any substantial difference between the scenarios 1-3 because we assume that the samples have all been drawn from the same distribution.'

In this case there are only 212 observations in the training set and 142 observations in the test set which is not a lot (especially compared with 63 predictors).

I think the main point the various authors are making is one of data leakage / data snooping when the entire training set mean and std are used. The example that is used in the article mentioned above makes a lot of sense:

'Again, why Scenario 3? The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data. Now, in a real application, the new, unseen data could be just 1 data point that we want to classify. (How do we estimate mean and standard deviation if we have only 1 data point?) That’s an intuitive case to show why we need to keep and use the training data parameters for scaling the test set.'

Yes I agree that in practice it may not make much of a difference compared to using the training set mean and standard deviation if the sample size is large and they observations are drawn independently from the same distribution. Yes we could check this before deciding. But why even take the chance?

I think the answer to this question provides a great explanation and also links to further reputable resources which discuss the issue:

https://stats.stackexchange.com/questions/174823/how-to-apply-standardization-normalization-to-train-and-testset-if-prediction-i

Chuk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants