Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform-then-split or split-then-transform? #13

Open
dipetkov opened this issue Sep 14, 2021 · 0 comments
Open

Transform-then-split or split-then-transform? #13

dipetkov opened this issue Sep 14, 2021 · 0 comments

Comments

@dipetkov
Copy link

Section "Discovering the built-in frameworks in Amazon SageMaker" in ch. 7 illustrates how to use XGBoost to predict houses prices.

In the example the housing dataset is first transformed and then split into training and validation subsets.

This isn't best practice; instead it's better to split first and transform the training dataset on. More importantly -- the one-hot encoding isn't saved anywhere so how can we use the trained model to make predictions for new houses?

Here is the relevant code snippet:

# One-hot encode
data = pd.get_dummies(data)

# Move labels to first column, which is what XGBoost expects
data = data.drop(['y_no'], axis=1)
data = pd.concat([data['y_yes'], data.drop(['y_yes'], axis=1)], axis=1)

# Shuffle and split into training and validation (95%/5%)
data = data.sample(frac=1, random_state=123)
train_data, val_data = train_test_split(data, test_size=0.05)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant