Transform-then-split or split-then-transform? #13

dipetkov · 2021-09-14T16:29:30Z

Section "Discovering the built-in frameworks in Amazon SageMaker" in ch. 7 illustrates how to use XGBoost to predict houses prices.

In the example the housing dataset is first transformed and then split into training and validation subsets.

This isn't best practice; instead it's better to split first and transform the training dataset on. More importantly -- the one-hot encoding isn't saved anywhere so how can we use the trained model to make predictions for new houses?

Here is the relevant code snippet:

# One-hot encode
data = pd.get_dummies(data)

# Move labels to first column, which is what XGBoost expects
data = data.drop(['y_no'], axis=1)
data = pd.concat([data['y_yes'], data.drop(['y_yes'], axis=1)], axis=1)

# Shuffle and split into training and validation (95%/5%)
data = data.sample(frac=1, random_state=123)
train_data, val_data = train_test_split(data, test_size=0.05)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transform-then-split or split-then-transform? #13

Transform-then-split or split-then-transform? #13

dipetkov commented Sep 14, 2021

Transform-then-split or split-then-transform? #13

Transform-then-split or split-then-transform? #13

Comments

dipetkov commented Sep 14, 2021