Length: 01:35
Notebook: lesson1-rf.ipynb
ln -s ../../fastai ./
where ./
is the current directory
Evaluation Metric is: root mean squared log error
sum{ [(ln(act) - ln(pred)]^2 }
- we need all of our columns to be numbers
- use function
add_datepart
to replace a date variable with all of its date parts - use function
train_cats
to convert strings to pandas categories (Notice: data type is notstring
, butcategory
) - use function
set_categories
to re-order categories - use function
proc_df
to replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable.df, y, nas = proc_df(df_raw, 'SalePrice')
- for continuous variables, missing values were replaced with the median
- if you get an R^2 that is negative, it means your model is worse than predicting the mean
- R^2 is not necessarily what you're trying to optimize
- R^2 how good is your model vs the naive mean model?
- Creating a validation set is the most important thing you'll do in machine learning.
- Validation Set (first hold out set): use this to determine what hyperparameters to use
- Testing (second hold out set): I've done modeling, now I'll see how it works
n_estimators
= number of treesn_jobs=-1
--> means create a separate job for each CPU that you have
m = RandomForestRegressor(n_estimators=20, n_jobs=-1)
[training RMSE , validation RMSE, training R^2, validation R^2]
[0.1026724559118164, 0.33553753413792303, 0.9786895444439101, 0.79893791069374753]
- statistical technique to create a random forest
- Bag of Little Bootstraps, Michael Jordan
- create 5 different models which are not correlated --> they offer different insights
- build 1000 trees on 10 separate data points --> invididual trees will not be predictive, but combined they will
- pick out n rows with replacement
- very useful when we have only a small dataset
m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)
[training RMSE , validation RMSE, training R^2, validation R^2, OOB R^2]
[0.10198464613020647, 0.2714485881623037, 0.9786192457999483, 0.86840992079038759, 0.84831537630038534]
- pass in list of hyperparameters we want to tune and values we want to try
- The basic idea is this: rather than limit the total amount of data that our model can access, let's instead limit it to a different random subset per tree. That way, given enough trees, the model can still see all the data, but for each individual tree it'll be just as fast as if we had cut down our dataset as before.
- no dataset is too big for this technique (ex: 120 million rows for grocery store data of Kaggle competition)
- need to set
oob_score = False
if using subsample approach ofset_rf_samples(20000)
- to turn it off, do
reset_rf_samples()
set_rf_samples(20000)
- very few people in industry or academia do this
- most people run all of their models on all of their data all of the time using their best parameters
- do most of your models on a large enough sample size so your accuracy is reasonable, that takes a small number of seconds to train
min_samples_leaf=1
this is the defaultmin_samples_leaf=3
says stop training the tree further when your leaf node has 3 or less samples in; the numbers 1, 3, 5, 10, and 25 work wellmax_features=None
this is the default; then max_features=n_features (default is to use all the features)max_features=0.5
the less correlated your trees are with each other, the better; randomly choose half the featuresmax_features
in practice, good values range from 0.5 to log2 or sqrt
- hard to screw it up
- great for out of box, even without tuning hyperparameters
- tends to work on most datasets most of the time
df_raw.fiProductClassDesc.cat.categories
df_raw.fiProductClassDesc.cat.codes
--> this is what the random forest sees
- experiment
- draw the trees
- plot the errors
- try different datasets
- write your own R2
- write your own versions of the datasets