Lesson 2: RF - Part 2

Length: 01:35
Notebook: lesson1-rf.ipynb

Create a symlink

ln -s ../../fastai ./

where ./ is the current directory

Evaluation Metric is: root mean squared log error
sum{ [(ln(act) - ln(pred)]^2 }

Data Process

we need all of our columns to be numbers
use function add_datepart to replace a date variable with all of its date parts
use function train_cats to convert strings to pandas categories (Notice: data type is not string, but category)
use function set_categories to re-order categories
use function proc_df to replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable.

df, y, nas = proc_df(df_raw, 'SalePrice')
- for continuous variables, missing values were replaced with the median

R^2

if you get an R^2 that is negative, it means your model is worse than predicting the mean
R^2 is not necessarily what you're trying to optimize
R^2 how good is your model vs the naive mean model?

Test and Validation Sets

Creating a validation set is the most important thing you'll do in machine learning.
Validation Set (first hold out set): use this to determine what hyperparameters to use
Testing (second hold out set): I've done modeling, now I'll see how it works

Random Forest code

n_estimators = number of trees
n_jobs=-1 --> means create a separate job for each CPU that you have

m = RandomForestRegressor(n_estimators=20, n_jobs=-1)

Random Forest Scores output

[training RMSE , validation RMSE, training R^2, validation R^2]

[0.1026724559118164, 0.33553753413792303, 0.9786895444439101, 0.79893791069374753]

Bagging

statistical technique to create a random forest
Bag of Little Bootstraps, Michael Jordan
- create 5 different models which are not correlated --> they offer different insights
- build 1000 trees on 10 separate data points --> invididual trees will not be predictive, but combined they will

Bootstrapping

pick out n rows with replacement

Out-of-Bag (OOB) Score

very useful when we have only a small dataset

m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

[training RMSE , validation RMSE, training R^2, validation R^2, OOB R^2]

[0.10198464613020647, 0.2714485881623037, 0.9786192457999483, 0.86840992079038759, 0.84831537630038534]

Grid Search

pass in list of hyperparameters we want to tune and values we want to try

Subsampling

The basic idea is this: rather than limit the total amount of data that our model can access, let's instead limit it to a different random subset per tree. That way, given enough trees, the model can still see all the data, but for each individual tree it'll be just as fast as if we had cut down our dataset as before.
no dataset is too big for this technique (ex: 120 million rows for grocery store data of Kaggle competition)
need to set oob_score = False if using subsample approach of set_rf_samples(20000)
to turn it off, do reset_rf_samples()

set_rf_samples(20000)

Important Takeaway / Tip

very few people in industry or academia do this
most people run all of their models on all of their data all of the time using their best parameters
do most of your models on a large enough sample size so your accuracy is reasonable, that takes a small number of seconds to train

Tree Building Parameters

min_samples_leaf=1 this is the default
min_samples_leaf=3 says stop training the tree further when your leaf node has 3 or less samples in; the numbers 1, 3, 5, 10, and 25 work well
max_features=None this is the default; then max_features=n_features (default is to use all the features)
max_features=0.5 the less correlated your trees are with each other, the better; randomly choose half the features
max_features in practice, good values range from 0.5 to log2 or sqrt

Random Forest

hard to screw it up
great for out of box, even without tuning hyperparameters
tends to work on most datasets most of the time

Looking at categories

df_raw.fiProductClassDesc.cat.categories
df_raw.fiProductClassDesc.cat.codes --> this is what the random forest sees

Homework

experiment
draw the trees
plot the errors
try different datasets
write your own R2
write your own versions of the datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lesson_02.md

lesson_02.md

Lesson 2: RF - Part 2

Create a symlink

Data Process

R^2

Test and Validation Sets

Random Forest code

Random Forest Scores output

Bagging

Bootstrapping

Out-of-Bag (OOB) Score

Grid Search

Subsampling

Important Takeaway / Tip

Tree Building Parameters

Random Forest

Looking at categories

Homework

Files

lesson_02.md

Latest commit

History

lesson_02.md

File metadata and controls

Lesson 2: RF - Part 2

Create a symlink

Data Process

R^2

Test and Validation Sets

Random Forest code

Random Forest Scores output

Bagging

Bootstrapping

Out-of-Bag (OOB) Score

Grid Search

Subsampling

Important Takeaway / Tip

Tree Building Parameters

Random Forest

Looking at categories

Homework