You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Forgive me if this is not the right forum or if this question has been answered previously. I pored through old issues to see if my question was answered, but none quite got to the bottom of things.
I have a binary classification problem with severe class imbalance, and have thus been playing around with how lightgbm can apply weighting to either classes or samples. Now the documentation points to several possibilities (I will stick to referencing the Python scikit-learn API, although these features are available in the other APIs as far as I'm aware):
Set sample weights (in the fit function)
Set class weights (explicitly not recommended in the docs)
Set is_unbalance=True
Set scale_pos_weight
Now, I have produced a minimal example that shows these four options in action, setting the weights for each class to the same quantity, but using these different ways.
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_sample_weight
import lightgbm as lgb
# Make a basic classification dataset with roughly 5% belonging to positive class, and split into train/test
n = 100000
X, y = make_classification(n_samples=n, weights = [.95], random_state=123)
X_train, X_test, y_train, y_test = train_test_split(X,y)
#Calculate scale_pos_weight
train_counts = np.bincount(y_train)
scale_weight = train_counts[0]/train_counts[1]
# Compute sample weights (same values as class weights)
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)
#Instantiate dict of models
models = {
"ClassWeightsBalanced": lgb.LGBMClassifier(class_weight='balanced', n_estimators=500),
"SampleWeights": lgb.LGBMClassifier(n_estimators=500),
"ScalePosWeight": lgb.LGBMClassifier(scale_pos_weight=scale_weight, n_estimators=500),
"IsUnbalanced": lgb.LGBMClassifier(is_unbalance=True, n_estimators=500),
}
for model, instance in models.items():
if model == "SampleWeights":
instance.fit(X_train, y_train, eval_set= [(X_test, y_test),(X_train, y_train)], sample_weight=sample_weights)
else:
instance.fit(X_train, y_train, eval_set= [(X_test, y_test),(X_train, y_train)])
lgb.plot_metric(instance, title=model)
From the resulting loss curve plots, we can see that the class weights and sample weights are exactly the same, as are the scale_pos_weight and is_unbalance methods. Interestingly enough, past the initial 50-60 iterations or so, the losses are almost equivalent, and indeed by the end, the predictions produce very similar scores by almost any metric.
The verbose output from the model training points to a primary difference. The class/sample weight methods say:
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000
whereas the scale_pos_weight/is_unbalance methods output
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.054960 -> initscore=-2.844622
[LightGBM] [Info] Start training from score -2.844622
So it would appear that one difference is in the initial score. Accordingly, I started playing around with init_score in the fit function as well as the parameter boost_from_average, but wasn't able to get one approach to completely duplicate the other (close, but not exact). With that all said, my questions are:
Am I correct that the main difference between these methods is how they initialize the score? Or are there other differences?
Why is class_weights not recommended for use in the binary domain, and instead it's advised people opt for is_unbalance? Based on the (admittedly simple) examples I've been looking at, there's not much of a difference after a certain number of iterations, and the loss curve resulting from the class/sample weight approach seems more well-behaved.
Is there some parameter or setting such that I can make one approach exhibit the exact same behavior as the other (i.e. produce the exact same tree at all iterations)?
Apologies for such a long question, I wanted to make sure it was clear what I was asking. Please let me know if I've misunderstood anything about how LGBM works.
The text was updated successfully, but these errors were encountered:
Forgive me if this is not the right forum or if this question has been answered previously. I pored through old issues to see if my question was answered, but none quite got to the bottom of things.
I have a binary classification problem with severe class imbalance, and have thus been playing around with how lightgbm can apply weighting to either classes or samples. Now the documentation points to several possibilities (I will stick to referencing the Python scikit-learn API, although these features are available in the other APIs as far as I'm aware):
is_unbalance=True
scale_pos_weight
Now, I have produced a minimal example that shows these four options in action, setting the weights for each class to the same quantity, but using these different ways.
From the resulting loss curve plots, we can see that the class weights and sample weights are exactly the same, as are the
scale_pos_weight
andis_unbalance
methods. Interestingly enough, past the initial 50-60 iterations or so, the losses are almost equivalent, and indeed by the end, the predictions produce very similar scores by almost any metric.The verbose output from the model training points to a primary difference. The class/sample weight methods say:
whereas the
scale_pos_weight
/is_unbalance
methods outputSo it would appear that one difference is in the initial score. Accordingly, I started playing around with
init_score
in the fit function as well as the parameterboost_from_average
, but wasn't able to get one approach to completely duplicate the other (close, but not exact). With that all said, my questions are:is_unbalance
? Based on the (admittedly simple) examples I've been looking at, there's not much of a difference after a certain number of iterations, and the loss curve resulting from the class/sample weight approach seems more well-behaved.Apologies for such a long question, I wanted to make sure it was clear what I was asking. Please let me know if I've misunderstood anything about how LGBM works.
The text was updated successfully, but these errors were encountered: