Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class/Sample weights vs is_unbalance/scale_pos_weight #6807

Open
bking124 opened this issue Jan 30, 2025 · 0 comments
Open

Class/Sample weights vs is_unbalance/scale_pos_weight #6807

bking124 opened this issue Jan 30, 2025 · 0 comments
Labels

Comments

@bking124
Copy link

Forgive me if this is not the right forum or if this question has been answered previously. I pored through old issues to see if my question was answered, but none quite got to the bottom of things.

I have a binary classification problem with severe class imbalance, and have thus been playing around with how lightgbm can apply weighting to either classes or samples. Now the documentation points to several possibilities (I will stick to referencing the Python scikit-learn API, although these features are available in the other APIs as far as I'm aware):

  • Set sample weights (in the fit function)
  • Set class weights (explicitly not recommended in the docs)
  • Set is_unbalance=True
  • Set scale_pos_weight

Now, I have produced a minimal example that shows these four options in action, setting the weights for each class to the same quantity, but using these different ways.

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_sample_weight
import lightgbm as lgb

# Make a basic classification dataset with roughly 5% belonging to positive class, and split into train/test
n = 100000
X, y = make_classification(n_samples=n, weights = [.95], random_state=123)
X_train, X_test, y_train, y_test = train_test_split(X,y)

#Calculate scale_pos_weight
train_counts = np.bincount(y_train)
scale_weight = train_counts[0]/train_counts[1]

# Compute sample weights (same values as class weights)
sample_weights = compute_sample_weight(class_weight='balanced', y=y_train)

#Instantiate dict of models
models = {
    "ClassWeightsBalanced": lgb.LGBMClassifier(class_weight='balanced', n_estimators=500),
    "SampleWeights": lgb.LGBMClassifier(n_estimators=500),
    "ScalePosWeight": lgb.LGBMClassifier(scale_pos_weight=scale_weight, n_estimators=500),
    "IsUnbalanced": lgb.LGBMClassifier(is_unbalance=True, n_estimators=500),
}

for model, instance in models.items():
    if model == "SampleWeights":
        instance.fit(X_train, y_train, eval_set= [(X_test, y_test),(X_train, y_train)], sample_weight=sample_weights)
    else:
        instance.fit(X_train, y_train, eval_set= [(X_test, y_test),(X_train, y_train)])
        
    lgb.plot_metric(instance, title=model)

From the resulting loss curve plots, we can see that the class weights and sample weights are exactly the same, as are the scale_pos_weight and is_unbalance methods. Interestingly enough, past the initial 50-60 iterations or so, the losses are almost equivalent, and indeed by the end, the predictions produce very similar scores by almost any metric.

Image

The verbose output from the model training points to a primary difference. The class/sample weight methods say:

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Start training from score 0.000000

whereas the scale_pos_weight/is_unbalance methods output

[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.054960 -> initscore=-2.844622
[LightGBM] [Info] Start training from score -2.844622

So it would appear that one difference is in the initial score. Accordingly, I started playing around with init_score in the fit function as well as the parameter boost_from_average, but wasn't able to get one approach to completely duplicate the other (close, but not exact). With that all said, my questions are:

  1. Am I correct that the main difference between these methods is how they initialize the score? Or are there other differences?
  2. Why is class_weights not recommended for use in the binary domain, and instead it's advised people opt for is_unbalance? Based on the (admittedly simple) examples I've been looking at, there's not much of a difference after a certain number of iterations, and the loss curve resulting from the class/sample weight approach seems more well-behaved.
  3. Is there some parameter or setting such that I can make one approach exhibit the exact same behavior as the other (i.e. produce the exact same tree at all iterations)?

Apologies for such a long question, I wanted to make sure it was clear what I was asking. Please let me know if I've misunderstood anything about how LGBM works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants