Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancies between ONNX and sklearn probabilities with isotonic CalibratedClassifierCV #1151

Open
cja-halfspace opened this issue Jan 7, 2025 · 3 comments

Comments

@cja-halfspace
Copy link

cja-halfspace commented Jan 7, 2025

Hello, and thank you for your work on this great library!
I'm seeing a pretty big difference in probabilities when using CalibratedClassifierCV with isotonic regression together with RandomForestClassifier.
It seems like it's only happening when the max_depth parameter is set high enough.

I've provided a small snippet to reproduce the issue, with the following versions of libraries:

  • scikit-learn==1.6.0
  • skl2onnx==1.18.0
  • onnxruntime==1.20.1
import numpy as np
import onnxruntime as ort
from numpy.testing import assert_almost_equal
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
from sklearn.calibration import CalibratedClassifierCV
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

X, y = make_classification(
    n_samples=400_000,
    n_features=15,
    n_informative=15,
    n_redundant=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=30,
)
X = X.astype(np.float32)

rf = RandomForestClassifier(
    max_depth=10,
    n_jobs=-1,
    random_state=1234,
).fit(X, y)

model = CalibratedClassifierCV(rf, method="isotonic", cv="prefit").fit(
    X, y
)

model_onnx = convert_sklearn(
    model,
    initial_types=[("input", FloatTensorType([None, X.shape[1]]))],
    target_opset=15,
    options={"zipmap": False},
)

session = ort.InferenceSession(model_onnx.SerializeToString())

output = session.run(
    ["probabilities"],
    {"input": X},
)
onnx_probs = output[0][:,1]
model_probs = model.predict_proba(X)[:,1].astype(np.float32)

assert_almost_equal(onnx_probs, model_probs, decimal=5)

The result is:

> Mismatched elements: 4485 / 400000 (1.12%)
Max absolute difference among violations: 0.01261032
Max relative difference among violations: 0.11618411

I see that IsotonicRegression is not supported on https://onnx.ai/sklearn-onnx/supported.html but I would think CalibratedClassifierCV with both methods would be supported.

@xadupre
Copy link
Collaborator

xadupre commented Jan 8, 2025

It is supported otherwise you would have a bigger number of mismatched. The issue probably comes from the user of float in the trees instead of double. You can read this to understand where it comes from: https://onnx.ai/sklearn-onnx/auto_tutorial/plot_ebegin_float_double.html. We should implement the latest onnx standard to fix that.

@cja-halfspace
Copy link
Author

Thanks for the fast reply! Is it correctly understood that this issue would be fixed with the switch to TreeEnsemble?

@xadupre
Copy link
Collaborator

xadupre commented Jan 13, 2025

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants