Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimpleImputer: conversion raises error when setting fill_value for imputers on string data. #1122

Closed
bodnarbm opened this issue Aug 7, 2024 · 1 comment

Comments

@bodnarbm
Copy link
Contributor

bodnarbm commented Aug 7, 2024

While attempting to convert a scikit-learn pipeline to an onnx model, a teammate discovered that setting the fill_value to an abritary string value raises an error during the conversion.

For example, when running the below code

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from skl2onnx.convert import to_onnx
from skl2onnx.common.data_types import StringTensorType
from onnxruntime import InferenceSession


model = SimpleImputer(strategy="constant", missing_values="", fill_value="missing")
data = pd.DataFrame(
    [["s1", "s2"], ["s1", "s2"], ["", "s3"], ["s7", "s6"], ["s8", ""]]
)
model.fit(data)

sklearn_transformed = model.transform(data)
print("SKLearn: ", sklearn_transformed.tolist())

model_onnx = to_onnx(model, initial_types=[("input", StringTensorType([None, 2]))], target_opset=21)

input = {"input": np.array(data).astype(np.str_)}
sess = InferenceSession(model_onnx.SerializeToString(), providers=["CPUExecutionProvider"])
onnx_transformed = sess.run(None, input)[0]
print("Onnx: ", onnx_transformed.tolist())

print("Matches: ", sklearn_transformed.tolist() == onnx_transformed.tolist())

The to_onnx call raises the following error

RuntimeError: Imputer cannot fill missing values with a string 'missing'.

I believe this is due to a leftover guard statement that existed before String support was added in #694.

Assuming the guard statement is still needed when handling numeric data, I believe it just needs to be moved into the else block after the check on if this is an imputer for string or numeric data.

I'm working on a PR right now to add unit tests for this and to move the guard statement.

@bodnarbm
Copy link
Contributor Author

Thanks! This can be closed with the merge of #1123.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant