Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help converting and running basic DictVectorizer #1068

Open
addisonklinke opened this issue Feb 2, 2024 · 4 comments
Open

Help converting and running basic DictVectorizer #1068

addisonklinke opened this issue Feb 2, 2024 · 4 comments

Comments

@addisonklinke
Copy link

addisonklinke commented Feb 2, 2024

I want to convert the example DictVectorizer from the sklearn docs to ONNX. Despite looking at the documented type constraints for OnnxDictVectorizer, all the approaches I've tried still have different errors. Can someone please advise?

Here is the base script I've been modifying

import onnxruntime
from sklearn.feature_extraction import DictVectorizer
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import (
    DictionaryType,
    FloatTensorType,
    StringType,
    StringTensorType,
    Int64Type,
    Int64TensorType,
)

# Initialize and fit
v = DictVectorizer(sparse=False)
d = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
output = v.fit_transform(d)

# Convert
dict_type = DictionaryType(
    key_type=...,
    value_type=...,
)
v_onnx = convert_sklearn(v, initial_types=[("input", dict_type)])

# Run in ONNX
sess = onnxruntime.InferenceSession(v_onnx.SerializeToString())
inputs = {"input": d}
output_onnx = sess.run(None, inputs)

And a summary of the approaches and errors. For notation, I'm showing {key_type : value_type} as passed to the DictionaryType constructor

Approach 1: non-tensor types

The most direct translation of the Python types in d should be {StringType([None, 1]) : Int64Type([None, 1]).
However convert_sklearn() raises

TypeError: data_type is not a tensor type but '<class 'onnxconverter_common.data_types.Int64Type'>'

Approach 2: tensor types

As indicated by the type error, I refactored to {StringTensorType([None, 1]) : Int64TensorType([None, 1]).
Now sess = ... raises

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : 
Type Error: Type (tensor(float)) of output arg (variable) of node (DictVectorizer) 
does not match expected type (tensor(int64)).

Approach 3: replace int64 with float

ONNX appears to treat the values as floats even though in Python they are ints.
To remedy, I tried {StringTensorType([None, 1]) : FloatTensorType([None, 1])
Now sess.run() raises

RuntimeError: /Users/runner/work/1/s/onnxruntime/python/onnxruntime_pybind_mlvalue.cc:963 
void onnxruntime::python::CreateGenericMLValue(
    const onnxruntime::InputDefList *, 
    const AllocatorPtr &, 
    const std::string &, 
    const py::object &, 
    OrtValue *, 
    bool,
    bool,
    MemCpyFunc
) type_proto.tensor_type().has_elem_type() was false. 
The graph is missing type information needed to construct the ORT tensor

I also see the same approach/error patterns if I wrap DictVectorizer in a Pipeline. Given that map(string, int64) is a supported type, I'm unsure what else to try

@xadupre
Copy link
Collaborator

xadupre commented Feb 8, 2024

Did you try with something like DictionaryType(StringTensorType([1]), FloatTensorType([1]))?

@addisonklinke
Copy link
Author

Thanks for the suggestion @xadupre. I believe I tried that already, but forgot to document it in the issue. In that case, sess.run() gives me the exact same runtime error as approach 3 above

@xadupre
Copy link
Collaborator

xadupre commented Feb 21, 2024

Following https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_dict_vectorizer_converter.py#L35, is it possible to replace integer values by floats? integer might not be supported in onnxruntime.

@addisonklinke
Copy link
Author

@xadupre it could be reasonable to replace integers with their float equivalent. I tried

d = [{'foo': 1.0, 'bar': 2.0}, {'foo': 3.0, 'baz': 1.0}]
dict_type = DictionaryType(
    key_type=StringTensorType([1]), 
    value_type=FloatTensorType([1]),
)

but this still encounters the "graph is missing type information needed to construct the ORT tensor" error during sess.run(...) from approach 3. Just to be sure the floats weren't being implicitly casted back to int, I tried adding some significant digits after the decimal place

d = [{'foo': 1.1, 'bar': 2.1}, {'foo': 3.1, 'baz': 1.1}]

However, this still has the same issue. Any other thoughts?

I also see the same error if I add a sess.run(...) to the test case you linked, so that appears to only confirm convert_sklearn() works, but not the other portions of my original code snippet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants