Help converting and running basic DictVectorizer #1068

addisonklinke · 2024-02-02T20:34:53Z

I want to convert the example DictVectorizer from the sklearn docs to ONNX. Despite looking at the documented type constraints for OnnxDictVectorizer, all the approaches I've tried still have different errors. Can someone please advise?

Here is the base script I've been modifying

import onnxruntime
from sklearn.feature_extraction import DictVectorizer
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import (
    DictionaryType,
    FloatTensorType,
    StringType,
    StringTensorType,
    Int64Type,
    Int64TensorType,
)

# Initialize and fit
v = DictVectorizer(sparse=False)
d = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
output = v.fit_transform(d)

# Convert
dict_type = DictionaryType(
    key_type=...,
    value_type=...,
)
v_onnx = convert_sklearn(v, initial_types=[("input", dict_type)])

# Run in ONNX
sess = onnxruntime.InferenceSession(v_onnx.SerializeToString())
inputs = {"input": d}
output_onnx = sess.run(None, inputs)

And a summary of the approaches and errors. For notation, I'm showing {key_type : value_type} as passed to the DictionaryType constructor

Approach 1: non-tensor types

The most direct translation of the Python types in d should be {StringType([None, 1]) : Int64Type([None, 1]).
However convert_sklearn() raises

TypeError: data_type is not a tensor type but '<class 'onnxconverter_common.data_types.Int64Type'>'

Approach 2: tensor types

As indicated by the type error, I refactored to {StringTensorType([None, 1]) : Int64TensorType([None, 1]).
Now sess = ... raises

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : 
Type Error: Type (tensor(float)) of output arg (variable) of node (DictVectorizer) 
does not match expected type (tensor(int64)).

Approach 3: replace int64 with float

ONNX appears to treat the values as floats even though in Python they are ints.
To remedy, I tried {StringTensorType([None, 1]) : FloatTensorType([None, 1])
Now sess.run() raises

RuntimeError: /Users/runner/work/1/s/onnxruntime/python/onnxruntime_pybind_mlvalue.cc:963 
void onnxruntime::python::CreateGenericMLValue(
    const onnxruntime::InputDefList *, 
    const AllocatorPtr &, 
    const std::string &, 
    const py::object &, 
    OrtValue *, 
    bool,
    bool,
    MemCpyFunc
) type_proto.tensor_type().has_elem_type() was false. 
The graph is missing type information needed to construct the ORT tensor

I also see the same approach/error patterns if I wrap DictVectorizer in a Pipeline. Given that map(string, int64) is a supported type, I'm unsure what else to try

The text was updated successfully, but these errors were encountered:

xadupre · 2024-02-08T13:30:15Z

Did you try with something like DictionaryType(StringTensorType([1]), FloatTensorType([1]))?

addisonklinke · 2024-02-13T20:43:29Z

Thanks for the suggestion @xadupre. I believe I tried that already, but forgot to document it in the issue. In that case, sess.run() gives me the exact same runtime error as approach 3 above

xadupre · 2024-02-21T18:50:00Z

Following https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_dict_vectorizer_converter.py#L35, is it possible to replace integer values by floats? integer might not be supported in onnxruntime.

addisonklinke · 2024-05-14T21:08:27Z

@xadupre it could be reasonable to replace integers with their float equivalent. I tried

d = [{'foo': 1.0, 'bar': 2.0}, {'foo': 3.0, 'baz': 1.0}]
dict_type = DictionaryType(
    key_type=StringTensorType([1]), 
    value_type=FloatTensorType([1]),
)

but this still encounters the "graph is missing type information needed to construct the ORT tensor" error during sess.run(...) from approach 3. Just to be sure the floats weren't being implicitly casted back to int, I tried adding some significant digits after the decimal place

d = [{'foo': 1.1, 'bar': 2.1}, {'foo': 3.1, 'baz': 1.1}]

However, this still has the same issue. Any other thoughts?

I also see the same error if I add a sess.run(...) to the test case you linked, so that appears to only confirm convert_sklearn() works, but not the other portions of my original code snippet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help converting and running basic DictVectorizer #1068

Help converting and running basic DictVectorizer #1068

addisonklinke commented Feb 2, 2024 •

edited

Loading

xadupre commented Feb 8, 2024

addisonklinke commented Feb 13, 2024

xadupre commented Feb 21, 2024

addisonklinke commented May 14, 2024

Help converting and running basic DictVectorizer #1068

Help converting and running basic DictVectorizer #1068

Comments

addisonklinke commented Feb 2, 2024 • edited Loading

Approach 1: non-tensor types

Approach 2: tensor types

Approach 3: replace int64 with float

xadupre commented Feb 8, 2024

addisonklinke commented Feb 13, 2024

xadupre commented Feb 21, 2024

addisonklinke commented May 14, 2024

addisonklinke commented Feb 2, 2024 •

edited

Loading