-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add incipient wasm_udf support #1654
base: main
Are you sure you want to change the base?
Conversation
Thanks. Can you review @chubei ? |
Sure. |
@tachyonicbytes what do you mean you could not find the parsing of |
@snork-alt I checked #1632 while developing. Is the definition you refer to the
part of the config? In that case, that is precisely what I didn't get. It's not the parsing itself that gets me @chubei, but I don't know what I should import to get the module path |
@chubei, can you help @tachyonicbytes? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tachyonicbytes, this is great work and your PR summary is very helpful. I see two places where we can improve in this PR. Please let me know what you think.
- Instead of using environment variables and
wasm_
prefix, we should use a designated config file section to define the UDFs, as @snork-alt suggested.
I understand that you're taking the Python UDF implementation as a reference. The truth is we've learnt more since we merged the Python UDF and we have a clearer understanding of the UDF usage pattern now so we decided to put the UDF definitions in the config file.
You mentioned the problem is that, you don't know how you can get access to the wasm module path defined in the config file. I believe we can add a parameter udfs
to statement_to_pipeline
, passing it all the way down to parse_sql_function
, and there we can look up the function name to determine if it's a UDF, and if it is, which kind of UDF.
- We need to implement double direction data type conversion.
As you mentioned, natively few types are matched. The goal here is to support as many types as possible.
For example, if a wasm UDF returns i32
, we should convert it to i64
to use it in Dozer. Here the principle is no data loss, so we won't convert Dozer i128
to wasm, which we just don't support.
We should also write string
, timestamp
, etc. conversions, so authors of the UDF gets a easy to use type.
I suggest we think about data type conversion thoroughly and produce a design document outlining what we'll do. Once we agree on the design we can get into implementation.
As a reference, you can check #1200 and #1514 to see how we handle json
type conversion between Dozer and Python, Dozer and Arrow, Dozer and Protobuf, etc.
That's all my comments! Thank you for your great work again!
@tachyonicbytes are you planning to implement those changes and close the PR ? |
Yes, I am in the process of writing a bigger message |
Thank you so much for the review, @chubei and @snork-alt. Ok, so I understand that, in order to pass the config variables to the Yeah, the Now, for the data type conversion: So, for WebAssembly, this is what we work with:
Now, there is a catch. In theory, If this is solved, I don't know how
The last one, Of course, there is also the possibility of splitting the data type conversion by parameter or return value. We can decide to implement So, please let me know which of the types above sound doable for this PR. Please let me know of the I'll also repair the |
@tachyonicbytes we just merged a PR that allows return types of SQL UDFs to be specified as |
Perfect. Less matching on strings for |
@tachyonicbytes are you still working on this ? |
Yes. So the config part seems to be more complicated than I anticipated. The problem is that a new section has to be created for wasm udfs. I added it to Another problem is that, in the Changing that type also means going back to the The problem is that the free format
is hard to serialize to simple types, if you don't store it in a string, for example, in an ad-hoc format, like Is there a discord server or some synchronous communication mechanism for outsiders that contribute to dozer? I think some of the problems that I face can be easily solved that way. |
Hi @tachyonicbytes , let's talk on Discord. https://discord.gg/3eWXBgJaEQ And we've added |
@tachyonicbytes any plan for.this ? |
Hi @tachyonicbytes is this ready for review ? If yes please ask @chubei . Thanks |
@snork-alt I talked with @chubei, and the agreed strategy is to wait for this #1838 to be merged and then rebase wasm udfs upon it. Otherwise it would contain a lot of duplicated work that would need to be resolved in conflicts. |
The PR is functional now. I usually check it with the
|
@tachyonicbytes is it possible to infer return type and parameter types from the wasm module itself, rather than having to declare it during the function call ? |
I am 90% sure that you can infer it, but I will have to thoroughly check the docs for that. The other thing is that |
@chubei can you review ? |
I'm leaving it for later of this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks good in general. We need to improve a little in error handling but that should be trivial.
I think there're still two things missing now: validation and casting.
Validation
As you know, wasm function is statically typed, and every Rust type we support right now can be mapped uniquely to a wasm type. So it's necessary to validate:
- Does the input schema match wasm function params?
- Does the wasm function result have exactly one output?
And we should infer return type instead of asking user to write it in the function signature.
Casting
To support as many types as possible, casting should be performed. If dozer type is U128
or I128
, we should cast it to wasm i64
. It should emit a precision loss warning during validation, and a runtime error if the cast cannot be performed due to overflow at runtime.
Let's skip the support for non-primitive types in this PR.
Onnx module has all these validation and casting functions. You can take that as a reference.
|
||
#[cfg(feature = "wasm")] | ||
#[test] | ||
fn standard_wasm() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't seem to pass
@tachyonicbytes can we fix the last things so we can merge ? Thanks |
@tachyonicbytes shall we close this or you plan to wrap it up so it can be merged and your bounty paid? |
Yeah, sorry for the delay, I'm wrapping it up. |
Any progress @tachyonicbytes ? |
Yes, I've updated to the new json schemas, as well as working on the new errors, like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I left some comments in the code, but they are all minor issues.
Questions:
- Looks like we're missing a
wasm/utils.rs
file? - Let's change the
todo!
s toErr
s. - Let's also validate the input argument conversion before execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like wasm/utils.rs
is missing?
We'll need at least one test to show that the UDF actually runs.
Besides, is it possible to use deno
instead of wasmtime
? As we are already depending on deno
.
dozer-sql/Cargo.toml
Outdated
@@ -19,6 +19,7 @@ enum_dispatch = "0.3.12" | |||
linked-hash-map = { version = "0.5.6", features = ["serde_impl"] } | |||
metrics = "0.21.0" | |||
multimap = "0.9.0" | |||
wasmtime = { version = "9.0.4", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wasm runtime that we use. This runs the actual wasm functions
dozer-sql/expression/src/wasm/udf.rs
Outdated
let engine = Engine::default(); | ||
let module = Module::from_file(&engine, config).unwrap(); | ||
let mut store = Store::new(&engine, ()); | ||
let instance = Instance::new(&mut store, &module, &[]).unwrap(); | ||
|
||
let wasm_udf_func; | ||
match instance.get_func(&mut store, name) { | ||
Some(func) => { | ||
wasm_udf_func = func; | ||
} | ||
None => { | ||
return Err(Wasm(WasmFunctionMissing( | ||
name.to_string(), | ||
config.to_string(), | ||
))); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These code should go into the compilation phase? Just like the onnx UDF is passing in a Session
instead of the model path.
dozer-sql/expression/src/wasm/udf.rs
Outdated
.collect::<Result<Vec<_>, Error>>()?; | ||
|
||
let engine = Engine::default(); | ||
let module = Module::from_file(&engine, config).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot unwrap
dozer-sql/expression/src/wasm/udf.rs
Outdated
let engine = Engine::default(); | ||
let module = Module::from_file(&engine, config).unwrap(); | ||
let mut store = Store::new(&engine, ()); | ||
let instance = Instance::new(&mut store, &module, &[]).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cannot unwrap
This PR adds incipient support for wasm_udf, as defined in #1653. A new feature is added, so compile with
--features wasm
, if you want to test it.I used Wasmtime as the wasm engine to run the wasm code. Wasmer or WasmEdge or some other engine can be chosen as well, with not many modifications, in theory.
The most important part is the type conversion from Dozer types to wasm types, which natively are very few (only i64 and f64 could be used for now). Wasm doesn't natively support i128, and Dozer does not support i32, so there were not many type that matched.
More types can of course be added (Boolean for example is trivial to add, even if it wastes an entire i32). For composite types, like String or Text, binding generation support has to be added.
I could not find the parsing of the
dozer-config.yml
, so I was inspired by the python_udf module to use environment variables. You can defineDOZER_WASM_UDF
to point to a.wasm
file with the exported functions.Testplan:
An AssemblyScript example has been added to
./dozer-tests/wasm_udf/assemblyscript
. The simples way to test the feature is to use thedozer-samples
sql join test.Change the sql to include a simple wasm_udf call:
Follow the rest of the instructions in that example.
To build the AssemblyScript that contains the
wasm_addf
function:The module is now at
./dozer-tests/wasm_udf/assemblyscript/build/debug.wasm
. SetDOZER_WASM_UDF
before you run the example:/claim #1653