Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add incipient wasm_udf support #1654

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

tachyonicbytes
Copy link

@tachyonicbytes tachyonicbytes commented Jun 19, 2023

This PR adds incipient support for wasm_udf, as defined in #1653. A new feature is added, so compile with --features wasm, if you want to test it.

I used Wasmtime as the wasm engine to run the wasm code. Wasmer or WasmEdge or some other engine can be chosen as well, with not many modifications, in theory.

The most important part is the type conversion from Dozer types to wasm types, which natively are very few (only i64 and f64 could be used for now). Wasm doesn't natively support i128, and Dozer does not support i32, so there were not many type that matched.

More types can of course be added (Boolean for example is trivial to add, even if it wastes an entire i32). For composite types, like String or Text, binding generation support has to be added.

I could not find the parsing of the dozer-config.yml, so I was inspired by the python_udf module to use environment variables. You can define DOZER_WASM_UDF to point to a .wasm file with the exported functions.

Testplan:

An AssemblyScript example has been added to ./dozer-tests/wasm_udf/assemblyscript. The simples way to test the feature is to use the dozer-samples sql join test.

Change the sql to include a simple wasm_udf call:

sql: |
  SELECT t.tpep_pickup_datetime AS pickup_time, z.Zone AS zone, wasm_addf(9.5, 10.7, 'float') as WASM
  INTO pickup
  FROM trips t JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID;

Follow the rest of the instructions in that example.

To build the AssemblyScript that contains the wasm_addf function:

cd ./dozer-tests/wasm_udf/assemblyscript
npm install && npm run asbuild

The module is now at ./dozer-tests/wasm_udf/assemblyscript/build/debug.wasm. Set DOZER_WASM_UDF before you run the example:

DOZER_WASM_UDF=./path/to/debug.wasm dozer

/claim #1653

@algora-pbc algora-pbc bot mentioned this pull request Jun 19, 2023
@CLAassistant
Copy link

CLAassistant commented Jun 19, 2023

CLA assistant check
All committers have signed the CLA.

@snork-alt
Copy link
Contributor

Thanks. Can you review @chubei ?

@chubei chubei self-requested a review June 20, 2023 05:12
@chubei
Copy link
Contributor

chubei commented Jun 20, 2023

Thanks. Can you review @chubei ?

Sure.

@snork-alt
Copy link
Contributor

@tachyonicbytes what do you mean you could not find the parsing of dozer-config.yml? @chubei can guide you here. We are trying to centralize the UDFs definitions. Check out also #1632 as well for UDFs definition

@tachyonicbytes
Copy link
Author

@snork-alt I checked #1632 while developing. Is the definition you refer to the

udfs:
  wasm:
    my_function: module.wasm

part of the config?

In that case, that is precisely what I didn't get. It's not the parsing itself that gets me @chubei, but I don't know what I should import to get the module path "path/to/debug.wasm" in wasm_udf.rs. The python module didn't have it, it used env vars, and the onnx module is not ready yet, so I couldn't see how they do it.

@snork-alt
Copy link
Contributor

@chubei, can you help @tachyonicbytes? Thanks

Copy link
Contributor

@chubei chubei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tachyonicbytes, this is great work and your PR summary is very helpful. I see two places where we can improve in this PR. Please let me know what you think.

  1. Instead of using environment variables and wasm_prefix, we should use a designated config file section to define the UDFs, as @snork-alt suggested.

I understand that you're taking the Python UDF implementation as a reference. The truth is we've learnt more since we merged the Python UDF and we have a clearer understanding of the UDF usage pattern now so we decided to put the UDF definitions in the config file.

You mentioned the problem is that, you don't know how you can get access to the wasm module path defined in the config file. I believe we can add a parameter udfs to statement_to_pipeline, passing it all the way down to parse_sql_function, and there we can look up the function name to determine if it's a UDF, and if it is, which kind of UDF.

  1. We need to implement double direction data type conversion.

As you mentioned, natively few types are matched. The goal here is to support as many types as possible.

For example, if a wasm UDF returns i32, we should convert it to i64 to use it in Dozer. Here the principle is no data loss, so we won't convert Dozer i128 to wasm, which we just don't support.

We should also write string, timestamp, etc. conversions, so authors of the UDF gets a easy to use type.

I suggest we think about data type conversion thoroughly and produce a design document outlining what we'll do. Once we agree on the design we can get into implementation.

As a reference, you can check #1200 and #1514 to see how we handle json type conversion between Dozer and Python, Dozer and Arrow, Dozer and Protobuf, etc.

That's all my comments! Thank you for your great work again!

@snork-alt
Copy link
Contributor

@tachyonicbytes are you planning to implement those changes and close the PR ?

@tachyonicbytes
Copy link
Author

Yes, I am in the process of writing a bigger message

@tachyonicbytes
Copy link
Author

tachyonicbytes commented Jun 22, 2023

Thank you so much for the review, @chubei and @snork-alt.

Ok, so I understand that, in order to pass the config variables to the wasm_udf module, we need to propagate those values through the necessary functions. I thought that the project had something akin to a global config module that I import and use. That's what I was looking for, and could not find.

Yeah, the python_udf module has been a semi-reference for me, because when I looked at it in more depth, it seemed incomplete. But it was still a nice starting point.

Now, for the data type conversion:

So, for WebAssembly, this is what we work with: i32, i64, f32, f64, v128, and linear memory for arbitrary amounts of memory. In Dozer, we have (please correct me if I am wrong with any of it):

  • UInt
  • U128
  • Int
  • I128
  • Float
  • Boolean
  • String
  • Text
  • Binary
  • Decimal
  • Timestamp
  • Date
  • Json

Int and Float are done. Boolean is trivial, albeit it wastes 31 bits for every type. UInt should be doable, even though we may have to impose certain restrictions on its usage. String, Text, Binary all involve the linear memory in some capacity, so you kinda have to solve them all at once, so it's only one data type's worth of work, plus idiosyncrasies.

Now, there is a catch. In theory, U128 and I128 are easily solvable with v128, which, even though is a simd type, can support the usual scalar operations that we do on I128 and U128 (with the exception of division, which can be done slower or it can be restricted). But the wasmtime crate does not implement v128. It is in AssemblyScript, but we can't use it because of that. The wasmer crate, on the other hand, supports it.

If this is solved, UInt can be half of a v128, even though that is, again, not fully efficient and I128 and U128 have the caveat above, no or slow division.

I don't know how Decimal works inside Dozer. Is it an Int type that you don't fully use? That's how it was implemented in intro CS classes, if I recall correctly. If so, that is an easy conversion, but it's not natively supported in any langauge, so users will have to implement their own semantics. If it is implemented as a String, then it falls in the linear memory category.

Timestamp and Date should be strings, I guess, so linear memory for them as well.

The last one, Json. This is the one that I would personally suggest skipping at least for now. The problem with WebAssembly and json is not that people don't need it, but it is kinda hard to do without proper reflection. AssemblyScript has no native json support. It has a json package that is great in its own way, but it doesn't support exception handling, so it's kinda hard to validate your json on the udf end. TinyGo does not support the standard library json, although it has some non-reflective packages.

Of course, there is also the possibility of splitting the data type conversion by parameter or return value. We can decide to implement Json for parameters, but not for return values, for example, so you always have valid jsons in your udf code.

So, please let me know which of the types above sound doable for this PR. Please let me know of the wasmtime <-> wasmer decision. And please further explain what is needed of Timestamp and Date.

I'll also repair the python_udf module and add a TinyGo wasm example, if you don't mind.

@snork-alt
Copy link
Contributor

@tachyonicbytes we just merged a PR that allows return types of SQL UDFs to be specified as function<Type>. This can be useful for WASM functions.

@tachyonicbytes
Copy link
Author

Perfect. Less matching on strings for wasm_udfs as well

@snork-alt
Copy link
Contributor

@tachyonicbytes are you still working on this ?

@tachyonicbytes
Copy link
Author

Yes. So the config part seems to be more complicated than I anticipated. The problem is that a new section has to be created for wasm udfs. I added it to api_config, because it seems to have more in common with the other api_config options than the others.

Another problem is that, in the ExecutorOptions at dozer-core/src/executor.rs the wasm udfs need to be added. I opted for this: pub udfs: HashMap<String, HashMap<String, String>>. But the HashMap is not serializable with prost, so another solution has to be found, as all the other types are simple types.

Changing that type also means going back to the UdfOptions struct I put in dozer-types/src/models/api_config.rs and change the type there as well.

The problem is that the free format

wasm_udfs:
    - module1: function1
    - module1: function2
    - module2: function3

is hard to serialize to simple types, if you don't store it in a string, for example, in an ad-hoc format, like "module1|function1|module1|function2" etc. But maybe that's an option as well.

Is there a discord server or some synchronous communication mechanism for outsiders that contribute to dozer? I think some of the problems that I face can be easily solved that way.

@chubei
Copy link
Contributor

chubei commented Aug 15, 2023

Hi @tachyonicbytes , let's talk on Discord. https://discord.gg/3eWXBgJaEQ

And we've added udfs config section in #1831. You can follow format in that PR.

@snork-alt
Copy link
Contributor

@tachyonicbytes any plan for.this ?

@snork-alt
Copy link
Contributor

Hi @tachyonicbytes is this ready for review ? If yes please ask @chubei . Thanks

@tachyonicbytes
Copy link
Author

@snork-alt I talked with @chubei, and the agreed strategy is to wait for this #1838 to be merged and then rebase wasm udfs upon it. Otherwise it would contain a lot of duplicated work that would need to be resolved in conflicts.

@tachyonicbytes
Copy link
Author

The PR is functional now. I usually check it with the dozer-samples/sql/join sample:

version: 1
app_name: sql-join-sample
connections:
  - config : !LocalStorage
      details:
        path: data
      tables:
        - !Table
          name: taxi_zone_lookup
          config: !CSV
            path: zones
            extension: .csv
        - !Table
          name: trips
          config: !Parquet
            path: trips
            extension: .parquet
    name: ny_taxi

sql: |
  SELECT t.tpep_pickup_datetime AS pickup_time, z.Zone AS zone, fib<Int>(10) as WASM
  INTO pickup
  FROM trips t JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID;

sources:
  - name: taxi_zone_lookup
    table_name: taxi_zone_lookup
    connection: ny_taxi
  - name: trips
    table_name: trips
    connection: ny_taxi

endpoints:
  - name: pickup
    path: /pickup
    table_name: pickup

udfs:
  - config: !Wasm
      path: /path/to/dozer/dozer-tests/wasm_udf/assemblyscript/build/debug.wasm
    name: fib

@snork-alt
Copy link
Contributor

@tachyonicbytes is it possible to infer return type and parameter types from the wasm module itself, rather than having to declare it during the function call ?

@snork-alt snork-alt requested a review from chubei September 3, 2023 18:15
@tachyonicbytes
Copy link
Author

I am 90% sure that you can infer it, but I will have to thoroughly check the docs for that. The other thing is that return_type is necessary there so that Dozer requests from the wasm module the actual type it wants, so that there is less confusion during the conversion.

@snork-alt
Copy link
Contributor

@chubei can you review ?

@chubei
Copy link
Contributor

chubei commented Sep 5, 2023

@chubei can you review ?

I'm leaving it for later of this week.

Copy link
Contributor

@chubei chubei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good in general. We need to improve a little in error handling but that should be trivial.

I think there're still two things missing now: validation and casting.

Validation

As you know, wasm function is statically typed, and every Rust type we support right now can be mapped uniquely to a wasm type. So it's necessary to validate:

  • Does the input schema match wasm function params?
  • Does the wasm function result have exactly one output?

And we should infer return type instead of asking user to write it in the function signature.

Casting

To support as many types as possible, casting should be performed. If dozer type is U128 or I128, we should cast it to wasm i64. It should emit a precision loss warning during validation, and a runtime error if the cast cannot be performed due to overflow at runtime.

Let's skip the support for non-primitive types in this PR.

Onnx module has all these validation and casting functions. You can take that as a reference.


#[cfg(feature = "wasm")]
#[test]
fn standard_wasm() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't seem to pass

@snork-alt
Copy link
Contributor

@tachyonicbytes can we fix the last things so we can merge ? Thanks

@snork-alt
Copy link
Contributor

@tachyonicbytes shall we close this or you plan to wrap it up so it can be merged and your bounty paid?

@tachyonicbytes
Copy link
Author

Yeah, sorry for the delay, I'm wrapping it up.

@snork-alt
Copy link
Contributor

Any progress @tachyonicbytes ?

@tachyonicbytes
Copy link
Author

Yes, I've updated to the new json schemas, as well as working on the new errors, like onnx does. With testing and all it should be ready in maybe a couple of days, maybe Sunday.

Copy link
Contributor

@chubei chubei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I left some comments in the code, but they are all minor issues.

Questions:

  • Looks like we're missing a wasm/utils.rs file?
  • Let's change the todo!s to Errs.
  • Let's also validate the input argument conversion before execution.

Copy link
Contributor

@chubei chubei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like wasm/utils.rs is missing?

We'll need at least one test to show that the UDF actually runs.

Besides, is it possible to use deno instead of wasmtime? As we are already depending on deno.

@@ -19,6 +19,7 @@ enum_dispatch = "0.3.12"
linked-hash-map = { version = "0.5.6", features = ["serde_impl"] }
metrics = "0.21.0"
multimap = "0.9.0"
wasmtime = { version = "9.0.4", optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this dependency?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wasm runtime that we use. This runs the actual wasm functions

Comment on lines 22 to 38
let engine = Engine::default();
let module = Module::from_file(&engine, config).unwrap();
let mut store = Store::new(&engine, ());
let instance = Instance::new(&mut store, &module, &[]).unwrap();

let wasm_udf_func;
match instance.get_func(&mut store, name) {
Some(func) => {
wasm_udf_func = func;
}
None => {
return Err(Wasm(WasmFunctionMissing(
name.to_string(),
config.to_string(),
)));
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These code should go into the compilation phase? Just like the onnx UDF is passing in a Session instead of the model path.

.collect::<Result<Vec<_>, Error>>()?;

let engine = Engine::default();
let module = Module::from_file(&engine, config).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot unwrap

let engine = Engine::default();
let module = Module::from_file(&engine, config).unwrap();
let mut store = Store::new(&engine, ());
let instance = Instance::new(&mut store, &module, &[]).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot unwrap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants