Add incipient wasm_udf support #1654

tachyonicbytes · 2023-06-19T23:47:00Z

This PR adds incipient support for wasm_udf, as defined in #1653. A new feature is added, so compile with --features wasm, if you want to test it.

I used Wasmtime as the wasm engine to run the wasm code. Wasmer or WasmEdge or some other engine can be chosen as well, with not many modifications, in theory.

The most important part is the type conversion from Dozer types to wasm types, which natively are very few (only i64 and f64 could be used for now). Wasm doesn't natively support i128, and Dozer does not support i32, so there were not many type that matched.

More types can of course be added (Boolean for example is trivial to add, even if it wastes an entire i32). For composite types, like String or Text, binding generation support has to be added.

I could not find the parsing of the dozer-config.yml, so I was inspired by the python_udf module to use environment variables. You can define DOZER_WASM_UDF to point to a .wasm file with the exported functions.

Testplan:

An AssemblyScript example has been added to ./dozer-tests/wasm_udf/assemblyscript. The simples way to test the feature is to use the dozer-samples sql join test.

Change the sql to include a simple wasm_udf call:

sql: |
  SELECT t.tpep_pickup_datetime AS pickup_time, z.Zone AS zone, wasm_addf(9.5, 10.7, 'float') as WASM
  INTO pickup
  FROM trips t JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID;

Follow the rest of the instructions in that example.

To build the AssemblyScript that contains the wasm_addf function:

cd ./dozer-tests/wasm_udf/assemblyscript
npm install && npm run asbuild

The module is now at ./dozer-tests/wasm_udf/assemblyscript/build/debug.wasm. Set DOZER_WASM_UDF before you run the example:

DOZER_WASM_UDF=./path/to/debug.wasm dozer

/claim #1653

CLAassistant · 2023-06-19T23:47:05Z

All committers have signed the CLA.

snork-alt · 2023-06-20T05:11:43Z

Thanks. Can you review @chubei ?

chubei · 2023-06-20T05:12:45Z

Thanks. Can you review @chubei ?

Sure.

snork-alt · 2023-06-20T07:31:14Z

@tachyonicbytes what do you mean you could not find the parsing of dozer-config.yml? @chubei can guide you here. We are trying to centralize the UDFs definitions. Check out also #1632 as well for UDFs definition

tachyonicbytes · 2023-06-20T08:20:54Z

@snork-alt I checked #1632 while developing. Is the definition you refer to the

udfs:
  wasm:
    my_function: module.wasm

part of the config?

In that case, that is precisely what I didn't get. It's not the parsing itself that gets me @chubei, but I don't know what I should import to get the module path "path/to/debug.wasm" in wasm_udf.rs. The python module didn't have it, it used env vars, and the onnx module is not ready yet, so I couldn't see how they do it.

snork-alt · 2023-06-20T13:17:15Z

@chubei, can you help @tachyonicbytes? Thanks

chubei

Hi @tachyonicbytes, this is great work and your PR summary is very helpful. I see two places where we can improve in this PR. Please let me know what you think.

Instead of using environment variables and wasm_prefix, we should use a designated config file section to define the UDFs, as @snork-alt suggested.

I understand that you're taking the Python UDF implementation as a reference. The truth is we've learnt more since we merged the Python UDF and we have a clearer understanding of the UDF usage pattern now so we decided to put the UDF definitions in the config file.

You mentioned the problem is that, you don't know how you can get access to the wasm module path defined in the config file. I believe we can add a parameter udfs to statement_to_pipeline, passing it all the way down to parse_sql_function, and there we can look up the function name to determine if it's a UDF, and if it is, which kind of UDF.

We need to implement double direction data type conversion.

As you mentioned, natively few types are matched. The goal here is to support as many types as possible.

For example, if a wasm UDF returns i32, we should convert it to i64 to use it in Dozer. Here the principle is no data loss, so we won't convert Dozer i128 to wasm, which we just don't support.

We should also write string, timestamp, etc. conversions, so authors of the UDF gets a easy to use type.

I suggest we think about data type conversion thoroughly and produce a design document outlining what we'll do. Once we agree on the design we can get into implementation.

As a reference, you can check #1200 and #1514 to see how we handle json type conversion between Dozer and Python, Dozer and Arrow, Dozer and Protobuf, etc.

That's all my comments! Thank you for your great work again!

snork-alt · 2023-06-22T15:27:32Z

@tachyonicbytes are you planning to implement those changes and close the PR ?

tachyonicbytes · 2023-06-22T15:30:40Z

Yes, I am in the process of writing a bigger message

tachyonicbytes · 2023-06-22T16:47:59Z

Thank you so much for the review, @chubei and @snork-alt.

Ok, so I understand that, in order to pass the config variables to the wasm_udf module, we need to propagate those values through the necessary functions. I thought that the project had something akin to a global config module that I import and use. That's what I was looking for, and could not find.

Yeah, the python_udf module has been a semi-reference for me, because when I looked at it in more depth, it seemed incomplete. But it was still a nice starting point.

Now, for the data type conversion:

So, for WebAssembly, this is what we work with: i32, i64, f32, f64, v128, and linear memory for arbitrary amounts of memory. In Dozer, we have (please correct me if I am wrong with any of it):

Int and Float are done. Boolean is trivial, albeit it wastes 31 bits for every type. UInt should be doable, even though we may have to impose certain restrictions on its usage. String, Text, Binary all involve the linear memory in some capacity, so you kinda have to solve them all at once, so it's only one data type's worth of work, plus idiosyncrasies.

Now, there is a catch. In theory, U128 and I128 are easily solvable with v128, which, even though is a simd type, can support the usual scalar operations that we do on I128 and U128 (with the exception of division, which can be done slower or it can be restricted). But the wasmtime crate does not implement v128. It is in AssemblyScript, but we can't use it because of that. The wasmer crate, on the other hand, supports it.

If this is solved, UInt can be half of a v128, even though that is, again, not fully efficient and I128 and U128 have the caveat above, no or slow division.

I don't know how Decimal works inside Dozer. Is it an Int type that you don't fully use? That's how it was implemented in intro CS classes, if I recall correctly. If so, that is an easy conversion, but it's not natively supported in any langauge, so users will have to implement their own semantics. If it is implemented as a String, then it falls in the linear memory category.

Timestamp and Date should be strings, I guess, so linear memory for them as well.

The last one, Json. This is the one that I would personally suggest skipping at least for now. The problem with WebAssembly and json is not that people don't need it, but it is kinda hard to do without proper reflection. AssemblyScript has no native json support. It has a json package that is great in its own way, but it doesn't support exception handling, so it's kinda hard to validate your json on the udf end. TinyGo does not support the standard library json, although it has some non-reflective packages.

Of course, there is also the possibility of splitting the data type conversion by parameter or return value. We can decide to implement Json for parameters, but not for return values, for example, so you always have valid jsons in your udf code.

So, please let me know which of the types above sound doable for this PR. Please let me know of the wasmtime <-> wasmer decision. And please further explain what is needed of Timestamp and Date.

I'll also repair the python_udf module and add a TinyGo wasm example, if you don't mind.

dozer-sql/src/pipeline/expression/builder.rs

snork-alt · 2023-06-27T10:21:14Z

@tachyonicbytes we just merged a PR that allows return types of SQL UDFs to be specified as function<Type>. This can be useful for WASM functions.

tachyonicbytes · 2023-06-27T11:37:10Z

Perfect. Less matching on strings for wasm_udfs as well

snork-alt · 2023-08-14T16:18:16Z

@tachyonicbytes are you still working on this ?

tachyonicbytes · 2023-08-15T00:08:49Z

Yes. So the config part seems to be more complicated than I anticipated. The problem is that a new section has to be created for wasm udfs. I added it to api_config, because it seems to have more in common with the other api_config options than the others.

Another problem is that, in the ExecutorOptions at dozer-core/src/executor.rs the wasm udfs need to be added. I opted for this: pub udfs: HashMap<String, HashMap<String, String>>. But the HashMap is not serializable with prost, so another solution has to be found, as all the other types are simple types.

Changing that type also means going back to the UdfOptions struct I put in dozer-types/src/models/api_config.rs and change the type there as well.

The problem is that the free format

wasm_udfs:
    - module1: function1
    - module1: function2
    - module2: function3

is hard to serialize to simple types, if you don't store it in a string, for example, in an ad-hoc format, like "module1|function1|module1|function2" etc. But maybe that's an option as well.

Is there a discord server or some synchronous communication mechanism for outsiders that contribute to dozer? I think some of the problems that I face can be easily solved that way.

chubei · 2023-08-15T02:18:08Z

Hi @tachyonicbytes , let's talk on Discord. https://discord.gg/3eWXBgJaEQ

And we've added udfs config section in #1831. You can follow format in that PR.

snork-alt · 2023-08-25T04:09:04Z

@tachyonicbytes any plan for.this ?

snork-alt · 2023-08-30T16:58:06Z

Hi @tachyonicbytes is this ready for review ? If yes please ask @chubei . Thanks

tachyonicbytes · 2023-09-01T09:44:12Z

@snork-alt I talked with @chubei, and the agreed strategy is to wait for this #1838 to be merged and then rebase wasm udfs upon it. Otherwise it would contain a lot of duplicated work that would need to be resolved in conflicts.

.github/workflows/coverage.yaml

tachyonicbytes · 2023-09-02T17:54:20Z

The PR is functional now. I usually check it with the dozer-samples/sql/join sample:

version: 1
app_name: sql-join-sample
connections:
  - config : !LocalStorage
      details:
        path: data
      tables:
        - !Table
          name: taxi_zone_lookup
          config: !CSV
            path: zones
            extension: .csv
        - !Table
          name: trips
          config: !Parquet
            path: trips
            extension: .parquet
    name: ny_taxi

sql: |
  SELECT t.tpep_pickup_datetime AS pickup_time, z.Zone AS zone, fib<Int>(10) as WASM
  INTO pickup
  FROM trips t JOIN taxi_zone_lookup z ON t.PULocationID = z.LocationID;

sources:
  - name: taxi_zone_lookup
    table_name: taxi_zone_lookup
    connection: ny_taxi
  - name: trips
    table_name: trips
    connection: ny_taxi

endpoints:
  - name: pickup
    path: /pickup
    table_name: pickup

udfs:
  - config: !Wasm
      path: /path/to/dozer/dozer-tests/wasm_udf/assemblyscript/build/debug.wasm
    name: fib

snork-alt · 2023-09-03T18:14:29Z

@tachyonicbytes is it possible to infer return type and parameter types from the wasm module itself, rather than having to declare it during the function call ?

tachyonicbytes · 2023-09-03T18:45:16Z

I am 90% sure that you can infer it, but I will have to thoroughly check the docs for that. The other thing is that return_type is necessary there so that Dozer requests from the wasm module the actual type it wants, so that there is less confusion during the conversion.

snork-alt · 2023-09-05T13:22:54Z

@chubei can you review ?

chubei · 2023-09-05T13:23:35Z

@chubei can you review ?

I'm leaving it for later of this week.

chubei

This PR looks good in general. We need to improve a little in error handling but that should be trivial.

I think there're still two things missing now: validation and casting.

Validation

As you know, wasm function is statically typed, and every Rust type we support right now can be mapped uniquely to a wasm type. So it's necessary to validate:

Does the input schema match wasm function params?
Does the wasm function result have exactly one output?

And we should infer return type instead of asking user to write it in the function signature.

Casting

To support as many types as possible, casting should be performed. If dozer type is U128 or I128, we should cast it to wasm i64. It should emit a precision loss warning during validation, and a runtime error if the cast cannot be performed due to overflow at runtime.

Let's skip the support for non-primitive types in this PR.

Onnx module has all these validation and casting functions. You can take that as a reference.

dozer-sql/src/pipeline/errors.rs

dozer-sql/src/pipeline/expression/wasm_udf.rs

chubei · 2023-09-08T02:21:28Z

dozer-types/src/tests/udf_yaml_deserialize.rs

+
+#[cfg(feature = "wasm")]
+#[test]
+fn standard_wasm() {


This test doesn't seem to pass

snork-alt · 2023-09-16T05:37:28Z

@tachyonicbytes can we fix the last things so we can merge ? Thanks

snork-alt · 2023-09-25T03:06:21Z

@tachyonicbytes shall we close this or you plan to wrap it up so it can be merged and your bounty paid?

tachyonicbytes · 2023-09-25T07:12:53Z

Yeah, sorry for the delay, I'm wrapping it up.

snork-alt · 2023-09-28T19:03:30Z

Any progress @tachyonicbytes ?

tachyonicbytes · 2023-09-28T21:31:52Z

Yes, I've updated to the new json schemas, as well as working on the new errors, like onnx does. With testing and all it should be ready in maybe a couple of days, maybe Sunday.

chubei

This looks great! I left some comments in the code, but they are all minor issues.

Questions:

Looks like we're missing a wasm/utils.rs file?
Let's change the todo!s to Errs.
Let's also validate the input argument conversion before execution.

dozer-sql/expression/src/builder.rs

dozer-sql/src/errors.rs

dozer-sql/src/pipeline/expression/mod.rs

dozer-sql/Cargo.toml

chubei

Looks like wasm/utils.rs is missing?

We'll need at least one test to show that the UDF actually runs.

Besides, is it possible to use deno instead of wasmtime? As we are already depending on deno.

chubei · 2023-11-07T02:09:29Z

dozer-sql/Cargo.toml

@@ -19,6 +19,7 @@ enum_dispatch = "0.3.12"
 linked-hash-map = { version = "0.5.6", features = ["serde_impl"] }
 metrics = "0.21.0"
 multimap = "0.9.0"
+wasmtime = { version = "9.0.4", optional = true }


We don't need this dependency?

This is the wasm runtime that we use. This runs the actual wasm functions

chubei · 2023-11-07T02:14:06Z

dozer-sql/expression/src/wasm/udf.rs

+    let engine = Engine::default();
+    let module = Module::from_file(&engine, config).unwrap();
+    let mut store = Store::new(&engine, ());
+    let instance = Instance::new(&mut store, &module, &[]).unwrap();
+
+    let wasm_udf_func;
+    match instance.get_func(&mut store, name) {
+        Some(func) => {
+            wasm_udf_func = func;
+        }
+        None => {
+            return Err(Wasm(WasmFunctionMissing(
+                name.to_string(),
+                config.to_string(),
+            )));
+        }
+    }


These code should go into the compilation phase? Just like the onnx UDF is passing in a Session instead of the model path.

chubei · 2023-11-07T02:14:18Z

dozer-sql/expression/src/wasm/udf.rs

+        .collect::<Result<Vec<_>, Error>>()?;
+
+    let engine = Engine::default();
+    let module = Module::from_file(&engine, config).unwrap();


Cannot unwrap

chubei · 2023-11-07T02:14:24Z

dozer-sql/expression/src/wasm/udf.rs

+    let engine = Engine::default();
+    let module = Module::from_file(&engine, config).unwrap();
+    let mut store = Store::new(&engine, ());
+    let instance = Instance::new(&mut store, &module, &[]).unwrap();


Cannot unwrap

algora-pbc bot mentioned this pull request Jun 19, 2023

WASM UDF support #1653

Open

chubei self-requested a review June 20, 2023 05:12

chubei reviewed Jun 21, 2023

View reviewed changes

snork-alt requested changes Jun 23, 2023

View reviewed changes

dozer-sql/src/pipeline/expression/builder.rs Outdated Show resolved Hide resolved

tachyonicbytes force-pushed the main branch from a85fc63 to 151eac4 Compare August 26, 2023 12:13

github-actions bot added the doc-update-needed label Sep 2, 2023

tachyonicbytes force-pushed the main branch from df214de to e112279 Compare September 2, 2023 17:33

tachyonicbytes commented Sep 2, 2023

View reviewed changes

.github/workflows/coverage.yaml Show resolved Hide resolved

snork-alt requested a review from chubei September 3, 2023 18:15

chubei reviewed Sep 8, 2023

View reviewed changes

chubei reviewed Oct 27, 2023

View reviewed changes

chubei suggested changes Nov 7, 2023

View reviewed changes

tachyonicbytes added 16 commits November 27, 2023 00:19

Adds incipient support for wasm_udfs

424790e

Passing udfs to parse_wasm_udf

8539585

Make wasm_udfs work with the onnx base.

4b35147

Clippy + fmt

37e929c

fix: commit Cargo.lock

daa81e8

Add wasm_udfs to the new config

8591f5e

Removed commented code

8868be4

Bring wasm udf up to date

67cc8ad

Wasm udf now work again, with the latest changes

b77ee05

Wasm UDFs now have type validation, and the return type is inferred.

ee5a5bf

Added return type validation

91d085a

Removed the last expect from wasm udfs

f469343

Reformatting

a03d0e6

Casting is now being done. Reverted back to todo macro

033a533

Incorporated feedback

b5d6b30

Better instantiation

95b36c5

tachyonicbytes force-pushed the main branch from 922d281 to 95b36c5 Compare November 26, 2023 23:36

Add incipient wasm_udf support #1654

Are you sure you want to change the base?

Add incipient wasm_udf support #1654

Conversation

tachyonicbytes commented Jun 19, 2023 • edited Loading

CLAassistant commented Jun 19, 2023 • edited Loading

snork-alt commented Jun 20, 2023

chubei commented Jun 20, 2023

snork-alt commented Jun 20, 2023

tachyonicbytes commented Jun 20, 2023

snork-alt commented Jun 20, 2023

chubei left a comment • edited Loading

Choose a reason for hiding this comment

snork-alt commented Jun 22, 2023

tachyonicbytes commented Jun 22, 2023

tachyonicbytes commented Jun 22, 2023 • edited Loading

snork-alt commented Jun 27, 2023

tachyonicbytes commented Jun 27, 2023

snork-alt commented Aug 14, 2023

tachyonicbytes commented Aug 15, 2023

chubei commented Aug 15, 2023

snork-alt commented Aug 25, 2023

snork-alt commented Aug 30, 2023

tachyonicbytes commented Sep 1, 2023

tachyonicbytes commented Sep 2, 2023

snork-alt commented Sep 3, 2023

tachyonicbytes commented Sep 3, 2023

snork-alt commented Sep 5, 2023

chubei commented Sep 5, 2023

chubei left a comment • edited Loading

Choose a reason for hiding this comment

Validation

Casting

chubei Sep 8, 2023

Choose a reason for hiding this comment

snork-alt commented Sep 16, 2023

snork-alt commented Sep 25, 2023

tachyonicbytes commented Sep 25, 2023

snork-alt commented Sep 28, 2023

tachyonicbytes commented Sep 28, 2023

chubei left a comment

Choose a reason for hiding this comment

chubei left a comment

Choose a reason for hiding this comment

chubei Nov 7, 2023

Choose a reason for hiding this comment

tachyonicbytes Nov 10, 2023

Choose a reason for hiding this comment

chubei Nov 7, 2023

Choose a reason for hiding this comment

chubei Nov 7, 2023

Choose a reason for hiding this comment

chubei Nov 7, 2023

Choose a reason for hiding this comment

tachyonicbytes commented Jun 19, 2023 •

edited

Loading

CLAassistant commented Jun 19, 2023 •

edited

Loading

chubei left a comment •

edited

Loading

tachyonicbytes commented Jun 22, 2023 •

edited

Loading

chubei left a comment •

edited

Loading