Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High level framework for ergonomic UDF programming #4856

Open
sunng87 opened this issue Oct 18, 2024 · 3 comments
Open

High level framework for ergonomic UDF programming #4856

sunng87 opened this issue Oct 18, 2024 · 3 comments
Labels
C-enhancement Category Enhancements

Comments

@sunng87
Copy link
Member

sunng87 commented Oct 18, 2024

What type of enhancement is this?

Refactor

What does the enhancement do?

The idea is to create a high level framework for UDF development (not UDAF), to
remove boilerplate code, and improve ergonomic.

The core responsibility of this framework is to provide:

  • Automatic function argument extraction/coerce
  • Declarative validation
  • Document generation

Current status

At the moment, a typical implementation of UDF looks like this one:
https://github.com/GreptimeTeam/greptimedb/blob/main/src/common/function/src/scalars/geo/h3.rs#L95

Basically we do following steps to generate the result vector:

  1. Validate input columns: column size and length for each column
  2. Initialise the result vector
  3. Extract values by row from columns
  4. Call the rust function and dealing with error if any
  5. Fill the result vector

Desired state

Because every implementation has do these 1/2/3/5 steps. An ergonomic solution
is to provide a declarative way to extract rust data types from column vectors,
and the user simply focus on calling rust function. The implementation of UDF
should be stateless, so until we have a real case, we don't need to provide any
type of context for execution except the original FunctionContext.

Inspired by how axum designed its web handler. The API looks like

trait Extract {
    fn extract(v: Value) -> Option<Self>;

    fn validate(&self) -> Result<()>;
}

struct Coordinate(f64);

impl Extract for Coordinate {

    fn extract(v: Value) -> Option<Self> {
        ...
    }

    fn validate(&self) -> Result<()> {
        Ok(())
    }
}

struct Resolution(i8);

impl Extract for Coordinate {

    fn extract(v: Value) -> Option<Self> {
        ...
    }

    fn validate(&self) -> Result<()> {
        ensure!(self.0 >=0 && self.0 < 18)
    }
}

trait FunctionExt1: Function {
    type T0: Extract;
    fn call(_ctx: FunctionContext, arg0: T0) -> R;
}

trait FunctionExt2: Function {
    type T0: Extract;
    type T1: Extract;
    fn call(_ctx: FunctionContext, arg0: T0, arg1: T1) -> R;
}

trait FunctionExt3: Function {
    type T0: Extract;
    type T1: Extract;
    type T2: Extract;
    fn call(_ctx: FunctionContext, arg0: T0, arg1: T1, arg2: T2) -> R;
}



...

FunctionExtN will provide default implementation for Function::eval.

TODO: think about how to detail with R

Limitation

  • Extractor has to be defined manually: But most extractor can be shared when
    they don't have a particular meaning like this coordinate or resolution. In
    some cases, they can be generic numbers or strings.
  • Variadic-argument is not supported with this design

Documentation

Procedural macro is preferred in this case for two types of usage:

  • As a marker for compile-time tools to extract rust docstrings to some markdown
    files that can be hosted in our docs.greptime.com
  • As the code generation macro that generates a doc function to return
    docstring at runtime. So we can have SQL query statement like SHOW DOC
    function
    to return docstring.

Implementation challenges

No response

@sunng87 sunng87 added the C-enhancement Category Enhancements label Oct 18, 2024
@evenyag
Copy link
Contributor

evenyag commented Oct 21, 2024

@sunng87 We are going to remove the wrapper layer of our UDF/UDAF and use datafusion's UDF API in the future. Not sure if this issue can benefit from it.

@sunng87
Copy link
Member Author

sunng87 commented Oct 23, 2024

@evenyag If we use datafusion's API, are we still using our own Vector as input?

@evenyag
Copy link
Contributor

evenyag commented Oct 25, 2024

@evenyag If we use datafusion's API, are we still using our own Vector as input?

No, we process arrow's arrays directly. Writing a simple UDF should be easy.
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/simple_udf.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements
Projects
None yet
Development

No branches or pull requests

2 participants