Unify Feature Transformations and Feature Views #4584

franciscojavierarceo · 2024-09-28T09:44:53Z

Problem

In Feast, the current On Demand Feature View executes feature transformations at read time. However, this behavior is not immediately obvious from the name "On Demand Feature View."

The terms "On Demand" don't explicitly convey when the transformation occurs, leading to potential confusion for users trying to understand the timing and execution flow of feature transformations. Additionally, there's inconsistency in how different types of feature views (batch, streaming, on-demand) are declared and used, which can make the codebase less intuitive and harder to maintain.

Proposed Solution

I propose we refactor and rename the way transformations are defined in Feast to make the execution timing explicit. Introducing a @transform decorator with a type parameter that accepts a FeastTransformation enum can achieve this clarity. The enum would define the transformation types:

from enum import Enum

class FeastTransformation(Enum):
    NONE = 0         # No transformations happening using Feast
    ON_READ = 1      # Equivalent to the current On Demand Feature View
    ON_WRITE = 2     # New: Transformations executed at write time
    BATCH = 3        # New: Batch processing transformations
    STREAMING = 4    # Migrates the existing stream_feature_view decorator

@transform(
    type=FeastTransformation.ON_WRITE,
    schema=[...],
    entity=[...],
    sources=[...],
    mode='python',
)
def my_feature_view(inputs: List[Dict[str, float]]) -> List[Dict[str, float]]:
    # Transformation logic here
    return outputs

By explicitly declaring when the transformation should occur {'ON_READ', 'ON_WRITE', 'BATCH', 'STREAMING'}, the code becomes more readable, maintainable, and, more importantly, we will be able to remove a lot of duplicate, confusing code that obfuscates what is actually happening (particularly during the execution of FeatureStore.apply()).

*This approach unifies the declaration of feature views and transformations, making it easier for users to understand and for developers to extend functionality in the future.

Alternatives

Enhancing Documentation: Improving the existing documentation to better explain the behavior of On Demand Feature View without changing the code structure.
- This doesn't solve the inherent ambiguity in the naming.
Renaming Existing Decorators: Simply renaming on_demand_feature_view to something like on_read_feature_view to make the execution timing clearer.
- While this improves clarity, it doesn't address the inconsistency in how different feature views are declared.
Using Configuration Parameters: Adding parameters to existing decorators to specify execution timing.
- This could make the decorators more complex and doesn't provide a unified approach.

Additional Context

Consistency and Extensibility: This change promotes a consistent method of declaring feature transformations, regardless of when they execute. It also sets a foundation for adding more transformation types in the future.

Clarity for New Users: Making the execution timing explicit in the code helps new users understand the flow of data and transformations in Feast, reducing the learning curve.

Impact on Existing Codebase: Careful consideration and planning would be needed to migrate existing code to the new structure without breaking functionality. Deprecation warnings and migration guides could facilitate this transition.

Community Feedback: Engaging with the community to gather feedback on this proposed change could provide additional insights and help refine the solution.

Migration: We can make the code backwards compatible and even provide some helper scripts to migrate to the new syntax but I think this is the right long term proposal that settles the several discussions we've had.

#4376
#4277
#3639

The text was updated successfully, but these errors were encountered:

franciscojavierarceo · 2024-09-28T10:00:40Z

FYI @tokoko @HaoXuAI @shuchu

tokoko · 2024-09-28T10:55:54Z

@franciscojavierarceo thanks. I like the simplicity of it, but wouldn't this be a problem when different types of FVs require different parameters? We can enumerate all the differences, but the first one that comes to mind is that in case of BatchFeatureViews sources is actually a list of DataSource objects, but for OnDemandFeatureViews sources is a list of FeatureView or RequestSource objects. We might end up with a less user-friendly interface, because we will no longer be able to convey this kind of information through typing.

I'm at the moment leaning more towards "better documentation + better naming" alternative because of these reasons.

franciscojavierarceo · 2024-09-28T11:05:38Z

We can enumerate all the differences, but the first one that comes to mind is that in case of BatchFeatureViews sources is actually a list of DataSource objects, but for OnDemandFeatureViews sources is a list of FeatureView or RequestSource objects.

This is just a DAG and we can express this thoughtfully and it ends up having a lot of pros, much like how dbt can compile a visual DAG to highlight behavior.

We might end up with a less user-friendly interface, because we will no longer be able to convey this kind of information through typing.

I think this can be thought through.

I'm at the moment leaning more towards "better documentation + better naming" alternative because of these reasons.

I think the long term approach of making it clear to data scientists and engineers what is actually happening is worth this effort. Making users learn weird jargon instead of using obvious industry patterns is a confusing choice.

Also, better naming is part of the proposal.

franciscojavierarceo · 2024-09-28T12:39:53Z

This means the following implementations will have to happen:

Rename and migrate OnDemandFeatureView
Rename and migrate StreamFeatureView
Implement the batch transformation as a subset of transform
Update lots of documentation

So I think I will complete the with writes and batch transform work then migrate it for completeness.

HaoXuAI · 2024-09-28T16:44:05Z

Hey I know you really want to optimize the ondemandfeatureciew, but I would recommend to implement the batch and stream feature view with transformation first, then come back to refactor and optimization. We need more use case and experiment to support the refactoring

franciscojavierarceo · 2024-09-28T17:58:23Z

Yes, let me clarify. I intend on doing the following:

Finish implementing transformations on writes (what I'm doing now)
Implement Batch transformations (next)
Rename and migrate Stream, On Demand, and Batch Feature Views

I should have ordered the above that way. So the full refactor and optimization is last.

HaoXuAI · 2024-09-28T21:47:58Z

that sounds good. i can help with the batch or stream transformation as well.

franciscojavierarceo · 2024-09-28T22:14:33Z

Sounds good. Want to take on the batch transformation?

You can tag me in the PR if u want feedback/help

franciscojavierarceo · 2024-09-28T22:15:38Z

There's already a spec written and maybe some code in place (take a look at aggregations.py) but I'm not sure what the status was.

HaoXuAI · 2024-09-29T05:36:50Z

Sure, I can play with it and share some thoughts, gonna finish some left work for the vector db

dandawg · 2024-10-01T15:13:01Z

I'm really glad to see the effort towards clarity in the API. I like the idea of getting rid of "on-demand". Can we expand on the idea of "when" to include "where/how" a transformation occurs? I've seen the pattern of "source" and "destination" in other transformation utilities (I know GCP and I think AWS uses this).

As a data scientist, I often choose to have transformations occur in specific locations (this is just my DS work outside of Feast). For example, I may do initial transformations in-database (server-side), and then pull the transformed data client side (this is great for aggregations). Or, I may choose to pull the data first, and then transform client-side, or send it to a remote job (on say a spark cluster), and pull the output from there. Or, I may just have the output written back to the original DB.

Given these sorts of complexities, I'd love to have some flexibility to make these decisions in a Feast transformation. There are a few ways we could tackle the "where". One would be through separate methods, something like:

db_transform(query=..., destination=table_b)  #in-db transformation
spark_transform(source=..., destination=..., spark_engine_endpoint=...)

Or, perhaps even more simplified:

transform(source=..., dest=..., config=...) 
# config could include info like 'spark engine' etc.

^ these ideas are just brainstorming--they would obviously need to be fleshed out more.

franciscojavierarceo · 2024-10-02T12:56:32Z

Can we expand on the idea of "when" to include "where/how" a transformation occurs? I've seen the pattern of "source" and "destination" in other transformation utilities (I know GCP and I think AWS uses this).

Ultimately this maps to the "engine" that's used for transformation mentioned briefly in this document: https://docs.feast.dev/master/getting-started/architecture/feature-transformation#feature-transformation-engines

We have to think through how we orchestrate the transformation because Feast itself is not an orchestrator. I think Kubeflow Pipelines is a natural fit here and we could offer tutorials using Airflow as well.

So the transform decorator really just needs to store the transformation that can be executed on the engine and, thus, the task becomes to ensure that the decorator is compatible with the engine.

tokoko · 2024-10-21T11:30:07Z

I should have ordered the above that way. So the full refactor and optimization is last.

+1 on this, let's get all the core functionality in and then decide on the interface.

P.S. I'm thinking of taking on #4277, I think no one has started on it yet, right?

franciscojavierarceo · 2024-10-21T11:34:56Z

Yeah, that sounds great! Do you want to tag me early in the PR so we can make sure we're on the same page for the refactor? I imagine we'll end up coming up with the same conclusion but would be good to collaborate on it.

tokoko · 2024-10-21T11:35:50Z

sure, I will start a draft PR here

robhowley · 2024-10-21T15:27:50Z

This will be a really nice improvement on the api. As it stands it's currently quite unintuitive, but a @transform decorator is a very clear and straightforward. Good readability change.

emgeee · 2024-10-24T20:33:25Z

I think this is a great idea, the current naming is confusing. I also think a diagram would help clarify the things

franciscojavierarceo added the kind/feature New feature or request label Sep 28, 2024

franciscojavierarceo self-assigned this Sep 28, 2024

franciscojavierarceo mentioned this issue Oct 6, 2024

feat: Adding write capability to online store to on demand feature views #4585

Merged

franciscojavierarceo mentioned this issue Oct 29, 2024

Refactor Vector Retrieval into Feature View #4716

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify Feature Transformations and Feature Views #4584

Unify Feature Transformations and Feature Views #4584

franciscojavierarceo commented Sep 28, 2024 •

edited

Loading

franciscojavierarceo commented Sep 28, 2024 •

edited

Loading

tokoko commented Sep 28, 2024

franciscojavierarceo commented Sep 28, 2024 •

edited

Loading

franciscojavierarceo commented Sep 28, 2024 •

edited

Loading

HaoXuAI commented Sep 28, 2024

franciscojavierarceo commented Sep 28, 2024

HaoXuAI commented Sep 28, 2024

franciscojavierarceo commented Sep 28, 2024

franciscojavierarceo commented Sep 28, 2024

HaoXuAI commented Sep 29, 2024

dandawg commented Oct 1, 2024 •

edited

Loading

franciscojavierarceo commented Oct 2, 2024

tokoko commented Oct 21, 2024

franciscojavierarceo commented Oct 21, 2024

tokoko commented Oct 21, 2024

robhowley commented Oct 21, 2024

emgeee commented Oct 24, 2024

Unify Feature Transformations and Feature Views #4584

Unify Feature Transformations and Feature Views #4584

Comments

franciscojavierarceo commented Sep 28, 2024 • edited Loading

Problem

Proposed Solution

Alternatives

Additional Context

franciscojavierarceo commented Sep 28, 2024 • edited Loading

tokoko commented Sep 28, 2024

franciscojavierarceo commented Sep 28, 2024 • edited Loading

franciscojavierarceo commented Sep 28, 2024 • edited Loading

HaoXuAI commented Sep 28, 2024

franciscojavierarceo commented Sep 28, 2024

HaoXuAI commented Sep 28, 2024

franciscojavierarceo commented Sep 28, 2024

franciscojavierarceo commented Sep 28, 2024

HaoXuAI commented Sep 29, 2024

dandawg commented Oct 1, 2024 • edited Loading

franciscojavierarceo commented Oct 2, 2024

tokoko commented Oct 21, 2024

franciscojavierarceo commented Oct 21, 2024

tokoko commented Oct 21, 2024

robhowley commented Oct 21, 2024

emgeee commented Oct 24, 2024

franciscojavierarceo commented Sep 28, 2024 •

edited

Loading

franciscojavierarceo commented Sep 28, 2024 •

edited

Loading

franciscojavierarceo commented Sep 28, 2024 •

edited

Loading

franciscojavierarceo commented Sep 28, 2024 •

edited

Loading

dandawg commented Oct 1, 2024 •

edited

Loading