Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify Feature Transformations and Feature Views #4584

Open
franciscojavierarceo opened this issue Sep 28, 2024 · 17 comments
Open

Unify Feature Transformations and Feature Views #4584

franciscojavierarceo opened this issue Sep 28, 2024 · 17 comments
Assignees
Labels
kind/feature New feature or request

Comments

@franciscojavierarceo
Copy link
Member

franciscojavierarceo commented Sep 28, 2024

Problem

In Feast, the current On Demand Feature View executes feature transformations at read time. However, this behavior is not immediately obvious from the name "On Demand Feature View."

The terms "On Demand" don't explicitly convey when the transformation occurs, leading to potential confusion for users trying to understand the timing and execution flow of feature transformations. Additionally, there's inconsistency in how different types of feature views (batch, streaming, on-demand) are declared and used, which can make the codebase less intuitive and harder to maintain.

Proposed Solution

I propose we refactor and rename the way transformations are defined in Feast to make the execution timing explicit. Introducing a @transform decorator with a type parameter that accepts a FeastTransformation enum can achieve this clarity. The enum would define the transformation types:

from enum import Enum

class FeastTransformation(Enum):
    NONE = 0         # No transformations happening using Feast
    ON_READ = 1      # Equivalent to the current On Demand Feature View
    ON_WRITE = 2     # New: Transformations executed at write time
    BATCH = 3        # New: Batch processing transformations
    STREAMING = 4    # Migrates the existing stream_feature_view decorator

@transform(
    type=FeastTransformation.ON_WRITE,
    schema=[...],
    entity=[...],
    sources=[...],
    mode='python',
)
def my_feature_view(inputs: List[Dict[str, float]]) -> List[Dict[str, float]]:
    # Transformation logic here
    return outputs

By explicitly declaring when the transformation should occur {'ON_READ', 'ON_WRITE', 'BATCH', 'STREAMING'}, the code becomes more readable, maintainable, and, more importantly, we will be able to remove a lot of duplicate, confusing code that obfuscates what is actually happening (particularly during the execution of FeatureStore.apply()).

*This approach unifies the declaration of feature views and transformations, making it easier for users to understand and for developers to extend functionality in the future.

Alternatives

  • Enhancing Documentation: Improving the existing documentation to better explain the behavior of On Demand Feature View without changing the code structure.

    • This doesn't solve the inherent ambiguity in the naming.
  • Renaming Existing Decorators: Simply renaming on_demand_feature_view to something like on_read_feature_view to make the execution timing clearer.

    • While this improves clarity, it doesn't address the inconsistency in how different feature views are declared.
  • Using Configuration Parameters: Adding parameters to existing decorators to specify execution timing.

    • This could make the decorators more complex and doesn't provide a unified approach.

Additional Context

Consistency and Extensibility: This change promotes a consistent method of declaring feature transformations, regardless of when they execute. It also sets a foundation for adding more transformation types in the future.

Clarity for New Users: Making the execution timing explicit in the code helps new users understand the flow of data and transformations in Feast, reducing the learning curve.

Impact on Existing Codebase: Careful consideration and planning would be needed to migrate existing code to the new structure without breaking functionality. Deprecation warnings and migration guides could facilitate this transition.

Community Feedback: Engaging with the community to gather feedback on this proposed change could provide additional insights and help refine the solution.

Migration: We can make the code backwards compatible and even provide some helper scripts to migrate to the new syntax but I think this is the right long term proposal that settles the several discussions we've had.

#4376
#4277
#3639

@franciscojavierarceo franciscojavierarceo added the kind/feature New feature or request label Sep 28, 2024
@franciscojavierarceo franciscojavierarceo self-assigned this Sep 28, 2024
@franciscojavierarceo
Copy link
Member Author

franciscojavierarceo commented Sep 28, 2024

FYI @tokoko @HaoXuAI @shuchu

@tokoko
Copy link
Collaborator

tokoko commented Sep 28, 2024

@franciscojavierarceo thanks. I like the simplicity of it, but wouldn't this be a problem when different types of FVs require different parameters? We can enumerate all the differences, but the first one that comes to mind is that in case of BatchFeatureViews sources is actually a list of DataSource objects, but for OnDemandFeatureViews sources is a list of FeatureView or RequestSource objects. We might end up with a less user-friendly interface, because we will no longer be able to convey this kind of information through typing.

I'm at the moment leaning more towards "better documentation + better naming" alternative because of these reasons.

@franciscojavierarceo
Copy link
Member Author

franciscojavierarceo commented Sep 28, 2024

We can enumerate all the differences, but the first one that comes to mind is that in case of BatchFeatureViews sources is actually a list of DataSource objects, but for OnDemandFeatureViews sources is a list of FeatureView or RequestSource objects.

This is just a DAG and we can express this thoughtfully and it ends up having a lot of pros, much like how dbt can compile a visual DAG to highlight behavior.

We might end up with a less user-friendly interface, because we will no longer be able to convey this kind of information through typing.

I think this can be thought through.

I'm at the moment leaning more towards "better documentation + better naming" alternative because of these reasons.

I think the long term approach of making it clear to data scientists and engineers what is actually happening is worth this effort. Making users learn weird jargon instead of using obvious industry patterns is a confusing choice.

Also, better naming is part of the proposal.

@franciscojavierarceo
Copy link
Member Author

franciscojavierarceo commented Sep 28, 2024

This means the following implementations will have to happen:

  • Rename and migrate OnDemandFeatureView
  • Rename and migrate StreamFeatureView
  • Implement the batch transformation as a subset of transform
  • Update lots of documentation

So I think I will complete the with writes and batch transform work then migrate it for completeness.

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Sep 28, 2024

Hey I know you really want to optimize the ondemandfeatureciew, but I would recommend to implement the batch and stream feature view with transformation first, then come back to refactor and optimization. We need more use case and experiment to support the refactoring

@franciscojavierarceo
Copy link
Member Author

Yes, let me clarify. I intend on doing the following:

  1. Finish implementing transformations on writes (what I'm doing now)
  2. Implement Batch transformations (next)
  3. Rename and migrate Stream, On Demand, and Batch Feature Views

I should have ordered the above that way. So the full refactor and optimization is last.

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Sep 28, 2024

that sounds good. i can help with the batch or stream transformation as well.

@franciscojavierarceo
Copy link
Member Author

Sounds good. Want to take on the batch transformation?

You can tag me in the PR if u want feedback/help

@franciscojavierarceo
Copy link
Member Author

There's already a spec written and maybe some code in place (take a look at aggregations.py) but I'm not sure what the status was.

@HaoXuAI
Copy link
Collaborator

HaoXuAI commented Sep 29, 2024

Sure, I can play with it and share some thoughts, gonna finish some left work for the vector db

@dandawg
Copy link
Contributor

dandawg commented Oct 1, 2024

I'm really glad to see the effort towards clarity in the API. I like the idea of getting rid of "on-demand". Can we expand on the idea of "when" to include "where/how" a transformation occurs? I've seen the pattern of "source" and "destination" in other transformation utilities (I know GCP and I think AWS uses this).

As a data scientist, I often choose to have transformations occur in specific locations (this is just my DS work outside of Feast). For example, I may do initial transformations in-database (server-side), and then pull the transformed data client side (this is great for aggregations). Or, I may choose to pull the data first, and then transform client-side, or send it to a remote job (on say a spark cluster), and pull the output from there. Or, I may just have the output written back to the original DB.

Given these sorts of complexities, I'd love to have some flexibility to make these decisions in a Feast transformation. There are a few ways we could tackle the "where". One would be through separate methods, something like:

db_transform(query=..., destination=table_b)  #in-db transformation
spark_transform(source=..., destination=..., spark_engine_endpoint=...)

Or, perhaps even more simplified:

transform(source=..., dest=..., config=...) 
# config could include info like 'spark engine' etc.

^ these ideas are just brainstorming--they would obviously need to be fleshed out more.

@franciscojavierarceo
Copy link
Member Author

Can we expand on the idea of "when" to include "where/how" a transformation occurs? I've seen the pattern of "source" and "destination" in other transformation utilities (I know GCP and I think AWS uses this).

Ultimately this maps to the "engine" that's used for transformation mentioned briefly in this document: https://docs.feast.dev/master/getting-started/architecture/feature-transformation#feature-transformation-engines

We have to think through how we orchestrate the transformation because Feast itself is not an orchestrator. I think Kubeflow Pipelines is a natural fit here and we could offer tutorials using Airflow as well.

So the transform decorator really just needs to store the transformation that can be executed on the engine and, thus, the task becomes to ensure that the decorator is compatible with the engine.

@tokoko
Copy link
Collaborator

tokoko commented Oct 21, 2024

I should have ordered the above that way. So the full refactor and optimization is last.

+1 on this, let's get all the core functionality in and then decide on the interface.

P.S. I'm thinking of taking on #4277, I think no one has started on it yet, right?

@franciscojavierarceo
Copy link
Member Author

Yeah, that sounds great! Do you want to tag me early in the PR so we can make sure we're on the same page for the refactor? I imagine we'll end up coming up with the same conclusion but would be good to collaborate on it.

@tokoko
Copy link
Collaborator

tokoko commented Oct 21, 2024

sure, I will start a draft PR here

@robhowley
Copy link
Contributor

This will be a really nice improvement on the api. As it stands it's currently quite unintuitive, but a @transform decorator is a very clear and straightforward. Good readability change.

@emgeee
Copy link
Contributor

emgeee commented Oct 24, 2024

I think this is a great idea, the current naming is confusing. I also think a diagram would help clarify the things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants