-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify Feature Transformations and Feature Views #4584
Comments
@franciscojavierarceo thanks. I like the simplicity of it, but wouldn't this be a problem when different types of FVs require different parameters? We can enumerate all the differences, but the first one that comes to mind is that in case of BatchFeatureViews I'm at the moment leaning more towards "better documentation + better naming" alternative because of these reasons. |
This is just a DAG and we can express this thoughtfully and it ends up having a lot of pros, much like how dbt can compile a visual DAG to highlight behavior.
I think this can be thought through.
I think the long term approach of making it clear to data scientists and engineers what is actually happening is worth this effort. Making users learn weird jargon instead of using obvious industry patterns is a confusing choice. Also, better naming is part of the proposal. |
This means the following implementations will have to happen:
So I think I will complete the with writes and batch transform work then migrate it for completeness. |
Hey I know you really want to optimize the ondemandfeatureciew, but I would recommend to implement the batch and stream feature view with transformation first, then come back to refactor and optimization. We need more use case and experiment to support the refactoring |
Yes, let me clarify. I intend on doing the following:
I should have ordered the above that way. So the full refactor and optimization is last. |
that sounds good. i can help with the batch or stream transformation as well. |
Sounds good. Want to take on the batch transformation? You can tag me in the PR if u want feedback/help |
There's already a spec written and maybe some code in place (take a look at aggregations.py) but I'm not sure what the status was. |
Sure, I can play with it and share some thoughts, gonna finish some left work for the vector db |
I'm really glad to see the effort towards clarity in the API. I like the idea of getting rid of "on-demand". Can we expand on the idea of "when" to include "where/how" a transformation occurs? I've seen the pattern of "source" and "destination" in other transformation utilities (I know GCP and I think AWS uses this). As a data scientist, I often choose to have transformations occur in specific locations (this is just my DS work outside of Feast). For example, I may do initial transformations in-database (server-side), and then pull the transformed data client side (this is great for aggregations). Or, I may choose to pull the data first, and then transform client-side, or send it to a remote job (on say a spark cluster), and pull the output from there. Or, I may just have the output written back to the original DB. Given these sorts of complexities, I'd love to have some flexibility to make these decisions in a Feast transformation. There are a few ways we could tackle the "where". One would be through separate methods, something like:
Or, perhaps even more simplified:
^ these ideas are just brainstorming--they would obviously need to be fleshed out more. |
Ultimately this maps to the "engine" that's used for transformation mentioned briefly in this document: https://docs.feast.dev/master/getting-started/architecture/feature-transformation#feature-transformation-engines We have to think through how we orchestrate the transformation because Feast itself is not an orchestrator. I think Kubeflow Pipelines is a natural fit here and we could offer tutorials using Airflow as well. So the |
+1 on this, let's get all the core functionality in and then decide on the interface. P.S. I'm thinking of taking on #4277, I think no one has started on it yet, right? |
Yeah, that sounds great! Do you want to tag me early in the PR so we can make sure we're on the same page for the refactor? I imagine we'll end up coming up with the same conclusion but would be good to collaborate on it. |
sure, I will start a draft PR here |
This will be a really nice improvement on the api. As it stands it's currently quite unintuitive, but a |
I think this is a great idea, the current naming is confusing. I also think a diagram would help clarify the things |
Problem
In Feast, the current On Demand Feature View executes feature transformations at read time. However, this behavior is not immediately obvious from the name "On Demand Feature View."
The terms "On Demand" don't explicitly convey when the transformation occurs, leading to potential confusion for users trying to understand the timing and execution flow of feature transformations. Additionally, there's inconsistency in how different types of feature views (batch, streaming, on-demand) are declared and used, which can make the codebase less intuitive and harder to maintain.
Proposed Solution
I propose we refactor and rename the way transformations are defined in Feast to make the execution timing explicit. Introducing a
@transform
decorator with a type parameter that accepts aFeastTransformation
enum can achieve this clarity. The enum would define the transformation types:By explicitly declaring when the transformation should occur
{'ON_READ', 'ON_WRITE', 'BATCH', 'STREAMING'}
, the code becomes more readable, maintainable, and, more importantly, we will be able to remove a lot of duplicate, confusing code that obfuscates what is actually happening (particularly during the execution ofFeatureStore.apply()
).*This approach unifies the declaration of feature views and transformations, making it easier for users to understand and for developers to extend functionality in the future.
Alternatives
Enhancing Documentation: Improving the existing documentation to better explain the behavior of On Demand Feature View without changing the code structure.
Renaming Existing Decorators: Simply renaming on_demand_feature_view to something like on_read_feature_view to make the execution timing clearer.
Using Configuration Parameters: Adding parameters to existing decorators to specify execution timing.
Additional Context
Consistency and Extensibility: This change promotes a consistent method of declaring feature transformations, regardless of when they execute. It also sets a foundation for adding more transformation types in the future.
Clarity for New Users: Making the execution timing explicit in the code helps new users understand the flow of data and transformations in Feast, reducing the learning curve.
Impact on Existing Codebase: Careful consideration and planning would be needed to migrate existing code to the new structure without breaking functionality. Deprecation warnings and migration guides could facilitate this transition.
Community Feedback: Engaging with the community to gather feedback on this proposed change could provide additional insights and help refine the solution.
Migration: We can make the code backwards compatible and even provide some helper scripts to migrate to the new syntax but I think this is the right long term proposal that settles the several discussions we've had.
#4376
#4277
#3639
The text was updated successfully, but these errors were encountered: