Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flytekit] Polish - Cache #6143

Open
2 tasks done
wild-endeavor opened this issue Jan 5, 2025 · 7 comments
Open
2 tasks done

[flytekit] Polish - Cache #6143

wild-endeavor opened this issue Jan 5, 2025 · 7 comments
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow.

Comments

@wild-endeavor
Copy link
Contributor

wild-endeavor commented Jan 5, 2025

Caching

This is a series of tickets to improve the flytekit authoring experience. If any changes are not possible to make in a backwards-compatible way, split it out into a separate ticket.

Merge options into a dataclass and rename argument

Currently there are three configurations for cache control: cache_serialize, cache_version, cache_ignore_input_vars. We should rename these to serialize (bool) , version (remains a str), and ignore (update type to list[str]|str|None=None).

This object should then be the input to the cache argument in the task decorator, which should be of type bool | Cache, and is False by default.

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@wild-endeavor wild-endeavor added documentation Improvements or additions to documentation untriaged This issues has not yet been looked at by the Maintainers labels Jan 5, 2025
@wild-endeavor wild-endeavor added backlogged For internal use. Reserved for contributor team workflow. and removed documentation Improvements or additions to documentation labels Jan 5, 2025
@thomasjpfan
Copy link
Member

Before moving forward with this, I think we need to consider how this interacts with flyteorg/flytekit#2971

@granthamtaylor
Copy link

Before moving forward with this, I think we need to consider how this interacts with flyteorg/flytekit#2971

Yes! I think, ideally, the Cache object should include all of the configurations of the auto-caching and the general behavior around caching (serialize, version, ignore).

Perhaps, instead of combining these two objects into one, we could simply add a policy argument to Cache, which could include the more granular auto-caching configurations.

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Jan 16, 2025
@wild-endeavor
Copy link
Contributor Author

wild-endeavor commented Jan 31, 2025

@granthamtaylor @thomasjpfan @eapolinario @dansola

I like grantham's idea of just moving the CachePolicy object into the new Cache object.

Can we just make it

Cache(
  version: str | CachePolicy,
  serialize: bool,
  ignored_inputs: list[str] | str | None = None,  # Or should we just use the "ignore" in the original above, i thought "ignored_inputs" might be more clear
)

The cache argument to the @task decorator should then be Optional[Cache] rather than bool | Cache right?

What does everyone think? If we're all good, I say we go for it. And yeah let's resolve and merge dan's pr first.

@granthamtaylor
Copy link

@wild-endeavor I have been thinking about this for some while.

I would still like the cache argument to be cache: bool | Cache = False. That way, you can enable automatic caching, where we define the automatic behavior according to best practices by simply setting cache=True.

However, I think that the Cache object should be defined as such:

class Cache:
    salt: Optional[str]=None
    serialize: bool=False
    ignored_inputs: Optional[list[str] | str]=None
    policies: Optional[list[CachePolicy] | CachePolicy]=None

I say this for the following reasons: we have currently been calling salt the cache_version. This name is slightly confusing (does it define the version of the caching mechanism?). Additionally, it should be Optional such that if it isn't included then we can not use it to modify the caching version.

Secondly, all of the parameters here should have some default arguments.

@dansola

Lastly, the policies and salt should be defined separately.

@eapolinario eapolinario self-assigned this Feb 4, 2025
@eapolinario
Copy link
Contributor

eapolinario commented Feb 4, 2025

I would still like the cache argument to be cache: bool | Cache = False. That way, you can enable automatic caching, where we define the automatic behavior according to best practices by simply setting cache=True

We're not abandoning the existing behavior, right? How users are going to migrate? We can't simply assume that if cache=True then auto-caching is enabled, unless we're saying that users will remove the cache_version argument and that's enough signal for us to assume auto-caching. I don't love this, but can be convinced. Right now I'm thinking we simply provide a default Cache object that will be used by 90% of the users.

we have currently been calling salt the cache_version

this is not entirely correct as the salt is used to produce the final version, which is what cache_version represents currently, but it's not the final version (nor it's part of the cache key itself).

This name is slightly confusing (does it define the version of the caching mechanism?)

that's reasonable, but this confusion is alleviated by docs.

Two comments about flyteorg/flytekit#2971:

  1. The design of the auto-caching policies as they currently exist in the PR separates the auto-caching policies into a separate flytekit plugin.
    i. I'd be ok with moving 1. to flytekit core
  2. Each auto-caching policy requires a salt, which is assumed to be used in the invocation of AutoCache.get_version.
    i. does it make sense to force all cache policies to share the same salt? IMO, no, especially if we set a reasonable default in each cache policy.

This leads me to propose a slightly modified version of Cache:

class Cache:
    policies: list[CachePolicy]=[ <reasonable-list-of-policies> ]  # this is for illustration purposes only, we shouldn't have mutable objects as part of an object definition
    serialize: bool=False
    ignored_inputs: Optional[list[str]]=None

@granthamtaylor
Copy link

I agree with that!

It is a small thing, but I would also like to see that single item ignore inputs are supported (ignored_inputs: Optional[list[str] | str])

I think that leaves us with the following impl for Cache ?

class Cache:
    version: Optional[str]=None
    policies: list[CachePolicy]=[ <reasonable-list-of-policies> ]  # this is for illustration purposes only, we shouldn't have mutable objects as part of an object definition
    serialize: bool=False
    ignored_inputs: Optional[list[str] | str]=None

@eapolinario
Copy link
Contributor

100%. As we spoke offline, not listing version in the Cache object was a mistake caused by my sleep-deprived brain.

Let me get the gears cranking on flyteorg/flytekit#2971. Summary of the proposed changes:

  • define a toplevel Cache object as discussed above
  • move cache policies to flytekit core, but keep the protocol, so that people can implement their policies if they so wish
  • define a reasonable default for salt
  • expose a default cache object that can be reused to indicate sane defaults and doesn't involve creating an empty Cache object
    • we can use cache=True and cache_version being unset as indication that auto-caching is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow.
Projects
Status: Backlog
Development

No branches or pull requests

4 participants