Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Cache Plugin #2971

Closed
wants to merge 20 commits into from
Closed

Auto Cache Plugin #2971

wants to merge 20 commits into from

Conversation

dansola
Copy link
Contributor

@dansola dansola commented Dec 2, 2024

Why are the changes needed?

Make caching easier to use in flytekit by reducing cognitive burden of specifying cache versions

What changes were proposed in this pull request?

To use the caching mechanism in a Flyte task, you can define a CachePolicy that combines multiple caching strategies. Here’s an example of how to set it up:

from flytekit import task
from flytekit.core.auto_cache import CachePolicy
from flytekitplugins.auto_cache import CacheFunctionBody, CachePrivateModules

cache_policy = CachePolicy(
    auto_cache_policies = [
        CacheFunctionBody(),
        CachePrivateModules(root_dir="../my_package"),
        ...,
    ]
    salt="my_salt"
)

@task(cache=cache_policy)
def task_fn():
    ...

Salt Parameter

The salt parameter in the CachePolicy adds uniqueness to the generated hash. It can be used to differentiate between different versions of the same task. This ensures that even if the underlying code remains unchanged, the hash will vary if a different salt is provided. This feature is particularly useful for invalidating the cache for specific versions of a task.

Cache Implementations

Users can add any number of cache policies that implement the AutoCache protocol defined in @auto_cache.py. Below are the implementations available so far:

1. CacheFunctionBody

This implementation hashes the contents of the function of interest, ignoring any formatting or comment changes. It ensures that the core logic of the function is considered for versioning.

2. CacheImage

This implementation includes the hash of the container_image object passed. If the image is specified as a name, that string is hashed. If it is an ImageSpec, the parametrization of the ImageSpec is hashed, allowing for precise versioning of the container image used in the task.

3. CachePrivateModules

This implementation recursively searches the task of interest for all callables and constants used. The contents of any callable (function or class) utilized by the task are hashed, ignoring formatting or comments. The values of the literal constants used are also included in the hash.

It accounts for both import and from-import statements at the global and local levels within a module or function. Any callables that are within site-packages (i.e., external libraries) are ignored.

4. CacheExternalDependencies

This implementation recursively searches through all the callables like CachePrivateModules, but when an external package is found, it records the version of the package, which is included in the hash. This ensures that changes in external dependencies are reflected in the task's versioning.

How was this patch tested?

Unit tests for the following:

  • verifying a function hash changes only when function contents change, not when formatting or comments are added
  • verify that a dummy repository can be recursively searched when various import statements are used
  • verify that functions not used by the task of interest are not hashed
  • verify that the all constants used by a task are and any of the functions it calls are identified
  • verify that in a new python environment, the correct external libraries are identified
  • verify that the correct dependency versions can be identified

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This PR refactors Flytekit's caching mechanism by introducing a comprehensive auto-cache plugin that implements multiple strategies including function body hashing, container image versioning, and dependency tracking. The implementation migrates CachePolicy to a dedicated plugin, simplifying the core auto_cache module and streamlining cache parameter types while maintaining seamless integration with existing task and workflow decorators.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 5

Comment on lines 65 to 67
self.cache_serialize = cache_serialize
self.cache_version = cache_version
self.cache_ignore_input_vars = cache_ignore_input_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of saving this state here? aren't these just forwarded to the underlying TaskMetadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea with this is the user could use the CachePolicy to define all the arguments relating to caching. This simplifies the UX a bit as opposed to having a CachePolicy and a cache_ignore_input_vars, cache_serialize, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a little confusing:

  • cache_version should not be exposed, since the AutoCache protocol is meant to produce this value automatically, and salt is meant to fulfill the need of manually bumping the cache.
  • I think it makes sense to keep cache_serialize and cache_ignore_input_vars as options to specify in the @task decorator as opposed to introducing this redundancy here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Sounds like there is a separate effort aimed at collecting all of the caching arguments here: flyteorg/flyte#6143

Happy to use that instead and simplify the arguments here!

Copy link

codecov bot commented Jan 3, 2025

Codecov Report

Attention: Patch coverage is 47.82609% with 24 lines in your changes missing coverage. Please review.

Project coverage is 77.96%. Comparing base (3b7cb3c) to head (18f253e).
Report is 51 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/core/auto_cache.py 45.16% 17 Missing ⚠️
flytekit/core/task.py 53.33% 6 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2971      +/-   ##
==========================================
+ Coverage   76.49%   77.96%   +1.46%     
==========================================
  Files         200      202       +2     
  Lines       20901    21324     +423     
  Branches     2689     2739      +50     
==========================================
+ Hits        15989    16625     +636     
+ Misses       4195     3904     -291     
- Partials      717      795      +78     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Code Review Agent Run #bc105b

Actionable Suggestions - 9
  • flytekit/core/auto_cache.py - 1
    • Return type inconsistency in get_version method · Line 95-95
  • plugins/flytekit-auto-cache/tests/requirements-test.txt - 1
    • Consider flexible version pinning for dependencies · Line 1-20
  • plugins/flytekit-auto-cache/tests/verify_identified_packages.py - 1
    • Missing assert keyword in test validation · Line 11-11
  • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_private_modules.py - 1
    • Consider splitting long method into smaller ones · Line 108-221
  • flytekit/core/task.py - 1
    • Consider implications of looser type hints · Line 136-136
  • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_function_body.py - 2
    • Consider validating empty salt parameter · Line 23-23
    • Consider adding type check for func · Line 32-35
  • plugins/flytekit-auto-cache/tests/my_package/module_a.py - 1
    • Consider handling sum function result · Line 12-12
  • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_external_dependencies.py - 1
Additional Suggestions - 7
  • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_private_modules.py - 3
  • flytekit/core/task.py - 1
    • Consider extracting cache policy handling logic · Line 353-362
  • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_image.py - 1
    • Consider extracting duplicate hash logic · Line 46-51
  • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_external_dependencies.py - 1
    • Consider breaking down version lookup logic · Line 89-126
  • plugins/flytekit-auto-cache/tests/my_package/main.py - 1
    • Consider splitting function responsibilities · Line 14-22
Review Details
  • Files reviewed - 27 · Commit Range: 2786c5b..18f253e
    • flytekit/core/auto_cache.py
    • flytekit/core/task.py
    • flytekit/core/workflow.py
    • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/__init__.py
    • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_external_dependencies.py
    • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_function_body.py
    • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_image.py
    • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_private_modules.py
    • plugins/flytekit-auto-cache/setup.py
    • plugins/flytekit-auto-cache/tests/dummy_functions/dummy_function.py
    • plugins/flytekit-auto-cache/tests/dummy_functions/dummy_function_comments_formatting_change.py
    • plugins/flytekit-auto-cache/tests/dummy_functions/dummy_function_logic_change.py
    • plugins/flytekit-auto-cache/tests/my_package/main.py
    • plugins/flytekit-auto-cache/tests/my_package/module_a.py
    • plugins/flytekit-auto-cache/tests/my_package/module_b.py
    • plugins/flytekit-auto-cache/tests/my_package/module_c.py
    • plugins/flytekit-auto-cache/tests/my_package/module_d.py
    • plugins/flytekit-auto-cache/tests/my_package/my_dir/__init__.py
    • plugins/flytekit-auto-cache/tests/my_package/my_dir/module_in_dir.py
    • plugins/flytekit-auto-cache/tests/my_package/utils.py
    • plugins/flytekit-auto-cache/tests/requirements-test.txt
    • plugins/flytekit-auto-cache/tests/test_external_dependencies.py
    • plugins/flytekit-auto-cache/tests/test_function_body.py
    • plugins/flytekit-auto-cache/tests/test_image.py
    • plugins/flytekit-auto-cache/tests/test_recursive.py
    • plugins/flytekit-auto-cache/tests/verify_identified_packages.py
    • plugins/flytekit-auto-cache/tests/verify_versions.py
  • Files skipped - 1
    • plugins/flytekit-auto-cache/README.md - Reason: Filter setting
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 4, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
New Feature - Auto Cache Core Implementation

auto_cache.py - Added core auto cache protocol and version parameters definitions

task.py - Updated task decorator to support auto cache functionality

workflow.py - Modified workflow definitions to support auto cache integration

Feature Improvement - Cache Plugin Implementations

__init__.py - Created plugin package structure and exposed cache implementations

cache_external_dependencies.py - Implemented external dependency version tracking

cache_function_body.py - Added function content hashing implementation

cache_image.py - Created container image versioning mechanism

cache_policy.py - Implemented combined cache policy handler

cache_private_modules.py - Added recursive module dependency tracking

New Feature - Auto Cache Core Implementation

auto_cache.py - Added core auto cache protocol and version parameters definitions

task.py - Updated task decorator to support auto cache functionality

workflow.py - Modified workflow definitions to support auto cache integration

Feature Improvement - Cache Plugin Implementations

__init__.py - Created plugin package structure and exposed cache implementations

cache_external_dependencies.py - Implemented external dependency version tracking

cache_function_body.py - Added function content hashing implementation

cache_image.py - Created container image versioning mechanism

cache_policy.py - Implemented combined cache policy handler

cache_private_modules.py - Added recursive module dependency tracking

Testing - Comprehensive Test Suite

test_function_body.py - Added tests for function body hashing functionality

test_image.py - Added tests for container image versioning

test_recursive.py - Added tests for recursive module dependency tracking

verify_identified_packages.py - Added verification tests for package identification

verify_versions.py - Added tests for version verification

dummy_function.py - Added test utility functions

main.py - Added test package structure

requirements-test.txt - Added test dependencies

hash_obj = hashlib.sha256(task_hash.encode())
return hash_obj.hexdigest()

return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return type inconsistency in get_version method

Consider returning an empty string instead of None for consistency in return types. The method signature indicates it returns str but can return None.

Code suggestion
Check the AI-generated fix before applying
Suggested change
return None
return ""

Code Review Run #bc105b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +1 to +20
numpy==1.24.3
pandas==2.0.3
requests==2.31.0
matplotlib==3.7.2
pillow==10.0.0
scipy==1.11.2
pytest==7.4.0
urllib3==2.0.4
cryptography==41.0.3
setuptools==68.0.0
flask==2.3.2
django==4.2.4
scikit-learn==1.3.0
beautifulsoup4==4.12.2
pyyaml==6.0
fastapi==0.100.0
sqlalchemy==2.0.36
tqdm==4.65.0
pytest-mock==3.11.0
jinja2==3.1.2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider flexible version pinning for dependencies

Consider pinning dependencies to compatible versions using ~= or >= instead of == to allow for minor version updates that include security patches while maintaining compatibility. This helps keep dependencies up-to-date with security fixes.

Code suggestion
Check the AI-generated fix before applying
Suggested change
numpy==1.24.3
pandas==2.0.3
requests==2.31.0
matplotlib==3.7.2
pillow==10.0.0
scipy==1.11.2
pytest==7.4.0
urllib3==2.0.4
cryptography==41.0.3
setuptools==68.0.0
flask==2.3.2
django==4.2.4
scikit-learn==1.3.0
beautifulsoup4==4.12.2
pyyaml==6.0
fastapi==0.100.0
sqlalchemy==2.0.36
tqdm==4.65.0
pytest-mock==3.11.0
jinja2==3.1.2
numpy~=1.24.3
pandas~=2.0.3
requests~=2.31.0
matplotlib~=3.7.2
pillow~=10.0.0
scipy~=1.11.2
pytest~=7.4.0
urllib3~=2.0.4
cryptography~=41.0.3
setuptools~=68.0.0
flask~=2.3.2
django~=4.2.4
scikit-learn~=1.3.0
beautifulsoup4~=4.12.2
pyyaml~=6.0
fastapi~=0.100.0
sqlalchemy~=2.0.36
tqdm~=4.65.0
pytest-mock~=3.11.0
jinja2~=3.1.2

Code Review Run #bc105b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

packages = cache.get_version_dict().keys()

expected_packages = {'PIL', 'bs4', 'numpy', 'pandas', 'scipy', 'sklearn'}
set(packages) == expected_packages, f"Expected keys {expected_packages}, but got {set(packages)}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing assert keyword in test validation

The assertion statement appears to be missing the assert keyword, which means this comparison won't actually validate the test condition. Consider adding the assert keyword.

Code suggestion
Check the AI-generated fix before applying
Suggested change
set(packages) == expected_packages, f"Expected keys {expected_packages}, but got {set(packages)}"
assert set(packages) == expected_packages, f"Expected keys {expected_packages}, but got {set(packages)}"

Code Review Run #bc105b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +32 to +35
def get_version(self, params: VersionParameters) -> str:
if params.func is None:
raise ValueError("Function-based cache requires a function parameter")
return self._get_version(func=params.func)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding type check for func

The get_version method could benefit from type checking params.func before accessing it to provide a more descriptive error message.

Code suggestion
Check the AI-generated fix before applying
Suggested change
def get_version(self, params: VersionParameters) -> str:
if params.func is None:
raise ValueError("Function-based cache requires a function parameter")
return self._get_version(func=params.func)
def get_version(self, params: VersionParameters) -> str:
if params.func is None:
raise ValueError("Function-based cache requires a function parameter")
if not callable(params.func):
raise TypeError("params.func must be a callable function")
return self._get_version(func=params.func)

Code Review Run #bc105b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

module_b.another_helper()
result = norm([1, 2, 3])
print(result)
sum([SOME_CONSTANT, utils.THIRD_CONSTANT])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider handling sum function result

The sum() function call result is not being stored or used, which may indicate meaningless executed code. Consider either storing the result or removing if not needed.

Code suggestion
Check the AI-generated fix before applying
Suggested change
sum([SOME_CONSTANT, utils.THIRD_CONSTANT])
result = sum([SOME_CONSTANT, utils.THIRD_CONSTANT])

Code Review Run #bc105b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

Comment on lines +106 to +107
except Exception as e:
click.secho(f"Could not get version for {package_name} using importlib.metadata: {str(e)}", fg="yellow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too broad exception handling

Catching a broad 'Exception' may hide bugs. Consider catching specific exceptions instead.

Code suggestion
Check the AI-generated fix before applying
Suggested change
except Exception as e:
click.secho(f"Could not get version for {package_name} using importlib.metadata: {str(e)}", fg="yellow")
except (ImportError, AttributeError) as e:
click.secho(f"Could not get version for {package_name} using importlib.metadata: {str(e)}", fg="yellow")

Code Review Run #bc105b


Is this a valid issue, or was it incorrectly flagged by the Agent?

  • it was incorrectly flagged

...


class CachePolicy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can CachePolicy live in the plugin? It makes sense for the abstract AutoCache protocol to be defined in flytekit core, but any implementation of it should be in the plugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thanks. i refactored this!

Signed-off-by: Daniel Sola <[email protected]>
Signed-off-by: Daniel Sola <[email protected]>
@flyte-bot
Copy link
Contributor

flyte-bot commented Jan 13, 2025

Code Review Agent Run #fa2b0b

Actionable Suggestions - 1
  • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_policy.py - 1
    • Consider initializing cache_version in init · Line 33-34
Review Details
  • Files reviewed - 3 · Commit Range: 18f253e..121c06f
    • flytekit/core/auto_cache.py
    • flytekit/core/task.py
    • plugins/flytekit-auto-cache/flytekitplugins/auto_cache/cache_policy.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants