Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future: Add check_exists flag to init of Step class #31

Open
evamaxfield opened this issue Jan 31, 2020 · 8 comments
Open

Future: Add check_exists flag to init of Step class #31

evamaxfield opened this issue Jan 31, 2020 · 8 comments
Assignees

Comments

@evamaxfield
Copy link
Contributor

Use Case

Please provide a use case to help us understand your request in context
Instead of always pulling the data down prior to run check if the data already exists locally.

@donovanr
Copy link
Contributor

i think use_cached_data might be more clear than check_exists.

anyway, should use_cached_data check whether the data for this step exists? or the upstream tasks? i think just having it check the current step would work in a prefect-style workflow model, since it would sort of recurse upwards. e.g.

raw = steps.Raw(use_cached_data=True)
qc = steps.Qc(use_cached_data=False)

raw_data = raw.run()
qc_data = qc.run(raw_data)

in this situation, qc would always recompute it's output, since use_cached_data=False. But raw_data would load from disk and not run the computation, since use_cached_data=True.

Does that make sense???

@evamaxfield
Copy link
Contributor Author

I think Greg was requesting that it checks current step data existing or not. And I like the name change. So agree on this for sure.

@donovanr donovanr self-assigned this Jan 31, 2020
@donovanr
Copy link
Contributor

donovanr commented Feb 1, 2020

ok.

i see this minorly conflicting with the clean_before_run kwarg. shouldn't be too tough to handle with some conditionals, but i might need some extra vigilant eyes on the tests for that.

@donovanr
Copy link
Contributor

donovanr commented Feb 1, 2020

ok, thinking about it a little more, a use_cached_data=True feature would require a step to be able to know what data it's going to produce.

we could ignore everything complicated about this and just do

if use_cached_data:
    pass
else:
   run code here

that's obviously... bad. it assumes that cached data == whatever's in the directory, ignores branches and worse.

one way forward might be by leveraging the git commit <--> quilt syncing. so use_cached_data=True could check to see if the git commit corresponds to a quilt hash? somehow?

@donovanr
Copy link
Contributor

donovanr commented Feb 1, 2020

does this bring us full circle to a sync lock mechanism? is there a good way to keep the quilt hash and the git hash synced? i guess the sync is only one way, i.e every quilt hash gets a git hash but not vice versa.

@donovanr
Copy link
Contributor

donovanr commented Feb 1, 2020

maybe stashing the git commit hash somewhere easy to check on push / in quilt (not just in the message). then on run, if use_cached_data==True`, pull that one file from quilt into memory with

p["git_commit_hash"]()

and if that hash matches, we can skip running the computation and just use step.checkout().

Is there a way to make checkout super fast if the files are already present? some quilt-side thing where it just compares hashes?

@donovanr
Copy link
Contributor

donovanr commented Feb 1, 2020

if so that might meet most of the needs here. maaaaybe we could think about a use_cached_data="total_free_for_all" mode that just uses whatever's there without any checking? or is there something more reasonable as an intermediate between checking for data that corresponds to a quilt hash and checking nothing?

@evamaxfield
Copy link
Contributor Author

evamaxfield commented Feb 14, 2020

Tossing comments from conversation on issue for tracking:

This can and should be done at a later point. Probably 0.1.5.

Current plan:
use_cached_data and cached_data_check_integrity would be added as a "protected parameter" in the similar fashion that clean is.
use_cached_data is simply a boolean option
cached_data_check_integrity would be a string from available constants: constants.DataIntegrityLevels.GitCommit and constants.DataIntegrityLevels.ManifestHash or similar.

With implementations of data integrity checks occurring during the run wrapper.

@evamaxfield evamaxfield changed the title Add check_exists flag to init of Step class Future: Add check_exists flag to init of Step class Feb 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants