Future: Add `check_exists` flag to init of `Step` class #31

evamaxfield · 2020-01-31T20:34:38Z

Use Case

Please provide a use case to help us understand your request in context
Instead of always pulling the data down prior to run check if the data already exists locally.

donovanr · 2020-01-31T22:39:31Z

i think use_cached_data might be more clear than check_exists.

anyway, should use_cached_data check whether the data for this step exists? or the upstream tasks? i think just having it check the current step would work in a prefect-style workflow model, since it would sort of recurse upwards. e.g.

raw = steps.Raw(use_cached_data=True)
qc = steps.Qc(use_cached_data=False)

raw_data = raw.run()
qc_data = qc.run(raw_data)

in this situation, qc would always recompute it's output, since use_cached_data=False. But raw_data would load from disk and not run the computation, since use_cached_data=True.

Does that make sense???

evamaxfield · 2020-01-31T22:41:33Z

I think Greg was requesting that it checks current step data existing or not. And I like the name change. So agree on this for sure.

donovanr · 2020-02-01T00:13:12Z

ok.

i see this minorly conflicting with the clean_before_run kwarg. shouldn't be too tough to handle with some conditionals, but i might need some extra vigilant eyes on the tests for that.

donovanr · 2020-02-01T05:20:32Z

ok, thinking about it a little more, a use_cached_data=True feature would require a step to be able to know what data it's going to produce.

we could ignore everything complicated about this and just do

if use_cached_data:
    pass
else:
   run code here

that's obviously... bad. it assumes that cached data == whatever's in the directory, ignores branches and worse.

one way forward might be by leveraging the git commit <--> quilt syncing. so use_cached_data=True could check to see if the git commit corresponds to a quilt hash? somehow?

donovanr · 2020-02-01T05:23:07Z

does this bring us full circle to a sync lock mechanism? is there a good way to keep the quilt hash and the git hash synced? i guess the sync is only one way, i.e every quilt hash gets a git hash but not vice versa.

donovanr · 2020-02-01T05:33:28Z

maybe stashing the git commit hash somewhere easy to check on push / in quilt (not just in the message). then on run, if use_cached_data==True`, pull that one file from quilt into memory with

p["git_commit_hash"]()

and if that hash matches, we can skip running the computation and just use step.checkout().

Is there a way to make checkout super fast if the files are already present? some quilt-side thing where it just compares hashes?

donovanr · 2020-02-01T05:37:55Z

if so that might meet most of the needs here. maaaaybe we could think about a use_cached_data="total_free_for_all" mode that just uses whatever's there without any checking? or is there something more reasonable as an intermediate between checking for data that corresponds to a quilt hash and checking nothing?

evamaxfield · 2020-02-14T20:01:29Z

Tossing comments from conversation on issue for tracking:

This can and should be done at a later point. Probably 0.1.5.

Current plan:
use_cached_data and cached_data_check_integrity would be added as a "protected parameter" in the similar fashion that clean is.
use_cached_data is simply a boolean option
cached_data_check_integrity would be a string from available constants: constants.DataIntegrityLevels.GitCommit and constants.DataIntegrityLevels.ManifestHash or similar.

With implementations of data integrity checks occurring during the run wrapper.

donovanr self-assigned this Jan 31, 2020

evamaxfield changed the title ~~Add check_exists flag to init of Step class~~ Future: Add check_exists flag to init of Step class Feb 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future: Add `check_exists` flag to init of `Step` class #31

Future: Add `check_exists` flag to init of `Step` class #31

evamaxfield commented Jan 31, 2020

donovanr commented Jan 31, 2020

evamaxfield commented Jan 31, 2020

donovanr commented Feb 1, 2020

donovanr commented Feb 1, 2020

donovanr commented Feb 1, 2020

donovanr commented Feb 1, 2020

donovanr commented Feb 1, 2020 •

edited

Loading

evamaxfield commented Feb 14, 2020 •

edited

Loading

Future: Add check_exists flag to init of Step class #31

Future: Add check_exists flag to init of Step class #31

Comments

evamaxfield commented Jan 31, 2020

Use Case

donovanr commented Jan 31, 2020

evamaxfield commented Jan 31, 2020

donovanr commented Feb 1, 2020

donovanr commented Feb 1, 2020

donovanr commented Feb 1, 2020

donovanr commented Feb 1, 2020

donovanr commented Feb 1, 2020 • edited Loading

evamaxfield commented Feb 14, 2020 • edited Loading

Future: Add `check_exists` flag to init of `Step` class #31

Future: Add `check_exists` flag to init of `Step` class #31

donovanr commented Feb 1, 2020 •

edited

Loading

evamaxfield commented Feb 14, 2020 •

edited

Loading