Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Flow decorator #547

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

davidwaroquiers
Copy link
Contributor

Summary

This is a WIP PR trying to develop the flow decorator (see idea #416, also linked to #450). Seems like we could have another option to do what we want. Currently in a (very) proof-of-concept phase (previous attempt with abstract syntax trees failed).

Currently works for "simple" flows (i.e. flows of jobs), and only if the script creates only one flow. I think it should be generalizable, but before spending more time, I'd like to get some feedback from @utf, @Andrew-S-Rosen, @janosh, and maybe others. Basically, it works by adding a reference to the created jobs and flows into a "global" variable (I used the singleton from monty, not a standard python global variable, not exactly sure if it is needed and/or if it helps and/or what it adds in this specific case).

Using what @Andrew-S-Rosen did for having something more user friendly (see #450).

Currently breaks @janosh magic methods.

Another addition should be that the flow output is stored in the database somehow ? (some things to be made sure here though, we don't want to duplicate outputs both in jobs and flows).

Example in test_flows.py:test_flow_decorator

TODO (if any)

If this is a work-in-progress, write something about what else needs to be done.

  • Generalize to complex flows
  • Flow outputs in the results database ?
  • Tests

Checklist

Work-in-progress pull requests are encouraged, but please put [WIP] in the pull request
title.

Before a pull request can be merged, the following items must be checked:

  • Code is in the standard Python style.
    The easiest way to handle this is to run the following in the correct sequence on
    your local machine. Start with running black on your new code. This will
    automatically reformat your code to PEP8 conventions and removes most issues. Then run
    pycodestyle, followed by flake8.
  • Docstrings have been added in theNumpy docstring format.
    Run pydocstyle on your code.
  • Type annotations are highly encouraged. Run mypy to
    type check your code.
  • Tests have been added for any new functionality or bug fixes.
  • All linting and tests pass.

Note that the CI system will run all the above checks. But it will be much more
efficient if you already fix most errors prior to submitting the PR. It is highly
recommended that you use the pre-commit hook provided in the repository. Simply
pip install pre-commit and then pre-commit install and a check will be run
prior to allowing commits.

@davidwaroquiers davidwaroquiers marked this pull request as draft February 19, 2024 14:10
Copy link

codecov bot commented Feb 19, 2024

Codecov Report

Attention: 3 lines in your changes are missing coverage. Please review.

Comparison is base (eda2a65) 99.42% compared to head (b246ce6) 97.48%.
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #547      +/-   ##
==========================================
- Coverage   99.42%   97.48%   -1.94%     
==========================================
  Files          21       21              
  Lines        1564     1592      +28     
  Branches      425      429       +4     
==========================================
- Hits         1555     1552       -3     
- Misses          9       36      +27     
- Partials        0        4       +4     
Files Coverage Δ
src/jobflow/__init__.py 100.00% <100.00%> (ø)
src/jobflow/core/job.py 97.78% <95.23%> (-2.22%) ⬇️
src/jobflow/core/flow.py 89.77% <84.61%> (-10.23%) ⬇️

@Andrew-S-Rosen
Copy link
Member

Thank you for kicking this off (again), @davidwaroquiers! As you already know, I'm a huge proponent of this idea in general.

I'll let @utf take the lead on any logistical comments, but from me, I'll just say this is exciting to see. Thanks for baking in my WIP PR in #450 as well!

In general, the UX here is pretty much exactly what I would expect to see.

@janosh
Copy link
Member

janosh commented Feb 20, 2024

the API this enables looks very appealing! if all edge cases that might negatively affect UX can be addressed, this looks like a big win.

Currently breaks @janosh magic methods.

weird... do you know why?

@janosh janosh added ux User experience api Application programming interface needs discussion Needs discussion labels Feb 20, 2024
@utf
Copy link
Member

utf commented Feb 20, 2024

Hi @davidwaroquiers, thanks for this PR. This is an interesting idea and implementation.

As I mentioned in #450, I have some concerns with the indexing features. While these can definitely be addressed, I think it would be better to remove those features from this PR so we can focus only on the @flow decorator. I've copied my concerns below.

Regarding the specific implementation, it looks promising. I would like to spend a little more time thinking it over and testing it locally.

Incompatibility with output attributes

One potential confusion is that you aren't able to do something like:

job2 = capitalize(job1.hello)

We could add that functionality in, but it then raises an issue with job1.output e.g., if your output document has an attribute called output this will return the OutputReference and not OutputReference.output.

A potential fix would be to rename all of the Job attributes and functions to attribute_, E.g., job.output_, job.graph_, job.update_maker_kwargs_ etc. This would be a breaking change but not too onerous to fix.

Incompatibility with jobs without a dictionary output

Another issue is with jobs that output a python primitive directly. E.g., your capitalize function. Consider these jobs.

@job
def capitalize(s):
    return s.upper()

@job
def decapitalize(s):
    return s.lower()

This change might think users than can do:

job1 = capitalize("hello")
job2 = decapitalize(job1)
flow = Flow([job1, job2])

But that will fail since jobs can only accept OutputReferences (or other python objects) as inputs.

@Andrew-S-Rosen
Copy link
Member

@davidwaroquiers: is there anything here that you would like me to prototype or add onto to help keep the ball rolling on this? I can't make any guarantees of course but am happy to try and help where I can.

@davidwaroquiers
Copy link
Contributor Author

@davidwaroquiers: is there anything here that you would like me to prototype or add onto to help keep the ball rolling on this? I can't make any guarantees of course but am happy to try and help where I can.

Not sure I can catch up on this at this stage. It's just that I had some time and I had thought about it so I decided to put it in a WIP PR for discussion. Also, before going on with this implementation, we should sit down and see what we want to have, what's "nice to have" and what's not on the table. While I agree with @utf that we should separate concerns, probably we should still first discuss the user experience. Basically "write down" how we would like to write our workflows, i.e. all cases we are thinking about, and see which are already handled, which would need the flow decorator and which would need the indexing features. I'm open to have a meeting on this if it could make things move.

@Andrew-S-Rosen
Copy link
Member

Andrew-S-Rosen commented Aug 14, 2024

Note to self:

Ignoring the changes to __getitem__ here, which are for a separate discussion, the changes implemented by @davidwaroquiers "work" for the simple case of one @flow in a file, but the fact that Steps is a global variable really throws a wrench in things.

For instance:

from jobflow import flow, job, run_locally


@job
def add(a, b):
    return a + b


@job
def multiply(a, b):
    return a * b


@flow
def add_multiply(a, b, c):
    addition = add(a, b)
    return multiply(addition.output, c)


@flow
def double_add(a, b):
    first = add(a, b)
    return add(first.output, first.output)


f = add_multiply(1, 2, 3)
outputs = run_locally(f)
results = [v[1].output for _, v in outputs.items()]
assert results == [3, 9]


f = double_add(1, 2)
outputs = run_locally(f)
results = [v[1].output for _, v in outputs.items()]
assert results == [3, 6]

The call to double_add will fail with ValueError: Job add (6e639178-2deb-4f2c-b28f-e8e47efd2fac) already belongs to another flow, which we can easily understand when investigating Steps. A similar issue would occur if you tried to run the same @flow twice in an interactive Jupyter notebook.

In short, Steps just continues to get populated over and over without getting reset anywhere. Just to be clear, resetting Steps in run_locally(), for instance, will not solve the problem.

It's not immediately clear to me how to handle this when using a global variable, and the previous ast-based approach (while more flexible) seems like it might be too fragile. It's not immediately clear to me how to proceed with a convenient @flow decorator, but it is worth continued thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Application programming interface needs discussion Needs discussion ux User experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants