-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full execution provenance resolution #5639
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
3e21298
to
2b4c5b3
Compare
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
I ripped the code from nf-prov to generate a mermaid diagram of the task graph using your provenance method. Your rnaseq-nf toy pipeline works fine: I tried to run against fetchngs, but the run hangs at the very end 😞 You should be able to reproduce it with: make pack
./build/releases/nextflow-24.11.0-edge-dist run nf-core/fetchngs -r 1.12.0 -profile test,conda --outdir results |
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Well done. I'll check fetchngs asap |
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
import groovy.util.logging.Slf4j | ||
import nextflow.prov.OperatorRun | ||
/** | ||
* Implements an operator context that binds a new run to the current thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth commenting here that a "thread" here refers to a DataflowProcessor (i.e. input channel).
My custom tests on join
and mix
are working correctly, so I'm assuming that each DataflowProcessor uses the same thread for all of its runs (as long as maxForks is 1).
For operators like collate
and groupTuple
that use only one DP, the run-per-thread context essentially allows you to manually override how the provenance links are recorded. They are also working correctly with my tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better would be to replace the thread-local with a Map<DataflowProcessor,OperatorLink>
, but not a big deal if that would be too complicated
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
…i fast] Signed-off-by: adamrtalbot <[email protected]> Signed-off-by: Paolo Di Tommaso <[email protected]> Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]> Signed-off-by: Ben Sherman <[email protected]> Co-authored-by: Ben Sherman <[email protected]> Co-authored-by: Chris Hakkaart <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Almost complete. One the following operators needs to be reviewed
Theres' also a the following non-deterministic error to be investigated:
|
This PR implements the ability to trace the full provenance of a Nextflow pipeline, so that once a task execution is completed, it reports the set of direct upstream tasks that have originated one or more inputs.
How it works
Each output value that's emitted by a task or an operator is wrapped with an object instance. This makes it possible to assign to each emitted value a unique identity based on the underlying Java object identity.
Each object is associated with the corresponding task or operator run (i.e.
TaskRun
andOperatorRun
).Once the output value is received as an input by task, the upstream task is determined by inspecting the output-run association table.
Required changes
This approach requires enclosing each output value with a wrapper object, and "unwrap" it once it is received by the downstream task or operator, so that the corresponding operation is not altered.
The input unwrapping can be automated easily both for tasks and operators because they have a common message receive interface.
However the output wrapping requires modifying all nextflow operators because each of them of a custom logic to produce the outputs
Possible problems
It should be assessed the impact of creating an object instance for each output value generated by the workflow execution on the underlying Java heap.
Similarity, keeping a heap reference for each task and operator run may determine memory pressure on large workflow graphs.
Current state and next steps
The current implementation demonstrates that this approach is viable. The solution already supports any tasks and the operators:
branch
,map
,flatMap
,collectFile
.Tests are available in this case.
The remaining operators should be added to fully support existing workflow applications.
Alternative solution
A simpler solution is possible using the output file paths as the identity value to track the tasks provenance using a logic very similar to the above proposal.
However, the path approach is limited to the case in which all workflow tasks and operator produce file values. The provenance can be tracked for task having one or more non-file input/output values.