Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full execution provenance resolution #5639

Draft
wants to merge 71 commits into
base: master
Choose a base branch
from
Draft

Conversation

pditommaso
Copy link
Member

This PR implements the ability to trace the full provenance of a Nextflow pipeline, so that once a task execution is completed, it reports the set of direct upstream tasks that have originated one or more inputs.

How it works

Each output value that's emitted by a task or an operator is wrapped with an object instance. This makes it possible to assign to each emitted value a unique identity based on the underlying Java object identity.

Each object is associated with the corresponding task or operator run (i.e. TaskRun and OperatorRun).

Once the output value is received as an input by task, the upstream task is determined by inspecting the output-run association table.

Required changes

This approach requires enclosing each output value with a wrapper object, and "unwrap" it once it is received by the downstream task or operator, so that the corresponding operation is not altered.

The input unwrapping can be automated easily both for tasks and operators because they have a common message receive interface.

However the output wrapping requires modifying all nextflow operators because each of them of a custom logic to produce the outputs

Possible problems

It should be assessed the impact of creating an object instance for each output value generated by the workflow execution on the underlying Java heap.

Similarity, keeping a heap reference for each task and operator run may determine memory pressure on large workflow graphs.

Current state and next steps

The current implementation demonstrates that this approach is viable. The solution already supports any tasks and the operators: branch, map, flatMap, collectFile.

Tests are available in this case.

The remaining operators should be added to fully support existing workflow applications.

Alternative solution

A simpler solution is possible using the output file paths as the identity value to track the tasks provenance using a logic very similar to the above proposal.

However, the path approach is limited to the case in which all workflow tasks and operator produce file values. The provenance can be tracked for task having one or more non-file input/output values.

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso requested review from jorgee and bentsherman and removed request for jorgee January 5, 2025 13:06
Copy link

netlify bot commented Jan 5, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 06a2ac3
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67a8cb581868b700084fa1e8

@pditommaso pditommaso marked this pull request as draft January 5, 2025 13:22
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@bentsherman
Copy link
Member

bentsherman commented Jan 18, 2025

I ripped the code from nf-prov to generate a mermaid diagram of the task graph using your provenance method.

Your rnaseq-nf toy pipeline works fine:

image

I tried to run against fetchngs, but the run hangs at the very end 😞

You should be able to reproduce it with:

make pack
./build/releases/nextflow-24.11.0-edge-dist run nf-core/fetchngs -r 1.12.0 -profile test,conda --outdir results

@pditommaso
Copy link
Member Author

Well done. I'll check fetchngs asap

import groovy.util.logging.Slf4j
import nextflow.prov.OperatorRun
/**
* Implements an operator context that binds a new run to the current thread
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth commenting here that a "thread" here refers to a DataflowProcessor (i.e. input channel).

My custom tests on join and mix are working correctly, so I'm assuming that each DataflowProcessor uses the same thread for all of its runs (as long as maxForks is 1).

For operators like collate and groupTuple that use only one DP, the run-per-thread context essentially allows you to manually override how the provenance links are recorded. They are also working correctly with my tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better would be to replace the thread-local with a Map<DataflowProcessor,OperatorLink>, but not a big deal if that would be too complicated

pditommaso and others added 17 commits January 31, 2025 19:29
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
…i fast]


Signed-off-by: adamrtalbot <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Co-authored-by: Ben Sherman <[email protected]>
Co-authored-by: Chris Hakkaart <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member Author

Almost complete. One the following operators needs to be reviewed

  • cross
  • dump
  • merge
  • subscribe
  • transpose

Theres' also a the following non-deterministic error to be investigated:

OperatorImplTest > testFilterWithValue FAILED
    org.spockframework.runtime.SpockTimeoutError at OperatorImplTest.groovy:85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants