Workflow tutorial

This tutorial helps you to set up a preprocessing workflow for structural data of the Studyforrest project with fmriprep. You can find the (retrievable, recomputable) results of this processing routine at github.com/psychoinformatics-de/processing-workflow-tutorial.

Software requirements and other prerequisites

DataLad version 0.14 or higher
The DataLad extension datalad-container
Singularity
flock
HTCondor
a freesurfer license file (free registration required)

Please make sure that you have a configured Git identity (see instructions here).

Please place a freesurfer license file in your home directory.

Step-by-Step

Clone this repository to your compute cluster.

git clone git@github.com:psychoinformatics-de/processing-workflow.git

Adjust variables to your system

The relevant file for this tutorial is bootstrap_forrest_fmriprep.sh. Open it in an editor of your choice, and adjust the following fields:

output_store and input_store: Please provide RIA URLs to a place where an input and an output store can be created. These locations should be writable by you, and take the form ria+ssh://[user@]hostname:/absolute/path/to/ria-store or ria+file:///absolute/path/to/ria-store. More information on RIA stores and RIA URLs is at handbook.datalad.org/r.html?RIA

No other adjustments should be necessary. Optionally, you can

adjust the variable source_ds to a name of your choice. If you do not modify it, the workflow will set up a temporary analysis dataset called forrest.
If your freesurfer license is not located in your home directory, replace the line cp ~/license.txt code/license.txt with an alternative copy command that transfers the freesurfer license from a different place into code/license.txt.

Bootstrap the analysis

Execute bootstrap_forrest_fmriprep.sh by running

$ bash bootstrap_forrest_fmriprep.sh

This command will:

Create a temporary analysis dataset called "forrest" under your current directory
Link openly available input data from GitHub (studyforrest data) and an openly available available container collection (ReproNim containers) to your analysis
Create a temporary input RIA store and an output RIA store under the paths you have supplied, and register them in the analysis dataset
Create an HTCondor-based job submission setup for structural preprocessing on a per subject level
Push the analysis dataset into both RIA stores

If the script finishes with "SUCCESS", you're good to go.

Submit the jobs

Navigate into the newly created analysis dataset, and submit the Condor DAG that was created automatically

$ cd forrest
$ condor_submit_dag code/process.condor_dag
-----------------------------------------------------------------------
File for submitting this DAG to HTCondor           : code/process.condor_dag.condor.sub
Log of DAGMan debugging messages                 : code/process.condor_dag.dagman.out
Log of HTCondor library output                     : code/process.condor_dag.lib.out
Log of HTCondor library error messages             : code/process.condor_dag.lib.err
Log of the life of condor_dagman itself          : code/process.condor_dag.dagman.log

Submitting job(s).
1 job(s) submitted to cluster 413143.

You can monitor the execution of the jobs via standard HTCondor commands such as condor_q -nobatch or by checking the log files that will be collected in the directory logs and in code/process.condor_dag* files.

When the jobs have finished, make sure that all jobs finished successfully, for example by querying the log files for the word "SUCCESS".

Merge the results

After successful completion of all jobs, the result exit in individual branches in the dataset in the output store. To consolidate these results, all branches need to be merged.

This is done in a temporary dataset clone from the RIA store. First, get the dataset ID in order to find its address in the RIA store. In your analysis dataset, run

datalad -f '{infos[dataset][id]}' wtf -S dataset
2758dbcb-fa39-40fb-8e1e-6b30d9103549

Navigate into any temporary location, and clone the dataset from the output store:

$ cd /tmp
$ datalad clone 'ria+file:///data/group/psyinf/myoutputstore#2758dbcb-fa39-40fb-8e1e-6b30d9103549' merger      1 !
[INFO   ] Scanning for unlocked files (this may take some time)
[INFO   ] Configure additional publication dependency on "output-storage"
configure-sibling(ok): . (sibling)
install(ok): /data/group/psyinf/scratch/merger (dataset)
action summary:
  configure-sibling (ok: 1)
  install (ok: 1)

Check that the expected number of branches is present:

$ cd merger
$ git branch -a | grep job- | sort | wc -l
21

Perform a further sanity check that each branch has a new commit. To do this, find the most recent commit on the master branch (or main branch, if your default branch is called main).

$ git show-ref master | cut -d ' ' -f1
609e5395596b9fbc8534f9c175dbf95d631c633c

Plug this hash into the command below. If it returns 0, this means that every job branch has a new commit on top of the reference commit the analysis started from.

$ for i in $(git branch -a | grep job- | sort); do [ x"$(git show-ref $i \
  | cut -d ' ' -f1)" = x"609e5395596b9fbc8534f9c175dbf95d631c633c" ]  && \
  echo $i; done | tee /tmp/nores.txt | wc -l
0

As the number of result branches is very small, you can merge them in one go with the following command:

$ git merge -m "Merge results" $(git branch -al | grep 'job-' | tr -d ' ')
Fast-forwarding to: remotes/origin/job-01.410198
Trying simple merge with remotes/origin/job-02.410204
Trying simple merge with remotes/origin/job-03.410193
Trying simple merge with remotes/origin/job-04.410194
Trying simple merge with remotes/origin/job-05.410203
Trying simple merge with remotes/origin/job-06.410197
Trying simple merge with remotes/origin/job-07.410189
Trying simple merge with remotes/origin/job-08.410205
Trying simple merge with remotes/origin/job-09.410192
Trying simple merge with remotes/origin/job-10.410191
Trying simple merge with remotes/origin/job-11.410206
Trying simple merge with remotes/origin/job-12.410200
Trying simple merge with remotes/origin/job-13.410188
Trying simple merge with remotes/origin/job-14.410190
Trying simple merge with remotes/origin/job-15.410196
Trying simple merge with remotes/origin/job-16.410202
Trying simple merge with remotes/origin/job-17.410195
Trying simple merge with remotes/origin/job-18.410201
Trying simple merge with remotes/origin/job-19.410187
Trying simple merge with remotes/origin/job-20.410199
Trying simple merge with remotes/origin/job-21.410186
Merge made by the 'octopus' strategy.
 fmriprep/sub-01/anat/sub-01_desc-brain_mask.json                                       | 1 +
 fmriprep/sub-01/anat/sub-01_desc-brain_mask.nii.gz                                     | 1 +
 fmriprep/sub-01/anat/sub-01_desc-preproc_T1w.json                                      | 1 +
 fmriprep/sub-01/anat/sub-01_desc-preproc_T1w.nii.gz                                    | 1 +
 fmriprep/sub-01/anat/sub-01_dseg.nii.gz                                                | 1 +
 [...]

This works well because different jobs never modified the same file. If you would run a full fmriprep workflow, head over to handbook.datalad.org/r.html?runhcp for information on how to handle merge conflicts in the CITATION.md file.

First, check that everything is in order, for example by checking that the expected directories and files are present:

$ tree -d fmriprep
fmriprep
├── sub-01
│   ├── anat
│   ├── figures
│   └── log
│       └── 20210318-155553_c8ea2b0c-f557-429f-a598-faf34876ba5b
├── sub-02
│   ├── anat
│   ├── figures
│   └── log
│       └── 20210318-155635_65694134-71b9-40e2-aaee-57910bfa24f9
├── sub-03
│   ├── anat
│   ├── figures
│   └── log
│       └── 20210318-155550_949f06a2-d45d-445a-a07b-62f309765ea6
[...]

If you can view your revision history with a tool like tig you should see a colorful merge operation like this:

Now you can push the merge back in to the outputstore:

$ git push
Enumerating objects: 23, done.
Counting objects: 100% (23/23), done.
Delta compression using up to 32 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 1.26 KiB | 431.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
To /data/group/psyinf/myoutputstore/275/8dbcb-fa39-40fb-8e1e-6b30d9103549
   609e539..ca3b612  master -> master

The name of the storage remote in the output store is "output-storage". We can see it listed in a datalad siblings call:

$ datalad siblings
.: here(+) [git]
.: origin(-) [/data/group/psyinf/myoutputstore/275/8dbcb-fa39-40fb-8e1e-6b30d9103549 (git)]
.: output-storage(+) [ora]

A "git annex fs-check", done with the git annex fsck command, checks what data is available from output-storage, and links it to the correct files in your dataset. Its important to do the operation with a --fast flag for big datasets!

$ git annex fsck --fast -f output-storage
fsck fmriprep/sub-01/anat/sub-01_desc-brain_mask.json (fixing location log) ok
fsck fmriprep/sub-01/anat/sub-01_desc-brain_mask.nii.gz (fixing location log) ok
fsck fmriprep/sub-01/anat/sub-01_desc-preproc_T1w.json (fixing location log) ok
[...]
fsck fmriprep/sub-21/figures/sub-21_desc-summary_T1w.html (fixing location log) ok
fsck fmriprep/sub-21/figures/sub-21_dseg.svg (fixing location log) ok
fsck fmriprep/sub-21/figures/sub-21_space-MNI152NLin2009cAsym_T1w.svg (fixing location log) ok
fsck fmriprep/sub-21/log/20210318-155652_f2c4be32-546f-4ba1-803e-08ae2c587d15/fmriprep.toml (fixing location log) o
k
(recording state in git...)

Make sure that each file has associated content - the command below should not return any output.

$ git annex find --not --in output-storage

As the dataset clone is a temporary clone used only for merging and restoring file availability, we do not want to add it as a known location of data into the distributed network of dataset clones:

$ git annex dead here
dead here ok
(recording state in git...)

Finally, do a datalad push without data to propagate the file availability information back into the dataset in the store.

$ datalad push --data nothing
publish(ok): . (dataset) [refs/heads/git-annex->origin:refs/heads/git-annex 9f5319f..fc7819a]

Create a dataset alias for easier cloning

To make cloning of the result dataset easier, create an alias for the dataset. First, create a directory alias in the root of your RIA store:

$ mkdir /data/group/psyinf/myoutputstore/alias

Then, place a symlink with a name of your choice to the dataset inside of it:

$ ln -s /data/group/psyinf/myoutputstore/275/8dbcb-fa39-40fb-8e1e-6b30d9103549 /data/group/psyinf/myoutputstore/alias/structural-forrest
$ tree /data/group/psyinf/myoutputstore/alias
alias
└── structural-forrest -> ../275/8dbcb-fa39-40fb-8e1e-6b30d9103549

The dataset can now be cloned with its alias:

$ datalad clone 'ria+file:///data/group/psyinf/myoutputstore#~structural-forrest'
[INFO   ] Scanning for unlocked files (this may take some time)
[INFO   ] Configure additional publication dependency on "output-storage"
configure-sibling(ok): . (sibling)
install(ok): /tmp/structural-forrest (dataset)
action summary:
  configure-sibling (ok: 1)
  install (ok: 1)

Data is retrieved with datalad get:

$ cd structural-forrest
$ datalad get fmriprep/sub-08/anat/sub-08_desc-brain_mask.nii.gz
get(ok): fmriprep/sub-08/anat/sub-08_desc-brain_mask.nii.gz (file) [from output-storage...]

Check a provenance record of a subject:

$ git log fmriprep/sub-13/anat/
commit e2eb2592583acf16edfa17565ff2bd90fa0a7070 (origin/job-13.410188)
Author: Adina Wagner <adina.wagner@t-online.de>
Date:   Thu Mar 18 19:09:23 2021 +0100

    [DATALAD RUNCMD] Compute sub-13

    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "singularity run -B {pwd} --cleanenv code/pipeline/images/bids/bids-fmriprep--20.2.0.sing /var/lib/cond
     "dsid": "2758dbcb-fa39-40fb-8e1e-6b30d9103549",
     "exit": 0,
     "extra_inputs": [
      "code/pipeline/images/bids/bids-fmriprep--20.2.0.sing"
     ],
     "inputs": [
      "inputs/data/sub-13/anat/",
      "code/license.txt"
     ],
     "outputs": [
      "fmriprep/sub-13"
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

Rerun a computation for a single subject:

$ datalad rerun e2eb2592583acf16edfa17565ff2bd90fa0a7070
datalad rerun e2eb2592583acf16edfa17565ff2bd90fa0a7070
[INFO   ] run commit e2eb259; (Compute sub-13)
[INFO   ] Making sure inputs are available (this may take some time)
[INFO   ] Scanning for unlocked files (this may take some time)
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
get(ok): inputs/data/sub-13/anat/sub-13_SWI.json (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_SWI_defacemask.nii.gz (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_SWImag.nii.gz (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_SWIphase.nii.gz (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_T1w.json (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_T1w.nii.gz (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_T1w_defacemask.nii.gz (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_T2w.json (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_T2w.nii.gz (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat/sub-13_T2w_defacemask.nii.gz (file) [from mddatasrc...]
get(ok): inputs/data/sub-13/anat (directory)
[...]
[INFO   ] Scanning for unlocked files (this may take some time)
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore
get(ok): code/pipeline/images/bids/bids-fmriprep--20.2.0.sing (file) [from datalad...]
[...]
[INFO   ] == Command start (output follows) =====
[...] FMRIPREP [...]
	 fMRIPrep finished successfully!
210416-09:31:48,560 nipype.workflow IMPORTANT:
	 Works derived from this fMRIPrep execution should include the boilerplate text found in <OUTPUT_PATH>/fmriprep/logs/CITATION.md.
Sentry is attempting to send 0 pending error messages
Waiting up to 2 seconds
Press Ctrl-C to quit
[INFO   ] == Command exit (modification check follows) =====
  [14 similar messages have been suppressed]
delete(ok): fmriprep/sub-10/log/20210415-084510_4d00340f-c56c-4c70-86bc-a11c90434e47/fmriprep.toml (file)
add(ok): fmriprep/.bidsignore (file)
add(ok): fmriprep/dataset_description.json (file)
add(ok): fmriprep/sub-10.html (file)
add(ok): fmriprep/sub-10/log/20210416-080524_22eae3d2-4b4c-469c-ad69-c32719e59b54/fmriprep.toml (file)
add(ok): fmriprep/sub-10/anat/sub-10_desc-brain_mask.json (file)
add(ok): fmriprep/sub-10/anat/sub-10_desc-brain_mask.nii.gz (file)
add(ok): fmriprep/sub-10/anat/sub-10_desc-preproc_T1w.json (file)
add(ok): fmriprep/sub-10/anat/sub-10_desc-preproc_T1w.nii.gz (file)
add(ok): fmriprep/sub-10/anat/sub-10_dseg.nii.gz (file)
add(ok): fmriprep/sub-10/anat/sub-10_from-MNI152NLin2009cAsym_to-T1w_mode-image_xfm.h5 (file)
  [17 similar messages have been suppressed]
save(ok): . (dataset)
action summary:
  add (ok: 27)
  delete (ok: 1)
  get (notneeded: 3, ok: 12)
  run.remove (ok: 24)
  save (notneeded: 2, ok: 1)
datalad rerun bb75237bd2ffaca78153  335.76s user 79.67s system 5% cpu 2:04:41.81 total

If you want to recompute the complete sample instead of only individual subjects, resubmit the job from the analysis dataset again. Afterwards, repeat the merge operation. If results have changed, you will see them summarized in the Git history.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tutorial.md

tutorial.md

Workflow tutorial

Software requirements and other prerequisites

Step-by-Step

Adjust variables to your system

Bootstrap the analysis

Submit the jobs

Merge the results

Create a dataset alias for easier cloning

Files

tutorial.md

Latest commit

History

tutorial.md

File metadata and controls

Workflow tutorial

Software requirements and other prerequisites

Step-by-Step

Adjust variables to your system

Bootstrap the analysis

Submit the jobs

Merge the results

Create a dataset alias for easier cloning