Skip to content

Latest commit

 

History

History
133 lines (89 loc) · 10.2 KB

Quick_start.md

File metadata and controls

133 lines (89 loc) · 10.2 KB

Quickstart scripts

Scripts are available to install the dependencies, run the test datasets, troubleshoot issues, and, for researchers based in Sweden, run Cactus on NAISS resources (including rackham, bianca, miarka, snowy). Please observe that the script to install dependencies is provided for convenience, but the commands may not be up to date; and it is recommended to follow the installation guidelines from the respective tools (links are provided in the Dependencies section of the Cactus documentation).

Dependencies and profiles

Cactus needs two software to be installed in order to run: Nextflow and one of SingularityCE, Docker, conda or Mamba. Please read the Dependencies section for details on versions and installation of these software. Then, the -profile argument should be used to specicify which tools manager to use. In general, it is recommended to use SingularityCE on HPC systems since Singularity containers can be run without sudo and Singularity images are immutable which ensures a high level of reproducibility and verification of images (see also here). Users should see with their administrator which of these 4 options are available and recommended.

Install and run

The first step is to create the global configuration file .cactus.config located in the user's home folder. This file must indicate the path where to download the references and the singularity containers. Here is an example of a .cactus.config file when using Singularity with a Tower token to monitor the pipeline runs on Nextflow Tower (which is highly recommended):

params.references_dir        = "${HOME}/workspace/cactus/references"
params.singularity_cache_dir = "${HOME}/workspace/singularity_containers"
params.tower_token           = "*"
params.enable_tower          = true

One can then download references and test datasets for C. elegans with this command:

NXF_VER=22.10.8 nextflow run jsalignon/cactus/scripts/download/download.nf -r main -latest --test_datasets --references -profile singularity --species worm

This should download the references and the containers/virtual environments in user-specified parameters (i.e., params.references_dir for reference and params.singularity_cache_dir for singularity images), and it should download the test dataset in the current folder which takes the form of 3 folders: parameters, design and data. Please read the Inputs section for details on what the folders contain.

Note: Test datasets are also available for the species d. melanogaster (fly), m. musculus (mouse), and h. sapiens (human). They can be tested by changing the --species argument.
Note: On some platforms, Nextflow should be called with a './' before (i.e., ./nextflow run...).

Then, one can run Cactus and download all container images/virtual environments using this command:

NXF_VER=22.10.8 nextflow run jsalignon/cactus -profile singularity -params-file parameters/full_test.yml -r main -latest

One can update the pipeline using this command:

nextflow pull jsalignon/cactus

Results are stored in the folder results/Cactus_v${cactus_version} (this path can be changed with the parameter params.res_dir).

It is recommended to use either the worm or the fly test datasets when testing Cactus on a laptop to reduce runtime. With 8 cores and 16Gb RAM the worm and fly test dataset can be run in respectively ~27 and ~56 minutes using this command:

NXF_VER=22.10.8 nextflow run jsalignon/cactus -r main -latest -params-file parameters/full_test.yml -profile singularity --executor_local_cpus 8 --executor_local_memory '16G' --res_dir 'results/almost_full_test'  --split__peak_assignment ['all'] --split__threshold_values [200]

Note: The run parameters can be set up in a .yml file or in the command line (as shown just above). The latter taking priority on the former. When setting parameters on the command line, one dash indicates Nextflow's internal parameters (e.g. -profile) and two dashes indicate Cactus' own parameters (e.g. res_dir).

Note: A minimum of 6 cores is required to run Cactus.

Additional details

It is recommended to use Nextflow Tower to easily monitor pipelines progress.

Nextflow handles job submissions on SLURM or other environments, and supervises running the jobs. Thus the Nextflow process must run until the pipeline is finished. We recommend that you put the process running in the background through screen / tmux or similar tool. Alternatively you can run nextflow within a cluster job submitted your job scheduler.

It is recommended to limit the Nextflow Java virtual machines memory). We recommend adding the following line to your environment (typically in ~/.bashrc or ~./bash_profile):

NXF_OPTS='-Xms1g -Xmx4g'

Parameters

Global parameters can be changed in the nextflow.config file. This include output folder names, resources (memory and CPU usage), type of machine to run the script (local or cloud). This is up to the user to set up the optimal nextflow environment according to their analysis platform. Help can be found here for that.

Analysis parameters can be changed in the yml input file. See the Parameters section for more details on parameter files and on available parameters.

Reproducibility

It's a good idea to specify a release version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. One can specify the version of the pipeline using the -r argument this way:

NXF_VER=22.10.8 nextflow run jsalignon/cactus -r release_tag -profile {singulariy,docker,conda,mamba} -params-file parameter_file

For instance:

NXF_VER=22.10.8 nextflow run jsalignon/cactus -r v1.0.0 -profile singularity -params-file parameters/full_test.yml

Troubleshooting

In general, scrolling through Nextflow’s documentation can help resolving most issues.

The general process to resolve a crashing pipeline is to go to the folder indicated in the crash report, launch the appropriate container, and run the lines of codes indicated in the crash report. This way one can try to identify and solve the issue. For finer inspection of the code and analysis, a good idea is to run cactus in the background to get a detailled log file. Note that the -dump-channels argument can also be used to explore channel contents.

The -bg argument can be used to run cactus in the background like this:

NXF_VER=22.10.8 nextflow run jsalignon/cactus -r main -latest -profile singularity -params-file parameters/full_test.yml -bg > nf_log.txt

This creates a .nextflow.pid file that contains the master PID to kill to stop the run in the background. However, this does not always work. A workaround to kill all running processed from the current run folder is to use this function:

kill_nextflow_processes() {
  kill -9 `ps -aux | grep $(whoami) | grep "${PWD}/work" | awk '{print $2}'`
}
kill_nextflow_processes

Then, one can inspect/grep the nf_log.txt file to go to the folder that we want to inspect in more details. Once in the appropriate folder, the following function can be used, if one uses singularity, to open a shell with the container in the same settings as in Cactus and displaying the set of commands that were ran (in the .command.sh file):

load_singularity_container() {
  container=$(grep SINGULARITY .command.run | sed 's/.*\/dev\/shm //g' | sed 's/.img.*/.img/g')
  singularity shell -B /home/jersal --containall --cleanenv --home $PWD --workdir /dev/shm $container
}
cd work/59/8a6fb9*
load_singularity_container
cat .command.sh

Then each command can be run to try to find the error or to inspect the code.

Note: The first row of the .command.sh file indicates if the script should be run in bash or in R.

If none of these work, an issue can be created on the the cactus GitHub page to report the problem.