diff --git a/.nojekyll b/.nojekyll index f1e7790a..c31efbe9 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -6e8a44e8 \ No newline at end of file +ea827d25 \ No newline at end of file diff --git a/index.html b/index.html index c274ec3e..d0185398 100644 --- a/index.html +++ b/index.html @@ -143,7 +143,7 @@

How to work reproducibly with control and structuring of project code, environment and workflow management


-

Updated: 10-10-2024 at 09:14:09 .

+

Updated: 15-10-2024 at 16:39:31 .

diff --git a/pages/conda.html b/pages/conda.html new file mode 100644 index 00000000..c184c8cd --- /dev/null +++ b/pages/conda.html @@ -0,0 +1,947 @@ + + + + + + + + + +Controlling your environment with Conda + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Controlling your environment with Conda

+

How to keep track of your software packages and their versions

+
+
+ + +
+ + +
+
Published
+
+

15-Oct-2024

+
+
+ + +
+ + +
+ + + + +
+ + + + + + +
+

1 Introduction

+

Conda is a package and environment manager. As a package manager it enables you to install a wide range of software and tools using one simple command: conda install. As an environment manager it allows you to create and manage multiple different environments, each with their own set of packages.

+

What are the benefits of using an environment manager? Some examples include the ability to easily run different versions of the same package, have different cross-package dependencies that are otherwise incompatible with each other and, last but not least, easy installation of all the software needed for an analysis.

+

Environments are of particular relevance when making bioinformatics projects reproducible. Full reproducibility requires the ability to recreate the system that was originally used to generate the results. This can, to a large extent, be accomplished by using Conda to make a project environment with specific versions of the packages that are needed in the project. You can read more about Conda here.

+

A Conda package is a compressed tarball (system-level libraries, Python or other modules, executable programs or other components). Conda keeps track of the dependencies between packages and platforms - this means that when installing a given package, all necessary dependencies will also be installed.

+

Conda packages are typically hosted and downloaded from remote so-called channels. Some widely used channels for general-purpose and bioinformatics packages are conda-forge and Bioconda, respectively. Both of these are community-driven projects, so if you’re missing some package you can contribute to the channel by adding the package to it. When installing a Conda package you specify the package name, version (optional) and channel to download from.

+

A Conda environment is essentially a directory that is added to your PATH and that contains a specific collection of packages that you have installed. Packages are symlinked between environments to avoid unnecessary duplication.

+
+

Different Conda flavours You may come across several flavours of Conda. There’s Miniconda, which is the installer for Conda. The second is Anaconda, which is a distribution of not only Conda, but also over 150 scientific Python packages curated by the company by the same name (Anaconda). It’s generally better to stick with the Miniconda installation rather than installing 3 GB worth of packages you may not even use. Then, lastly, there’s the Miniforge flavour that we’re using here, which is a community-driven version of Conda that’s highly popular within the scientific community.

+

The difference between Miniconda and Miniforge is that the former points to points to the default channel by default (which requires an Anaconda license for commercial purposes), while the latter points to the community-maintained conda-forge channel by default. While Conda is created and owned by Anaconda the company, Conda itself is open source - it’s the default channel that is proprietary. The conda-forge and bioconda channels (two of the largest channels outside of default) are community-driven. Confusing? Yes. If you want this information more in-depth you can read this blog post by Anaconda.

+
+
+
+

2 The basics

+

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if, you haven’t done so already. Then open up a terminal and go to workshop-reproducible-research/tutorials/conda. Instructions below assume that you are standing in workshop-reproducible-research/tutorials/conda/ unless otherwise specified (e.g. if it says “create a file”, it means save it in workshop-reproducible-research/tutorials/conda/).

+

Let’s assume that you are just about to start a new exciting research project called Project A.

+
+

2.1 Creating Conda environments

+
    +
  • Let’s make our first Conda environment:
  • +
+
conda create -n project_a -c bioconda fastqc
+

This will create an environment called project_a, containing FastQC from the Bioconda channel. Conda will list the packages that will be installed and ask for your confirmation.

+
    +
  • Once it is done, you can activate the environment:
  • +
+
conda activate project_a
+

By default, Conda will add information to your prompt telling you which environment that is active.

+
    +
  • To see all your environments you can run:
  • +
+
conda info --envs
+

The active environment will be marked with an asterisk.

+
    +
  • To see the installed packages and their versions in the active environment, run:
  • +
+
conda list
+
    +
  • To save the installed packages to a file, run:
  • +
+
conda env export --from-history > environment.yml
+

Where --from-history only reports the packages requested to be installed and not additional dependencies. A caveat is that if no version was originally specified, then it is not included in the export file either.

+
    +
  • Now, deactivate the environment by running conda deactivate.
  • +
  • List all environments again. Which environment is now marked as active?
  • +
  • Try to run FastQC:
  • +
+
fastqc --version
+
    +
  • Did it work? Activate your project_a environment and run the fastqc --version command again. Does it work now?
  • +
+

Hopefully the FastQC software was not found in your base environment (unless you had installed it previously), but worked once your environment was activated.

+
+
+

2.2 Adding more packages

+
    +
  • Now, let’s add another package (MultiQC) to our environment using conda install. Make sure that project_a is the active environment first.
  • +
+
conda install -c bioconda multiqc
+
    +
  • If we don’t specify the package version, the latest available version will be installed. What version of MultiQC got installed?
  • +
  • Run the following to see what versions are available:
  • +
+
conda search -c bioconda multiqc
+
    +
  • Now try to install a different version of MultiQC, e.g.:
  • +
+
conda install -c bioconda multiqc=1.13
+

Read the information that Conda displays in the terminal. It probably asks if you want to downgrade the initial MultiQC installation to the one specified here (1.13 in the example). You can only have one version of a given package in a given environment.

+

Let’s assume that you will have sequencing data in your Project A, and want to use the latest BBMap software to align your reads.

+
    +
  • Find out what versions of BBMap are available in the Bioconda channel using conda search -c bioconda bbmap.
  • +
  • Now install the latest available version of BBMap in your project_a environment.
  • +
+

Let’s further assume that you have an old project (called Project Old) where you know you used BBMap 37.10. You just got back reviewer comments and they want you to include some alignment statistics. Unfortunately, you haven’t saved that information so you will have to rerun the alignment. Now, it is essential that you use the same version of BBMap that your results are based on, otherwise the alignment statistics will be misleading. Using Conda environments this becomes simple. You can just have a separate environment for your old project where you have an old version of BBMap without interfering with your new Project A where you want the latest version.

+
    +
  • Make a new environment for your old project:
  • +
+
conda create -n project_old -c bioconda bbmap=37.10
+
    +
  • List your environments (do you remember the command?).
  • +
  • Activate project_old and check the BBMap version (bbmap.sh --version).
  • +
  • Activate project_a again and check the BBMap version.
  • +
+
+
+

2.3 Removing packages

+

Now let’s try to remove an installed package from the active environment:

+
conda remove multiqc
+
    +
  • Run conda deactivate to exit your active environment.
  • +
  • Now, let’s remove an environment:
  • +
+
conda env remove -n project_old
+

After making a few different environments and installing a bunch of packages, Conda can take up some disk space. You can remove unnecessary files with the command:

+
conda clean -a
+

This will remove package tar-balls that are left from package installations, unused packages (i.e. those not present in any environments), and cached data.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to use conda install for installing packages on the fly.
  • +
  • How to create, activate and change between environments.
  • +
  • How to remove packages or environments and clean up.
  • +
+
+
+
+
+
+

3 Working with environments

+

We have up until now specified which Conda packages to install directly on the command line using the conda create and conda install commands. For working in projects this is not the recommended way. Instead, for increased control and reproducibility, it is better to use an environment file (in YAML format) that specifies the packages, versions and channels needed to create the environment for a project.

+

Throughout these tutorials we will use a case study where we analyse an RNA-seq experiment with the multi-resistant bacteria MRSA (see intro). You will now start to make a Conda YAML file for this MRSA project. The file will contain a list of the software and versions needed to execute the analysis code.

+

In this Conda tutorial, all code for the analysis is available in the script code/run_qc.sh. This code will download the raw FASTQ-files and subsequently run quality control on these using the FastQC software.

+
+

3.1 Working with environments

+

We will start by making a Conda YAML-file that contains the required packages to perform these two steps. Later in the course, you will update the Conda YAML-file with more packages, as the analysis workflow is expanded.

+
    +
  • Let’s get going! Make a YAML file called environment.yml looking like this, and save it in the current directory (which should be workshop-reproducible-research/tutorials/conda):
  • +
+
channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  - fastqc=0.12.1
+
    +
  • Now, make a new Conda environment from the YAML file (note that here the command is conda env create as opposed to conda create that we used before):
  • +
+
conda env create -n project_mrsa -f environment.yml
+
+
+
+ +
+
+Tip +
+
+
+

You can also specify exactly which channel a package should come from inside the environment file, using the channel::package=version syntax.

+
+
+
+
+
+ +
+
+Tip +
+
+
+

Instead of the -n flag you can use the -p flag to set the full path to where the Conda environment should be installed. In that way you can contain the Conda environment inside the project directory, which does make sense from a reproducibility perspective, and makes it easier to keep track of what environment belongs to what project. If you don’t specify -p the environment will be installed in the envs/ directory inside your Conda installation path.

+
+
+
    +
  • Activate the environment!

  • +
  • Now we can run the code for the MRSA project found in code/run_qc.sh, either by running bash code/run_qc.sh or by opening the run_qc.sh file and executing each line in the terminal one by one. Do this!

  • +
+

This should download the project FASTQ files and run FastQC on them (as mentioned above).

+
    +
  • Check your directory contents (ls -Rlh, or in your file browser). It should now have the following structure:
  • +
+
   conda/
+    |
+    |- code/
+    |   |- run_qc.sh
+    |
+    |- data/
+    |   |- SRR935090.fastq.gz
+    |   |- SRR935091.fastq.gz
+    |   |- SRR935092.fastq.gz
+    |
+    |- results/
+    |   |- fastqc/
+    |       |- SRR935090_fastqc.html
+    |       |- SRR935090_fastqc.zip
+    |       |- SRR935091_fastqc.html
+    |       |- SRR935091_fastqc.zip
+    |       |- SRR935092_fastqc.html
+    |       |- SRR935092_fastqc.zip
+    |
+    |- environment.yml
+

Note that all that was needed to carry out the analysis and generate these files and results was environment.yml (that we used to create a Conda environment with the required packages) and the analysis code in code/run_qc.sh.

+
+
+

3.2 Keeping track of dependencies

+

Projects can often be quite large and require lots of dependencies; it can feel daunting to try to capture all of that in a single Conda environment, especially when you consider potential incompatibilities that may arise. It can therefore be a good idea to start new projects with an environment file with each package you know that you will need to use, but without specifying exact versions (except for those packages where you know you need a specific version). This will install the latest compatible versions of all the specified software, making the start-up and installation part of new projects easier. You can then add the versions that were installed to your environment file afterwards, ensuring future reproducibility.

+

There is one command that can make this easier: conda env export. This allows you to export a list of the packages you’ve already installed, including their specific versions, meaning you can easily add them after the fact to your environment file. If you use the --no-builds flag, you’ll get a list of the packages minus their OS-specific build specifications, which is more useful for making the environment portable across systems. This way, you can start with an environment file with just the packages you need (without version), which will install the most up-to-date version possible, and then add the resulting version back in to the environment file using the export command!

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to define our Conda environment using a YAML-file.
  • +
  • How to use conda env create to make a new environment from a YAML-file.
  • +
  • How to use conda env export to get a list of installed packages.
  • +
  • How to work in a project-like setting.
  • +
+
+
+
+
+
+

4 Extra material

+

The following extra material contains some more advanced things you can do with Conda and the command line in general, which is not part of the main course materials. All the essential skills of are covered by the previous section: the material here should be considered tips and tricks from people who use Conda as part of their daily work. You thus don’t need to use these things unless you want to, and you can even skip this part of the lesson if you like!

+
+

4.1 Configuration

+

The behaviour of your Conda installation can be changed using an optional configuration file .condarc. On a fresh Conda install no such file is included but it’s created in your home directory as ~/.condarc the first time you run conda config.

+

You can edit the .condarc file either using a text editor or by way of the conda config command. To list all config parameters and their settings run:

+
conda config --show
+

Similar to Conda environment files, the configuration file is in YAML syntax. This means that the config file is structured in the form of key:value pairs where the key is the name of the config parameter (e.g. auto_update_conda) and the value is the parameter setting (e.g. True).

+

Adding the name of a config parameter to conda config --show will show only that parameter, e.g. conda config --show channels.

+

You can change parameters with the --set, --add, --append and --remove flags to conda config.

+

If you for example want to enable the ‘Always yes’ behaviour which makes Conda automatically choose the yes option, such as when installing, you can run:

+
conda config --set always_yes True
+

To see details about a config parameter you can run conda config --describe parameter. Try running it on the channels parameter:

+
conda config --describe channels
+

In the beginning of this tutorial we added Conda channels to the .condarc file using conda config --add channels. To remove one of the channels from the configuration file you can run:

+
conda config --remove channels conda-forge
+

Check your .condarc file to see the change. To add the conda-forge channel back to the top of the channels simply run:

+
conda config --add channels conda-forge
+

To completely remove a parameter and all its values run:

+
conda config --remove-key parameter
+

For a list of Conda configuration parameters see the Conda configuration page.

+
+
+

4.2 Managing Python versions

+

With Conda environments it’s possible to keep several different versions of Python on your computer at the same time, and switching between these versions is very easy. However, a single Conda environment can only contain one version of Python.

+
+

4.2.1 Your current Python installation

+

The base environment has its own version of Python installed. When you open a terminal (after having installed Conda on your system) this base environment is activated by default (as evidenced by (base) prepended to your prompt). You can check what Python version is installed in this environment by running python --version. To see the exact path to the Python executable type which python.

+

In addition to this your computer may already have Python installed in a separate (system-wide) location outside of the Conda installation. To see if that is the case type conda deactivate until your prompt is not prepended with a Conda environment name. Then type which python. If a path was printed to the terminal (e.g. /usr/bin/python) that means some Python version is already installed in that location. Check what version it is by typing python --version.

+

Now activate the base environment again by typing conda activate (or the equivalent conda activate base) then check the Python installation path and version using which and python --version as above. See the difference? When you activate an environment your $PATH variable is updated so that when you call python (or any other program) the system first searches the directory of the currently active environment.

+
+
+

4.2.2 Different Python versions

+

When you create a new Conda environment you can choose to install a specific version of Python in that environment as well. As an example, create an environment containing Python version 3.5 by running:

+
conda create -n py35 python=3.5
+

Here we name the environment py35 but you can choose whatever name you want.

+

To activate the environment run:

+
conda activate py35
+

You now have a completely separate environment with its own Python version.

+

Let’s say you instead want an environment with Python version 2.7 installed. You may for instance want to run scripts or packages that were written for Python 2.x and are thus incompatible with Python 3.x. Simply create the new Conda environment with:

+
conda create -n py27 python=2.7
+

Activate this environment with:

+
conda activate py27
+

Now, switching between Python versions is as easy as typing conda activate py35 / conda activate py27.

+
+
+
+ +
+
+Note +
+
+
+

If you create an environment where none of the packages require Python, and you don’t explicitly install the python package then that new environment will use the Python version installed in your base environment.

+
+
+
+
+
+

4.3 Decorating your prompt

+

By default, the name of the currently activated environment is added to your command line prompt. This is a good thing, as it makes it easier to keep track of what environment and packages you have access to. The way this is done in the default implementation becomes an issue when using absolute paths for environments (specifying conda env create -p path/to/environment, though, as the entire path will be added to the prompt. This can take up a lot of unnecessary space on your screen, but can be solved in a number of ways.

+

The most straightforward way to solve this is to change the Conda configuration file, specifically the settings of the env_prompt configuration value which determines how Conda modifies your command line prompt. For more information about this setting you can run conda config --describe env_prompt and to see your current setting you can run conda config --show env_prompt.

+

By default env_prompt is set to ({default_env}) which modifies your prompt with the active environment name if it was installed using the -n flag or if the environment folder has a parent folder named envs/. Otherwise the full environment path (i.e. the ‘prefix’) is displayed.

+

If you instead set env_prompt to ({name}) Conda will modify your prompt with the folder name of the active environment. You can change the setting by running conda config --set env_prompt '({name}) '

+

If you wish to keep the ({default_env}) behaviour, or just don’t want to change your Conda config, an alternative is to keep Conda environment folders within a parent folder called envs/. This will make Conda only add the folder name of the Conda environment to your prompt when you activate it.

+

As an example, say you have a project called project_a with the project path ~/myprojects/project_a. You could then install the environment for project_a into a folder ~/myprojects/project_a/envs/project_a_environment. Activating the environment by pointing Conda to it (e.g. conda activate ~/myprojects/project_a/envs/project_a_environment) will only cause your prompt to be modified with project_a_environment.

+
+
+

4.4 Bash aliases for conda

+

Some programmers like to have aliases (i.e. shortcuts) for common commands. Two aliases that might be useful for you are alias coac='conda activate' and alias code='conda deactivate'. Don’t forget to add them to your ~/.bash_profile if you want to use them!

+
+
+

4.5 Rolling back to an earlier version of the environment

+

The history of the changes to an environment are automatically tracked. You can see revisions to an environment by using:

+
conda list --revisions
+

Which shows each revision (numbered) and what’s installed.

+

You can revert back to particular revision using:

+
conda install --revision 5
+
+
+

4.6 Mamba, the drop-in Conda replacement

+

There is another piece of software that is built on top of Conda as a drop-in replacement for it: Mamba. The reason for Mamba’s existence is that it used to have a better solver algorithm for the dependency tree than Conda did. These days, however, this algorithm is included in Conda as the default. There is still some minor reasons you might want to use Mamba, however, the first of which being that Mamba re-implements Conda in C++, which runs slightly faster than the Python-based Conda. This only yields a minor speed increase compared to the dependency-tree algorithm, though, so don’t expect major differences in execution time between Conda and Mamba. Another reason is that Mamba colours its output, which is nice if you care about that sort of thing. If you installed Conda as described in the pre-course material you’ll, conveniently, already have installed Mamba as well!

+ + +
+
+ +
+ +
+ + + + + + + \ No newline at end of file diff --git a/pages/containers.html b/pages/containers.html new file mode 100644 index 00000000..38e81ac5 --- /dev/null +++ b/pages/containers.html @@ -0,0 +1,1230 @@ + + + + + + + + + +Controlling your environment with Containers + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Controlling your environment with Containers

+

How to keep keep track of both your environment and operating system

+
+
+ + +
+ + +
+
Published
+
+

15-Oct-2024

+
+
+ + +
+ + +
+ + + + +
+ + + + + + +
+

1 Introduction

+

Container-based technologies are designed to make it easier to create, deploy, and run applications by isolating them in self-contained software units (hence their name). The idea is to package software and/or code together with everything it needs (other packages it depends, various environment settings, etc.) into one unit, i.e. a container. This way we can ensure that the software or code functions in exactly the same way regardless of where it’s executed. Containers are in many ways similar to virtual machines but more lightweight. Rather than starting up a whole new operating system, containers can use the same kernel (usually Linux) as the system that they’re running on. This makes them much faster and smaller compared to virtual machines. While this might sound a bit technical, actually using containers is quite smooth and very powerful.

+

Containers have also proven to be a very good solution for packaging, running and distributing scientific data analyses. Some applications of containers relevant for reproducible research are:

+
    +
  • When publishing, package your analyses in a container image and let it accompany the article. This way interested readers can reproduce your analysis at the push of a button.
  • +
  • Packaging your analysis in a container enables you to develop on e.g. your laptop and seamlessly move to cluster or cloud to run the actual analysis.
  • +
  • Say that you are collaborating on a project and you are using Mac while your collaborator is using Windows. You can then set up a container image specific for your project to ensure that you are working in an identical environment.
  • +
+

One of the largest and most widely used container-based technologies is Docker. Just as with Git, Docker was designed for software development but is rapidly becoming widely used in scientific research. Another container-based technology is Apptainer (and the related Singularity), which was developed to work well in computer cluster environments such as Uppmax. We will cover both Docker and Apptainer in this course, but the focus will be be on the former (since that is the most widely used and runs on all three operating systems).

+

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to install Docker if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/containers.

+
+
+
+ +
+
+Dockage and storage +
+
+
+

Docker images tend to take up quite a lot of space. In order to do all the exercises in this tutorial you need to have ~10 GB available.

+
+
+
+
+

2 The basics

+

We’re almost ready to start, just one last note on nomenclature. You might have noticed that we sometimes refer to “Docker images” and sometimes to “Docker containers”. We use images to start containers, so containers are simply an instances of an image. You can have an image containing, say, a certain Linux distribution, and then start multiple containers running that same OS.

+
+
+
+ +
+
+Root privileges are required +
+
+
+

If you don’t have root privileges you have to prepend all Docker commands with sudo.

+
+
+
+

2.1 Downloading images

+

Docker containers typically run Linux, so let’s start by downloading an image containing Ubuntu (a popular Linux distribution that is based on only open-source tools) through the command line.

+
docker pull ubuntu:latest
+

You will notice that it downloads different layers with weird hashes as names. This represents a very fundamental property of Docker images that we’ll get back to in just a little while. The process should end with something along the lines of:

+
Status: Downloaded newer image for ubuntu:latest
+docker.io/library/ubuntu:latest
+

Let’s take a look at our new and growing collection of Docker images:

+
docker image ls
+

The Ubuntu image should show up in this list, with something looking like this:

+
REPOSITORY       TAG              IMAGE ID            CREATED             SIZE
+ubuntu           latest           d70eaf7277ea        3 weeks ago         72.9MB
+
+
+

2.2 Running containers

+

We can now start a container from the image we just downloaded. We can refer to the image either by “REPOSITORY:TAG” (“latest” is the default so we can omit it) or “IMAGE ID”. The syntax for docker run is docker run [OPTIONS] IMAGE [COMMAND] [ARG...]. To see the available options run docker run --help. The COMMAND part is any command that you want to run inside the container, it can be a script that you have written yourself, a command line tool or a complete workflow. The ARG part is where you put optional arguments that the command will use.

+

Let’s run uname -a to get some info about the operating system. In this case, uname is the COMMAND and -a the ARG. This command will display some general info about your system, and the -a argument tells uname to display all possible information.

+

First run it on your own system (use systeminfo if you are on Windows):

+
uname -a
+

This should print something like this to your command line:

+
Darwin liv433l.lan 15.6.0 Darwin Kernel Version 15.6.0: Mon Oct  2 22:20:08 PDT 2017; root:xnu-3248.71.4~1/RELEASE_X86_64 x86_64
+

Seems like I’m running the Darwin version of MacOS. Then run it in the Ubuntu Docker container:

+
docker run ubuntu uname -a
+

Here I get the following result:

+
Linux 24d063b5d877 5.4.39-linuxkit #1 SMP Fri May 8 23:03:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
+

And now I’m running on Linux! What happens is that we use the downloaded ubuntu image to run a container that has Ubuntu as the operating system, and we instruct Docker to execute uname -a to print the system info within that container. The output from the command is printed to the terminal.

+

Try the same thing with whoami instead of uname -a.

+
+
+

2.3 Running interactively

+

So, seems we can execute arbitrary commands on Linux. This looks useful, but maybe a bit limited. We can also get an interactive terminal with the flags -it.

+
docker run -it ubuntu
+

Your prompt should now look similar to:

+
root@1f339e929fa9:/#
+

You are now using a terminal inside a container running Ubuntu. Here you can do whatever; install, run, remove stuff. Anything you do will be isolated within the container and never affect your host system.

+

Now exit the container with exit.

+
+
+

2.4 Containers inside scripts

+

Okay, so Docker lets us work in any OS in a quite convenient way. That would probably be useful on its own, but Docker is much more powerful than that. For example, let’s look at the shell part of the index_genome rule in the Snakemake workflow for the MRSA case study:

+
shell:
+    """
+    bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}
+    """
+

You may have seen that one can use containers through both Snakemake and Nextflow if you’ve gone through their tutorial’s extra material, but we can also use containers directly inside scripts in a very simple way. Let’s imagine we want to run the above command using containers instead. How would that look? It’s quite simple, really: first we find a container image that has bowtie2 installed, and then prepend the command with docker run <image>.

+

First of all we need to download the genome to index though, so run:

+
curl -o NCTC8325.fa.gz ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
+gunzip -c NCTC8325.fa.gz > tempfile
+

To download and prepare the input for Bowtie2.

+

Now try running the following Bash code:

+
docker run -v $(pwd):/analysis quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 bowtie2-build /analysis/tempfile /analysis/NCTC8325
+

Docker will automatically download the container image for Bowtie2 version 2.5.1 from the remote repository https://quay.io/repository/biocontainers/bowtie2 and subsequently run the command! This is the docker run [OPTIONS] IMAGE [COMMAND] [ARG...] syntax just like before. In this case quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 is the IMAGE but instead of first downloading and then running it we point to its remote location directly, which will cause Docker to download it on the fly. The bowtie2-build part is the COMMAND followed by the ARG (the input tempfile and the output index)

+

The -v $(pwd):/analysis part is the OPTIONS which we use to mount the current directory inside the container in order to make the tempfile input available to Bowtie2. More on these so-called “Bind mounts” in Section 4 of this tutorial.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to use docker pull for downloading remotely stored images
  • +
  • How to use docker image ls for getting information about the images we have on our system.
  • +
  • How to use docker run for starting a container from an image.
  • +
  • How to use the -it flag for running in interactive mode.
  • +
  • How to use Docker inside scripts.
  • +
+
+
+
+
+
+

3 Building images

+

In the previous section we downloaded a Docker image of Ubuntu and noticed that it was based on layers, each with a unique hash as id. An image in Docker is based on a number of read-only layers, where each layer contains the differences to the previous layers. If you’ve done the Git tutorial this might remind you of how a Git commit contains the difference to the previous commit. The great thing about this is that we can start from one base layer, say containing an operating system and some utility programs, and then generate many new images based on this, say 10 different project-specific images. This dramatically reduces the storage space requirements. For example, Bioconda (see the Conda tutorial) has one base image and then one individual layer for each of the more than 3000 packages available in Bioconda.

+

Docker provides a convenient way to describe how to go from a base image to the image we want by using a “Dockerfile”. This is a simple text file containing the instructions for how to generate each layer. Docker images are typically quite large, often several GBs, while Dockerfiles are small and serve as blueprints for the images. It is therefore good practice to have your Dockerfile in your project Git repository, since it allows other users to exactly replicate your project environment.

+

We will be looking at a Dockerfile called Dockerfile_slim that is located in your containers directory (where you should hopefully be standing already). We will now go through that file and discuss the different steps and what they do. After that we’ll build the image and test it out. Lastly, we’ll start from that image and make a new one to reproduce the results from the Conda tutorial.

+
+

3.1 Understanding Dockerfiles

+

Here are the first few lines of Dockerfile_slim. Each line in the Dockerfile will typically result in one layer in the resulting image. The format for Dockerfiles is INSTRUCTION arguments. A full specification of the format, together with best practices, can be found here.

+
FROM condaforge/miniforge3
+
+LABEL description = "Minimal image for the NBIS reproducible research course."
+MAINTAINER "John Sundh" john.sundh@scilifelab.se
+

Here we use the instructions FROM, LABEL and MAINTAINER. While LABEL and MAINTAINER is just meta-data that can be used for organizing your various Docker components the important one is FROM, which specifies the base image we want to start from. Because we want to use conda to install packages we will start from an image from the conda-forge community that has conda pre-installed. This image was in turn built using a Dockerfile as a blueprint and then uploaded to Dockerhub. The conda-forge community keeps the Dockerfile in a git repository and you can view the file here. You will see that it starts from an official Ubuntu image (check the first line with the FROM instruction), followed by code to install various packages including conda.

+
+
+
+ +
+
+There are many roads to Rome +
+
+
+

When it comes to choosing the best image to start from there are multiple routes you could take. Say you want to run RStudio in a Conda environment through a Jupyter notebook. You could then start from one of the rocker images for R, a Condaforge image, or a Jupyter image. Or you just start from one of the low-level official images and set up everything from scratch.

+
+
+

Let’s take a look at the next section of Dockerfile_slim.

+
# Use bash as shell
+SHELL ["/bin/bash", "--login", "-c"]
+
+# Set workdir
+WORKDIR /course
+
+# Set time zone
+ENV TZ="Europe/Stockholm"
+ENV DEBIAN_FRONTEND=noninteractive
+

SHELL simply sets which shell to use and WORKDIR determines the directory the container should start in. The ENV instruction is used to set environmental variables and here we use it to set the time zone by declaring a TZ variable. The DEBIAN_FRONTEND=noninteractive line means that we force the subsequent installation to not prompt us to set the time zone manually.

+

The next few lines introduce the important RUN instruction, which is used for executing shell commands:

+
# Install package for setting time zone
+RUN apt-get update && apt-get install -y tzdata && apt-get clean
+
+# Configure Conda
+RUN conda init bash && conda config --set channel_priority strict && \
+    conda config --append channels bioconda && \
+    conda config --append channels r && \
+    conda config --set subdir linux-64
+

The first RUN command installs the tzdata package for managing local time settings in the container. This may not always be required for your Dockerfile but it’s added here because some R packages used in the course require it.

+
+
+
+ +
+
+Note +
+
+
+

While installing things with apt-get inside Dockerfiles is relatively common practice, it’s important to note that this may affect reproducibility, since it’s not common to specify an exact version. The packages installed in this manner are, however, usually not important for the actual analyses performed, but rather help in the building of the container image itself. While not critical, it’s important to note this from a reproducibility perspective.

+
+
+

Next, we run conda init bash to initialize the bash shell inside the image, meaning we can use conda activate in containers that run from the image. In the same RUN statement we also configure the strict channel priority and add appropriate channels with conda config. You’ll probably recognize this from the pre-course-setup. The last part sets the somewhat obscure subdir config parameter pointing to the linux-64 architecture of conda channels.

+

As a general rule, you want each layer in an image to be a “logical unit”. For example, if you want to install a program the RUN command should both retrieve the program, install it and perform any necessary clean up. This is due to how layers work and how Docker decides what needs to be rerun between builds. More on this later.

+

Next up is:

+
# Open port for running Jupyter Notebook
+EXPOSE 8888
+
+# Start Bash shell by default
+CMD /bin/bash
+

EXPOSE opens up the port 8888, so that we can later run a Jupyter Notebook server on that port. CMD is an interesting instruction. It sets what a container should run when nothing else is specified, i.e. if you run docker run [OPTIONS] [IMAGE] without the additional [COMMAND] [ARG]. It can be used for example for printing some information on how to use the image or, as here, start a Bash shell for the user. If the purpose of your image is to accompany a publication then CMD could be to run the workflow that generates the paper figures from raw data, e.g. CMD snakemake -s Snakefile -c 1 generate_figures.

+
+
+

3.2 Building from Dockerfiles

+

Now we understand how a Dockerfile works. Constructing the image itself from the Dockerfile can be done as follows - try it out:

+
+
+
+ +
+
+Image platforms on newer Macs +
+
+
+

If your computer is a MAC with the M1 chip, you may have to add --platform linux/x86_64 to the docker build command.

+
+
+
docker build -f Dockerfile_slim -t my_docker_image .
+

This should result in something similar to this:

+
 [+] Building 2.2s (7/7) FINISHED
+ => [internal] load build definition from Dockerfile_slim                                                                                                                                             0.0s
+ => => transferring dockerfile: 667B                                                                                                                                                                  0.0s
+ => [internal] load .dockerignore                                                                                                                                                                     0.0s
+ => => transferring context: 2B                                                                                                                                                                       0.0s
+ => [internal] load metadata for docker.io/condaforge/miniforge3:latest                                                                                                                               0.0s
+ => [1/3] FROM docker.io/condaforge/miniforge3                                                                                                                                                        0.0s
+ => CACHED [2/3] WORKDIR /course                                                                                                                                                                      0.0s
+ => [3/3] RUN conda init bash && conda config --set channel_priority strict &&     conda config --append channels bioconda &&     conda config --append channels r &&     conda config --set subdir   2.1s
+ => exporting to image                                                                                                                                                                                0.0s
+ => => exporting layers                                                                                                                                                                               0.0s
+ => => writing image sha256:53e6efeaa063eadf44c509c770d887af5e222151f08312e741aecc687e6e8981                                                                                                          0.0s
+ => => naming to docker.io/library/my_docker_image
+

Exactly how the output looks depends on which version of Docker you are using. The -f flag sets which Dockerfile to use and -t tags the image with a name. This name is how you will refer to the image later. Lastly, the . is the path to where the image should be build (. means the current directory). This had no real impact in this case, but matters if you want to import files. Validate with docker image ls that you can see your new image.

+
+
+

3.3 Creating your own Dockerfile

+

Now it’s time to make your own Dockerfile to reproduce the results from the Conda tutorial. If you haven’t done the tutorial, it boils down to creating a Conda environment file, setting up that environment, downloading three RNA-seq data files, and running FastQC on those files. We will later package and run the whole RNA-seq workflow in a Docker container, but for now we keep it simple to reduce the size and time required.

+

The Conda tutorial uses a shell script, run_qc.sh, for downloading and running the analysis. A copy of this file should also be available in your current directory. If we want to use the same script we need to include it in the image. A basic outline of what we need to do is:

+
    +
  1. Create a file called Dockerfile_conda
  2. +
  3. Start the image from the my_docker_image we just built
  4. +
  5. Install the package fastqc which is required for the analysis.
  6. +
  7. Add the run_qc.sh script to the image
  8. +
  9. Set the default command of the image to run the run_qc.sh script.
  10. +
+

We’ll now go through these steps in more detail. Try to add the corresponding code to Dockerfile_conda on your own, and if you get stuck you can click to reveal the solution below under “Click to show solution”.

+

Set image starting point

+

To set the starting point of the new image, use the FROM instruction and point to my_docker_image that we built in the previous Building from Dockerfiles step.

+

Install packages

+

Use the RUN instruction to install the package fastqc=0.11.9 with conda. Here there are several options available. For instance we could add an environment file e.g. environment.yml from the Conda tutorial and use conda env create to create an environment from that file. Or we could create an environment directly with conda create. We’ll try this later option here, so add a line that will create an environment named project_mrsa containing the fastqc package, and also clean up packages and cache after installation. Use the -y flag to conda create to avoid the prompt that expects an interaction from the user.

+

In order to have the project_mrsa environment activated upon start-up we need to add two more lines to the Dockerfile. First we need to use a RUN instruction to run echo "source activate project_mrsa" >> ~/.bashrc, and then we need to use the ENV instruction to set the $PATH variable inside the image to /opt/conda/envs/project_mrsa/bin:$PATH.

+

Add the analysis script

+

Use the COPY instruction to Add run_qc.sh to the image. The syntax is COPY SOURCE TARGET. In this case SOURCE is the run_qc.sh script and TARGET is a path inside the image, for simplicity it can be specified with ./.

+

Set default command

+

Use the CMD instruction to set the default command for the image to bash run_qc.sh.

+
+ +
+
+
FROM my_docker_image
+
+RUN conda create -y -n project_mrsa -c bioconda fastqc=0.11.9 && conda clean -a
+
+RUN echo "source activate project_mrsa" >> ~/.bashrc
+
+ENV PATH=/opt/conda/envs/project_mrsa/bin:$PATH
+
+COPY run_qc.sh .
+
+CMD bash run_qc.sh
+
+
+
+

Build the image and tag it my_docker_conda (remember to add --platform linux/x86_64 to the build command if you are using a Mac with the Apple chip).

+
docker build -t my_docker_conda -f Dockerfile_conda .
+

Verify that the image was built using docker image ls.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How the keywords FROM, LABEL, MAINTAINER, RUN, ENV, SHELL, WORKDIR, and CMD can be used when writing a Dockerfile.
  • +
  • How to use docker build to construct and tag an image from a Dockerfile.
  • +
  • How to create your own Dockerfile.
  • +
+
+
+
+
+
+

4 Managing containers

+

When you start a container with docker run it is given an unique id that you can use for interacting with the container. Let’s try to run a container from the image we just created:

+
docker run my_docker_conda
+

If everything worked run_qc.sh is executed and will first download and then analyse the three samples. Once it’s finished you can list all containers, including those that have exited.

+
docker container ls --all
+

This should show information about the container that we just ran. Similar to:

+
CONTAINER ID   IMAGE            COMMAND                  CREATED         STATUS          PORTS      NAMES
+b6f7790462c4   my_docker_conda   "tini -- /bin/bash -…"  3 minutes ago   Up 24 seconds   8888/tcp   sad_maxwell
+

If we run docker run without any flags, your local terminal is attached to the container. This enables you to see the output of run_qc.sh, but also disables you from doing anything else in the meantime. We can start a container in detached mode with the -d flag. Try this out and run docker container ls to validate that the container is running.

+

By default, Docker keeps containers after they have exited. This can be convenient for debugging or if you want to look at logs, but it also consumes huge amounts of disk space. It’s therefore a good idea to always run with --rm, which will remove the container once it has exited.

+

If we want to enter a running container, there are two related commands we can use, docker attach and docker exec. docker attach will attach local standard input, output, and error streams to a running container. This can be useful if your terminal closed down for some reason or if you started a terminal in detached mode and changed your mind. docker exec can be used to execute any command in a running container. It’s typically used to peak in at what is happening by opening up a new shell. Here we start the container in detached mode and then start a new interactive shell so that we can see what happens. If you use ls inside the container you can see how the script generates file in the data and results directories. Note that you will be thrown out when the container exits, so you have to be quick.

+
docker run -d --rm --name my_container my_docker_conda
+docker exec -it my_container /bin/bash
+
+

4.1 Bind mounts

+

There are obviously some advantages to isolating and running your data analysis in containers, but at some point you need to be able to interact with the rest of the host system (e.g. your laptop) to actually deliver the results. This is done via bind mounts. When you use a bind mount, a file or directory on the host machine is mounted into a container. That way, when the container generates a file in such a directory it will appear in the mounted directory on your host system.

+
+
+
+ +
+
+Tip +
+
+
+

Docker also has a more advanced way of data storage called volumes. Volumes provide added flexibility and are independent of the host machine’s file system having a specific directory structure available. They are particularly useful when you want to share data between containers.

+
+
+

Say that we are interested in getting the resulting html reports from FastQC in our container. We can do this by mounting a directory called, say, fastqc_results in your current directory to the /course/results/fastqc directory in the container. Try this out by running:

+
docker run --rm -v $(pwd)/fastqc_results:/course/results/fastqc my_docker_conda
+

Here the -v flag to docker run specifies the bind mount in the form of directory/on/your/computer:/directory/inside/container. $(pwd) simply evaluates to the working directory on your computer.

+

Once the container finishes validate that it worked by opening one of the html reports under fastqc_results/.

+

We can also use bind mounts for getting files into the container rather than out. We’ve mainly been discussing Docker in the context of packaging an analysis pipeline to allow someone else to reproduce its outcome. Another application is as a kind of very powerful environment manager, similarly to how we’ve used Conda before. If you’ve organized your work into projects, then you can mount the whole project directory in a container and use the container as the terminal for running stuff while still using your normal OS for editing files and so on. Let’s try this out by mounting our current directory and start an interactive terminal. Note that this will override the CMD command, so we won’t start the analysis automatically when we start the container.

+
docker run -it --rm -v $(pwd):/course/ my_docker_conda /bin/bash
+

If you run ls you will see that all the files in the container/ directory are there.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to use docker run for starting a container and how the flags -d and --rm work.
  • +
  • How to use docker container ls for displaying information about the containers.
  • +
  • How to use docker attach and docker exec to interact with running containers.
  • +
  • How to use bind mounts to share data between the container and the host system.
  • +
+
+
+
+
+
+

5 Sharing images

+

There would be little point in going through all the trouble of making your analyses reproducible if you can’t distribute them to others. Luckily, sharing Docker containers is extremely easy, and can be done in several ways. One of the more common ways to share Docker images is through container registries and repositories.

+

For example, a Docker registry is a service that stores Docker images, which could be hosted by a third party, publicly or privately. One of the most common registries is Docker Hub, which is a registry hosted by Docker itself. A repository, on the other hand, is a collection of container images with the same name but different tags (i.e. versions), for example ubuntu:latest or ubuntu:20.04. Repositories are stored in registries.

+
+
+
+ +
+
+Note +
+
+
+

Remember that we now have some clashing nomenclature between Git repositories (which we covered in the Git tutorial) and container repositories, so be aware of which one you’re talking about!

+
+
+

There are many registries out there, but here are some that might be of interest to you who are taking this course:

+ +

The most common registry is probably Docker Hub, which lets you host unlimited public images and one private image for free (after which they charge a small fee). The GitHub Container Registry is also quite handy if you’re already using GitHub. Let’s see how it’s done using Docker Hub!

+
    +
  1. Register for an account on Docker Hub.

  2. +
  3. Use docker login -u your_dockerhub_id to login to the Docker Hub registry. Or use the Sign in button in Docker Desktop.

  4. +
  5. When you build an image, tag it with -t your_dockerhub_id/image_name, rather than just image_name.

  6. +
  7. Once the image has been built, upload it to Docker Hub with docker push your_dockerhub_id/image_name.

  8. +
  9. If another user runs docker run your_dockerhub_id/image_name the image will automatically be retrieved from Docker Hub. You can use docker pull for downloading without running.

  10. +
+

If you want to refer to a Docker image in for example a publication, it’s very important that it’s the correct version of the image. This is handled via the ‘tags’ (e.g. docker build -t your_dockerhub_id/image_name:tag_name) that we introduced in Containers 2: The basics and used when building images in Containers 3: Building images.

+
+
+
+ +
+
+Tip +
+
+
+

On Docker Hub it is also possible to link to your Bitbucket or GitHub account and select repositories from which you want to automatically build and distribute Docker images. The Docker Hub servers will then build an image from the Dockerfile in your Git repository and make it available for download using docker pull. That way, you don’t have to bother manually building and pushing using docker push. The GitHub repository for this course is linked to Docker Hub and the Docker images are built automatically from Dockerfile and Dockerfile_slim, triggered by changes made to the GitHub repository. You can take a look at the course on Docker Hub here.

+
+
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How container registries and repositories work
  • +
  • How to use Docker Hub to share Docker images
  • +
+
+
+
+
+

6 Packaging the case study

+

During these tutorials we have been working on a case study about the multi-resistant bacteria MRSA. Here we will build and run a Docker container that contains all the work we’ve done so far.

+
    +
  • We’ve set up a GitHub repository for version control and for hosting our project.
  • +
  • We’ve defined a Conda environment that specifies the packages we’re depending on in the project.
  • +
  • We’ve constructed a Snakemake workflow that performs the data analysis and keeps track of files and parameters.
  • +
  • We’ve written a Quarto document that takes the results from the Snakemake workflow and summarizes them in a report.
  • +
+

The workshop-reproducible-research/tutorials/containers directory contains the final versions of all the files we’ve generated in the other tutorials: environment.yml, Snakefile, config.yml and code/supplementary_material.qmd. The only difference compared to the other tutorials is that we have also included the rendering of the Supplementary Material HTML file into the Snakemake workflow as the rule make_supplementary. Running all of these steps will take some time to execute (around 20 minutes or so), in particular if you’re on a slow internet connection.

+

Now take a look at Dockerfile. Everything should look quite familiar to you, since it’s basically the same steps as in the image we constructed in the Building images section, although with some small modifications. The main difference is that we add the project files needed for executing the workflow (mentioned in the previous paragraph), and install the conda packages using environment.yml. If you look at the CMD command you can see that it will run the whole Snakemake workflow by default.

+

Now run docker build as before, tag the image with my_docker_project (remember the --platform linux/x86_64 flag if you’re on a new Mac with the Apple chip):

+
docker build -t my_docker_project -f Dockerfile .
+

Go get a coffee while the image builds (or you could use docker pull nbisweden/workshop-reproducible-research which will download the same image).

+

Validate with docker image ls. Now all that remains is to run the whole thing with docker run. We just want to get the results, so mount the directory /course/results/ to, say, results/ in your current directory. Click below to see how to write the command.

+
+ +
+
+

If building your own image:

+
docker run -v $(pwd)/results:/course/results my_docker_project
+

If you pulled the image from DockerHub:

+
docker run -v $(pwd)/results:/course/results nbisweden/workshop-reproducible-research
+
+
+
+

Well done! You now have an image that allows anyone to exactly reproduce your analysis workflow (if you first docker push to Dockerhub that is).

+
+
+
+ +
+
+Tip +
+
+
+

If you’ve done the Jupyter tutorial, you know that Jupyter Notebook runs as a web server. This makes it very well suited for running in a Docker container, since we can just expose the port Jupyter Notebook uses and redirect it to one of our own. You can then work with the notebooks in your browser just as you’ve done before, while it’s actually running in the container. This means you could package your data, scripts and environment in a Docker image that also runs a Jupyter Notebook server. If you make this image available, say on Dockerhub, other researchers could then download it and interact with your data/code via the fancy interactive Jupyter notebooks that you have prepared for them. We haven’t made any fancy notebooks for you, but we have set up a Jupyter Notebook server. Try it out if you want to (replace the image name with your version if you’ve built it yourself):

+
docker run -it nbisweden/workshop-reproducible-research jupyter notebook -allow-root --no-browser
+
+
+
+
+

7 Apptainer

+

Apptainer is a container software alternative to Docker. It was originally developed as Singularity by researchers at Lawrence Berkeley National Laboratory (read more about this below) with focus on security, scientific software, and HPC clusters. One of the ways in which Apptainer is more suitable for HPC is that it very actively restricts permissions so that you do not gain access to additional resources while inside the container. Apptainer also, unlike Docker, stores images as single files using the Singularity Image Format (SIF). A SIF file is self-contained and can be moved around and shared like any other file, which also makes it easy to work with on an HPC cluster.

+
+
+
+ +
+
+Apptainer and Singularity +
+
+
+

The open source Singularity project was renamed to Apptainer in 2021. The company Sylabs still keeps their commercial branch of the project under the Singularity name, and offer a free ‘Community Edition’ version. The name change was done in order to clarify the distinction between the open source project and the various commercial versions. At the moment there is virtually no difference to you as a user whether you use Singularity or Apptainer, but eventually it’s very likely that the two will diverge.

+
+
+

While it is possible to define and build Apptainer images from scratch, in a manner similar to what you’ve already learned for Docker, this is not something we will cover here (but feel free to read more about this in e.g. the Apptainer docs.

+

The reasons for not covering Apptainer more in-depth are varied, but it basically boils down to it being more or less Linux-only, unless you use Virtual Machines (VMs). Even with this you’ll run into issues of incompatibility of various kinds, and these issues are further compounded if you’re on one of the new ARM64-Macs. You also need root (admin) access in order to actually build Apptainer images regardless of platform, meaning that you can’t build them on e.g. Uppmax, even though Apptainer is already installed there. You can, however, use the --remote flag, which runs the build on Apptainer’s own servers. This doesn’t work in practice a lot of the time, though, since most scientist will work in private Git repositories so that their research and code is not available to anybody, and the --remote flag requires that e.g. the environment.yml file is publicly available.

+

There are very good reasons to use Apptainer, however, the major one being that you aren’t allowed to use Docker on most HPC systems! One of the nicer features of Apptainer is that it can convert Docker images directly for use within Apptainer, which is highly useful for the cases when you already built your Docker image or if you’re using a remotely available image stored on e.g. DockerHub. For a lot of scientific work based in R and/or Python, however, it is most often the case that you build your own images, since you have a complex dependency tree of software packages not readily available in existing images. So, we now have another problem for building our own images:

+
    +
  1. Only Apptainer is allowed on HPC systems, but you can’t build images there due to not having root access.
  2. +
  3. You can build Apptainer images locally and transfer them to HPCs, but this is problematic unless you’re running Linux natively.
  4. +
+

Seems like a “catch 22”-problem, right? There are certainly workarounds (some of which we have already mentioned) but most are roundabout or difficult to get working for all use-cases. Funnily enough, there’s a simple solution: run Apptainer locally from inside a Docker container! Conceptually very meta, yes, but works very well in practice. What we are basically advocating for is that you stick with Docker for most of your container-based work, but convert your Docker images using Apptainer-in-Docker whenever you need to work on an HPC. This is of course not applicable to Linux users or those of you who are fine with working through using VMs and managing any issues that arise from doing that.

+
+
+
+ +
+
+Summary +
+
+
+

Apptainer is a great piece of software that is easiest to use if you’re working on a Linux environment. Docker is, however, easier to use from a cross-platform standpoint and covers all use-cases except running on HPCs. Running on HPCs can be done by converting existing Docker images at runtime, while building images for use on HPCs can be done using local Docker images and Apptainer-in-Docker.

+
+
+
+

7.1 Apptainer-in-Docker

+

By creating a bare-bones, Linux-based Docker image with Apptainer you can build Apptainer images locally on non-Linux operating systems. There is already a good image setup for just this, and it is defined in this GitHub repository. Looking at the instructions there we can see that we need to do the following:

+
docker run \
+    --rm \
+    -v /var/run/docker.sock:/var/run/docker.sock \
+    -v $(pwd):/work \
+    kaczmarj/apptainer \
+    build <IMAGE>.sif docker-daemon://<IMAGE>:<TAG>
+

You already know about docker run, the --rm flag and bind mounts using -v. The /var/run/docker.sock part is the Unix socket that the Docker daemon listens to by default, meaning that it is needed for us to be able to specify the location of the Docker container we want to convert to a SIF file. The kaczmarj/apptainer part after the bind mounts is the image location hosted at DockerHub, while the last line is the Apptainer command that actually does the conversion. All we need to do is to replace the <IMAGE> part with the Docker image we want to convert, e.g. my_docker_image.

+
    +
  • Replace <IMAGE> and <TAG> with one of your locally available Docker images and one of its tags and run the command - remember that you can use docker image ls to check what images you have available.
  • +
+

In the end you’ll have a SIF file (e.g. my_docker_image.sif) that you can transfer to an HPC such as Uppmax and run whatever analyses you need. If you want to be able to do this without having to remember all the code you can check out the this script.

+
+
+

7.2 Running Apptainer

+

The following exercises assume that you have a login to the Uppmax HPC cluster in Uppsala, but will also work for any other system that has Apptainer installed - like if you managed to install Apptainer on your local system or have access to some other HPC cluster. Let’s try to convert the Docker image for this course directly from DockerHub:

+
apptainer pull mrsa_proj.sif docker://nbisweden/workshop-reproducible-research
+

This should result in a SIF file called mrsa_proj.sif.

+

In the Docker image we included the code needed for the workflow in the /course directory of the image. These files are of course also available in the Apptainer image. However, a Apptainer image is read-only. This will be a problem if we try to run the workflow within the /course directory, since the workflow will produce files and Snakemake will create a .snakemake directory. Instead, we need to provide the files externally from our host system and simply use the Apptainer image as the environment to execute the workflow in (i.e. all the software and dependencies).

+

In your current working directory (workshop-reproducible-research/tutorials/containers/) the vital MRSA project files are already available (Snakefile, config.yml and code/supplementary_material.qmd). Since Apptainer bind mounts the current working directory we can simply execute the workflow and generate the output files using:

+
apptainer run mrsa_proj.sif
+

This executes the default run command, which is snakemake -rp -c 1 --configfile config.yml (as defined in the original Dockerfile). Once completed you should see a bunch of directories and files generated in your current working directory, including the results/ directory containing the final HTML report.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to build a Apptainer image using Apptainer inside Docker.
  • +
  • How to convert Docker images to Apptainer images.
  • +
  • How to run Apptainer images.
  • +
+
+
+
+
+
+

8 Extra material

+

Containers can be large and complicated, but once you start using them regularly you’ll find that you start understand these complexities. There are lots of different things you can do with images and containers in general, especially when it comes to optimising build time or final image size. Here is some small tips and tricks that you can be inspired from!

+

If you want to read more about containers in general you can check out these resources:

+
    +
  • A “Get started with Docker” at the Docker website.
  • +
  • An early paper on the subject of using Docker for reproducible research.
  • +
+
+

8.1 Building for multiple platforms

+

With the newer ARM64 architectures introduced by Apple one often runs into the problem of not having an architecture-native image to run with. This is sometimes okay since the Rosetta2 software can emulate the old AMD64 architecture on newer ARM64 computers, but results in a performance hit. One could just build for ARM64 using --platform=linux/arm64 instead, but then somebody who doesn’t have the new architecture can’t run it. There is a way around this, however: multi-platform builds. We can build for multiple platforms at the same time and push those to e.g. DockerHub and anybody using those images will automatically pull the one appropriate for their computer. Here’s how to do it:

+
    +
  • Start by checking the available builders using docker buildx ls.
  • +
+

You should only see the default builder, which does not have access to multi-platform builds. Let’s create a new builder that does have access to it:

+
    +
  • Run the following: docker buildx create --name mybuilder --driver docker-container --bootstrap.

  • +
  • Switch to using the new builder with docker buildx use mybuilder and check that it worked with docker buildx ls.

  • +
+

All that’s needed now is to build and push the images! The following command assumes that you have an account with <username> at DockerHub and you’re pushing the <image> image:

+
docker buildx build --platform linux/amd64,linux/arm64 -t <username>/<image>:latest --push .
+
    +
  • Execute the above command with your username and your image.
  • +
+

That’s it! Now anybody who does e.g. docker pull <username>/<image> will get an image appropriate for their architecture whether they are on AMD64 or ARM64!

+
+
+
+ +
+
+An alias to buildx +
+
+
+

You can type docker buildx install to make the docker build into an alias for docker buildx, allowing you to run multi-platform builds using docker build. Use docker buildx uninstall to remove this alias.

+
+
+ + +
+
+ +
+ +
+ + + + + + + \ No newline at end of file diff --git a/pages/images/dag_mrsa.svg b/pages/images/dag_mrsa.svg new file mode 100644 index 00000000..8bd05ab6 --- /dev/null +++ b/pages/images/dag_mrsa.svg @@ -0,0 +1,229 @@ + + + + + + +snakemake_dag + + +0 + +get_genome_fasta + + +6 + +index_genome + + +0->6 + + + + +1 + +sort_bam + + +4 + +generate_count_table + + +1->4 + + + + +2 + +align_to_genome + + +12 + +sort_bam + + +2->12 + + + + +3 + +fastqc +id: SRR935090 + + +5 + +multiqc + + +3->5 + + + + +11 + +all + + +4->11 + + + + +5->11 + + + + +6->2 + + + + +8 + +align_to_genome + + +6->8 + + + + +9 + +align_to_genome + + +6->9 + + + + +7 + +sort_bam + + +7->4 + + + + +8->7 + + + + +9->1 + + + + +10 + +generate_rulegraph + + +10->11 + + + + +12->4 + + + + +13 + +fastqc +id: SRR935091 + + +13->5 + + + + +14 + +fastqc +id: SRR935092 + + +14->5 + + + + +15 + +get_SRA_by_accession +sra_id: SRR935092 + + +15->2 + + + + +15->14 + + + + +16 + +get_SRA_by_accession +sra_id: SRR935091 + + +16->8 + + + + +16->13 + + + + +17 + +get_SRA_by_accession +sra_id: SRR935090 + + +17->3 + + + + +17->9 + + + + +18 + +get_genome_gff3 + + +18->4 + + + + + diff --git a/pages/images/jobgraph.svg b/pages/images/jobgraph.svg new file mode 100644 index 00000000..d5788c73 --- /dev/null +++ b/pages/images/jobgraph.svg @@ -0,0 +1,42 @@ + + + + + + +snakemake_dag + + +0 + +concatenate_files +first: a +second: b + + +1 + +convert_to_upper_case +some_name: b + + +1->0 + + + + +2 + +convert_to_upper_case +some_name: a + + +2->0 + + + + + diff --git a/pages/images/jupyter_basic_update.png b/pages/images/jupyter_basic_update.png new file mode 100644 index 00000000..ac02f7b5 Binary files /dev/null and b/pages/images/jupyter_basic_update.png differ diff --git a/pages/images/jupyter_dashboard.png b/pages/images/jupyter_dashboard.png new file mode 100644 index 00000000..9c9748cd Binary files /dev/null and b/pages/images/jupyter_dashboard.png differ diff --git a/pages/images/jupyter_empty_nb.png b/pages/images/jupyter_empty_nb.png new file mode 100644 index 00000000..f19bbaf3 Binary files /dev/null and b/pages/images/jupyter_empty_nb.png differ diff --git a/pages/images/jupyter_widget.png b/pages/images/jupyter_widget.png new file mode 100644 index 00000000..7b7ea221 Binary files /dev/null and b/pages/images/jupyter_widget.png differ diff --git a/pages/images/jupyterlab_dashboard.png b/pages/images/jupyterlab_dashboard.png new file mode 100644 index 00000000..190a3dc2 Binary files /dev/null and b/pages/images/jupyterlab_dashboard.png differ diff --git a/pages/images/rulegraph.svg b/pages/images/rulegraph.svg new file mode 100644 index 00000000..05c8c214 --- /dev/null +++ b/pages/images/rulegraph.svg @@ -0,0 +1,28 @@ + + + + + + +snakemake_dag + + +0 + +concatenate_files + + +1 + +convert_to_upper_case + + +1->0 + + + + + diff --git a/pages/images/rulegraph_complex.svg b/pages/images/rulegraph_complex.svg new file mode 100644 index 00000000..1639ce54 --- /dev/null +++ b/pages/images/rulegraph_complex.svg @@ -0,0 +1,1478 @@ + + + + + + +snakemake_dag + + +0 + + + + + + + + +rulegraph_png + + +45 + + + + + + + + +atacseq_report + + +0->45 + + + + +1 + + + + + + + + +atacseq_aggregate_picard_results + + +10 + + + + + + + + +generate_paper_figures + + +1->10 + + + + +1->45 + + + + +2 + + + + + + + + +pdf2png + + +11 + + + + + + + + +qc_report + + +2->11 + + + + +3 + + + + + + + + +download_FAIRE_and_DNASE + + +16 + + + + + + + + +compare_to_FAIRE_and_DNASE + + +3->16 + + + + +4 + + + + + + + + +atacseq_aggregate_qualimap_results + + +4->45 + + + + +5 + + + + + + + + +atacseq_aggregate_cutadapt_results + + +5->45 + + + + +6 + + + + + + + + +atacseq_correct_coordinates + + +7 + + + + + + + + +bam_summary + + +6->7 + + + + +9 + + + + + + + + +merge_bam_healthy + + +6->9 + + + + +12 + + + + + + + + +merge_bam_diabetic + + +6->12 + + + + +17 + + + + + + + + +copy_to_webexport + + +6->17 + + + + +23 + + + + + + + + +phantompeaks + + +6->23 + + + + +37 + + + + + + + + +picard_build_bam_index + + +6->37 + + + + +39 + + + + + + + + +merge_bam + + +6->39 + + + + +61 + + + + + + + + +bam_subsampling + + +6->61 + + + + +40 + + + + + + + + +plot_correlation + + +7->40 + + + + +8 + + + + + + + + +picard_merge_sam + + +26 + + + + + + + + +picard_mark_duplicates + + +8->26 + + + + +30 + + + + + + + + +qualimap_bamqc + + +8->30 + + + + +8->37 + + + + +9->7 + + + + +9->23 + + + + +9->37 + + + + +9->61 + + + + +18 + + + + + + + + +generate_all_reports + + +10->18 + + + + +11->18 + + + + +12->7 + + + + +12->23 + + + + +12->37 + + + + +12->61 + + + + +13 + + + + + + + + +annotate_subset + + +43 + + + + + + + + +bed_report + + +13->43 + + + + +14 + + + + + + + + +genome_coverage + + +14->43 + + + + +15 + + + + + + + + +aggregate_bed_jaccard + + +15->43 + + + + +16->10 + + + + +24 + + + + + + + + +get_main_annotation_file + + +16->24 + + + + +16->43 + + + + +18->17 + + + + +19 + + + + + + + + +homer_annotation_to_bed + + +27 + + + + + + + + +annotate_to_multiple + + +19->27 + + + + +56 + + + + + + + + +annotate_TSS_distance_groups + + +19->56 + + + + +20 + + + + + + + + +find_motif_positions + + +20->43 + + + + +21 + + + + + + + + +aggregate_phantompeaks + + +21->11 + + + + +22 + + + + + + + + +rulegraph + + +22->0 + + + + +23->2 + + + + +23->21 + + + + +24->10 + + + + +24->43 + + + + +25 + + + + + + + + +intersect_all_bed + + +25->24 + + + + +35 + + + + + + + + +merge_annotated_features + + +25->35 + + + + +25->43 + + + + +58 + + + + + + + + +merge_close_peaks + + +25->58 + + + + +26->37 + + + + +41 + + + + + + + + +picard_collect_alignment_summary_metrics + + +26->41 + + + + +44 + + + + + + + + +picard_mark_duplicates_log + + +26->44 + + + + +51 + + + + + + + + +picard_collect_insert_size_metrics + + +26->51 + + + + +57 + + + + + + + + +bamtools_filter_script + + +26->57 + + + + +27->13 + + + + +27->35 + + + + +27->43 + + + + +59 + + + + + + + + +get_all_TSS + + +27->59 + + + + +28 + + + + + + + + +aggregate_bed_reldist + + +28->43 + + + + +29 + + + + + + + + +download_histone_modifications + + +63 + + + + + + + + +compare_to_histone_modifications + + +29->63 + + + + +30->4 + + + + +31 + + + + + + + + +filter_macs + + +31->15 + + + + +31->25 + + + + +31->28 + + + + +32 + + + + + + + + +ling_macs2_combine_xls_with_annotation + + +32->10 + + + + +32->17 + + + + +32->43 + + + + +33 + + + + + + + + +bdg2bw + + +33->17 + + + + +34 + + + + + + + + +macs2_xls_to_bed + + +49 + + + + + + + + +annotatePeaks + + +34->49 + + + + +35->43 + + + + +36 + + + + + + + + +download_blacklist + + +46 + + + + + + + + +ling_macs2_filter_peaks + + +36->46 + + + + +48 + + + + + + + + +bed6_blacklist + + +36->48 + + + + +37->7 + + + + +37->17 + + + + +37->26 + + + + +37->41 + + + + +37->51 + + + + +38 + + + + + + + + +bowtie2_align_pe + + +50 + + + + + + + + +picard_sort_sam + + +38->50 + + + + +39->7 + + + + +39->37 + + + + +39->61 + + + + +40->11 + + + + +41->1 + + + + +42 + + + + + + + + +bamtools_create_filter_script + + +42->57 + + + + +43->18 + + + + +44->1 + + + + +45->11 + + + + +45->18 + + + + +46->32 + + + + +46->34 + + + + +47 + + + + + + + + +peaks_vs_coverage + + +47->11 + + + + +48->14 + + + + +48->15 + + + + +48->17 + + + + +48->25 + + + + +48->28 + + + + +48->31 + + + + +49->24 + + + + +49->32 + + + + +49->43 + + + + +50->8 + + + + +51->1 + + + + +52 + + + + + + + + +aggregate_histone_mark_comparison + + +52->24 + + + + +52->43 + + + + +53 + + + + + + + + +macs2_callpeak_treatment_only_bam_pe + + +53->15 + + + + +53->17 + + + + +53->28 + + + + +53->46 + + + + +53->47 + + + + +53->48 + + + + +54 + + + + + + + + +compare_to_FAIRE + + +54->10 + + + + +54->43 + + + + +54->58 + + + + +55 + + + + + + + + +cutadapt_cut_paired_end + + +55->5 + + + + +55->38 + + + + +56->24 + + + + +57->6 + + + + +58->14 + + + + +58->16 + + + + +58->20 + + + + +58->24 + + + + +58->27 + + + + +58->35 + + + + +58->43 + + + + +58->49 + + + + +58->54 + + + + +58->56 + + + + +60 + + + + + + + + +homer_find_motifs + + +58->60 + + + + +58->63 + + + + +59->43 + + + + +60->20 + + + + +60->43 + + + + +61->53 + + + + +62 + + + + + + + + +bowtie2_build + + +62->38 + + + + +63->52 + + + + +64 + + + + + + + + +fetch_chrom_sizes + + +64->14 + + + + +64->33 + + + + + diff --git a/pages/images/rulegraph_mrsa.svg b/pages/images/rulegraph_mrsa.svg new file mode 100644 index 00000000..d8611231 --- /dev/null +++ b/pages/images/rulegraph_mrsa.svg @@ -0,0 +1,123 @@ + + + + + + +snakemake_dag + + +0 + +align_to_genome + + +3 + +sort_bam + + +0->3 + + + + +1 + +all + + +2 + +generate_rulegraph + + +2->1 + + + + +10 + +generate_count_table + + +3->10 + + + + +4 + +get_genome_gff3 + + +4->10 + + + + +5 + +index_genome + + +5->0 + + + + +6 + +fastqc + + +9 + +multiqc + + +6->9 + + + + +7 + +get_genome_fasta + + +7->5 + + + + +8 + +get_SRA_by_accession + + +8->0 + + + + +8->6 + + + + +9->1 + + + + +10->1 + + + + + diff --git a/pages/jupyter.html b/pages/jupyter.html new file mode 100644 index 00000000..b6e5d200 --- /dev/null +++ b/pages/jupyter.html @@ -0,0 +1,1438 @@ + + + + + + + + + +Working with Jupyter + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Working with Jupyter

+

How to generate reproducible reports and computational notebooks

+
+
+ + +
+ + +
+
Published
+
+

15-Oct-2024

+
+
+ + +
+ + +
+ + + + +
+ + + + + + +
+

1 Introduction

+

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain code, equations, visualizations and text. The functionality is partly overlapping with Quarto (see the tutorial), in that they both use markdown and code chunks to generate reports that integrate results of computations with the code that generated them. Jupyter Notebook comes from the Python community while Quarto was developed by Posit (who also created R Markdown and RStudio), but you could use most common programming languages in either alternative. In practice though, it’s quite common that R developers use Jupyter but probably not very common that Python developers use RStudio. Some reasons to use Jupyter include:

+
    +
  • Python is lacking a really good IDE for doing exploratory scientific data analysis, like RStudio or Matlab. Some people use Jupyter simply as an alternative for that.
  • +
  • The Jupyter Project community is large and dynamic, and there are lots of tools for sharing, displaying or interacting with notebooks.
  • +
  • An early ambition with Jupyter notebooks (and its predecessor IPython notebooks) was to be analogous to the lab notebook used in a wet lab. It would allow the data scientist to document his or her day-to-day work and interweave results, ideas, and hypotheses with the code. From a reproducibility perspective, this is one of the main advantages.
  • +
  • Jupyter notebooks can be used, just like Quarto, to provide a tighter connection between your data and your results by integrating results of computations with the code that generated them. They can also do this in an interactive way that makes them very appealing for sharing with others.
  • +
+

As always, the best way is to try it out yourself and decide what to use it for!

+

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already. Then open up a terminal and go to workshop-reproducible-research/tutorials/jupyter and activate your jupyter-env Conda environment.

+
+
+
+ +
+
+A note on nomenclature +
+
+
+
    +
  • Jupyter: a project to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. Lives at jupyter.org.
  • +
  • Jupyter Notebook: A web application that you use for creating and managing notebooks. One of the outputs of the Jupyter project.
  • +
  • Jupyter lab: A more powerful and feature-rich interface that also includes a terminal, debugger, tabs etc.
  • +
  • Jupyter notebook: The actual .ipynb file that constitutes your notebook.
  • +
+
+
+
+
+

2 The basics

+

One thing that sets Jupyter Notebook apart from what you might be used to is that it’s a web application, i.e. you edit and run your code from your browser. But first you have to start the Jupyter Notebook server. At this point you may either try the classic notebook interface by running:

+
jupyter notebook --allow-root
+

Or give the more feature-rich Jupyter lab interface a try by running:

+
jupyter lab --allow-root
+

Whichever interface you choose you should see something similar to this printed to your terminal:

+
[I 18:02:26.722 NotebookApp] Serving notebooks from local directory: /Users/john/workshop-reproducible-research/tutorials/jupyter
+[I 18:02:26.723 NotebookApp] 0 active kernels
+[I 18:02:26.723 NotebookApp] The Jupyter Notebook is running at:
+[I 18:02:26.723 NotebookApp] http://localhost:8888/?token=e03f10ccb40efc3c6154358593c410a139b76acf2cae000
+[I 18:02:26.723 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
+[C 18:02:26.724 NotebookApp]
+
+    Copy/paste this URL into your browser when you connect for the first time,
+    to login with a token:
+        http://localhost:8888/?token=e03f10ccb40efc3c6154358593c410a139b76acf2cae785c
+[I 18:02:27.209 NotebookApp] Accepting one-time-token-authenticated connection from ::1
+
+
+
+ +
+
+A note for Windows users +
+
+
+

If you see the error message Start : This command cannot be run due to the error: The system cannot find the file specified. ... then try starting Jupyter with jupyter notebook --no-browser then copy the URL given into the browser directly.

+
+
+
+
+
+ +
+
+Jupyter notebook versions +
+
+
+

Depending on what version of the notebook conda package you have installed, the interface may look slightly different. The screenshots in this tutorial are from version 7, an update which has brought the ‘classic’ Jupyter notebook closer to the Jupyter lab interface. Read more about this update at the Jupyter homepage.

+
+
+

The Jupyter Notebook/Lab interface probably opened up a web browser for you automatically, otherwise go to the address specified in the message in the terminal. Note that the server is running locally (as http://localhost:8888) so this does not require that you have an active internet connection. Also note that it says:

+
Serving notebooks from local directory: </some/local/path/workshop-reproducible-research/tutorials/jupyter>
+

Everything you do in your Notebook session will be stored in this directory, so you won’t lose any work if you shut down the server.

+
+

+
+

What you’re looking at is the Notebook dashboard. This is where you manage your files, notebooks, and kernels. The Files tab shows the files in your directory. The Running tab keeps track of all your processes.

+

The Jupyter lab dashboard should look something like this:

+
+

+
+

Let’s start by creating an empty notebook. You can do this by selecting the Files tab and clicking New > Notebook. When the notebook opens, select the suggested Python 3 kernel from the drop-down menu.

+

This will open up a new tab or window looking like this:

+
+

+
+

Start by giving your notebook a name by clicking on the text “Untitled” at the top of the page. Enter “jupyter-tutorial.ipynb”.

+

Note that for most of this tutorial we will describe how you work in the actual notebook and not devote a lot of time to the extra features available in the Jupyter lab interface.

+
+
+
+ +
+
+Tip +
+
+
+

If you want to start Jupyter Notebooks on a cluster that you SSH to (e.g. Uppmax) see the section in the Extra material

+
+
+

Jupyter notebooks are made up of cells, and you are currently standing in the first cell in your notebook. Your cursor should be blinking in this cell, indicating that you are in “Edit mode” meaning that you can type text in the cell. Pressing the Esc key puts you in “Command mode” which allows you to manipulate the notebook as a whole, more on this later.

+

Cells in Jupyter notebooks can be of two types:markdown or code.

+
    +
  • Markdown:
  • +
+

These cells contain static material such as captions, text, lists, images and so on. You express this using Markdown, which is a lightweight markup language. Markdown documents can then be converted to other formats for viewing (the document you’re reading now is written in Markdown and then converted to HTML). The format is discussed a little more in detail in the Quarto tutoriall. Jupyter Notebook uses a dialect of Markdown called GitHub Flavoured Markdown, which is described here.

+
    +
  • Code:
  • +
+

These are the cells that actually do something, just as code chunks do in Quarto/R Markdown. You can write code in dozens of languages and do all kinds of clever tricks. You then run the code cell and any output the code generates, such as text or figures, will be displayed beneath the cell. We will get back to this in much more detail, but for now it’s enough to understand that code cells are for executing code that is interpreted by a kernel (in this case the Python version in your Conda environment).

+

Before we continue, here are some shortcuts that can be useful. Note that they are only applicable when in “Command mode”. Most of them are also available from the menus. You can also view this list of shortcuts from the Help menu under “Show Keyboard Shortcuts”.

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ShortcutEffect
enterEnter Edit mode
escapeEnter Command mode
ctrl-enterRun the cell
shift-enterRun the cell and select the cell below
alt-enterRun the cell and insert a new cell below
sSave the notebook
tabFor code completion or indentation
m,yToggle between Markdown and Code cells
d-dDelete a cell
aInsert cells above current cell
bInsert cells below current cell
xCut currently selected cells
vPaste cell below
oToggle output of current cell
+
+
+

3 Writing markdown

+

Let’s use our first cell to create a header. Change the format from Code to Markdown using the drop-down list in the Notebook Toolbar, or by pressing the m key when in command mode. Double click on the cell, or hit enter to enter editing mode and input “# My notebook” (“#” is used in Markdown for header 1). Run the cell with ctrl-enter (cmd-enter on Mac).

+

Markdown is a simple way to structure your notebook into sections with descriptive notes, lists, links, images etc.

+

Below are some examples of what you can do in markdown. Paste all or parts of it into one or more cells in your notebook to see how it renders. Make sure you set the cell type to Markdown.

+
## Introduction
+In this notebook I will try out some of the **fantastic** concepts of Jupyter
+Notebooks.
+
+## Markdown basics
+Examples of text attributes are:
+
+- *italics*
+- **bold**
+- `monospace`
+
+Sections can be separated by horizontal lines.
+
+---
+
+Blockquotes can be added, for instance to insert a Monty Python quote:
+
+> Spam!
+> Spam!
+> Spam!
+> Spam!
+
+See [here](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) for more information.
+
+
+

4 Writing code

+

Now let’s write some code! Since we chose a Python kernel, Python would be the native language to run in a cell. Enter this code in the second cell and run it:

+
print("Hello world!")
+

Note how the output is directly displayed below the cell. This interactive way of working is one of the things that sets Jupyter Notebook apart from RStudio and Quarto. In RStudio/Quarto, documents are typically rendered top-to-bottom in one run, while you work in a Jupyter notebook in a different way. This requires some special attention when it comes to reproducibility, which we will get back to in the reproducibility section.

+

What is a Jupyter notebook? Let’s look a closer at the notebook we’re currently working in. Jupyter Notebooks are autosaved every minute or so, so you will already have it available. We can be a little meta and do this from within the notebook itself, by running some shell commands in a code cell. This very handy functionality is possible by prepending the command with !. Try adding !ls to a code cell and run it. This will list the files in the current directory.

+

Aha, we have a new file called jupyter-tutorial.ipynb! This is our notebook. Look at the first ten lines of the file by using !head jupyter-tutorial.ipynb. Seems like it’s just a plain old JSON file. Since it’s a text file it’s suitable for version control with for example Git. There are however some special considerations to take into account for Notebooks which we will cover in the reproducibility section of this tutorial.

+

Variables defined in cells become variables in the global namespace. You can therefore share information between cells. Try to define a function or variable in one cell and use it in the next. For example, add the following to a new cell and run it:

+
def print_me(str):
+    print(str)
+

Now create a new cell and add:

+
print_me("Hi!")
+

Your notebook should now look something like this.

+
+

+
+

The focus of this tutorial is not on how to write Markdown or Python; you can make really pretty notebooks with Markdown and you can code whatever you want with Python. Rather, we will focus on the Jupyter Notebook features that allow you to do a little more than that.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • That a Jupyter notebook consists of a series of cells, and that they can be either markdown or code cells.
  • +
  • That we execute the code in a code cell with the kernel that we chose when opening the notebook.
  • +
  • We can run shell commands by prepending them with !.
  • +
  • A Jupyter notebook is simply a text file in JSON format.
  • +
+
+
+
+
+

5 Magics

+

Magics constitute a simple command language that significantly extends the power of Jupyter notebooks. There are two types of magics:

+
    +
  • Line magics: Commands that are prepended by %, and whose arguments only extend to the end of the line.
  • +
  • Cell magics: Commands that start with %% and then apply to the whole cell. Must be written on the first line of a cell.
  • +
+

Now list all available magics with %lsmagic (which itself is a magic). You add a question mark to a magic to show the help (e.g. %lsmagic?). Some of them act as shortcuts for commonly used shell commands (%ls, %cp, %cat, ..). Others are useful for debugging and optimizing your code (%timeit, %debug, %prun, ..). For more information see the magics documentation.

+

A very useful magic, in particular when using shell commands a lot in your work, is %%capture. This will capture the stdout/stderr of any code cell and store them in a Python object. Run %%capture? to display the help and try to understand how it works. Try it out with either some Python code, other magics or shell commands. Here is an example of how you can make it work:

+
%%capture output
+%%bash
+echo "Print to stdout"
+echo "Print to stderr" >&2
+

… And in another cell:

+
print("stdout:" + output.stdout)
+print("stderr:" + output.stderr)
+
+

Tip
You can capture the output of some magics directly like this: my_dir = %pwd.

+
+

The %%script magic is used for specifying a program (Bash, Perl, Ruby, ..) with which to run the code (similar to a shebang). For some languages it’s possible to use these shortcuts:

+
    +
  • %%ruby
  • +
  • %%perl
  • +
  • %%bash
  • +
  • %%html
  • +
  • %%latex
  • +
  • %%R
  • +
+
+
+
+ +
+
+A note on R code +
+
+
+

In order to use the %%R magic you need to install the rpy2 extension, for example with Conda. This package is already installed in the jupyter-env environment you’re using for this tutorial. However, you also have to load it by running %load_ext rpy2.ipython in a cell.

+
+
+

Try this out if you know any of the languages above. Otherwise you can always try to print the quadratic formula with LaTeX!

+
\begin{array}{*{20}c} {x = \frac{{ - b \pm \sqrt {b^2 - 4ac} }}{{2a}}} & {{\rm{when}}} & {ax^2 + bx + c = 0} \\ \end{array}
+

Another useful magic is %precision which sets the floating point precision in the notebook. As a quick example, add the following to a cell and run it:

+
float(100/3)
+

Next set the precision to 4 decimal points by running a cell with:

+
%precision 4
+

Now run the cell with float(100/3) again to see the difference.

+

Running %precision without additional arguments will restore the default.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • The basics of Jupyter magics and the difference between line magics and cell magics
  • +
  • How to capture and use output from notebook cells with %%capture
  • +
  • How to use magics to run non-Python code in notebooks
  • +
+
+
+
+
+

6 Plotting

+

An essential feature of Jupyter Notebooks is of course the ability to visualize data and results via plots. A full guide to plotting in Python is beyond the scope of this course, but we’ll offer a few glimpses into the plotting landscape of Python.

+

First of all, Python has a library for plotting called matplotlib, which comes packed with functionality for creating high-quality plots. Below is an example of how to generate a line plot of a sine wave.

+
# Import packages
+import numpy as np
+import matplotlib.pyplot as plt
+# Generate a set of evenly spaced numbers between 0 and 100
+x = np.linspace(0,3*np.pi,100)
+# Use the sine function to generate y-values
+y = np.sin(x)
+# Plot the data
+line, = plt.plot(x, y, color='red', linestyle="-")
+

By default plots are rendered in the notebook as rasterised images which can make the quality poor. To render in scalable vector graphics format use the set_matplotlib_formats from the matplotlib_inline package:

+
import matplotlib_inline
+matplotlib_inline.backend_inline.set_matplotlib_formats('pdf', 'svg')
+

Now try running the code for the sine wave plot again.

+
+
+

7 Other packages for plotting

+

As we mentioned Matplotlib comes with a lot of functionality which is great because it allows you to create all sorts of plots and modify them exactly to your liking. However, this can also mean that creating very basic plots might involve a lot of cumbersome coding, when all you want is a simple bar chart!

+

Fortunately there are a number of Python packages that build upon matplotlib but with a much simplified interface. One such popular package is Seaborn. Below we’ll see how to generate a nice looking bar plot with error bars.

+

First import the Seaborn package (using an abbreviated name to simplify typing):

+
import seaborn as sns
+

Next we’ll load some example data of penguins collected at the Palmer Station, in Antarctica.

+
penguins = sns.load_dataset("penguins")
+# Look at first 5 lines of the data
+penguins.head(5)
+

The most basic way to generate a bar plot of this data with Seaborn is:

+
sns.barplot(data=penguins)
+

Simple right? Yes, but maybe not very informative. Here Seaborn simply calculates the mean of all numeric variables for the penguins and plots them with error bars representing a 95% confidence interval.

+

Let’s say that instead we want to plot the mean value of the body mass of the penguins at the different islands where they were examined.

+
sns.barplot(data=penguins, x="island", y="body_mass_g", errorbar="sd");
+

Here we specified to use values in the ‘island’ column as categories for the x-axis, and values in the ‘body_mass_g’ column as values for the y-axis. The barplot function of Seaborn will then calculate the mean body mass for each island and plot the bars. With errorbar="sd" we tell the function to draw the standard deviation as error bars, instead of computing a confidence interval.

+

If we instead want to visualize the data as a scatterplot we can use the sns.scatterplot function. Let’s plot the body mass vs bill length for all penguins and colour the data points by species. We’ll also move the legend outside of the plotting area and modify the x and y-axis labels:

+
# Store the matplotlib axes containing the plot in a variable called 'ax'
+ax = sns.scatterplot(data=penguins, x="bill_length_mm", y="body_mass_g",
+                     hue="species")
+# Modify the labels of the plot
+ax.set_xlabel("Bill length (mm)")
+ax.set_ylabel("Body mass (g)")
+# Set legend position outside of plot
+ax.legend(bbox_to_anchor=(1,1));
+

If you want to save a plot to file you can use the plt.savefig function. Add the following to the bottom of the cell with the scatterplot code:

+
plt.savefig("scatterplot.pdf", bbox_inches="tight")
+

The bbox_inches="tight" setting ensures that the figure is not clipped when saved to file.

+

The Seaborn website contains great tutorials and examples of other ways to plot data!

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to generate simple plots with matplotlib
  • +
  • How to import and use the Seaborn package for plotting
  • +
  • How to save plots from notebooks to a file
  • +
+
+
+
+
+

8 Widgets

+

Since we’re typically running our notebooks in a web browser, they are quite well suited for also including more interactive elements. A typical use case could be that you want to communicate some results to a collaborator or to a wider audience, and that you would like them to be able to modify how the results are displayed. It could, for example, be to select which gene to plot for, or to see how some parameter value affects a clustering. Jupyter notebooks has great support for this in the form of widgets.

+

Widgets are eventful Python objects that have a representation in the browser, often as a control like a slider, text box, etc. These are implemented in the ipywidgets package.

+

The easiest way to get started with using widgets are via the interact and interactive functions. These functions auto-generate widgets from functions that you define, and then call those functions when you manipulate the widgets. This might sound abstract so let’s look at an example.

+

Let’s take the scatterplot of the penguins dataset that we generated in the previous section and add widgets that lets us choose variables to plot as well as coloring of the points.

+

First we’ll import the interactive function from ipywidgets. Let’s also import the widgets module which we we’ll use later. Add the following code to a cell and run it:

+
from ipywidgets import interactive, widgets
+

Now, in a new cell, define a function called scatterplot with the code to generate the plot itself. Also add a palette argument to the function so that we can specify the colour palette to use for the plot. The function should look like this:

+
def scatterplot(x, y, hue, palette):
+    ax = sns.scatterplot(data=penguins, x=x, y=y, hue=hue, palette=palette)
+

Run the cell and create a new cell below it.

+

Next, we’ll use the interactive function to generate a widget to control the x, y, hue and palette arguments. The interactive function takes a function as its first argument, and then keyword arguments for each of the arguments in the function. The returned value is a widget which we will store in a variable called interactive_scatterplot. Add the following to a cell and run it:

+
interactive_scatterplot = interactive(scatterplot,
+            x=["bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g"],
+            y=["body_mass_g","bill_length_mm","bill_depth_mm","flipper_length_mm"],
+            hue=["species","island","sex"],
+            palette=["Set1","Set2","Dark2","Paired2"])
+

Importantly, all parameters defined in the scatterplot function must be given in the interactive call. The interactive_scatterplot widget is now tied to the scatterplot function. However, we still haven’t displayed the widget itself. To do that, simply add interactive_scatterplot to a new cell and run it:

+
interactive_scatterplot
+

This should show the scatterplot with drop-down menus for each of the arguments. Try changing the x and y variables to plot by selecting from the respective drop-downs. The hue drop-down now lets you change what variable to use for colouring the points and the palette drop-down changes the colour palette. As you can see, the available options in the drop-downs are the ones we specified in the interactive call.

+

Depending on the type of the passed argument different types of widgets will be created by interactive. For instance:

+
    +
  • int or float arguments will generate a slider
  • +
  • bool arguments (True/False) will generate checkbox widgets
  • +
  • list arguments will generate a drop-down
  • +
  • str arguments will generate a text-box
  • +
+

Let’s add a slider to control the size of the points. In the Seaborn package this is controlled by the s argument to the scatterplot function. Modify the cell with your scatterplot function so it looks like this (remember to run the cell in order to update the function definition):

+
def scatterplot(x, y, hue, palette, size=50):
+    ax = sns.scatterplot(data=penguins, x=x, y=y, hue=hue, palette=palette, s=size)
+

Note that we added a size argument to the function and supplied it to the Seaborn scatterplot call with s=size. Setting size=50 in the function definition means that the default size of the points will be 50.

+

Now we need to add a slider for the size argument. Update the cell where we call the interactive function so that it looks like this, then run it:

+
interactive_scatterplot = interactive(scatterplot,
+            x=["bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g"],
+            y=["body_mass_g","bill_length_mm","bill_depth_mm","flipper_length_mm",],
+            hue=["species","island","sex"],
+            palette=["Set1","Set2","Dark2","Paired2"],
+            size=(20,100,10))
+

Here the size argument is defined as a tuple which sets the minimum value of the slider to 20, the maximum value to 100 and the step size to 10.

+

Finally, re-run the cell where we displayed the interactive_scatterplot widget. You should now see a slider for the size argument (starting at 50). Try changing the size of the points by moving the slider.

+

This is how it should look if everything works.

+

+

There are lots of widgets, e.g.:

+
    +
  • Drop-down menus
  • +
  • Toggle buttons
  • +
  • Range sliders
  • +
  • File uploader
  • +
+

… And much, much more. Here is a list of all available widgets together with documentation and examples. Some of these widgets cannot be auto-generated by interactive, but fear not! Instead of relying on auto-generation we can define the widget and supply it directly to interactive.

+

To see this in practice, we’ll modify the scatterplot function to display a title and add a color picker widget that let’s us set the color of the title text.

+

First, update the scatterplot function so that it looks like this:

+
def scatterplot(x, y, hue, palette, size, color):
+    ax = sns.scatterplot(data=penguins, x=x, y=y, hue=hue, palette=palette, s=size)
+    ax.set_title("Penguin scatterplot", color=color)
+

Then run the cell to update the function definition.

+

Next, we’ll define the colour picker widget. Add the definition to the cell where you defined the interactive_scatterplot then supply the widget to the interactive call. The cell should look like this:

+
colorpicker = widgets.ColorPicker(
+    concise=False,
+    description='Title color',
+    value='blue',
+    disabled=False
+)
+interactive_scatterplot = interactive(scatterplot,
+            x=["bill_length_mm","bill_depth_mm","flipper_length_mm","body_mass_g"],
+            y=["body_mass_g","bill_length_mm","bill_depth_mm","flipper_length_mm"],
+            hue=["species","island","sex"],
+            palette=["Set1","Set2","Dark2","Paired2"],
+            size=(20, 100, 10),
+            color=colorpicker)
+

Run the cell to update the widgets.

+

Finally, re-run the cell where we displayed the interactive_scatterplot. The plot should now have a title and you should see a new color picker below the slider for the point size. Try changing the title colour by clicking on the new color picker.

+
+
+
+ +
+
+Caution +
+
+
+

Note that you may have to close the colour picker once you’ve made your choice in order to make the plot update.

+
+
+
+
+

9 Other interactive plots

+

Jupyter widgets, like we used here, is the most vanilla way of getting interactive graphs in Jupyter notebooks. Some other alternatives are:

+
    +
  • altair is a plotting library that uses Vega-Lite grammar which is reminiscent of ggplot2 in R. The syntax is different from what we’ve shown here, but it’s very powerful once you get the hang of it.
  • +
  • Plotly is actually an API to a web service that renders your graph and returns it for display in your Jupyter notebook. Generates very visually appealing graphs, but from a reproducibility perspective it’s maybe not a good idea to be so reliant on a third party.
  • +
  • Bokeh is another popular tool for interactive graphs. Most plotting packages for Python are built on top of matplotlib, but Bokeh has its own library. This can give a steeper learning curve if you’re used to the standard packages.
  • +
  • mpld3 tries to integrate matplotlib with Javascript and the D3js package. It doesn’t scale well for very large datasets, but it’s easy to use and works quite seamlessly.
  • +
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to implement interactive widgets in notebooks
  • +
+
+
+
+
+

10 Extensions

+

Jupyter Notebook extensions are add-ons that can increase the functionality of your notebooks. Extensions include themes, editors, git support, renderers and much more. The most user-friendly way of managing extensions is via the Extension Manager available in the Jupyter lab interface. You can access it by clicking the puzzle piece icon in the left sidebar.

+
+
+
+ +
+
+Caution +
+
+
+

Note that the extension manager contains a disclaimer warning you that these third-party extensions are not reviewed for vulnerabilities which means that you should be careful about what extensions you install.

+
+
+

You can use the search field to perform a free text search for available extensions, then click ‘Install’ to install an extension. Not that in some cases you will be prompted to install additional packages.

+

While an in-depth listing of available extensions is well beyond the scope of this tutorial we offer this list of a few extensions that are of particular relevance to this course:

+ +
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • What Jupyter extensions are and how to manage them
  • +
+
+
+
+
+

11 Reproducibility

+

Now that you have a feeling for what Jupyter can do we’ll spend a little time on things to consider specifically from a repdroducibility point of view when it comes to Jupyter notebooks.

+
+
+

12 Version control of Jupyter notebooks

+

As we’ve seen, Jupyter notebooks are plain-text JSON files. This means that they can be version controlled with Git just like any other text file. However, because of the way Jupyter notebooks store their content, the diffs produced by Git can be difficult to interpret. Luckily, there are tools that can provide content-aware diffs and merge functionality for Jupyter notebooks.

+

One such tool is nbdime. nbdime is built to understand the structure of Jupyter notebooks and can therefore generate diffs that are easier to read. It can also be used to merge changes made to notebooks, which is great especially when collaborating on notebooks with others.

+

nbdime is already installed in the jupyter-env Conda environment you are using for this tutorial. To try it in action, create a new notebook and name it Analysis.ipynb. Add the following code to the first cell, then run it:

+
import numpy as np
+import seaborn as sns
+penguins = sns.load_dataset("penguins")
+

This simply imports some python modules and loads a dataset.

+

Save the notebook. Now we’ll add and commit the new notebook to the Git repository:

+
git add Analysis.ipynb
+git commit -m "Add Analysis notebook"
+

So far so good. And nothing new here compared to what we’ve already learned about version control. Now let’s make some changes to the notebook. First we’ll replace one of the loaded modules. Update the first cell of the notebook so that it reads:

+
import pandas as pd
+import seaborn as sns
+penguins = sns.load_dataset("penguins")
+

Then create a new cell where we’ll calculate the mean of each numeric value grouped by species. In the new cell, add the following code:

+
penguins.groupby("species").mean(numeric_only=True)
+

Run the cell and save the notebook.

+

Now use git diff to view the changes we’ve made to the notebook. Run:

+
git diff Analysis.ipynb
+

Even with very minor modifications to the notebook the diff will contain numerous lines that are difficult to interpret. This is because the notebook not only contains the code, but also cell metadata and output (in this case a table produced by the second cell).

+

Now let’s generate a more easy-to-read diff. Run:

+
nbdiff -s Analysis.ipynb
+

This will use the nbdiff tool that comes with nbdime to show an inline diff of the notebook. The -s flag tells nbdiff to only show differences for the actual code changes, ignoring changes in metadata and output. There are a number of flags you can use here to customise the diff. The uppercase version of each flag will ignore the respective change type. For example, to see the diff but ignore changes to the output of cells you can run:

+
nbdiff -O Analysis.ipynb
+

nbdime also comes with a graphical web-based diff viewer. To try it, run:

+
nbdiff-web Analysis.ipynb
+

This will open up a tab in your web browser showing you changes made to the notebook side-by-side for each cell, including also cell output. This makes it easy to see changes made both to code and outputs such as tables and plots.

+
+

12.1 Other tools for version control of notebooks

+
    +
  • You can also install the nbdime jupyter lab extension to get access to the diff functionality directly from the Jupyter lab interface. If you also install the jupyterlab-git extension you can both view diffs and commit changes directly from Jupyter lab.
  • +
  • VS Code actually comes with built-in support for both Jupyter notebooks and Git so that you can view informative diffs directly from the editor
  • +
+
+
+
+

13 Making sure notebooks work as expected

+

One of the great things with Jupyter notebooks is the ability to do data exploration in an interactive way. Because loaded data, defined variables and functions remain in the notebook until you restart the kernel, you can easily make changes to your analysis and re-run cells to see the effect of the changes immediately. However, this can also be a source of errors and inconsistencies if you, during your work, modify or use variables in cells upstream of their initial definition.

+

The nbval package can help you catch these types of errors. nbval is a plugin for the pytest testing framework that can be used to test Jupyter notebooks. It works by executing each cell in the notebook and comparing the output to the output stored in the notebook. If the output is the same, the test passes. If the output differs, the test fails. nbval is also pre-installed in the jupyter-env Conda environment you’re using for this tutorial.

+

As an example, we’ll keep working with the Analysis.ipynb notebook we’ve created.

+

Let’s say we want to estimate the size of the bill of penguins using the bill_length_mm and bill_depth_mm columns. We’ll do this by adding a new cell to our notebook with the following code:

+
penguins["bill_size"] = (penguins["bill_length_mm"] * penguins["bill_depth_mm"])
+

Run the cell and add a new one below it. In the new cell, output the mean of each column grouped by island using the following code:

+
penguins.groupby("island").mean(numeric_only=True)
+

Run the cell to see the output. Looks good. Now we have a very simple example of some exploratory analyses on a dataset.

+

Save the notebook and try running nbval on it to see if it works as expected. From the commandline, run:

+
pytest --nbval Analysis.ipynb
+

nbval tests each cell in your notebook by executing it and comparing the output to the output stored in the notebook. If the output is the same, the test passes. The output of the test should look something like this:

+
collected 4 items
+
+Analysis.ipynb ....                                                                                                   [100%]
+
+========== 4 passed in 1.93s ==========
+

Now let’s say we realize that we want to normalize the bill_size values by the body mass of the penguins. We’ll just modify the cell where we calculated this value, introducing a small piece of code to divide by the body_mass_g column.

+

Change the third cell of the notebook so that it reads:

+
penguins["bill_size"] = (penguins["bill_length_mm"] * penguins["bill_depth_mm"]) / penguins["body_mass_g"]
+sns.scatterplot(data=penguins, x="bill_size", y="flipper_length_mm", hue="island")
+

Re-run the cell and save the notebook. So far so good! Let’s test the notebook again with nbval. Just like before run it from the commandline with:

+
pytest --nbval Analysis.ipynb
+

If you’ve followed the instructions, this second run of nbval should generate a FAILED test, showing something like:

+
==================== short test summary info ====================
+FAILED Analysis.ipynb::Cell 3
+================== 1 failed, 3 passed in 1.83s ==================
+

What happened here was that we modified the cell where we calculated the bill_size value, but we didn’t re-run the cell where we output the mean of each column grouped by island. This means that the output of the last cell in the notebook now differs from what is actually stored in the notebook variables. This type of error can be difficult to spot, especially if you have a large notebook with many cells. Luckily, nbval can help us here.

+
+
+
+ +
+
+Note +
+
+
+

Note that nbval reports cell numbers using 0-based numbering, so when the test fails on Cell 3 it actually refers to the 4th cell in the notebook.

+
+
+

This problem could have been solved if we had re-run the cell where we output the mean of each column grouped by island. In fact, it is good practice to re-run all cells in a notebook before saving it. If you in addition restart the kernel before re-running you make sure that you haven’t introduced any ‘hidden states’

+
+
+
+ +
+
+Ignoring specific cells +
+
+
+

One caveat of nbval is that it doesn’t work well with cells that generate plots. You can tell nbval to ignore the output of specific cells by adding # NBVAL_IGNORE_OUTPUT to the top of a cell.

+
+
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned: - How to use nbdime to view diffs of Jupyter notebooks - How to use nbval to test that notebooks work as expected

+
+
+
+
+

14 Converting notebooks

+

Notebooks can be converted to various output formats such as HTML, PDF, LaTeX etc. directly from the File -> Save and Export Notebook As… menu.

+

Conversion can also be performed on the command line using the jupyter nbconvert command. nbconvert is installed together with the jupyter Conda package and is executed on the command line by running jupyter nbconvert.

+

The syntax for converting a Jupyter notebook is:

+
jupyter nbconvert --to <FORMAT> notebook.ipynb
+

Here <FORMAT> can be any of asciidoc, custom, html, latex, markdown, notebook, pdf, python, rst, script, slides. Converting to some output formats (e.g. PDF) may require you to install separate software such as Pandoc or a TeX environment.

+

Try converting the jupyter-tutorial.ipynb notebook that you have been working on for this tutorial to HTML using jupyter nbconvert.

+
+
+
+ +
+
+Tip +
+
+
+

If the plots in HTML rendered version of your notebook are not displayed properly, try changing the matplotlib_inline.backend_inline.set_matplotlib_formats('pdf', 'svg') line to matplotlib_inline.backend_inline.set_matplotlib_formats('retina').

+
+
+

nbconvert can also be used to run a Jupyter notebook from the command line by running:

+
jupyter nbconvert --execute --to <FORMAT> notebook.ipynb
+

nbconvert executes the cells in a notebook, captures the output and saves the results in a new file. Try running it on the jupyter-tutorial.ipynb notebook.

+

You can also specify a different output file with --output <filename>.

+

So in order to execute your jupyter-tutorial.ipynb notebook and save it to a file named report.html you could run:

+
jupyter nbconvert --to html --output report.html --execute jupyter-tutorial.ipynb
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to convert Jupyter notebooks to various other formats
  • +
  • How to use nbconvert to convert notebooks on the command line
  • +
+
+
+
+
+

15 Notebooks and Quarto

+

You may have noticed that a lot of the functionality in Jupyter is overlapping with Quarto. And you may be wondering which one to use. This is a difficult question to answer as it will depend on your use-case and personal preference. As such, any answer will be subjective, but we’ll try to give you some pointers on how to get the best out of both worlds.

+

While similar in some ways Jupyter and Quarto are not completely overlapping. Quarto is great for generating high-quality reports and manuscripts, and is agnostic to the programming language used. Jupyter on the other hand is great for interactive data analysis and exploration with a more direct connection between code and output. While Jupyter is also somewhat agnostic to programming language, it is most commonly used with Python and with both the Jupyter and Python ecosystem at its back it can be customized with a lot of different extensions and plugins.

+

The good news is that the two can be used together allowing you to get the best of both. For example, you may like the professional look of rendered Quarto documents but really like the interactive and exploratory nature of Jupyter. Well you can simply work as you normally do in Jupyter and then use Quarto to render the notebook to a high-quality report or manuscript.

+

To give you an example, take a look at the supplementary_material.ipynb file in the jupyter/ tutorial directory. Open this notebook in the Jupyter lab interface (make sure you have activated the jupyter-env Conda environment).

+

As you can see this notebook contains some brief descriptions in Markdown and code to generate a few plots. It uses the output from the MRSA case-study Snakemake workflow you worked on in the Snakemake tutorial. This is a common use-case for Jupyter notebooks; to generate summary statistics and plots from the results of a workflow run. (A real-world example could of course include a lot more in-depth exploratory analyses).

+

Now, let’s say you want to share the results of this notebook with your PI or collaborators. We could simply share the notebook file, or as we saw in the previous section, convert it to HTML or PDF via jupybter nbconvert.

+

Let’s do that first so we have something to compare with. Run the following:

+
jupyter nbconvert --to HTML --output supplementary_material.nbconvert.html supplementary_material.ipynb
+

Open the supplementary_material.nbconvert.html file in a browser to see that it looks like you expect. This looks more or less like the original notebook.

+

Now let’s go one step further and render the notebook to a high-quality report using Quarto. We can actually add a YAML header to the notebook with some document options that Quarto understands. Create a new cell in the notebook (from the Jupyter lab interface) and move it to the top. In this cell, add the following:

+
---
+title: Supplementary material
+subtitle: Supplementary tables and plots for the MRSA study
+format:
+    html:
+        embed-resources: true
+        code-fold: true
+        code-tools: true
+language:
+  code-summary: Click to show code
+bibliography: references.bib
+---
+

Set the cell type to Markdown, then run the cell. Most likely that cell will look rather weird but that’s OK. We’ll fix that in a bit.

+

Save the notebook and now render the document with Quarto from the commandline:

+
quarto render supplementary_material.ipynb
+

Open up the supplementary_material.html file in a browser and compare it to the supplementary_material.nbconvert.html file. You should see that the Quarto version looks a lot better. The fact that Quarto supports rendering of Jupyter notebooks means you can keep editing your notebooks as you normally would and use Quarto for rendering the final document. Also there’s very little we had to change in the notebook to make it work with Quarto. If you look closely at the code cells used to generate the plots and table you’ll see that they contain code-chunk options in the same form we used in the Quarto tutorial. These options do not impact the notebook when run in Jupyter, making it easy to use the two tools in combination.

+

Let’s go back to the YAML header cell and fix how it looks in the Jupyter notebook. The reason it looks weird is that Jupyter doesn’t understand the syntax. But luckily there’s a Jupyter lab Quarto extension you can install to fix this. Click the extension icon in the left sidebar and search for quarto. Install the jupyterlab-quarto extension and then reload the page. Now the YAML header should look a lot better.

+

Try adding more options to the header to customize the look of the rendered document. For instance you could:

+
    +
  • add a Table of contents with (toc: true)
  • +
  • try out different themes
  • +
  • add your name as author (author: Your Name)
  • +
  • add a date (date: last-modified)
  • +
+

and much more.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to use Quarto to render Jupyter notebooks to high-quality reports.
  • +
+
+
+
+
+

16 Extra material

+

The following material contains some additional tips and tricks on how to use Jupyter notebooks. This is not part of the core of the Jupyter material and you can choose what you want to go through, or skip it entirely.

+

Here are some useful resources if you want to read more about Jupyter in general:

+ +
+
+

17 Running Jupyter notebooks on a cluster

+
    +
  • Login to Uppmax, making sure to use a specific login node, e.g. rackham1:
  • +
+
ssh <your-user-name>@rackham1.uppmax.uu.se
+
    +
  • Create/activate a Conda environment containing jupyter, e.g.:
  • +
+
conda create -n jupyter -c conda-forge jupyter
+
    +
  • activate the environment, then run:
  • +
+
jupyter notebook --no-browser
+

When the Jupyter server starts up you should see something resembling:

+
[I 2023-11-13 22:15:36.944 ServerApp] Serving notebooks from local directory: <path-to-your-directory>
+[I 2023-11-13 22:15:36.944 ServerApp] Jupyter Server 2.10.0 is running at:
+[I 2023-11-13 22:15:36.944 ServerApp] http://localhost:8888/tree?token=25fa07e89b7c0bc2e518f259ba79c67847ca813cdf4eeed6
+[I 2023-11-13 22:15:36.944 ServerApp]     http://127.0.0.1:8888/tree?token=25fa07e89b7c0bc2e518f259ba79c67847ca813cdf4eeed6
+[I 2023-11-13 22:15:36.944 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
+

Now a Jupyter notebook server is running on the Uppmax end. The line that says:

+
[I 2023-11-13 22:15:36.944 ServerApp] http://localhost:8888/tree?token=25fa07e89b7c0bc2e518f259ba79c67847ca813cdf4eeed6
+

Contains information on the port used on the server side (8888 in this case) and the token required to use the server (25fa07e89b7c0bc2e518f259ba79c67847ca813cdf4eeed6).

+

Next step is to use this information to login to the server from your local computer.

+

On your local computer

+

In a terminal, run the following command to start port forwarding of port 8080 on your local computer to the remote port on the Uppmax side. Replace with the port given when you started the server on Uppmax. Also replace with your user name on Uppmax.

+
ssh -N -L localhost:8080:localhost:<remote-port> <your-user-name>@rackham1.uppmax.uu.se
+

As long as this process is running the port forwarding is running. To disable it simply interrupt it with CTRL + C.

+

Connect to the Jupyter server by opening localhost:8080 in your browser. When prompted, paste the token you got when starting the server on Uppmax and set a new password.

+
+
+

18 Using Binder to share interactive notebooks

+

Binder is a service that allows you to share Jupyter notebooks with others, while also allowing them to run the notebooks in the browser. This is great if you wish to share an analysis and have others interact with the code and results, without them having to install anything locally. What you will need is:

+
    +
  1. A public GitHub repository containing the notebooks you want to share.
  2. +
  3. An environment.yml file in the repository containing the Conda environment required to run the notebooks.
  4. +
  5. Data files (if any) required to run the notebook(s).
  6. +
+

Binder will then create a Docker image containing the Conda environment and the notebooks, and run a Jupyter server on this image. The Docker image is then hosted on the Binder server and can be used by anyone with the link to the repository to run the notebooks interactively in their browser.

+

To show you an example we’ve created a basic GitHub repository containing the supplementary_material.ipynb notebook from the previous section. If you go to the repository you will see a badge saying “launch binder”, click this to start the Binder server. This will take a few minutes the first time you do it, but after that it should be faster. When the server is ready you will be presented with the now familiar Jupyter interface. Go ahead and open up the supplementary_material.ipynb notebook and run it.

+

You can now interact with the notebook as you would if you had it running on a local Jupyter server. You can change the code, run it, and see the results. You can also add new cells and write new code. However, you cannot save the changes you make to the notebook.

+

To read more about Binder and how to use it, see the Binder documentation. For pointers on how to make data available to the notebooks you share via Binder, see this guide on Accessing data in your Binder.

+ + +
+ +
+ +
+ + + + + + + \ No newline at end of file diff --git a/pages/nextflow.html b/pages/nextflow.html new file mode 100644 index 00000000..13f40db3 --- /dev/null +++ b/pages/nextflow.html @@ -0,0 +1,1848 @@ + + + + + + + + + +Working with Nextflow + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Working with Nextflow

+

How to create reproducible workflows and computational pipelines

+
+
+ + +
+ + +
+
Published
+
+

15-Oct-2024

+
+
+ + +
+ + +
+ + + + +
+ + + + + + +
+

1 Introduction

+

Nextflow is a workflow management system (WfMS), and is one of the most common such systems within the bioinformatic and academic communities. These systems are important for scientific reproducibility in that they greatly facilitate keeping track of which files have been processed in what way throughout an entire project.

+

Nextflow is built from the ground-up to be portable, scalable, reproducible and usable in a platform-agnostic sense. This means that any workflow you write in Nextflow can be run locally on your laptop, a computer cluster or a cloud service (as long as your architecture has the necessary computational resources). You can also define the compute environment in which each task is carried out on a per-task basis. You might thus develop your workflow on your local computer using a minimal test dataset, but run the full analyses with all samples on e.g. a computer cluster. Nextflow can work on both files and arbitrary values, often-times connected in useful and advanced ways.

+

Nextflow can easily work with dynamic inputs where the exact output is unknown, e.g. the exact number of files or which samples pass some arbitrary quality control threshold. While Nextflow is based on the Groovy language, you don’t need to know how to code Groovy to be able to write good Nextflow workflows. Nextflow has a large community centred around it, including the nf-core curated collection of high quality pipelines used by e.g. the National Genomics Infrastructure.

+

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/nextflow and activate your nextflow-env Conda environment.

+
+
+

2 The basics

+

We’ll start by creating a very simple workflow from scratch, to show how Nextflow works: it will take two input files and convert them to UPPERCASE letters.

+
    +
  • Start by running the following commands:
  • +
+
touch main.nf
+echo "This is a.txt" > a.txt
+echo "This is b.txt" > b.txt
+

Open the main.nf file with an editor of your choice. This is the main workflow file used in Nextflow, where workflows and their processes are defined.

+
    +
  • Copy the following code into your main.nf file:
  • +
+
// Workflow definition
+workflow {
+    // Define input files
+    ch_input = Channel.fromPath( "a.txt" )
+
+    // Run workflow
+    CONVERT_TO_UPPER_CASE( ch_input )
+}
+
+// Process definition
+process CONVERT_TO_UPPER_CASE {
+    publishDir "results/",
+        mode: "copy"
+
+    input:
+    path(file)
+
+    output:
+    path("a.upper.txt")
+
+    script:
+    """
+    tr [a-z] [A-Z] < ${file} > a.upper.txt
+    """
+}
+

Here we have two separate parts. The first is the workflow definition, while the last is a process. Let’s go through them both in more detail!

+
+
+
+ +
+
+Nextflow comments +
+
+
+

Double-slashes (//) are used for comments in Nextflow.

+
+
+
+
+
+ +
+
+Nextflow and whitespace +
+
+
+

Nextflow is not indentation-sensitive. In fact, Nextflow doesn’t care at all about whitespace, so go ahead and use it in whatever manner you think is easiest to read and work with! Do keep in mind that indentations and other types of whitespace does improve readability, so it’s generally not a good idea to forego it entirely, even though you can.

+
+
+
+

2.1 Workflow definitions

+
workflow {
+    // Define input files
+    ch_input = Channel.fromPath( "a.txt" )
+
+    // Run workflow
+    CONVERT_TO_UPPER_CASE( ch_input )
+}
+

The workflow definition here has two parts, each doing an important job for any Nextflow workflow. The first part defines a channel, which is an asynchronous first-in-first-out stream of data that connect a workflow’s various inputs and outputs. In simpler terms, channels contain the data that you want to process with the workflow and can be passed between the various parts of the workflow.

+

Channels can be created in various different ways using channel factories, depending on what type data you want to put into them and where this data is stored. In this particular case we define our ch_input channel using the .fromPath channel factory, which takes a file path as input - here we use the a.txt file. You can thus read ch_input = Channel.fromPath("a.txt") as “create the channel ch_input and send the file a.txt into it”.

+
+
+
+ +
+
+Naming channels +
+
+
+

A channel can be named anything you like, but it is good practice to prepend them with ch_, as that makes it clear which variables are channels and which are just normal variables.

+
+
+

How do we use these channels then? Channels pass data to and from processes through our workflow. By providing channels as arguments to processes, we describe how we want data to flow. This is exactly what we do in the second part: we call our CONVERT_TO_UPPER_CASE process with the ch_input as input argument - this is very similar to functional programming.

+

This is our entire workflow, for now: the creation of a channel followed by using the contents of that channel as input to a single process. Let’s look at how processes themselves are defined!

+
+
+

2.2 Process definitions

+
process CONVERT_TO_UPPER_CASE {
+    publishDir "results/",
+        mode: "copy"
+
+    input:
+    path(file)
+
+    output:
+    path("a.upper.txt")
+
+    script:
+    """
+    tr [a-z] [A-Z] < ${file} > a.upper.txt
+    """
+}
+

Looking at the process in the code above, we can see several parts. The process block starts with its name, in this case CONVERT_TO_UPPER_CASE, followed by several sections, or directives as Nextflow calls them: publishDir, input, output and script.

+
+
+
+ +
+
+Naming processes +
+
+
+

A process can be named using any case, but a commonly used convention is to use UPPERCASE letters for processes to visually distinguish them in the workflow. You do not have to follow this if you don’t want to, but we do so here.

+
+
+

Let’s start with the first directive: publishDir. This tells Nextflow where the output of the process should be placed when it is finished. Setting mode to "copy" just means that we want to copy the output files to the publishing directory, rather than using a symbolic link (which is the default).

+

The input and output directives describe the data expected to come through this specific process. Each line of input describes the data expected for each process argument, in the order used in the workflow. In this case, CONVERT_TO_UPPER_CASE expects a single channel (one line of input), and expects the data to be filenames ( i.e. of type path). The script directive is where you put the code that the process should execute.

+

Notice that there is a difference between how the inputs and outputs are declared? The output is an explicit string (i.e. surrounded by quotes), while the input is a variable named file. This means inputs can be referenced in the process without naming the data explicitly, unlike the output where the name needs to be explicit. We’ll get back to exactly how this works in just a moment. While the name of the input variable here is chosen to be the descriptive file, we could also have chosen something completely different, e.g. banana (we’d also have to change its reference in the script directive).

+
+
+

2.3 Executing workflows

+

Let’s try running the workflow we just created!

+
    +
  • Type the following in your terminal:
  • +
+
nextflow run main.nf
+

This will make Nextflow run the workflow specified in your main.nf file. You should see something along these lines:

+
N E X T F L O W  ~  version 22.10.6
+Launching `./main.nf` [mad_legentil] - revision: 87f0c253ed
+executor >  local (1)
+[32/9124a1] process > CONVERT_TO_UPPER_CASE (1) [100%] 1 of 1 ✔
+

The first few lines are information about this particular run, including the Nextflow version used, which workflow definition file was used, a randomly generated run name (an adjective and a scientist), the revision ID as well as where the processes were executed (locally, in this case, as opposed to e.g. SLURM or AWS).

+

What follows next is a list of all the various processes for this particular workflow. The order does not necessarily reflect the order of execution (depending on each process’ input and output dependencies), but they are in the order they were defined in the workflow file - there’s only the one process here, of course. The first part (e.g. [32/9124a1]) is the process ID, which is also the first part of the subdirectory in which the process is run (the full subdirectory will be something like 32/9124a1dj56n2346236245i2343, so just a longer hash). We then get the process and its name. Lastly, we get how many instances of each process are currently running or have finished. Here we only have the one process, of course, but this will soon change.

+
    +
  • Let’s check that everything worked: type ls results/ and see that it contains the output we expected.

  • +
  • Let’s explore the working directory: change into whatever directory is specified by the process ID (your equivalent to work/32/9124a1[...]).

  • +
+

What do you see when you list the contents of this directory? You should see a symbolic link named a.txt pointing to the real location of this file, plus a normal file a.upper.txt, which is the output of the process that was run in this directory. You generally only move into these work directories when debugging errors in your workflow, and Nextflow has some tricks to make this process a lot easier - more on this later.

+

So, in summary: we have three components: a set of inputs stored in a channel, a set of processes and a workflow that defines which processes should be run in what order. We tell Nextflow to push the inputs through the entire workflow, so to speak.

+
    +
  • Now it’s your turn! Move back to the workflow root and make it use only the b.txt input file and give you the b.upper.txt instead.

  • +
  • Run your workflow and make sure it works before you move on; check below if you’re having trouble.

  • +
+
+ +
+
+
ch_input = Channel.fromPath( "b.txt" )
+
+
+
+
+
+

2.4 Viewing channel contents

+

Something that’s highly useful during development of Nextflow workflows is to view the contents of channels, which can be done with the view() operator.

+
    +
  • Add the following to your workflow definition (on a new line) and execute the workflow: ch_input.view(). What do you see?

  • +
  • Remove the view() operator once you’re done.

  • +
+

It can be quite helpful to view the channel contents whenever you’re unsure of what a channel contains or if you’ve run into some kind of bug or error, or even just when you’re adding something new to your workflow. Remember to view the channel contents whenever you need to during the rest of this tutorial!

+
+
+

2.5 Files and sample names

+

One powerful feature of Nextflow is that it can handle complex data structures as input, and not only filenames. One of the more useful things this allows us to do is to couple sample names with their respective data files inside channels.

+
    +
  • Change the channel definition to the following:
  • +
+
ch_input = Channel
+    .fromPath ( "a.txt" )
+    .map      { file -> tuple(file.getBaseName(), file) }
+

Here we create a tuple (something containing multiple parts) using the map operator, the base name of the file (a) and the file path (a.txt). The statement .map{ file -> tuple(file.getBaseName(), file) } can thus be read as “replace the channel’s contents with a tuple containing the base name and the file path”. The contents of the channel thus change from [a.txt] to [a, a.txt]. Passing the sample name or ID together with the sample data in this way is extremely useful in a workflow context and can greatly simplify downstream processes.

+

Before this will work, however, we have to change the process itself to make use of this new information contained in the ch_input channel.

+
    +
  • Change the process definition to the following:
  • +
+
process CONVERT_TO_UPPER_CASE {
+    publishDir "results/",
+        mode: "copy"
+
+    input:
+    tuple val(sample), path(file)
+
+    output:
+    path("${sample}.upper.txt")
+
+    script:
+    """
+    tr [a-z] [A-Z] < ${file} > ${sample}.upper.txt
+    """
+}
+

Notice how the input now is aware that we’re passing a tuple as input, which allows us to use both the file variable (as before) and the new sample variable. All that’s left now is to change the input to our pipeline!

+
    +
  • Change the channel definition line from .fromPath ( "a.txt" ) to .fromPath ( ["a.txt", "b.txt"] ) and try running the pipeline. Make sure it works before you move on! Remember to use the view() operator if you want to inspect the channel contents in detail.
  • +
+
+
+

2.6 Input from samplesheets

+

So far we’ve been specifying inputs using strings inside the workflow itself, but hard-coding inputs like this is not ideal. A better solution is to use samplesheets instead, e.g. comma- or tab-separated data files; this is standard for many pipelines, including nf-core. Take, for example, the following CSV file:

+
a,a.txt
+b,b.txt
+

This specifies the samples and their respective files on each row. Using such a file is much more portable, scalable and overall easier to use than simple hard-coding things in the workflow definition itself. We might also include an arbitrary number of additional metadata columns, useful for downstream processing and analyses. Using contents of files as input can be done using the .splitCsv() and .map{} operators, like so:

+
ch_input = Channel
+    .fromPath ( "first_samplesheet.csv" )
+    .splitCsv ( )
+    .map      { row -> tuple(row[0], file(row[1])) }
+

The .SplitCsv() operator lets the channel know the input is a CSV file, while the .map{} operator makes the CSV content into a tuple from the first and second elements of each row.

+
    +
  • Change the input channel definition to the code above and create the first_samplesheet.csv file as shown above.

  • +
  • Add the .view() operator somewhere to show the contents of ch_input.

  • +
  • Execute the pipeline. Do you see what you expect? Remove the .view() operator before moving on.

  • +
+
+
+
+ +
+
+Note +
+
+
+

While we are still hard-coding the name of the samplesheet it is still much better to edit a samplesheet than having to edit the pipeline itself - there are also convenient ways to work around this using parameters, which we’ll talk more about later in this tutorial.

+
+
+

We can also specify a header in our samplesheet like so: .splitCsv(header: true). This will allow us to reference the columns using their names instead of their index, e.g. row.col1 instead of row[0].

+
    +
  • Add an appropriate header to your samplesheet, make sure your workflow can read it and execute. Use .view() to see what’s going on, if needed.
  • +
+
+
+

2.7 Adding more processes

+

It’s time to add more processes to our workflow! We have the two files a.upper.txt and b.upper.txt; the next part of the workflow is a step that concatenates the content of all these UPPERCASE files.

+

We already have a channel containing the two files we need: the output of the CONVERT_TO_UPPER_CASE process called CONVERT_TO_UPPER_CASE.out. We can use this output as input to a new process using the syntax: CONVERT_TO_UPPER_CASE.out.collect(). The collect() operator groups all the outputs in the channel into a single data object for the next process. This is a many-to-one type of operation: a stream with several files (many) is merged into a lone list of files (one). If collect() was not used, the next process would try to run a task for each file in the output channel.

+

Let’s put this in use by adding a new process to the workflow definition. We’ll call this process CONCATENATE_FILES and it will take the output from CONVERT_TO_UPPER_CASE as input, grouped using the collect() operator.

+
    +
  • Add a line to your workflow definition for this new process with the appropriate input - remember that you can use .view() to check channel contents; click below if you’re having trouble.
  • +
+
+ +
+
+
CONCATENATE_FILES( CONVERT_TO_UPPER_CASE.out.collect() )
+
+
+
+

Now all we have to do is define the actual CONCATENATE_FILES process in the process definition section.

+
    +
  • Copy the following code as a new process into your workflow:
  • +
+
process CONCATENATE_FILES {
+    publishDir "results/",
+        mode: "copy"
+
+    input:
+    path(files)
+
+    output:
+    path("*.txt")
+
+    script:
+    """
+    cat ${files} > concat.txt
+    """
+}
+
    +
  • Run your workflow again and check the results/ directory. At this point you should have three files there: a.upper.txt, b.upper.txt and concat.txt.

  • +
  • Inspect the contents of concat.txt - do you see everything as you expected?

  • +
+

Note the use of path(files) as input. Although we pass a list of files as input, the list is considered a single object, and so the files variable references a list. Each file in that list can be individually accessed using an index e.g. ${files[0]}, or as we do here, use the variable without an index to list all the input files.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learnt:

+
    +
  • How to create, execute and extend workflows
  • +
  • How to explore the work directory and channel contents
  • +
  • How to couple sample names to sample data files
  • +
  • How to use samplesheets as input
  • +
  • How to collect multiple files as single inputs for processes
  • +
+
+
+
+
+
+

3 Executing workflows

+

It’s time to start working with a more realistic workflow using the MRSA case study of this course! We’ve created a bare-bones version of this pipeline for you, but we’ll work our way through it as we go along and learn more about Nextflow’s features and functionality. The MRSA workflow looks like this:

+
workflow {
+
+    // Workflow for generating count data for the MRSA case study
+
+    // Get input files from a samplesheet
+    ch_input = Channel
+        .fromPath ( "samplesheet.csv" )
+        .splitCsv ( header: true)
+
+    // Define the workflow
+    DOWNLOAD_FASTQ_FILES (
+        ch_input
+    )
+    RUN_FASTQC (
+        DOWNLOAD_FASTQ_FILES.out
+    )
+    RUN_MULTIQC (
+        RUN_FASTQC.out[1].collect()
+    )
+    GET_GENOME_FASTA ()
+    INDEX_GENOME (
+        GET_GENOME_FASTA.out.fasta
+    )
+    ALIGN_TO_GENOME (
+        DOWNLOAD_FASTQ_FILES.out,
+        INDEX_GENOME.out.index
+    )
+    SORT_BAM (
+        ALIGN_TO_GENOME.out.bam
+    )
+    GET_GENOME_GFF3 ()
+    GENERATE_COUNTS_TABLE (
+        SORT_BAM.out.bam.collect(),
+        GET_GENOME_GFF3.out.gff
+    )
+}
+

The workflow has one input channel named ch_input, which reads input from the samplesheet.csv file. We then define the processes to be executed by this workflow, nine in total. The first process (DOWNLOAD_FASTQ_FILES) takes the ch_input channel as input, while the rest of the processes takes the output of previous processes as input. Before we go into more detail regarding the ins-and-outs of this workflow, let’s start with some specifics of how workflows are executed and what you can get from them.

+
+

3.1 Reports and visualisations

+

Let’s start with running the workflow plus getting some reports and visualisation while we’re at it!

+
    +
  • Run the workflow using the following command: nextflow run main_mrsa.nf -with-report report.html -with-timeline timeline.html -with-dag dag.png.
  • +
+

After successful executing, you will find three more files in your current directory: report.html, timeline.html and dag.png. The first file contains a workflow report, which includes various information regarding execution such as runtime, resource usage and details about the different processes. The second file contains a timeline for how long each individual process took to execute, while the last contains a visualisation of the workflow itself.

+

Take a few minutes to browse these files for yourself. When running a workflow you can of course choose which of these additional files you want to include by picking which ones are important or interesting to you - or don’t include any!

+
+
+

3.2 Logs

+

Nextflow keeps a log of all the workflows that have been executed. Let’s check it out!

+
    +
  • Type nextflow log to get a list of all the executions.
  • +
+

Here we get information about when the workflow was executed, how long it ran, its run name, whether it succeeded or not and what command was used to run it. You can also use nextflow log <run name> to show each task’s directory that was executed for that run. You can also supply the -f (or -fields) flag along with additional fields to show.

+
    +
  • Run nextflow log <run name> -f hash,name,exit,status
  • +
+

This shows us not only the beginning of each task’s working directory, but also its name, exit code and status (i.e. if it completed successfully or failed in some manner).

+
+
+
+ +
+
+Listing fields +
+
+
+

If you want to see a complete list of all the fields you might explore using the log, just type nextflow log -l or nextflow log -list-fields. This is highly useful for debugging when there’s some specific information about a run you’re particularly interested in!

+
+
+

We can also get even more detailed information about the latest run by looking into the .nextflow.log file!

+
    +
  • Look into the latest log by typing less .nextflow.log.
  • +
+

You’ll be greeted by a wealth of debugging information, which may even seem a bit overkill at this point! This level of detail is, however, quite useful both as a history of what you’ve attempted and as an additional help when you run into errors! Also, it helps with advanced debugging - which we’ll get into later.

+
+
+

3.3 Re-running workflows

+

Something you often want to do in Nextflow (or any WfMS for that matter) is to re-run the workflow when you changed some input files or some of the code for its analyses, but you don’t want to re-run the entire workflow from start to finish. Let’s find out how this works in Nextflow!

+
    +
  • Run the same nextflow run main_mrsa.nf command again.
  • +
+

You’ll notice that Nextflow actually re-ran the entire workflow from scratch, even though we didn’t change anything. This is the default behaviour of Nextflow.

+
    +
  • Let’s try that again: nextflow run main_mrsa.nf -resume instead.
  • +
+

Now you can see that Nextflow didn’t actually re-run anything. The -resume flag instructed Nextflow to use the cached results from the previous run!

+

Nextflow automatically keeps track of not only changes to input files, but also changes to code, process definitions and scripts. You can thus change anything relating to your workflow and just re-run with the -resume flag and be sure that only processes relevant to your changes are executed again!

+
    +
  • Use tree work/ to list the contents of the work directory.
  • +
+

Because Nextflow keeps track of all the runs, we’ve now got two sets of files in the work directory. One set from the first run, and another from the second run. This can take up valuable space, so let’s clean that up.

+
    +
  • Use nextflow clean -n -before <run_name> to show which work directories will be cleaned up (use nextflow log to find the run name if you don’t remember it). Then delete those directories by changing -n (dry-run) to -f (force).
  • +
+

Nextflow’s clean subcommand can be used to clean up failed tasks and unused processes. Here we used the -before flag, meaning that any runs before the specified run are removed; use nextflow help clean to see other options for cleaning. This is the preferred way to clean up the working directory.

+
    +
  • Remove the results directory and re-run the workflow again using the -resume flag.
  • +
+

We removed all the results we used before, but we still managed to resume the workflow and use its cache - how come? Remember that Nextflow uses the work directory to run all of its tasks, while the results directory is just where we have chosen to publish our outputs. We can thus delete the results directory as often as we like (a necessity when output filenames are changed) and still get everything back without having to re-run anything. If we were to delete the work directory, however…

+
    +
  • Delete the work directory and re-run the workflow using the -resume flag.
  • +
+

There is no longer any cache for Nextflow to use, so it re-runs from the start! This is good to keep in mind: you can always delete the output directories of your workflow, but if you mess with work you’ll lose, well… work!

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learnt:

+
    +
  • How to get automatic reports and visualisations
  • +
  • How to check the Nextflow logs
  • +
  • How to re-run workflows
  • +
  • How to clean the Nextflow cache
  • +
+
+
+
+
+
+

4 Working with processes

+

Now that we’ve gone through the specifics of executing workflows in a bit more detail, let’s go through working with processes. While there are numerous process directives that can be used, we’ll go through some of the more commonly used ones here.

+
+

4.1 Tags

+

Let’s look at the command line output we got during the workflow’s execution, which should look something like this:

+
N E X T F L O W  ~  version 22.10.6
+Launching `./main.nf` [friendly_bhaskara] - revision: b4490b9201
+executor >  local (17)
+[c9/e5f818] process > DONWLOAD_FASTQ_FILES (SRR935092) [100%] 3 of 3 ✔
+[d5/b5f24e] process > RUN_FASTQC (SRR935092)           [100%] 3 of 3 ✔
+[91/2cea54] process > RUN_MULTIQC                      [100%] 1 of 1 ✔
+[e0/b4fd37] process > GET_GENOME_FASTA                 [100%] 1 of 1 ✔
+[87/32ce10] process > INDEX_GENOME                     [100%] 1 of 1 ✔
+[56/e9a460] process > ALIGN_TO_GENOME (SRR935092)      [100%] 3 of 3 ✔
+[ed/d8c223] process > SORT_BAM (SRR935092)             [100%] 3 of 3 ✔
+[e7/4a6bda] process > GET_GENOME_GFF3                  [100%] 1 of 1 ✔
+[e9/84f093] process > GENERATE_COUNTS_TABLE            [100%] 1 of 1 ✔
+

Have you noticed that there are SRA IDs after some of the processes? Well, if you look at which processes show these SRA IDs you might see that it’s only those processes that are executed three times, i.e. once per SRA ID. This doesn’t happen automatically, however, and comes from something called tags. Let’s look at the DONWLOAD_FASTQ_FILES process:

+
process DONWLOAD_FASTQ_FILES {
+
+    // Download a single-read FASTQ file from the SciLifeLab Figshare remote
+
+    tag "${sra_id}"
+    publishDir "results/data",
+        mode: "copy"
+
+    input:
+    tuple val(sra_id), val(figshare_link)
+
+    output:
+    tuple val(sra_id), path("*.fastq.gz")
+
+    script:
+    """
+    wget ${figshare_link} -O ${sra_id}.fastq.gz
+    """
+}
+

You can see the tag directive at the very top of the process definition. Tags can be used to e.g. show information about the sample currently being analysed by the process. This is useful both during run-time (allowing you to see which sample is being processed) but also for debugging or finding problematic samples in case of errors or odd output. There is, naturally, no need to use tags for processes which are only run once.

+
    +
  • Comment out (prefix with //) the tag directive from the DONWLOAD_FASTQ_FILES process and run the workflow again. What do you see?
  • +
+

Without the tag directive you should instead see the numbers 1 through 3, representing the input files (of which there are three). Nextflow still tells us that it’s working on one of the input files, but it’s generally much more useful to actually see the sample name or ID, rather than just a number.

+
    +
  • Uncomment the tag directive before you move on.
  • +
+
+
+

4.2 Named outputs

+

Let’s move on to the next process! It looks like this:

+
process RUN_FASTQC {
+
+    // Run FastQC on a FASTQ file.
+
+    tag "${sample}"
+    publishDir "results/",
+        mode: "copy"
+
+    input:
+    tuple val(sample), path(fastq)
+
+    output:
+    path("*.html")
+    path("*.zip")
+
+    script:
+    """
+    fastqc ${fastq} -q
+    """
+}
+

Here is a process with two output channels! One contains all the .html files, while the other contains all the .zip files. How is this handled in the workflow definition of downstream processes that use the outputs? The RUN_MULTIQC process uses this output, and its part in the workflow definition looks like this:

+
RUN_MULTIQC (
+    RUN_FASTQC.out[1].collect()
+)
+

We already know about .out and .collect(), but we have something new here: the RUN_MULTIQC process is taking the second channel of the output from the RUN_FASTQC process - [1] is the index for the second channel, as Groovy is zero-based (the first channel is indexed by [0]).

+

This comes with some issues, however. What if we accidentally changed the order of the outputs in the rule, or added a new one? Using positions like this is easy to mess up, but there is a better solution: named outputs! This can be achieved by adding the emit option for some or all of the outputs, like so:

+
output:
+path(*.txt), emit: text
+

Instead of referring to the output by its position in an array as before we refer to the channel with a label we choose (.out.text) instead. This benefits us in that we can infer more information about channel contents called text rather than [1], and it is also allows us to be less error-prone when rewriting parts of a workflow.

+
    +
  • Your turn! Add named outputs to the RUN_FASTQC process and make RUN_MULTIQC use those outputs. You’ll have to change both the output section of the RUN_FASTQC process, and the workflow definition section for RUN_MULTIQC. If you need help, see the hint below.
  • +
+
+ +
+
+
// Workflow definition for RUN_MULTIQC
+RUN_MULTIQC (
+    RUN_FASTQC.out.zip.collect()
+)
+
+// Output section of RUN_FASTC
+output:
+path("*.html"), emit: html
+path("*.zip"),  emit: zip
+
+
+
+

Check if it works by executing the workflow.

+
+
+

4.3 Advanced publishing

+

So far we’ve only used the publishDir directive in a very simple way: specifying a directory and the mode to use when publishing (to copy the files rather than symbolically link them). There are more things you can do, however, especially for processes with more than one output. For example, we can publish outputs in separate directories, like so:

+
publishDir "results/tables",
+    pattern: "*.tsv",
+    mode: "copy"
+publishDir "results/logs",
+    pattern: "*.log",
+    mode: "copy"
+

In this example, *.tsv files are copied to the folder results/tables/, while *.log files are copied to the folder results/logs. The publishDir directive can be used multiple times in a single process, allowing one to separate output as above, or publish the same output to multiple folders.

+
    +
  • Edit the RUN_FASTQC process to place the HTML and compressed files in separate directories. Remove the results directory and re-run the workflow to check that it worked - click below if you’re having trouble.
  • +
+
+ +
+
+
process RUN_FASTQC {
+
+    (...)
+
+    publishDir "results/fastqc/html",
+        pattern: "*.html",
+        mode: "copy"
+    publishDir "results/fastqc/zip",
+        pattern: "*.zip",
+        mode: "copy"
+
+    (...)
+}
+
+
+
+
+

Note that an output and a published output are different things: something can be an output of a process without being published. In fact, the RUN_FASTQC process is a prime example of this! Think about the compressed output: this output is only used by the downstream process RUN_MULTIQC and is never meant to be viewed by a human or used by a human in some downstream task not part of the pipeline itself. We would thus like to keep the compressed files as an output, but not publish said output. How do we do this? Just remove the corresponding publishDir directive!

+

The MRSA workflow we’ve made here was refactored directly from its original version in the Snakemake tutorial of this course, which means that its output structure is not fully taking advantage of some of Nextflow’s functionality. The compressed output we’ve already talked about above is one example.

+
    +
  • See if you can find any other processes in the current implementation of the MRSA workflow that you could optimise like this!
  • +
+

Think about whether all processes actually need to have published outputs. Make sure you test executing the workflow after you’ve made any changes; click below if you want a hint.

+
+ +
+
+

The GET_GENOME_FASTA and GET_GENOME_GFF3 both download reference files which are only needed by the workflow itself and does not need to be published, the same goes for the genome index generated by the INDEX_GENOME process.

+

One could argue that neither of the BAM files generated by the ALIGN_TO_GENOME and SORT_BAM processes are needed by the user if only the final counts table is of interest, but BAM files can also be useful for exploring the alignments in e.g. IGV. Both BAMs are, however, definitely not needed: only the sorted one should be published if one is interested in BAM files.

+
+
+
+
+
+

4.4 Debugging

+

It is, sadly, inevitable that we all make mistakes while coding - nobody’s perfect! Nextflow helps you quite a bit when this happens, not just with its logs but also with informative error messages. Let’s introduce an error and look at what we get:

+
    +
  • Change the final output line in the RUN_MULTIQC process to the following and re-run the workflow: path("multiqc_general_stats.csv") - notice the usage of .csv rather than .txt as before.
  • +
+

We got an error! We get a number of things, actually, including (in order from the top) the name of the process that gave the error, the likely cause, the command that was executed, along with its exit status, output, error and the work directory that the task was run in. Let’s focus on the Caused by: part at the top, which should look something like this:

+
Caused by:
+  Missing output file(s) `multiqc_general_stats.csv` expected by process `RUN_MULTIQC`
+

We can also see that the command’s exit status is 0, which means that the command was successful; any exit status other than 0 means there was an error of some kind. We can thus infer that the command (1) worked, (2) failed to give us the output expected by Nextflow. Thankfully, Nextflow graciously prints the work directory for us so that we may check out what happened in more detail.

+
    +
  • Copy the working directory path, cd into it and list its contents using ls.
  • +
+

You might already have spotted the error in the message above; the error we introduced here was that the expected output file has a .csv extension, rather than the correct .txt. Nextflow is expecting the .csv output, but the process script directive is (correctly) giving us the .txt file, which we can see inside the process’ work directory.

+
    +
  • Go back to the root directory, revert the error you introduced and re-run the workflow to make sure it works again.
  • +
+

This might have seemed like a trivial error, but a lot of errors in Nextflow can be solved in the same manner, i.e. by just following the debugging output reported by Nextflow and inspecting the specific subdirectory in question.

+
+
+
+ +
+
+A note about Bash +
+
+
+

If you are using Bash variables inside the script directive you have to be careful to prepend it with a backslash, e.g. \${BASH_VARIABLE}. This is because the dollar-sign is used by Nextflow, so you have to tell Nextflow explicitly when you’re using a Bash variable. This is a common source of errors when using Bash variables, so keeping it in mind can save you some debugging time!

+
+
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learnt:

+
    +
  • How to use the tag directive
  • +
  • How to use named output with emit
  • +
  • How to publish outputs into different directories
  • +
  • How to debug errors and mistakes
  • +
+
+
+
+
+
+

5 Workflow configuration

+

We’ve so far been working with a relatively non-generalised workflow: it’s got hard-coded inputs, paths and genome references. This is perfectly fine for a project that is purely aimed at getting reproducible results (which is the full extent of what you want in a lot of cases), but it can be made a lot more generalisable. Let’s go through the MRSA workflow and see what can be improved!

+
+

5.1 Parameters

+

One of the things that allow generalisability of Nextflow workflows is parameters, which hold information and values that can be changed directly on the command-line at the time of execution. One use of parameters in our MRSA workflow is to remove the hard-coded results output directory, for example. Parameters can be written in the following form:

+
params {
+    parameter_1 = "some/data/path"      // A string parameter
+    parameter_2 = 42                    // A value parameter
+    parameter_3 = ["a", "b", "c", "d"]  // A list parameter
+}
+

You would then refer to these parameters using e.g. params.parameter_1 anywhere you need to in the workflow. Although parameters can be defined in main_mrsa.nf, it is preferable to define them in a separate configuration file. The default name of this file is nextflow.config and if such a file is present it will be used automatically by Nextflow (to supply a config file with another name use nextflow -c <path-to-config-file> run main_mrsa.nf)

+
    +
  • Create a configuration file and add a parameter for the results output directory.

  • +
  • Use your newly created parameter in the publishDir directory of a process Run your workflow to see if it worked; click below if you need help.

  • +
+
+ +
+
+
// Configuration file
+params {
+    outdir = "results"
+}
+
+// A publishDir directive in a process
+publishDir "${params.outdir}",
+    mode: "copy"
+
+
+
+
+
+

5.2 Command line parameters

+

Workflow parameters can be assigned on the command-line by executing workflows like so: nextflow run main_mrsa.nf --parameter_name 'some_value'. The workflow parameter parameter_name is prefixed by a double dash -- to tell Nextflow this is a parameter to the workflow (a single dash is a parameter to Nextflow, e.g. -resume). The value is also quoted (this is important for parameters that take file paths as values).

+
    +
  • Run your workflow using the parameter you previously created, but pick something other than the default value!
  • +
+

You should now have a new directory containing all the results! This is highly useful if you want to keep track of separate runs of a workflow with different software parameters, for example: nextflow run main.nf --important_param 'value1' --resultsdir 'results-value1', or simply want to keep the results of separate versions of the same workflow. You can also change parameters by using the -params-file option or by using another configuration file (and using -c), rather than on the command line!

+
+
+

5.3 Configuring inputs

+

Remember the input for the MRSA workflow, the ch_input channel? This input (the samplesheet.csv file) is hard-coded inside the main_mrsa.nf file. This could also be made into a parameter!

+
    +
  • Change the definition of the ch_input channel to take the value of a new parameter of your choice, defined in the configuration file.
  • +
+

You should now have a more generalised input to your workflow! Try to run it to make sure it works - look below if you need some help.

+
+ +
+
+
// Channel definition
+ch_input = Channel
+    .fromPath ( params.input )
+    .splitCsv ( header: true )
+
+// Configuration file
+input = "samplesheet.csv"
+
+
+
+

By specifying inputs from sample sheets like this we can change inputs of a workflow execution by creating another sample sheet and specifying e.g., --input samplesheet-2.csv on the command line. This is highly useful when you want to run a single sample e.g., when testing a workflow, or when you want to keep track of all the different inputs you’ve used historically.

+
+
+

5.4 Other configuration scopes

+

There are lots of things that you might want to add to your configuration, not just parameters! The workflow manifest, for example, which might look like this:

+
manifest {
+    name        = "My Workflow"
+    description = "My awesome workflow, created by me"
+    author      = "Me"
+    mainScript  = "main.nf"
+    version     = "1.0.0"
+}
+
    +
  • Go ahead and add a workflow manifest to your nextflow.config file!
  • +
+

The manifest is useful when you’re publishing or sharing the workflow through e.g. GitHub or similar. There are many more such configuration scopes that you might want to use - read more about them in the documentation.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we learnt:

+
    +
  • How to create parameters in a configuration file
  • +
  • How to specify parameters on the command line
  • +
  • How to add workflow manifest and other configuration scopes
  • +
+
+
+
+
+
+

6 Optimising the MRSA workflow

+

We just added several parameters and configurations to our MRSA workflow, but we didn’t do anything about the reference genomes: those are still hard-coded. The current MRSA workflow is, in fact, not very well-optimised for Nextflow at all, being a refactor from the Snakemake tutorial of this course.

+

All of the processes are basically unchanged, excluding some minor alterations. For example, the run_fastqc rule in Snakemake used the -o flag to specify that the results should be in the current directory, followed by moving the output files to their respective output directory. The first part is not needed in Nextflow (as everything is run in its own subdirectory), and the second part is done by the publishDir directive. These are just minor alterations, though, but we can do much more if we fully utilise Nextflow’s features!

+
+

6.1 Remote files

+

One of these features is the ability to automatically download remote files, without needing to explicitly do so! The path input type can handle either file paths (like we’ve done so far) or a URI-supported protocol (such as http://, s3://, ftp://, etc.). This would be highly useful for e.g. the GET_GENOME_FASTA process - in fact, we don’t even need that process at all! All we need to do is to change the input to the INDEX_GENOME and ALIGN_TO_GENOME processes.

+
    +
  • Create a new input channel using the fromPath() channel factory and the absolute path (the FTP address) to the genome FASTA.

  • +
  • Make the INDEX_GENOME process use that input channel instead of the previously used output of the GET_GENOME_FASTA process.

  • +
  • Remove the GET_GENOME_FASTA process, as it is not needed anymore.

  • +
+

Re-run the workflow to see if it worked. Check the code below for an example if you’re stuck:

+
+ +
+
+
// Channel creation
+ch_genome_fasta = Channel.fromPath( "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz" )
+
+// Workflow definition
+INDEX_GENOME (
+    ch_genome_fasta
+)
+
+
+
+

We could also do this using parameters from our configfile, of course!

+
    +
  • Now change the input to the GENERATE_COUNTS_TABLE to use the remote GFF3 file and remove the GET_GENOME_GFF3 in the same manner as above, but using a new parameter instead.
  • +
+

Re-run the workflow again to make sure it worked; check below if you’re stuck.

+
+ +
+
+
// [ nextflow.config ]
+params {
+    genome_gff3 = "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz"
+}
+
+// [ main.nf ]
+// Channel creation
+ch_genome_ggf3 = Channel.fromPath ( params.genome_gff3 )
+
+// Workflow definition
+GENERATE_COUNTS_TABLE (
+    SORT_BAM.out.bam.collect(),
+    ch_genome_ggf3
+)
+
+
+
+

If we want to get detailed we can also change the hard-coded “NCT8325” naming in e.g. the INDEX_GENOME process and put that in another parameter, or grab the baseName() from the channel and make a [prefix, file] tuple using the map{} operator like we did previously; check below if you’re curious of how this could be done.

+
+ +
+
+
// Channel definition
+ch_genome_fasta = Channel
+    .fromPath( "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz" )
+    .map     { file -> tuple(file.getBaseName(), file) }
+
+// INDEX_GENOME process definition
+process INDEX_GENOME {
+
+    publishDir "results/bowtie2/",
+        mode: "copy"
+
+    input:
+    tuple val(fasta_name), path(fasta)
+
+    output:
+    path("*.b2t"), emit: index
+
+    script:
+    """
+    # Bowtie2 cannot use .gz, so unzip to a temporary file first
+    gunzip -c ${fasta} > tempfile
+    bowtie2-build tempfile ${fasta_name}
+    """
+}
+
+
+
+
+
+

6.2 Subworkflows

+

The DSL2 allows highly modular workflow design, where a workflow may contain multiple subworkflows. A subworkflow is just like a normal workflow, but it can be called inside other workflows, similar to a process. There is thus no special difference between a subworkflow and a workflow; the only difference is how you use them in practice. Let’s take a look at a toy example:

+
workflow {
+    ch_input = Channel.fromPath ( params.input )
+    SUBWORKFLOW (
+        ch_input
+    )
+}
+
+workflow SUBWORKFLOW {
+
+    take:
+    input_file
+
+    main:
+    ALIGN_READS( input_file )
+
+    emit:
+    bam = ALIGN_READS.out.bam
+}
+

Here we have an unnamed, main workflow like before, plus a named subworkflow. A workflow can have inputs specified by the take directive, which is the equivalent of process input for workflows. The main part is the workflow body, which contains how to run which processes in which order. The last part, emit, also works the same as for processes, in that we name the different outputs of the workflow so that we may use them in other workflows or processes. Nextflow will run the unnamed workflow by default, unless the -entry flag is specified, like so:

+
nextflow run main.nf -entry SUBWORKFLOW
+

This will run the workflow named SUBWORKFLOW, but nothing else. You can also store subworkflows in separate files, so that everything doesn’t have to be crammed into a single main.nf file. A subworkflow named SUBWORKFLOW contained in the file subworkflow.nf can be loaded into a main.nf file like so:

+
include { SUBWORKFLOW } from "./subworkflow.nf"
+

If you have a complex workflow with several subworkflows you might thus store them in a separate directory, e.g. subworkflows/. This allows you to have fine-grained control over the general architecture of your Nextflow workflows, organising them in a manner that is easy to code and maintain. A process can also be treated in the same manner, and defined separately in another file.

+
    +
  • Now it’s your turn! Separate the RUN_FASTQC and RUN_MULTIQC processes out of the main workflow and into a subworkflow. Check below if you’re having trouble.
  • +
+
+ +
+
+
// [ main.nf ]
+// Include subworkflow
+include { QUALITY_CONTROLS } from "./subworkflows/quality_controls.nf"
+
+// Main workflow
+QUALITY_CONTROLS (
+    DOWNLOAD_FASTQ_FILES.out
+)
+
+// [ subworkflows/quality_controls.nf ]
+// Quality controls subworkflow
+workflow QUALITY_CONTROLS {
+
+    take:
+    fastq
+
+    main:
+    RUN_FASTQC (
+        fastq
+    )
+    RUN_MULTIQC (
+        RUN_FASTQC.out.zip.collect()
+    )
+
+    emit:
+    html          = RUN_MULTIQC.out.html
+    general_stats = RUN_MULTIQC.out.general_stats
+}
+
+// [ Include RUN_FASTQC and RUN_MULTIQC processes here ]
+
+
+
+

If you want to challenge yourself, try to do the same with the INDEX_GENOME, ALIGN_TO_GENOME and SORT_BAM processes! Be careful of where you get your inputs and outputs; check below if you want one of the ways in which you can do this:

+
+ +
+
+
// [ main.nf ]
+// Include subworkflow
+include { ALIGNMENT } from "./subworkflows/alignment.nf"
+
+// Main workflow
+ALIGNMENT {
+    ch_genome_fasta,
+    DOWNLOAD_FASTQ_FILES.out
+}
+
+// [ subworkflows/alignment.nf ]
+// Alignment subworkflow
+workflow ALIGNMENT {
+
+    take:
+    fasta
+    fastq
+
+    main:
+    INDEX_GENOME (
+        fasta
+    )
+    ALIGN_TO_GENOME (
+        fastq,
+        INDEX_GENOME.out.index
+    )
+    SORT_BAM (
+        ALIGN_TO_GENOME.out.bam
+    )
+
+    emit:
+    bam = SORT_BAM.out.bam
+}
+
+// [ Include INDEX_GENOME, ALIGN_TO_GENOME and SORT_BAM processes here ]
+
+
+
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we learnt:

+
    +
  • How to automatically download remote files
  • +
  • How to create and work with subworkflows
  • +
+
+
+
+
+
+

7 Extra material

+

There are many more things you can do with Nextflow than covered here. If you are interested to learn more details about Nextflow, we will briefly show some of its advanced features in this section. But first, here are some links to additional resources on Nextflow:

+ +
+

7.1 Using containers in Nextflow

+

Nextflow has built-in support for using both Docker and Apptainer containers (and others too), either with a single container for the workflow as a whole or separate containers for each individual process. The simplest way to do it is to have a single container for your entire workflow, in which case you simply run the workflow and specify the image you want to use, like so:

+
# Run with docker
+nextflow run main.nf -with-docker image-name
+
+# Run with Apptainer
+nextflow run main.nf -with-apptainer image.sif
+

If you don’t want to supply this at every execution, you can also add it directly to your configuration file:

+
# Docker configuration
+process.container = 'image-name'
+docker.enabled = true
+
+# Apptainer configuration
+process.container = 'path/to/image.sif'
+apptainer.enabled = true
+

If you instead would like to have each process use a different container you can use the container directive in your processes:

+
process PROCESS_01 {
+    (...)
+    container: 'image_01'
+    (...)
+}
+
+process PROCESS_02 {
+    (...)
+    container: 'image_02'
+    (...)
+}
+

Regardless of which solution you go for, Nextflow will execute all the processes inside the specified container. In practice, this means that Nextflow will automatically wrap your processes and run them by executing the Docker or Apptainer command with the image you have provided.

+
+
+

7.2 Using Conda in Nextflow

+

While you can execute Nextflow inside Conda environments just like you would any other type of software, you can also use Conda with Nextflow in the same way as for Docker and Apptainer above. You can either supply an environment.yml file, the path to an existing environment or the packages and their versions directly in the conda directive, like so:

+
process PROCESS_01 {
+    (...)
+    conda: 'mrsa-environment.yml'
+    (...)
+}
+process PROCESS_02 {
+    (...)
+    conda: 'path/to/mrsa-env'
+    (...)
+}
+process PROCESS_03 {
+    (...)
+    conda: 'bioconda::bwa=0.7.17 bioconda::samtools=1.13'
+    (...)
+}
+

You can use either of the methods described above with your configuration file as well, here exemplified using an environment.yml file:

+
process.conda = 'mrsa-environment.yml'
+
+
+

7.3 Running Nextflow on Uppmax

+

A lot of researchers in Sweden are using the Uppmax computer cluster in Uppsala, which is easily handled by Nextflow. What you need to do is to add the following profile to your nextflow.config file:

+
profiles {
+    // Uppmax general profile
+    uppmax {
+        params {
+            account        = null
+        }
+        process {
+            executor       = 'slurm'
+            clusterOptions = "-A '${params.account}'"
+            memory         = { 6.GB * task.attempt }
+            cpus           = { 1 * task.attempt }
+            time           = { 10.h * task.attempt }
+            scratch        = '$SNIC_TMP'
+            errorStrategy  = 'retry'
+            maxRetries     = 1
+        }
+    }
+}
+

This will add a profile to your workflow, which you can access by running the workflow with -profile uppmax. You will also have to supply an extra parameter account which corresponds to your SNIC project account, but the rest you can leave as-is, unless you want to tinker with e.g. compute resource specifications. That’s all you need! Nextflow will take care of communications with SLURM (the system used by Uppmax, specified by the executor line) and will send off jobs to the cluster for you, and everything will look exactly the same way as if you were executing the pipeline locally.

+

The memory, cpus and time lines define the various resources Nextflow will use as well as how much to automatically increase them by if re-trying failed tasks; this, in turn, is specified by the errorStrategy and maxRetries variables. The scratch variable defines where each node’s local storage is situated, which gives Nextflow the most optimal access to the Uppmax file system for temporary files.

+
+
+

7.4 Advanced channel creation

+

The input data shown in the MRSA example workflow is not that complex, but Nextflow channels can do much more than that. A common scenario in high-throughput sequencing is that you have pairs of reads for each sample. Nextflow has a special, built-in way to create channels for this data type: the fromFilePairs channel factory:

+
ch_raw_reads = Channel
+    .fromFilePairs ( "data/*_R{1,2}.fastq.gz" )
+

This will create a channel containing all the reads in the data/ directory in the format <sample>_R1.fastq.gz and <sample>_R2.fastq.gz and will pair them together into a nested tuple looking like this:

+
[sample, [data/sample_R1.fastq.gz, data/sample_R2.fastq.gz]]
+

The first element of the tuple ([0]) thus contains the value sample, while the second element ([1]) contains another tuple with paths to both read files. This nested tuple can be passed into processes for e.g. read alignment, and it makes the entire procedure of going from read pairs (i.e. two separate files, one sample) into a single alignment file (one file, one sample) very simple. For more methods of reading in data see the Nextflow documentation on Channel Factories.

+

We can also do quite advanced things to manipulate data in channels, such as this:

+
samples_and_treatments = Channel
+    .fromPath ( params.metadata )
+    .splitCsv ( sep: "\t", header: true )
+    .map      { row -> tuple("${row.sample_id}", "${row.treatment}") }
+    .filter   { id, treatment -> treatment != "DMSO" }
+    .unique   ( )
+

That’s a bit of a handful! But what does it do? The first line specifies that we want to read some data from a file specified by the metadata parameter, and the second line actually reads that data using tab as delimiter, including a header. The map operator takes each entire row and subsets it to only two columns: the sample_id and treatment columns (discarding the other columns). This subset is stored as a tuple. The filter operator is then used to remove any tuples where the second entry (treatment) is not equal to the string "DMSO" (i.e. untreated cells, in this example). Finally, we only keep unique tuple values. Let’s say that this is the metadata we’re reading:

+
sample        dose    group     treatment
+sample_1      0.1     control   DMSO
+sample_1      1.0     control   DMSO
+sample_1      2.0     control   DMSO
+sample_2      0.1     case      vorinostat
+sample_2      1.0     case      vorinostat
+sample_2      2.0     case      vorinostat
+sample_3      0.1     case      fulvestrant
+sample_3      1.0     case      fulvestrant
+sample_3      2.0     case      fulvestrant
+

Given the channel creation strategy above, we would get the following result:

+
[sample_2, vorinostat]
+[sample_3, fulvestrant]
+

In this way, you can perform complex operations on input files or input metadata and send the resulting content to your downstream processes in a simple way. Composing data manipulations in Nextflow like this can be half the fun of writing the workflow. Check out Nextflow’s documentation on Channel operators to see the full list of channel operations at your disposal.

+
+
+

7.5 Using Groovy in processes

+

You don’t have to use bash or external scripts inside your processes all the time unless you want to: Nextflow is based on Groovy, which allows you to use both Groovy and Bash in the same process. For example, have a look at this:

+
process index_fasta {
+    tag "${fasta_name}"
+
+    input:
+    tuple val(fasta), path(fasta_file)
+
+    output:
+    path("${fasta_name}.idx"), emit: fasta
+
+    script:
+    fasta_name = fasta.substring(0, fasta.lastIndexOf("."))
+    """
+    index --ref ${fasta_file},${fasta_name}
+    """
+}
+

Here we have some command index that, for whatever reason, requires both the path to a FASTA file and the name of that file without the .fasta extension. We can use Groovy in the script directive together with normal Bash, mixing and matching as we like. The first line of the script directive gets the name of the FASTA file without the extension by removing anything after the dot, while the second calls the index command like normal using bash.

+
+
+

7.6 The nf-core pipeline collection

+

You may have heard of the nf-core pipeline collection previously, which is a large, collaborative bioinformatics community dedicated to building, developing and maintaining Nextflow workflows. In fact, if you have sequenced data at e.g. the National Genomics Infrastructure (NGI), you can be sure that the data processing has been run using one of the nf-core pipelines! While the community only started in 2018 (with a Nature Biotechnology paper in 2020), it already has over 30 production-ready pipelines with everything from genomics, transcriptomics, proteomics and metagenomics - and more being developed all the time.

+

The nf-core pipelines all work in the same way, in that they have the same exact base for inputs, parameters and arguments, making them all highly similar to run. Since you’ve already learnt the basics of Nextflow in this course, you should now be able to also run the nf-core pipelines! It might be that you have a data type that you can analyse using one of the pipelines in nf-core, meaning you don’t need to do anything other than find out what parameters you should run it with.

+

Each pipeline comes with extensive documentation, test datasets that you can use to practice on, can be run on both HPCs like Uppmax, cloud services like AWS or locally on your own computer. All pipelines support both Conda and Docker/Apptainer, and you can additionally run specific versions of the pipelines, allowing for full reproducibility of your analyses. If you want to check nf-core out, simply head over to their list of pipelines and see what’s available! Who knows, you might even write your own nf-core pipeline in the future?

+ + +
+
+ +
+ +
+ + + + + + + \ No newline at end of file diff --git a/pages/quarto.html b/pages/quarto.html new file mode 100644 index 00000000..ca87f6b3 --- /dev/null +++ b/pages/quarto.html @@ -0,0 +1,1352 @@ + + + + + + + + + +Working with Quarto + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Working with Quarto

+

How to generate reproducible reports and computational notebooks

+
+
+ + +
+ + +
+
Published
+
+

15-Oct-2024

+
+
+ + +
+ + +
+ + + + +
+ + + + + + +
+

1 Introduction

+

The Quarto format (.qmd) is a multi-functional format, which is especially useful for scientific coding and analyses. Quarto documents can be used both to save and execute code as well as generating reports in various output formats. This is done by mixing markdown and so-called code chunks in the same document (we have course materials for markdown if you are unfamiliar with this format). The code itself as well as the output it generates can be included in the final report. Not only can Quarto work great for scientific coding, but can also be used for things such as presentation and websites - this entire workshop website is, in fact, created using only Quarto!

+

Quarto makes your analysis more reproducible by connecting your code, figures and descriptive text. You can use it to make reproducible reports, rather than e.g. copy-pasting figures into a Word document. You can also use it as a notebook, in the same way as lab notebooks are used in a wet lab setting (or as we utilise Jupyter notebooks in the tutorial after this one). Quarto itself does not require any particular programming language to be installed - any language you want to use can be installed separately. The currently supported languages are R, Python, Julia and Observable. Quarto is fully compatible with both R Markdown and Jupyter documents.

+

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already. Place yourself in the workshop-reproducible-research/tutorials/quarto/ directory, activate your quarto-env Conda environment and start your text editor or IDE of choice.

+
+
+
+ +
+
+A note on R Markdown +
+
+
+

Quarto is an evolution of the R Markdown format, which was previously used in this course. While R Markdown is a widely-used and excellent software for code and reports, Quarto is most easily thought of as “R Markdown 2.0”. If you’re familiar with R Markdown, you will find Quarto to be highly similar. The creators of both Quarto and R Markdown (Posit) have stated that R Markdown is not going to be deprecated, but most newer features will only come to Quarto. This means that if you’ve used R Markdown in the past now is a good time to make the switch, but you don’t have to. You can check out the Quarto website for more in-depth discussions regarding Quarto/R Markdown (dis-)similarities.

+
+
+
+
+

2 The basics

+

Let’s start with creating basic Quarto document that we can work with.

+
+

2.1 Creating Quarto documents

+

Quarto documents are just plain text files with the .qmd extension. Create a new file called e.g. quarto-tutorial.qmd and copy the following into it:

+
---
+title: "Untitled Quarto Document"
+author: "Jane Doe"
+format: html
+---
+

This is a so-called YAML header, which is where we specify the general settings of the document in the form of key: value. The title and author are just what they sound like, while the format field specifies what type of output you want the final report to be in (alternatives include pdf, revealjs and many others). Here we have specified that we want HTML output, which is perhaps the most useful for scientific computing.

+
    +
  • Change the title to My first Quarto document and the author to your name.
  • +
+

Let’s add some actual content to the document, starting with some basic markdown:

+
    +
  • Add some text into your Quarto document (including an empty line between the YAML header and the text), e.g. the following:
  • +
+
This is my first Quarto document!
+
+# This is a header
+
+This is where I'll soon add some *code* related to the first header.
+

Let’s see what this document looks like when it’s rendered into HTML by Quarto:

+
    +
  • Go to the command line and type quarto render quarto-tutorial.qmd.
  • +
+
+
+
+ +
+
+Rendering +
+
+
+

If you’re using e.g. RStudio or VSCode to edit your Quarto document you might have access to a render button, which means you don’t have to run the above command from the command line if you prefer.

+
+
+

Open your new quarto-tutorial.html file that was created and see what it looks like. It’s only markdown content so far, so let’s add some R code using a code chunk:

+
```{r}
+Sys.Date()
+```
+

Notice that we delimit the code chunk from the rest of the document’s contents using three backticks (```) and specify the R language using curly brackets ({r}). The code itself just prints the current date.

+
    +
  • Render the document again and see what it looks like.
  • +
+

You can also name chunks by adding it after the language:

+
```{r}
+Sys.Date()
+```
+

This is useful for debugging when something has gone wrong, since it’ll be easier to see exactly which code chunk an error happened (instead of just showing the chunk as a number).

+

We can also get in-line code using {r} <R CODE>, like so:

+
The current date is `{r} Sys.Date()`.
+
    +
  • Add the example above and render the document again to make sure it worked.
  • +
+
+
+

2.2 Previewing documents

+

Quarto has a highly useful command for when you’re working on a document: preview. It’s essentially a live preview of the document you’re working on that will automatically render when you introduce changes to the document.

+
    +
  • Type quarto preview quarto-tutorial.qmd in the command line.
  • +
+

Your default web browser should now have opened a new window with your rendered document, while your command line should say something like the following:

+
Watching files for changes
+Browse at http://localhost:4175/
+

You can’t type new commands at the moment, because the Quarto Preview command is still running - it’s watching for any new changes to the Quarto document you specified.

+
    +
  • Change or add some markdown text to your Quarto document, e.g. This is a code chunk instead of the previous text under the first header. Make sure you save the document.
  • +
+

The HTML document in your browser should have updated to reflect your newest changes automatically. Previewing documents is great when you want to have continuous feedback to the changes you make and can make the process of writing more seamless, since you don’t have to manually render all the time. Previewing will still render the entire document, however, meaning that if you have some heavy computations you might not want to re-render on every single save. For those cases you might instead prefer to stick with manual rendering when you are satisfied with multiple changes. You can abort a preview like any on-going command, e.g. using Ctrl-C.

+

In the rest of the tutorial it’s up to you whether you want to use preview or not - the tutorial will just mention when it’s time to render, you decide how that’s done.

+
+
+

2.3 Rendering to PDF

+

So far we’ve only rendered to HTML, but sometimes you prefer a PDF. This entails changing the format option in the YAML header:

+
    +
  • Change the format to pdf in the header and render your document.
  • +
+

You can add any raw LaTeX commands you want to your document when you’re rendering to PDF, e.g. \footnotsize to change the font size. You also have LaTeX-specific settings, such as setting the geometry for the whole document or specifying a citation method. While the details of LaTeX are outside the scope of this course, it’s useful to be aware of this functionality of Quarto so that you may use it if you already know LaTeX or if you want to learn it.

+

Switch back to HTML rendering before you move on.

+
+
+

2.4 Languages

+

The examples so far have been using R, but we could just as easily have used Python. All we have to do is to change our code chunk to specify {python} as language and its content to be the equivalent Python code:

+
```{python}
+from datetime import date
+print(date.today())
+```
+
    +
  • Change the code chunk to the above Python chunk instead and render your document again.
  • +
+
+
+
+ +
+
+A note on Python in-line code +
+
+
+

Quarto support for in-line python code was added in version 1.4, so if you’re using an older version of Quarto simply remove the in-line code example. You can check your quarto version by running quarto --version on the commandline. As of this writing, the 1.4 version of Quarto can be obtained from the pre-release page: https://quarto.org/docs/download/prerelease

+

If you’re using Quarto version 1.4 or higher and want to try the in-line code example above for Python, change the line to:

+
The current date is `{python} date.strftime(date.today(), format="%Y-%m-%d")`
+
+
+

So far we’ve had Quarto automatically determine which language engine should be used, which it detects through the code chunks we’ve written. We can also do this explicitly by adding engine: knitr or engine: jupyter to the YAML header.

+
    +
  • Explicitly add engine: jupyter to your YAML header and render the document.
  • +
+
+
+
+ +
+
+Making sure your Jupyter engine is recognised +
+
+
+

Quarto attempts to identify a suitable Jupyter engine for your system when you include Python code chunks. However, if you want to use Jupyter available in a specific conda environment (e.g. your quarto-env environment) you need to take some extra steps. Please visit this link and follow steps 1-4. In the final step, check for the name of the kernel matching your quarto-env conda environment, e.g.

+
$ jupyter kernelspec list
+Available kernels:
+...
+ conda-env-quarto-env-py    /Users/<your-user-name/Library/Jupyter/kernels/conda-env-quarto-env-py
+

Using the example output from above we can add the following to the YAML header of our Quarto document:

+
jupyter:
+ kernelspec:
+   display_name: Python 3
+   language: python
+   name: conda-env-quarto-env-py
+
+
+

It can be useful to explicitly set the language for the document, as it makes it clearer from just the YAML header what language will be used. There are also more language-related options for Quarto, but we’ll save those for later in the tutorial.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section you learned how to create, edit and render basic Quarto documents using different languages.

+
+
+
+
+
+

3 Code chunks

+

Sometimes you want to add chunk options to the code chunks in your Quarto documents. They are also in YAML format and are prefixed with a special type of comment (#|). It can look something like this:

+
```{python}
+#| echo: false
+from datetime import date
+print(date.today())
+```
+
    +
  • Add the chunk option above to your document and render the document again.
  • +
+

Notice how we no longer see the code itself, just the output? This is because the echo option specifies just that: whether we see the code or not. There are a number of such chunk options that are useful to know about:

+ ++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Chunk optionEffect
echoInclude the chunk code in the output.
evalEvaluate the code chunk.
outputInclude the results of executing the code in the output.
warningInclude warnings in the output.
errorInclude errors in the output (note that this implies that errors executing code will not halt processing of the document).
includePrevent both code and output from being included.
+
    +
  • Check what happens if you change echo: False to eval: False.
  • +
+

Now the code in the code chunk is not run, which means that if you previously added the python inline code it will no longer work because it depends on date from the datetime module that we import in the code chunk. Remove the inline code snippet if you added it. Then try rendering again. Now you should see the code itself but it won’t be run and therefore has no output.

+
+

3.1 Figure options

+

There are also options related to figures, but for that we need to actually have some code that produces a figure.

+
    +
  • Change the YAML header to use R instead of Python, remove the Python code chunk and replace it with the following (don’t worry if you don’t understand the R code itself, it’s just as example):
  • +
+
```{r}
+library("ggplot2")
+library("palmerpenguins")
+data(penguins, package = "palmerpenguins")
+ggplot(penguins, aes(x      = bill_length_mm,
+                     y      = body_mass_g,
+                     colour = species)) +
+    geom_point(size = 2) +
+    theme_bw() +
+    labs(x      = "Bill length (mm)",
+         y      = "Body mass (g)",
+         colour = "Species") +
+    ggtitle("Penguin weight and bill length") +
+    theme(plot.title = element_text(hjust = 0.5)) +
+    scale_colour_manual(values = c("#c1dea0", "#85be42", "#425f21"))
+```
+

When you’ve rendered the document you should see both the code and a figure using the Palmer Penguins dataset. You should also see a warning along the lines of Removed 2 rows containing missing values.

+
    +
  • Suppress the warning by adding #| warning: false as a chunk option and render.
  • +
+

There are two chunk options related to figure sizes: fig-width and fig-height (expressed in inches). These allow you to experiment with your figures and make them look the way you want.

+
    +
  • Add both the fig-width: 10 and fig-height: 5 chunk options and render.
  • +
+
+
+
+ +
+
+Note +
+
+
+

These two chunk options are only available when using the Knitr engine, not for Jupyter. There is a way to set these for the whole document with Jupyter, though, which we’ll talk more about in the next section of the tutorial.

+
+
+

You can also add captions and alt text using fig-cap and fig-alt, respectively.

+
    +
  • Add a suitable caption and alt text to the figure and render.
  • +
+

If you want to place the caption in the margin of your document you can use the cap-location chunk option.

+
    +
  • Add cap-location: margin to your chunk options and render.
  • +
+
+
+
+ +
+
+Note +
+
+
+

On some quarto versions the cap-location: option may not work as expected. If you experience this, try also adding #| label: fig-penguins to the chunk.

+
+
+
+
+

3.2 Cross-references

+

A convenient way to be able to refer to figures in text is by adding a figure label, which will automatically add a figure number before your caption.

+
    +
  • Add a suitable label, e.g. label: fig-penguins to the chunk options.
  • +
+

Cross-references use the @ symbol and the corresponding label. You can thus write some markdown outside of a code chunk and refer to e.g. @fig-penguins, as per the example here. This is extremely useful if you’re writing a paper or a report where you want to refer to figures and content in the markdown text. Quarto even adds a clickable link to the figure itself as well!

+
+
+

3.3 Sub-figures

+

It’s also possible to create sub-figures using Quarto, instead of using whatever plotting library that your created the figures with.

+
    +
  • Add the following (almost identical) code at the bottom of the chunk you already have:
  • +
+
ggplot(penguins, aes(x      = bill_depth_mm,
+                     y      = body_mass_g,
+                     colour = species)) +
+    geom_point(size = 2) +
+    theme_bw() +
+    labs(x      = "Bill depth (mm)",
+         y      = "Body mass (g)",
+         colour = "Species") +
+    scale_colour_manual(values = c("#c1dea0", "#85be42", "#425f21"))
+
    +
  • Also add the following to the chunk options:
  • +
+
#| fig-subcap:
+#|     - Bill length vs. body mass
+#|     - Bill depth vs. body mass
+

You should now see that we have two figures with separate sub-captions as well as the overall figure caption we previously added. We can also control the layout of these figures using the layout-ncol chunk option.

+
    +
  • Add a layout-ncol: 2 chunk option and render the document.
  • +
+

We now have a different, two-column layout instead, but whether you prefer this or just a one-column layout is up to you.

+
+
+

3.4 Tables

+

Tables work much in the same way as figures. It might, in our example, be nice to add a table with the data we previously plotted.

+
    +
  • Add the following code chunk to your document and render it:
  • +
+
```{r}
+#| label: tbl-penguins
+#| tbl-cap: Palmer penguins bill length, width and body mass.
+#| tbl-cap-location: margin
+knitr::kable(
+    penguins[1:10, c("species", "bill_length_mm", "bill_depth_mm", "body_mass_g")],
+    col.names = c("Species", "Bill length (mm)", "Bill depth (mm)", "Body mass (g)")
+)
+```
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section you learned several chunk, figure and table options, how cross-referencing works and how to add sub-figures.

+
+
+
+
+
+

4 Document options

+

So far we’ve mostly worked with chunk options, which are specific to the chunk they appear in. You can set many of these at the global document level, however, and there are also some options specifically for tailoring the document as a whole, regardless of chunk content.

+

We’ve already looked at some global options, such as title, author, format and engine. Something that would go nicely with the first two is the date option. You could just write the actual date if you like, or you can use the today option:

+
    +
  • Add the following to the options: date: today
  • +
+
+

4.1 Code folding

+

A useful option we haven’t touched already is the code-fold option. This and similar global options are specified nested inside the format option, like so:

+
format:
+    html:
+        code-fold: true
+
    +
  • Add the code-fold option to your document and render it.
  • +
+

This can be a nice default to use in scientific reports, as it hides the code by default but is always there for those who want to inspect it. You can also use the code-summary chunk option to specify a different text to show with the folded code instead of the default Code, e.g. code-summary: Click to show code.

+

If you want to add the code-summary option to all chunks you can add the following to the yaml header:

+
language:
+  code-summary: Click to show code
+

You can also add the code-tools option, which will add a drop-down menu to toggle visibility of all code as well as the ability to view the source of the document.

+
    +
  • Add the code-tools: true option and render the document.
  • +
+
+
+

4.2 Table of contents

+

Another useful document option is to add a table of contents, which can be done with the toc option. This will automatically populate the table of contents using the headers from your document.

+
    +
  • Add some more headings and/or sub-headings to your document.

  • +
  • Add the toc: true option to the html format and render.

  • +
+

The table of contents is to the right of the document by default, but you can change it using toc-location. The toc-depth allows you to control how many sub-heading levels are included in the table of contents.

+
    +
  • Add toc-location: left and toc-depth: 2 to your document and render it.
  • +
+

Having the table of contents on the left can be useful if you are using the margins for something, such as we are doing in this tutorial. You can similarly add section numbering using number-sections and number-depth. Smooth scrolling is not enabled by default, but you can add it using smooth-scroll: true. You can change the title of the table of contents using toc-title.

+
    +
  • Add section numbers, depth, smooth scrolling and a different table of contents title to your document and render it.
  • +
+
+
+

4.3 Themes

+

Quarto has a lot of themes available for it.

+
    +
  • Add theme: flatly under the HTML format option and render.
  • +
+

If you want to get real advanced you can play around with lots of details regarding the themes and adjust as you see fit, or even just create your own theme. This is a bit too advanced to go through here, but you can read about it more in the official documentation.

+
+
+

4.4 Global chunk options

+

The chunk options we learnt about in the previous section of this tutorial can also be specified on the global document level. Instead of specifying e.g. warning: false or fig-height: 5 in individual chunks we can add it to the main YAML header in the same manner as for e.g. code folding or table of contents. We’ll still have to specify options like labels or captions at the chunk-level, though.

+
    +
  • Add warning: false to your document header and remove it from the penguin figure chunk you already have.
  • +
+
+
+

4.5 Embedding HTML resources

+

When rendering HTML documents you get any figures and other resources in a <document-name>_files/ directory, which is not always desirable. It’s easier to move the HTML around if all figures etc. are embedded directly in the HTML itself, which can be done by specifying embed-resources: true in the HTML format options. This option is false by default, meaning that you’ll also have to include the previously mentioned directory if you want to share the HTML with anybody.

+
    +
  • Remove the <document-name>_files/ directory, refresh the rendered document and see what happens.

  • +
  • Add the embed_resources option and render your document again.

  • +
+

What happened first is that your figures should have disappeared when you deleted the resources directory. Embedding resources and rendering again should not re-create this directory, so now you’ll just have a stand-alone HTML file that is more portable than before.

+
+
+

4.6 Multiple formats

+

So far we’ve mostly been working with HTML output, but you don’t need to limit yourself to a single output format if you don’t want to.

+
    +
  • Add the docx: default line in the format: part of your YAML header and render your document.
  • +
+

You should have gotten two separate output files now: a HTML and a DOCX (Word) file. You can specify further options for any of the formats you include, instead of just using the default settings as in this example.

+
    +
  • Render your document again, but supply the --to html flag.
  • +
+

This will only render to the specified output format, which is highly useful when you want to write a Quarto document with more than one format but not always render them all.

+
+
+

4.7 Parameters

+

The last document-wide option we’ll touch on is parameters. This is useful for when you want to be able to run the same document with different parameters or options for some computations. How parameters are specified depends on which engine you’re using. With Knitr you can specify parameters using the params option:

+
    +
  • Add the following code to your YAML header:
  • +
+
params:
+    point_size: 2
+
    +
  • Also change the hard-coded geom_point(size = 2) to geom_point(size = params$point_size) in the two ggplot calls in the first code chunk.
  • +
+

We have thus specified a parameter called point_size in the YAML header and referred to it in the code using params$point_size. You can now change this parameter at run-time by supplying the -P <param>:<value> (or --execute-param) flag to quarto render.

+

Notice that this won’t work if you want to use a parameter to control e.g. a chunk option like layout-ncol. For this we need to use an in-line code expression: #| layout-ncol: !expr params$ncols.

+
    +
  • Add a parameter for the layout-ncol chunk option to the YAML header
  • +
  • Also add the layout-ncol chunk option to the figure chunk using the syntax above and render to make sure it works.
  • +
+

Note that to modify multiple parameters at run-time you have to use the -P param:value flag multiple times, like so:

+
quarto render quarto-tutorial.qmd -P point_size:4 -P ncols:1
+

If you’re using the Jupyter engine you can instead specify parameters by designating a single cell as a parameter cell, like so:

+
```{python}
+#| tags: [parameters]
+point_size = 2
+```
+

You can also specify parameters in a params.yml file and instruct quarto to use them with the --execute-params params.yml flag when rendering. Note that the parameters must be defined in the document (in the YAML header when using the knitr engine, or in a cell when using the jupyter engine). Pointing quarto to a params.yml file with --execute-params only overrides them when rendering.

+

Using parameters is extremely useful when you’re using a workflow manager system (e.g. Snakemake or Nextflow), since you can easily specify sample-specific parameters from the command line directly from your workflow manager.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this sections we covered a number of document-wide options, including code-folding, table of contents, theming, HTML portability, using multiple output formats and parameters.

+
+
+
+
+
+

5 Presentations

+

Quarto can also be used to create presentations in multiple formats such as reveal.js (HTML), beamer (PDF) and pptx (PowerPoint) - the most powerful of these formats by far is the first one. Creating presentations with Quarto is quite similar to creating general Quarto documents, with some added features to keep in mind.

+
+

5.1 Slides

+

The first thing that’s needed for creating a presentation is deciding what constitutes a slide. The default is that slides are delimited by a document’s header levels.

+
    +
  • Render your document using the --to revealjs flag and open it.
  • +
+

You should now have the same document we’ve been working on for this tutorial in presentation format! You can step through the slides using the arrow keys, press F to go into full-screen mode, S to view speaker notes, M for the menu (you can also click in the lower left corner to get this menu) and ESC to go back.

+

If you’ve followed along you should have one level-1 header (#) and two level-2 headers (##). Notice that the level-1 header here will render as a blank page with just the header content on it, while the level-2 headers will render as normal slide headers. This all looks quite nice, and we didn’t even have to change a thing! Disregard that the table on the last slide doesn’t fit for now, we’ll get back to it later. Another method of delimiting slides is using a horizontal rule, ---, which allows you more fine-grained control over slides and their content (and is especially useful if you want to have a slide without a title).

+
+
+

5.2 Divisions

+

There are many ways you can add presentation-specific content to your slides, some of which you’d recognise from e.g. PowerPoint functionality.

+

So called “divisions” or “divs” allow you to control the appearance of content in your slides.

+

Let’s fix that issue with the table that was larger than the page. The problem here is one of content overflow, which can be fixed by adding a special {.smaller} div.

+
+
+
+ +
+
+Note +
+
+
+

Divs do not work for level1 headings (starting with a single #).

+
+
+
    +
  • Add the {.smaller} div to the table header (it should read something like ## A table {.smaller}) and render.
  • +
+

That should have automatically re-sized the table to fit into the slide. Another way to solve this is to make slide content scrollable.

+
    +
  • Change the {.smaller} div to a {.scrollable} div and render.
  • +
+

Instead of re-sizing the table we now get the ability to scroll down it instead; whichever solution you prefer is up to you.

+

Adding divisions of various types like this is a common thing for Quarto presentations. Another common presentation-functionality is incremental lists, which can also be achieved with divisions. When adding a division to slide content we specify the division’s content in a manner similar to a code chunk, like in the following example:

+
## Penguin species
+
+::: {.incremental}
+ - Adelie
+ - Chinstrap
+ - Gentoo
+:::
+
    +
  • Add the code above to your document and render it.
  • +
+

Stepping through incremental content works the same as for stepping through slides, i.e. using the arrow keys.

+
    +
  • Render your document to html instead of revealjs.
  • +
+

Notice that Quarto rendered the HTML document just fine, even though you now have some presentation-specific code? This allows you to switch between the formats on-demand without having much overhead or format-specific code, which is great when you want to present your work without having to whip out a full-fledged presentation and all the work that goes into that!

+

There are other useful divisions as well, including {.notes} (speaker notes), {.aside} (additional commentary similar to footnotes), {.footer} (slide footers), which you can add in the same way as we did for the incremental list above.

+
    +
  • Pick one of the above-mentioned divisions to add to your presentation and render it.
  • +
+
+
+
+ +
+
+Note +
+
+
+

The notes and footer divisions will appear as normal Markdown text when rendering to HTML, while asides will appear in the margin. These divisions thus represent cases you might want to avoid if you want to be completely format-agnostic.

+
+
+
+
+

5.3 Presentation options

+

Just like the other formats you can specify presentation-specific options at the document-level using the YAML header. You could, for example, add the {.scrollable} or {.smaller} div to the entire document.

+
    +
  • Add the revealjs format to the YAML header as well as a scrollable: true option to it.
  • +
+

You can also specify one of the built-in themes here.

+
    +
  • Add theme: simple to your YAML header and render.
  • +
+

You can find the entire list of themes at the Quarto website.

+
+
+

5.4 Multiple columns

+

Sometimes you’ll want to have more than one column in your presentation, which is done with the {.columns} and {.column} divisions. The former specifies that a section with multiple columns is starting, while the second specifies when each column starts, like so:

+
:::: {.columns}
+
+::: {.column}
+Left column
+:::
+
+::: {.column}
+Right column
+:::
+
+::::
+
    +
  • Add multiple columns with some content to your presentation and render it.
  • +
+

You can also control the widths of these columns using e.g. {.column width="40%"}.

+
+
+
+ +
+
+Note +
+
+
+

The {.columns} div also works for a normal HTML render, so it’ll look the same regardless of whether you output as a document or a presentation.

+
+
+
+
+

5.5 Fragments

+

We’ve already learnt how to get incremental lists working, but what about general content we want to incrementally step through? This is done with the {.fragment} div.

+
    +
  • Add a {.fragment} div to some slide content and render.
  • +
+

Fragments are similar to “animations” from PowerPoint and come with lots of built-in variations, e.g. fade-out, grow, strike and several others.

+
    +
  • Add a fragment variant to your content, e.g. {.fragment .grow} and render your document.
  • +
+

You can also control the order in which fragments appear using the fragment-index=<NUMBER> option.

+
    +
  • Create a new slide and add some content with a different order of appearance than the order of the code. If you need help or inspiration, click below.
  • +
+
+ +
+
+
## Why Palmer Penguins?
+
+::: {.fragment fragment-index=2}
+![](https://allisonhorst.github.io/palmerpenguins/logo.png){fig-align="center"}
+:::
+
+::: {.fragment fragment-index=1}
+The goal of `palmerpenguins` is to provide a good dataset for data exploration
+and visualization, as an alternative to `iris.`
+:::
+
+
+
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we covered how to create presentations using Quarto, including how to add various divisions, global slide-options, multiple columns and fragments.

+
+
+
+
+
+

6 Extra material

+

The following material contains some more advanced things that you can do with Quarto but are not really part of the core of the Quarto material. It’s a mix of various functionalities, and you don’t have to go through it if you don’t want to.

+

If you’re interested in learning more about Quarto in general, here are some reading tips:

+ +
+

6.1 Tabsets

+

Sometimes you’ll want to present the same content in different ways, e.g. the equivalent code in different languages. Look at the following toy example:

+
::: {.panel-tabset}
+## R
+```{r}
+words <- c("Foo", "bar")
+print(paste(words), collapse = ' ')
+```
+
+## Python
+```{python}
+words = ["Foo", "bar"]
+print(' '.join(words))
+```
+:::
+

Try adding that to a document and see that you’ll get a set of tabs that change the content of the code chunk to the respective language. This is not only useful for showing different languages, but can be used for other situations as well. For example, you might want to run different analyses and show them in different tabs, or even show different interactive elements in separate tabs.

+
+
+

6.2 Callouts

+

If you’re writing some sort of documentation, tutorial or just want to draw special attention to something, callouts are here for you. They render as a coloured block with a header and content. There are five types of callouts: note, tip, warning, caution, and important. As with lots of Quarto things they are specified using a division, like so:

+
::: {.callout-note}
+This is a note callout.
+:::
+

The different callouts come with appropriate colours by default, which you can change in the theme. You can also have collapsible callouts by adding the collapse=true option, where true will have the callout collapsed by default. You can also specify titles in the same way using the title=<TITLE> option or by adding the title directly to the callout content, like so:

+
::: {.callout-note}
+## This is the callout title
+
+This is a note callout.
+:::
+

You can change the overall appearance of callouts by using the appearance option or the callout-appearance global option. Valid values are default, simple and minimal, with decreasing usage of colours and weights. You can also suppress the callout icons using icon=false or callout-icon: false in a similar manner.

+
+
+

6.3 Mixing R and Python

+

Earlier in the tutorial we showed how to change the language using the engine global option, but there is actually a way to use both R and Python in the same Quarto document. This is done via the Knitr engine and the reticulate R package, which allows communication between any variables and data you store in either R or Python code chunks. While this may not be that common of a use-case, it’s still great that it’s there for those that want access to it. We won’t go through the details of how this works here, but you’re welcome to go and check out the official reticulate website for yourself.

+

If you just want to mix R and Python in a single Quarto document without the interoperability between the languages it’s a lot simpler, though. You can either just install the reticulate package (r-reticulate in Conda) or add the python.reticulate=FALSE chunk option to the Python chunks.

+
+
+

6.4 Citations

+

You can actually write whole articles in Quarto! For that purpose, it’s also great that you can cite things from a bibliography as well. Specifying the bibliography file(s) is done using the bibliography global option; specifying the citation style can be done using a csl (Citation Style Language) file and the csl global option. Citation itself is similar to cross-referencing (@cross-ref), but is surrounded by square brackets: [@citation]. You can read more details about citations at the Quarto website.

+ + +
+
+ +
+ +
+ + + + + + + \ No newline at end of file diff --git a/pages/snakemake.html b/pages/snakemake.html new file mode 100644 index 00000000..c83b9278 --- /dev/null +++ b/pages/snakemake.html @@ -0,0 +1,2052 @@ + + + + + + + + + +Working with Snakemake + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+
+
+

Working with Snakemake

+

How to create reproducible workflows and computational pipelines

+
+
+ + +
+ + +
+
Published
+
+

15-Oct-2024

+
+
+ + +
+ + +
+ + + + +
+ + + + + + +
+

1 Introduction

+

A workflow management system (WfMS) is a piece of software that sets up, performs and monitors a defined sequence of computational tasks (i.e. “a workflow”). Snakemake is a WfMS that was developed in the bioinformatics community, and as such it has a number of features that make it particularly well-suited for creating reproducible and scalable data analyses.

+

First of all the language you use to formulate your workflows is based on Python, which is a language with strong standing in academia. However, users are not required to know how to code in Python to work efficiently with Snakemake. Workflows can easily be scaled from your desktop to server, cluster, grid or cloud environments. This makes it possible to develop a workflow on your laptop, maybe using only a small subset of your data, and then run the real analysis on a cluster. Snakemake also has several features for defining the environment with which each task is carried out. This is important in bioinformatics, where workflows often involve running a large number of small third-party tools.

+

Snakemake is primarily intended to work on files (rather than for example streams, reading/writing from databases or passing variables in memory). This fits well with many fields of bioinformatics, notably next-generation sequencing, that often involve computationally expensive operations on large files. It’s also a good fit for a scientific research setting, where the exact specifications of the final workflow aren’t always known at the beginning of a project.

+

Lastly, a WfMS is a very important tool for making your analyses reproducible. By keeping track of when each file was generated, and by which operation, it is possible to ensure that there is a consistent “paper trail” from raw data to final results. Snakemake also has features that allow you to package and distribute the workflow, and any files it involves, once it’s done.

+

This tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/snakemake and activate your snakemake-env Conda environment.

+
+
+

2 The basics

+

In this part of the tutorial we will create a very simple workflow from scratch, in order to show the fundamentals of how Snakemake works. The workflow will take two files as inputs, a.txt and b.txt, and the purpose is to convert the text in the files to upper case and then to concatenate them.

+

Run the following shell commands. The first one will make an empty file named Snakefile, which will later contain the workflow. The second and third commands generate two files containing some arbitrary text.

+
touch Snakefile
+echo "This is a.txt" > a.txt
+echo "This is b.txt" > b.txt
+

Then open Snakefile in your favourite text editor. A Snakemake workflow is based on rules which take some file(s) as input, performs some type of operation on them, and generate some file(s) as outputs. Here is a very simple rule that produces a.upper.txt as an output, using a.txt as input. Copy this rule to your Snakefile and save it.

+
rule convert_to_upper_case:
+    output:
+        "a.upper.txt"
+    input:
+        "a.txt"
+    shell:
+        """
+        tr [a-z] [A-Z] < {input} > {output}
+        """
+
+
+
+ +
+
+Caution +
+
+
+

Indentation is important in Snakefiles, so make sure that you have the correct number of spaces before input/output/shell and their respective subsections. The number of spaces per level doesn’t matter as long as you’re consistent. Here we use four, but you could just as well use two for a more compact look. Don’t use tabs (unless your editor automatically converts them to spaces).

+
+
+

Rules can be given names, here it’s convert_to_upper_case. While rule names are not strictly necessary we encourage you to use them and to make an effort to name your rules in a way that makes it easy to understand the purpose of the rule, as rule names are one of the main ways to interact with the workflow. The shell section (or directive) contains the shell commands that will convert the text in the input file to upper case and send it to the output file. In the shell command string, we can refer to elements of the rule via curly brackets. Here, we refer to the output file by specifying {output} and to the input file by specifying {input}. If you’re not very familiar with Bash, this particular command can be read like “send the contents of a.txt to the program tr, which will convert all characters in the set [a-z] to the corresponding character in the set [A-Z], and then send the output to a.upper.txt”.

+

Now let’s run our first Snakemake workflow. When a workflow is executed Snakemake tries to generate a set of target files. Target files can be specified via the command line (or, as you will see later, in several other ways). Here we ask Snakemake to make the file a.upper.txt. We can specify the file containing our rules with -s but since the default behaviour of Snakemake is to look for a file called Snakefile in either the working directory or in a subdirectory called workflow/ we don’t need to specify that here. It’s good practice to first run with the flag -n (or --dry-run), which will show what Snakemake plans to do without actually running anything, and you also need to specify how many cores to be used for the workflow with --cores or -c. For now, you only need 1 so set -c 1. You can also use the flag -p, for showing the shell commands that it will execute, and the flag -r for showing the reason for running a specific rule. snakemake --help will show you all available flags.

+
$ snakemake -n -c 1 -r -p a.upper.txt
+
+Building DAG of jobs...
+Job stats:
+job                      count    min threads    max threads
+---------------------  -------  -------------  -------------
+convert_to_upper_case        1              1              1
+total                        1              1              1
+
+
+[Mon Oct 25 16:48:43 2021]
+rule convert_to_upper_case:
+    input: a.txt
+    output: a.upper.txt
+    jobid: 0
+    reason: Missing output files: a.upper.txt
+    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
+
+
+        tr [a-z] [A-Z] < a.txt > a.upper.txt
+
+Job stats:
+job                      count    min threads    max threads
+---------------------  -------  -------------  -------------
+convert_to_upper_case        1              1              1
+total                        1              1              1
+
+This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
+

You can see that Snakemake plans to run one job: the rule convert_to_upper_case with a.txt as input and a.upper.txt as output. The reason for doing this is that it’s missing the file a.upper.txt. Now execute the workflow without the -n flag and check that the contents of a.upper.txt is as expected. Then try running the same command again. What do you see? It turns out that Snakemake only reruns jobs if there have been changes to either the input files, or the workflow itself. This is how Snakemake ensures that everything in the workflow is up to date. We will get back to this shortly.

+

What if we ask Snakemake to generate the file b.upper.txt?

+
$ snakemake -n -c 1 -r -p b.upper.txt
+
+Building DAG of jobs...
+MissingRuleException:
+No rule to produce b.upper.txt (if you use input functions make sure that they don't raise unexpected exceptions).
+

That didn’t work well. We could copy the rule to make a similar one for b.txt, but that would be a bit cumbersome. Here is where named wildcards come in; one of the most powerful features of Snakemake. Simply change the input from input: "a.txt" to input: "{some_name}.txt" and the output to output: "{some_name}.upper.txt". Now try asking for b.upper.txt again.

+

Tada! What happens here is that Snakemake looks at all the rules it has available (actually only one in this case) and tries to assign values to all wildcards so that the targeted files can be generated. In this case it was quite simple, you can see that it says that wildcards: some_name=b, but for large workflows and multiple wildcards it can get much more complex. Named wildcards is what enables a workflow (or single rules) to be efficiently generalized and reused between projects or shared between people.

+

It seems we have the first part of our workflow working, now it’s time to make the second rule for concatenating the outputs from convert_to_upper_case. The rule structure will be similar; the only difference is that here we have two inputs instead of one. This can be expressed in two ways, either with named inputs like this:

+
input:
+    firstFile="...",
+    secondFile="..."
+shell:
+    """
+    some_function {input.firstFile} {input.secondFile}
+    """
+

Or with indexes like this:

+
input:
+    "...",
+    "..."
+shell:
+    """
+    some_function {input[0]} {input[1]}
+    """
+
+
+
+ +
+
+Caution +
+
+
+

If you have multiple inputs or outputs they need to be delimited with a comma (as seen above). This is a very common mistake when writing Snakemake workflows. The parser will complain, but sometimes the error message can be difficult to interpret.

+
+
+

Now try to construct this rule yourself and name it concatenate_a_and_b. The syntax for concatenating two files in Bash is cat first_file.txt second_file.txt > output_file.txt. Call the output c.txt. Run the workflow in Snakemake and validate that the output looks as expected.

+

Wouldn’t it be nice if our workflow could be used for any files, not just a.txt and b.txt? We can achieve this by using named wildcards (or in other ways as we will discuss later). As we’ve mentioned, Snakemake looks at all the rules it has available and tries to assign values to all wildcards so that the targeted files can be generated. We therefore have to name the output file in a way so that it also contains information about which input files it should be based on. Try to figure out how to do this yourself. If you’re stuck you can look at the spoiler below, but spend some time on it before you look. Also rename the rule to concatenate_files to reflect its new more general use.

+
+ +
+
+
rule concatenate_files:
+    output:
+        "{first}_{second}.txt"
+    input:
+        "{first}.upper.txt",
+        "{second}.upper.txt"
+    shell:
+        """
+        cat {input[0]} {input[1]} > {output}
+        """
+
+
+
+

We can now control which input files to use by the name of the file we ask Snakemake to generate. Run the workflow without the flag -n (or --dry-run) to execute both rules, providing one core with -c 1 (or --cores 1):

+
$ snakemake a_b.txt -c 1
+
+Building DAG of jobs...
+Using shell: /bin/bash
+Provided cores: 1 (use --cores to define parallelism)
+Rules claiming more threads will be scaled down.
+Job stats:
+job                      count    min threads    max threads
+---------------------  -------  -------------  -------------
+concatenate_files            1              1              1
+convert_to_upper_case        2              1              1
+total                        3              1              1
+
+Select jobs to execute...
+
+[Mon Oct 25 16:51:52 2021]
+rule convert_to_upper_case:
+    input: b.txt
+    output: b.upper.txt
+    jobid: 2
+    wildcards: some_name=b
+    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
+
+[Mon Oct 25 16:51:53 2021]
+Finished job 2.
+1 of 3 steps (33%) done
+Select jobs to execute...
+
+[Mon Oct 25 16:51:53 2021]
+rule convert_to_upper_case:
+    input: a.txt
+    output: a.upper.txt
+    jobid: 1
+    wildcards: some_name=a
+    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
+
+[Mon Oct 25 16:51:53 2021]
+Finished job 1.
+2 of 3 steps (67%) done
+Select jobs to execute...
+
+[Mon Oct 25 16:51:53 2021]
+rule concatenate_files:
+    input: a.upper.txt, b.upper.txt
+    output: a_b.txt
+    jobid: 0
+    wildcards: first=a, second=b
+    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
+
+[Mon Oct 25 16:51:53 2021]
+Finished job 0.
+3 of 3 steps (100%) done
+

Neat!

+
+
+
+ +
+
+Tip +
+
+
+

You can name a file whatever you want in a Snakemake workflow, but you will find that everything falls into place much nicer if the filename reflects the file’s path through the workflow, e.g. sample_a.trimmed.deduplicated.sorted.bam.

+
+
+

The input to Snakemake rules have to be strings or lists of strings, however you don’t have to specify these strings directly in the input: section of rules. Instead, you can specify Python functions that return strings or lists of strings. This allows you to supply input to rules that can vary depending on the wildcards being used. We’ll get to why that’s useful in a sec, but first let’s put it to use for the conatenate_files rule. Because Snakemake is based on Python we can mix rule definitions with standard python code in the same file. Add a function just above the concatenate_files that looks like this:

+
def concat_input(wildcards):
+    files = [wildcards.first + ".upper.txt", wildcards.second + ".upper.txt"]
+    return files
+

This is the syntax to define a function in Python. The def concat_input(wildcards): line shows the name of the function (concat_input) and the variable passed to the function (the wildcards object). In the second line we add two items to a list that we call files and add the ‘.upper.txt’ suffix to each item. Finally, the function returns the list. Because the concatenate_files rule has two wildcards {first} and {second} we can access the actual strings in the wildcards object using wildcards.first and wildcards.second. When we ask for the file a_b.txt then wildcards.first == 'a' and wildcards.second == 'b'. This means that the files list returned by the function will be ['a.upper.txt', 'b.upper.txt']. To see for yourself you can add the following line to the function, just before the return statement: print (wildcards.first, wildcards.second, files). This way the wildcard values and the list will be printed to the terminal when you run Snakemake.

+

Now that we’ve defined the function to use as input, we can use it in the concatenate_files rule. Update the rule so that it looks like this:

+
rule concatenate_files:
+    output:
+        "{first}_{second}.txt"
+    input:
+        concat_input
+    shell:
+        """
+        cat {input[0]} {input[1]} > {output}
+        """
+

You see that the name of the function concat_input is added in place of the input strings. When using the wildcards object in input functions like this we have to call the function without any arguments (simply concat_input) and the function has to be defined to accept a single argument (here def concat_input(wildcards):). Let’s run the workflow with the updated rule. Remove the file a_b.txt or add -f to the Snakemake command to force a re-run:

+
snakemake a_b.txt -c 1 -f
+

If you added the print statement to the function you should see the following printed to your terminal:

+
Building DAG of jobs...
+a b ['a.upper.txt', 'b.upper.txt']
+

Followed by the rest of the workflow output.

+

There are a number of possible use-cases for input functions. For example, say that you have an experiment where you’ve sequenced three samples: sample1, sample2 and sample3 with the corresponding FASTQ files under data/ and you want to write a rule that outputs the statistics of all sequences within each sample. However, samples sample1 and sample2 have been sequenced with single-end technology while sample3 have paired-end reads. The single-end samples will have only one FASTQ file whereas the paired-end sample will have two (one for each sequenced end). Thus, depending on the name of the sample the input to the function will either be one file or two. With input functions we can write a generalized rule that can handle both types:

+
def fastq_input(wildcards):
+    if wildcards.sample_id in ["sample1", "sample2"]:
+        return "data/" + wildcards.sample_id + ".fastq.gz"
+    else:
+        return ["data/" + wildcards.sample_id + ".R1.fastq.gz",
+                "data/" + wildcards.sample_id + ".R2.fastq.gz"]
+
+rule fastq_stats:
+    output:
+        "{sample_id}.stats.txt"
+    input:
+        fastq_input
+    shell:
+        """
+        seqtk comp {input} > {output}
+        """
+

As you can see, the fastq_stats rule outputs one file {sample_id}.stats.txt and takes as input the value returned from the fastq_input function. In this function the sample id is evaluated and if it is either sample1 or sample2 (our single-end samples) then the function returns a single string which is the path to the FASTQ file for that sample. Otherwise, the function returns a list containing both the R1 and R2 files for the sample. In the shell: directive of the rule the seqtk comp command is run on the input and the output is sent to the output file.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How a simple Snakemake rule looks.
  • +
  • How to define target files when executing a workflow.
  • +
  • How to use named wildcards for writing generic and flexible rules.
  • +
  • How to use input functions in rules
  • +
+
+
+
+
+

3 Visualising workflows

+

All that we’ve done so far could quite easily be done in a simple shell script that takes the input files as parameters. Let’s now take a look at some of the features where a WfMS like Snakemake really adds value compared to a more straightforward approach. One such feature is the possibility to visualize your workflow. Snakemake can generate three types of graphs, one that shows how the rules are connected, one that shows how the jobs (i.e. an execution of a rule with some given inputs/outputs/settings) are connected, and finally one that shows rules with their respective input/output files.

+

First we look at the rule graph. The following command will generate a rule graph in the dot language and pipe it to the program dot, which in turn will save a visualization of the graph as a PNG file (if you’re having troubles displaying PNG files you could use SVG or JPG instead).

+
+
+
+ +
+
+Caution +
+
+
+

If you added the print(wildcards.first,wildcards.second,files) statement to the concat_input function in the previous section you need to remove that line before running the commands below.

+
+
+
snakemake --rulegraph a_b.txt | dot -Tpng > rulegraph.png
+

+

This looks simple enough, the output from the rule convert_to_upper_case will be used as input to the rule concatenate_files.

+

For a more typical bioinformatics project it can look something like this when you include all the rules from processing of the raw data to generating figures for the paper.

+

+

While saying that it’s easy to read might be a bit of a stretch, it definitely gives you a better overview of the project than you would have without a WfMS.

+

The second type of graph is based on the jobs, and looks like this for our little workflow (use --dag instead of --rulegraph).

+
snakemake --dag a_b.txt | dot -Tpng > jobgraph.png
+

+

The main difference here is that now each node is a job instead of a rule. You can see that the wildcards used in each job are also displayed. Another difference is the dotted lines around the nodes. A dotted line is Snakemake’s way of indicating that this rule doesn’t need to be rerun in order to generate a_b.txt. Validate this by running snakemake -n -r a_b.txt and it should say that there is nothing to be done.

+

We’ve discussed before that one of the main purposes of using a WfMS is that it automatically makes sure that everything is up to date. This is done by recursively checking that outputs are always newer than inputs for all the rules involved in the generation of your target files. Now try to change the contents of a.txt to some other text and save it. What do you think will happen if you run snakemake -n -r a_b.txt again?

+
+ +
+
+
$ snakemake -n -r a_b.txt
+
+Building DAG of jobs...
+Job stats:
+job                      count    min threads    max threads
+---------------------  -------  -------------  -------------
+concatenate_files            1              1              1
+convert_to_upper_case        1              1              1
+total                        2              1              1
+
+
+[Mon Oct 25 17:00:02 2021]
+rule convert_to_upper_case:
+    input: a.txt
+    output: a.upper.txt
+    jobid: 1
+    reason: Updated input files: a.txt
+    wildcards: some_name=a
+    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
+
+
+[Mon Oct 25 17:00:02 2021]
+rule concatenate_files:
+    input: a.upper.txt, b.upper.txt
+    output: a_b.txt
+    jobid: 0
+    reason: Input files updated by another job: a.upper.txt
+    wildcards: first=a, second=b
+    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
+
+Job stats:
+job                      count    min threads    max threads
+---------------------  -------  -------------  -------------
+concatenate_files            1              1              1
+convert_to_upper_case        1              1              1
+total                        2              1              1
+
+This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
+
+
+
+

Were you correct? Also generate the job graph and compare to the one generated above. What’s the difference? Now rerun without -n and validate that a_b.txt contains the new text (don’t forget to specify -c 1). Note that Snakemake doesn’t look at the contents of files when trying to determine what has changed, only at the timestamp for when they were last modified.

+

We’ve seen that Snakemake keeps track of if files in the workflow have changed, and automatically makes sure that any results depending on such files are regenerated. What about if the rules themselves are changed? It turns out that since version 7.8.0 Snakemake keeps track of this automatically.

+

Let’s say that we want to modify the rule concatenate_files to also include which files were concatenated.

+
rule concatenate_files:
+    output:
+        "{first}_{second}.txt"
+    input:
+        "{first}.upper.txt",
+        "{second}.upper.txt"
+    shell:
+        """
+        echo 'Concatenating {input}' | cat - {input[0]} {input[1]} > {output}
+        """
+
+
+
+ +
+
+Note +
+
+
+

It’s not really important for the tutorial, but the shell command used here first outputs “Concatenating” followed by a space delimited list of the files in input. This string is then sent to the program cat where it’s concatenated with input[0] and input[1] (the parameter - means that it should read from standard input). Lastly, the output from cat is sent to {output}.

+
+
+

If you now run the workflow as before you should see:

+
rule concatenate_files:
+    input: a.upper.txt, b.upper.txt
+    output: a_b.txt
+    jobid: 0
+    reason: Code has changed since last execution
+    wildcards: first=a, second=b
+

Because although no files involved in the workflow have been changed, Snakemake recognizes that the workflow code itself has been modified and this triggers a re-run.

+

Snakemake is aware of changes to four categories of such “rerun-triggers”: “input” (changes to rule input files), “params” (changes to the rule params section), “software-env” (changes to Conda environment files specified by the conda: directive) and “code” (changes to code in the shell:, run:, script: and notebook: directives).

+

Prior to version 7.8.0, only changes to the modification time of input files would trigger automatic re-runs. To run Snakemake with this previous behaviour you can use the setting --rerun-triggers mtime at the command line. Change the shell: section of the concatenate_files rule back to the previous version, then try running: snakemake -n -r a_b.txt --rerun-triggers mtime and you should again see Nothing to be done (all requested files are present and up to date).

+

You can also export information on how all files were generated (when, by which rule, which version of the rule, and by which commands) to a tab-delimited file like this:

+
snakemake a_b.txt -c 1 -D > summary.tsv
+

The content of summary.tsv is shown in the table below:

+
+ +++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
output_filedateruleversionlog-file(s)input-file(s)shellcmdstatusplan
a_b.txtMon Oct 25 17:01:46 2021concatenate_files-a.upper.txt,b.upper.txtcat a.upper.txt b.upper.txt > a_b.txtrule implementation changedupdate pending
a.upper.txtMon Oct 25 17:01:46 2021convert_to_upper_case-a.txttr [a-z] [A-Z] < a.txt > a.upper.txtokno update
b.upper.txtMon Oct 25 17:01:46 2021convert_to_upper_case-b.txttr [a-z] [A-Z] < b.txt > b.upper.txtokno update
+
+

You can see in the second last column that the rule implementation for a_b.txt has changed. The last column shows if Snakemake plans to regenerate the files when it’s next executed. You can see that for the concatenate_files the plan is update pending because we generated the summary with the default behaviour of using all rerun-triggers.

+

You might wonder where Snakemake keeps track of all these things? It stores all information in a hidden subdirectory called .snakemake. This is convenient since it’s easy to delete if you don’t need it anymore and everything is contained in the project directory. Just be sure to add it to .gitignore so that you don’t end up tracking it with git.

+

By now you should be familiar with the basic functionality of Snakemake, and you can build advanced workflows with only the features we have discussed here. There’s a lot we haven’t covered though, in particular when it comes to making your workflow more reusable. In the following section we will start with a workflow that is fully functional but not very flexible. We will then gradually improve it, and at the same time showcase some Snakemake features we haven’t discussed yet. Note that this can get a little complex at times, so if you felt that this section was a struggle then you could move on to one of the other tutorials instead.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to use --dag and --rulegraph for visualizing the job and rule graphs, respectively.
  • +
  • How Snakemake reruns relevant parts of the workflow after there have been changes.
  • +
  • How Snakemake tracks changes to files and code in a workflow
  • +
+
+
+
+
+

4 The MRSA workflow

+

As you might remember from the intro, we are attempting to understand how lytic bacteriophages can be used as a future therapy for the multi-resistant bacteria MRSA (methicillin-resistant Staphylococcus aureus). In order to do this we have performed RNA-seq of three strains, one test and two controls. We have already set up a draft Snakemake workflow for the RNA-seq analysis and it seems to be running nicely. The rest of the Snakemake tutorial will be spent improving and making this workflow more flexible!

+
+
+
+ +
+
+Tip +
+
+
+

This section will leave a little more up to you compared to the previous one. If you get stuck at some point the final workflow after all the modifications is available in tutorials/git/Snakefile.

+
+
+

You are probably already in your snakemake-env environment, otherwise activate it (use conda info --envs if you are unsure).

+
+
+
+ +
+
+Tip +
+
+
+

Here we have one Conda environment for executing the whole Snakemake workflow. Snakemake also supports using explicit Conda environments on a per-rule basis, by specifying something like conda: rule-specific-env.yml in the rule definition and running Snakemake with the --use-conda flag. The given rule will then be run in the Conda environment specified in rule-specific-env.yml that will be created and activated on the fly by Snakemake. Note that by default Snakemake uses conda to generate the rule-specific environments. This behaviour can be changed by running with --conda-frontend conda, which will force Snakemake to use conda instead.

+
+
+

Let’s start by generating the rule graph so that we get an overview of the workflow. Here we have to specify the file with the rules using the -s flag to Snakemake since the path to the file differs from the default.

+
snakemake -s snakefile_mrsa.smk --rulegraph | dot -T png > rulegraph_mrsa.png
+

There’s another difference in this command compared to the one we’ve used before, namely that we don’t define a target. In the toy example we used a_b.txt as a target, and the wildcards were resolved based on that. How come that we don’t need to do that here? It turns out that by default Snakemake targets the first rule in a workflow. By convention, we call this rule all and let it serve as a rule for aggregating the main outputs of the workflow.

+

+

Now take some time and look through the workflow file and try to understand how the rules fit together. Use the rule graph as aid. The rules represent a quite standard, although somewhat simplified, workflow for RNA-seq analysis. If you are unfamiliar with the purpose of the different operations (index genome, FastQC and so on), then take a look at the intro.

+

Also generate the job graph in the same manner. Here you can see that three samples will be downloaded: SRR935090, SRR935091, and SRR935092. The original sample files contain tens of millions of reads but for the purpose of this course we have sub-sampled them to 100,000 reads per sample, so that they are easy to manage, and made them available at the SciLifeLab Data Repository. These FASTQ files will then be quality controlled with FastQC and aligned to a genome. The QC output will be aggregated with MultiQC and the alignments will be used to generate a count table, i.e. a table that shows how many reads map to each gene for each sample. This count table is then what the downstream analysis will be based on.

+

+

Now try to run the whole workflow. Hopefully you see something like this.

+
Building DAG of jobs...
+Using shell: /bin/bash
+Provided cores: 1 (use --cores to define parallelism)
+Rules claiming more threads will be scaled down.
+Job stats:
+job                     count    min threads    max threads
+--------------------  -------  -------------  -------------
+align_to_genome             3              1              1
+all                         1              1              1
+fastqc                      3              1              1
+generate_count_table        1              1              1
+generate_rulegraph          1              1              1
+get_SRA_by_accession        3              1              1
+get_genome_fasta            1              1              1
+get_genome_gff3             1              1              1
+index_genome                1              1              1
+multiqc                     1              1              1
+sort_bam                    3              1              1
+total                      19              1              1
+
+Select jobs to execute...
+
+[Mon Oct 25 17:13:47 2021]
+rule get_genome_fasta:
+    output: data/ref/NCTC8325.fa.gz
+    jobid: 6
+    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
+
+--2021-10-25 17:13:48--  ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
+           => ‘data/ref/NCTC8325.fa.gz’
+Resolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.75
+Connecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.75|:21... connected.
+Logging in as anonymous ... Logged in!
+==> SYST ... done.    ==> PWD ... done.
+.
+.
+[lots of stuff]
+.
+.
+localrule all:
+    input: results/tables/counts.tsv, results/multiqc/multiqc.html, results/rulegraph.png
+    jobid: 0
+    resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T
+
+[Mon Oct 25 17:14:38 2021]
+Finished job 0.
+19 of 19 steps (100%) done
+

After everything is done, the workflow will have resulted in a bunch of files in the directories data/ and results/. Take some time to look through the structure, in particular the quality control reports in results/multiqc/ and the count table in results/tables/.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How the MRSA workflow looks.
  • +
  • How to run the MRSA workflow.
  • +
  • Which output files the MRSA workflow produces.
  • +
+
+
+
+
+

5 Parameters

+

In a typical bioinformatics project, considerable efforts are spent on tweaking parameters for the various programs involved. It would be inconvenient if you had to change in the shell scripts themselves every time you wanted to run with a new setting. Luckily, there is a better option for this: the params keyword.

+
rule some_rule:
+    output:
+        "..."
+    input:
+        "..."
+    params:
+        cutoff=2.5
+    shell:
+        """
+        some_program --cutoff {params.cutoff} {input} {output}
+        """
+

Most of the programs are run with default settings in the MRSA workflow and don’t use the params: directive. However, the get_SRA_by_accession rule is an exception. Here the remote address for each of the files to download is passed to the shell directive via:

+
def get_sample_url(wildcards):
+    samples = {
+        "SRR935090": "https://figshare.scilifelab.se/ndownloader/files/39539767",
+        "SRR935091": "https://figshare.scilifelab.se/ndownloader/files/39539770",
+        "SRR935092": "https://figshare.scilifelab.se/ndownloader/files/39539773"
+    }
+    return samples[wildcards.sample_id]
+
+rule get_SRA_by_accession:
+    """
+    Retrieve a single-read FASTQ file
+    """
+    output:
+        "data/{sample_id}.fastq.gz"
+    params:
+        url = get_sample_url
+    shell:
+        """
+        wget -O - {params.url} | seqtk sample - 25000 | gzip -c > {output[0]}
+        """
+

You may recognize this from page 2 of this tutorial where we used input functions to generate strings and lists of strings for the input: section of a rule. Using a function to return values based on the wildcards also works for params:. Here sample_id is a wildcard which in this specific workflow can be either SRR935090, SRR935091, or SRR935092. The wildcards object is passed to the function get_sample_url and depending on what output the rule is supposed to generate, wildcards.sample_id will take the value of either of the three sample ids. The samples variable defined in the function is a Python dictionary that has the URLs for each sample_id hard-coded. This dictionary is used to convert the value of the sample_id wildcard to a URL, which is returned by the function. Finally, in the shell: directive we access the url parameter with {params.url}. (We could have written three separate rules to download the samples, but it’s easy to see how that can become impractical.)

+

Let’s add another parameter to the get_SRA_by_accession rule. As you can see in the shell command the FASTQ file downloaded by wget gets piped directly (the -O - part means send contents to STDOUT) to the seqtk sample command which reads from STDIN and outputs 25000 randomly sampled reads (out of the 100,000 contained in the example FASTQ file). Change in the rule to use the parameter max_reads instead and set the value to 20000. If you need help, click to show the solution below.

+
+ +
+
+
rule get_SRA_by_accession:
+    """
+    Retrieve a single-read FASTQ file
+    """
+    output:
+        "data/{sample_id}.fastq.gz"
+    params:
+        url = get_sample_url,
+        max_reads = 20000
+    shell:
+        """
+        wget -O - {params.url} | seqtk sample - {params.max_reads} | gzip -c > {output[0]}
+        """
+
+
+
+

Now run through the workflow. Because there’s been changes to the get_SRA_by_accession rule this will trigger a re-run of the rule for all three accessions. In addition all downstream rules that depend on output from get_SRA_by_accession are re-run.

+

As you can see the parameter values we set in the params section don’t have to be static, they can be any Python expression. In particular, Snakemake provides a global dictionary of configuration parameters called config. Let’s modify get_SRA_by_accession to look something like this in order to make use of this dictionary:

+
rule get_SRA_by_accession:
+    """
+    Retrieve a single-read FASTQ file
+    """
+    output:
+        "data/{sample_id}.fastq.gz"
+    params:
+        url = get_sample_url,
+        max_reads = config["max_reads"]
+    shell:
+        """
+        wget -L {params.url} | seqtk sample - {params.max_reads} | gzip -c > {output[0]}
+        """
+

Note that Snakemake now expects there to be a key named max_reads in the config dictionary. If we don’t populate the dictionary somehow the dictionary will be empty so if you were to run the workflow now it would trigger a KeyError (try running snakemake -s snakefile_mrsa.smk -n to see for yourself). In order to populate the config dictionary with data for the workflow we could use the snakemake --config KEY=VALUE syntax directly from the command line (e.g. snakemake --config max_reads=20000 -s snakefile_mrsa.smk). However, from a reproducibility perspective, it’s not optimal to set parameters from the command line, since it’s difficult to keep track of which parameter values that were used.

+

A much better alternative is to use the --configfile FILE option to supply a configuration file to Snakemake. In this file we can collect all the project-specific settings, sample ids and so on. This also enables us to write the Snakefile in a more general manner so that it can be better reused between projects. Like several other files used in these tutorials, this file should be in YAML format. Create the file below and save it as config.yml.

+
max_reads: 25000
+

If we now run Snakemake with --configfile config.yml, it will parse this file to form the config dictionary. If you want to overwrite a parameter value, e.g. for testing, you can still use the --config KEY=VALUE flag, as in --config max_reads=1000.

+
+
+
+ +
+
+Tip +
+
+
+

Rather than supplying the config file from the command line you could also add the line configfile: "config.yml" to the top of your Snakefile. Keep in mind that with such a setup Snakemake will complain if the file config.yml is not present.

+
+
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to set parameter values with the params directive.
  • +
  • How to run Snakemake with the config variable and with a configuration file.
  • +
+
+
+
+
+

6 Logs

+

As you probably noticed it was difficult to follow how the workflow progressed since some rules printed a lot of output to the terminal. In some cases this also contained important information, such as statistics on the sequence alignments or genome indexing. This could be valuable for example if you later in the project get weird results and want to debug. It’s also important from a reproducibility perspective that the “paper trail” describing how the outputs were generated is saved. Luckily, Snakemake has a feature that can help with this. Just as we define input and output in a rule we can also define log.

+
rule some_rule:
+    output:
+        "..."
+    input:
+        "..."
+    log:
+        "..."
+    shell:
+        """
+        echo 'Converting {input} to {output}' > {log}
+        """
+

A log file is not different from any other output file, but it’s dealt with a little differently by Snakemake. For example, it’s shown in the file summary when using -D and unlike other output files it’s not deleted if jobs fail which of course is necessary for debugging purposes. It’s also a good way to clarify the purpose of the file. We probably don’t need to save logs for all the rules, only the ones with interesting output.

+
    +
  • get_genome_fasta and get_genome_gff3 would be good to log since they are dependent on downloading files from an external server.
  • +
  • multiqc aggregates quality control data for all the samples into one html report, and the log contains information about which samples were aggregated.
  • +
  • index_genome outputs some statistics about the genome indexing.
  • +
  • align_to_genome outputs important statistics about the alignments. This is probably the most important log to save.
  • +
+

Now add a log file to some or all of the rules above. A good place to save them to would be results/logs/rule_name/. In order to avoid that multiple jobs write to the same files Snakemake requires that all output and log files contain the same wildcards, so be sure to include any wildcards used in the rule in the log name as well, e.g. {some_wildcard}.log.

+

You also have to specify in the shell section of each rule what you want the log to contain. Some of the programs we use send their log information to standard out, some to standard error and some let us specify a log file via a flag.

+

For example, in the align_to_genome rule, it could look like this (Bowtie2 writes log info to standard error):

+
rule align_to_genome:
+    """
+    Align a fastq file to a genome index using Bowtie 2.
+    """
+    output:
+        "results/bam/{sample_id,\w+}.bam"
+    input:
+        "data/{sample_id}.fastq.gz",
+        "results/bowtie2/NCTC8325.1.bt2",
+        "results/bowtie2/NCTC8325.2.bt2",
+        "results/bowtie2/NCTC8325.3.bt2",
+        "results/bowtie2/NCTC8325.4.bt2",
+        "results/bowtie2/NCTC8325.rev.1.bt2",
+        "results/bowtie2/NCTC8325.rev.2.bt2"
+    log:
+        "results/logs/align_to_genome/{sample_id}.log"
+    shell:
+        """
+        bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output} 2>{log}
+        """
+

To save some time you can use the info below.

+
# wget has a -o flag for specifying the log file
+wget remote_file -O output_file -o {log}
+
+# MultiQC and featureCounts write to standard error so we redirect with "2>"
+multiqc -n output_file input_files 2> {log}
+featureCounts -t gene -g gene_id -a gff_file -o output_file input_files 2>{log}
+
+# Bowtie2-build redirects to standard out so we use ">"
+bowtie2-build input_file index_dir > {log}
+

Now rerun the whole workflow. Do the logs contain what they should? Note how much easier it is to follow the progression of the workflow when the rules write to logs instead of to the terminal.

+
+
+
+ +
+
+Tip +
+
+
+

If you have a rule with a shell directive in which several commands are run and you want to save stdout and stderr for all commands into the same log file you can add exec &{log} as the first line of the shell directive.

+
+
+

If you run with -D (or -S for a simpler version) you will see that the summary table now also contains the log file for each of the files in the workflow.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to redirect output to log files with the log directive.
  • +
+
+
+
+
+

7 Temporary files

+

It’s not uncommon that workflows contain temporary files that should be kept for some time and then deleted once they are no longer needed. A typical case could be that some operation generates a file, which is then compressed to save space or indexed to make searching faster. There is then no need to save the original output file. Take a look at the job graph for our workflow again. The output from align_to_genome is a BAM file, which contains information about all the reads for a sample and where they map in the genome. For downstream processing we need this file to be sorted by genome coordinates. This is what the rule sort_bam is for. We therefore end up with both results/bam/{sample_id}.bam and results/bam/{sample_id}.sorted.bam.

+

In Snakemake we can mark an output file as temporary like this:

+
output: temp("...")
+

The file will then be deleted as soon as all jobs where it’s an input have finished. Now do this for the output of align_to_genome. We have to rerun the rule for it to trigger, so use -R align_to_genome. It should look something like this:

+
.
+.
+rule sort_bam:
+    input: results/bam/SRR935090.bam
+    output: results/bam/SRR935090.sorted.bam
+    jobid: 2
+    wildcards: sample_id=SRR935090
+
+Removing temporary output file results/bam/SRR935090.bam.
+Finished job 2.
+.
+.
+
+
+
+ +
+
+Tip +
+
+
+

Sometimes you may want to trigger removal of temporary files without actually rerunning the jobs. You can then use the --delete-temp-output flag. In some cases you may instead want to run only parts of a workflow and therefore want to prevent files marked as temporary from being deleted (because the files are needed for other parts of the workflow). In such cases you can use the --notemp flag.

+
+
+

Snakemake has a number of options for marking files:

+
    +
  • temp("..."): The output file should be deleted once it’s no longer needed by any rules.
  • +
  • protected("..."): The output file should be write-protected. Typically used to protect files that require a huge amount of computational resources from being accidentally deleted.
  • +
  • ancient("..."): The timestamp of the input file is ignored and it’s always assumed to be older than any of the output files.
  • +
  • touch("..."): The output file should be “touched”, i.e. created or updated, when the rule has finished. Typically used as “flag files” to enforce some rule execution order without real file dependencies.
  • +
  • directory("..."): The output is a directory rather than a file.
  • +
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to mark an output file as temporary for automatic removal.
  • +
+
+
+
+
+

8 Targets

+

We’ve mentioned that Snakemake rules take either strings or a list of strings as input, and that we can use any Python expression in Snakemake workflows. Here we’ll show how these features help us condense the code of rules.

+

Consider the rule align_to_genome below.

+
rule align_to_genome:
+    """
+    Align a fastq file to a genome index using Bowtie 2.
+    """
+    output:
+        "results/bam/{sample_id}.bam"
+    input:
+        "data/{sample_id}.fastq.gz",
+        "results/bowtie2/NCTC8325.1.bt2",
+        "results/bowtie2/NCTC8325.2.bt2",
+        "results/bowtie2/NCTC8325.3.bt2",
+        "results/bowtie2/NCTC8325.4.bt2",
+        "results/bowtie2/NCTC8325.rev.1.bt2",
+        "results/bowtie2/NCTC8325.rev.2.bt2"
+    shell:
+        """
+        bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output}
+        """
+

Here we have seven inputs; the FASTQ file with the reads and six files with similar file names from the Bowtie2 genome indexing. Instead of writing all the filenames we can tidy this up by using a Python expression to generate a list of these files instead. If you’re familiar with Python you could do this with list comprehensions like this:

+
input:
+    "data/{sample_id}.fastq.gz",
+    [f"results/bowtie2/NCTC8325.{substr}.bt2" for
+        substr in ["1", "2", "3", "4", "rev.1", "rev.2"]]
+

This will take the elements of the list of substrings one by one, and insert that element in the place of {substr}. Since this type of aggregating rules are quite common, Snakemake also has a more compact way of achieving the same thing.

+
input:
+    "data/{sample_id}.fastq.gz",
+    expand("results/bowtie2/NCTC8325.{substr}.bt2",
+        substr = ["1", "2", "3", "4", "rev.1", "rev.2"])
+
+
+
+ +
+
+Caution +
+
+
+

When using expand() like this, substr is not a wildcard because it is resolved to the values explicitly given inside the expand expression.

+
+
+

Now change in the rules index_genome and align_to_genome to use the expand() expression.

+

In the workflow we decide which samples to run by including the SRR ids in the names of the inputs to the rules multiqc and generate_count_table:

+
rule generate_count_table:
+    output:
+        "results/tables/counts.tsv"
+    input:
+        bams = ["results/bam/SRR935090.sorted.bam",
+                "results/bam/SRR935091.sorted.bam",
+                "results/bam/SRR935092.sorted.bam"],
+...
+rule multiqc:
+    output:
+        html = "results/multiqc/multiqc.html",
+        stats = "results/multiqc/multiqc_general_stats.txt"
+    input:
+        "results/fastqc/SRR935090_fastqc.zip",
+        "results/fastqc/SRR935091_fastqc.zip",
+        "results/fastqc/SRR935092_fastqc.zip"
+

The output files from these two rules, results/multiqc.html and results/tables/counts.tsv, are in turn specified as input to the all rule at the top of the file. Because the first rule is targeted by default when we run Snakemake on the command line (like we mentioned in snakemake-4-the-mrsa-workflow) this is what triggers the rules to run on each of the three samples.

+

However, this is a potential source of errors since it’s easy to change in one place and forget to change in the other. Because we can use Python code “everywhere” let’s instead define a list of sample ids and put at the very top of the Snakefile, just before the rule all:

+
SAMPLES = ["SRR935090", "SRR935091", "SRR935092"]
+

Now use expand() in multiqc and generate_count_table to use SAMPLES for the sample ids. For the multiqc rule it could look like this:

+
input:
+    expand("results/fastqc/{sample_id}_fastqc.zip", sample_id = SAMPLES)
+

See if you can update the generate_count_table rule in the same manner!

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to use the expand() expression to create a list with file names, inserting all provided wildcard values.
  • +
+
+
+
+
+

9 Shadow rules

+

Take a look at the index_genome rule below:

+
rule index_genome:
+    """
+    Index a genome using Bowtie 2.
+    """
+    output:
+        index = expand("results/bowtie2/NCTC8325.{substr}.bt2",
+           substr = ["1", "2", "3", "4", "rev.1", "rev.2"])
+    input:
+        "data/NCTC8325.fa.gz"
+    log:
+        "results/logs/index_genome/NCTC8325.log"
+    shell:
+        """
+        # Bowtie2 cannot use .gz, so unzip to a temporary file first
+        gunzip -c {input} > tempfile
+        bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}
+
+        # Remove the temporary file
+        rm tempfile
+        """
+

There is a temporary file here called tempfile which is the uncompressed version of the input, since Bowtie2 cannot use compressed files. There are a number of drawbacks with having files that aren’t explicitly part of the workflow as input/output files to rules:

+
    +
  • Snakemake cannot clean up these files if the job fails, as it would do for normal output files.
  • +
  • If several jobs are run in parallel there is a risk that they write to tempfile at the same time. This can lead to very scary results.
  • +
  • Sometimes we don’t know the names of all the files that a program can generate. It is, for example, not unusual that programs leave some kind of error log behind if something goes wrong.
  • +
+

All of these issues can be dealt with by using the shadow option for a rule. The shadow option results in that each execution of the rule is run in an isolated temporary directory (located in .snakemake/shadow/ by default). There are a few options for shadow (for the full list of these options see the Snakemake docs). The most simple is shadow: "minimal", which means that the rule is executed in an empty directory that the input files to the rule have been symlinked into. For the rule below, that means that the only file available would be input.txt. The shell commands would generate the files some_other_junk_file and output.txt. Lastly, Snakemake will move the output file (output.txt) to its “real” location and remove the whole shadow directory. We therefore never have to think about manually removing some_other_junk_file.

+
rule some_rule:
+    output:
+        "output.txt"
+    input:
+        "input.txt"
+    shadow: "minimal"
+    shell:
+        """
+        touch some_other_junk_file
+        cp {input} {output}
+        """
+

Try this out for the rules where we have to “manually” deal with files that aren’t tracked by Snakemake (multiqc, index_genome). Also remove the shell commands that remove temporary files from those rules, as they are no longer needed. Now rerun the workflow and validate that the temporary files don’t show up in your working directory.

+
+
+
+ +
+
+Tip +
+
+
+

Some people use the shadow option for almost every rule and some never use it at all. One thing to keep in mind is that it leads to some extra file operations when the outputs are moved to their final location. This is no issue when the shadow directory is on the same disk as the output directory, but if you’re running on a distributed file system and generate very many or very large files it might be worth considering other options (see e.g. the --shadow-prefix flag).

+
+
+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to use the shadow option to handle files that are not tracked by Snakemake.
  • +
+
+
+
+
+

10 Generalising workflows

+

It’s a good idea to separate project-specific parameters from the actual implementation of the workflow. This allows anyone using the workflow to modify its behaviour without changing the underlying code, making the workflow more general.

+

In order to generalize our RNA-seq analysis workflow we should move all project-specific information to config.yml. This means that we want the config file to:

+
    +
  • Specify which samples to run.
  • +
  • Specify which genome to align to and where to download its sequence and annotation files.
  • +
  • (Contain any other parameters we might need to make it into a general workflow, e.g. to support both paired-end and single-read sequencing)
  • +
+
+
+
+ +
+
+Note +
+
+
+

Putting all configuration in config.yml will break the generate_rulegraph rule. You can fix it either by replacing --config max_reads=0 with --configfile=config.yml in the shell command of that rule in the Snakefile, or by adding configfile: "config.yml" to the top of the Snakefile (as mentioned in a previous tip).

+
+
+

The first point is straightforward; rather than using SAMPLES = ["..."] in the Snakefile we define it as a parameter in config.yml. You can either add it as a list similar to the way it was expressed before by adding:

+
SAMPLES: ["SRR935090", "SRR935091", "SRR935092"]
+

To config.yml, or you can use this YAML notation (whether you choose SAMPLES or sample_ids as the name of the entry doesn’t matter, you will just have to reference the same name in the config dictionary inside the workflow):

+
sample_ids:
+  - SRR935090
+  - SRR935091
+  - SRR935092
+

Change the workflow to reference config["sample_ids"] (if using the latter example) instead of SAMPLES, as in:

+
expand("results/fastqc/{sample_id}_fastqc.zip",
+            sample_id = config["sample_ids"])
+

Remove the line with SAMPLES = ["SRR935090", "SRR935091", "SRR935092"] that we added to the top of snakefile_mrsa.smk in Snakemake 8: Targets.

+

Do a dry-run afterwards to make sure that everything works as expected.

+

You may remember from the snakemake-5-parameters part of this tutorial that we’re using a function to return the URL of the FASTQ files to download for each sample:

+
def get_sample_url(wildcards):
+    samples = {
+        "SRR935090": "https://figshare.scilifelab.se/ndownloader/files/39539767",
+        "SRR935091": "https://figshare.scilifelab.se/ndownloader/files/39539770",
+        "SRR935092": "https://figshare.scilifelab.se/ndownloader/files/39539773"
+    }
+    return samples[wildcards.sample_id]
+

Here the URLs of each sample_id is hard-coded in the samples dictionary inside the function. To generalize this function we can move the definition to the config file, placing it for example under an entry that we call sample_urls like this:

+
sample_urls:
+  SRR935090: "https://figshare.scilifelab.se/ndownloader/files/39539767"
+  SRR935091: "https://figshare.scilifelab.se/ndownloader/files/39539770"
+  SRR935092: "https://figshare.scilifelab.se/ndownloader/files/39539773"
+

This is what’s called ‘nested’ key/value pairs, meaning that each sample_id -> URL pair becomes nested under the config key sample_urls. So in order to access the URL of e.g. SRR935090 we would use config["sample_urls"]["SRR935090"]. This means that you will have to update the get_sample_url function to:

+
def get_sample_url(wildcards):
+    return config["sample_urls"][wildcards.sample_id]
+

Now the function uses the global config dictionary to return URLs for each sample_id. Again, do a dry-run to see that the new implementation works.

+
+
+
+ +
+
+Tip +
+
+
+

If you were to scale up this workflow with more samples it could become impractical to have to define the URLs by hand in the config file. A tip then is to have a separate file where samples are listed in one column and the URLs (or file paths) in another column. With a few lines of python code you could then read that list at the start of the workflow and add each sample to the config dictionary.

+
+
+

Now let’s take a look at the genome reference used in the workflow. In the get_genome_fasta and get_genome_gff3 rules we have hard-coded FTP paths to the FASTA GFF annotation file for the genome NCTC8325. We can generalize this in a similar fashion to what we did with the get_SRA_by_accession rule. Let’s add a nested entry called genomes to the config file that will hold the genome id and FTP paths to the FASTA and GFF file:

+
genomes:
+  NCTC8325:
+    fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz
+    gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz
+  ST398:
+    fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection//staphylococcus_aureus_subsp_aureus_st398/dna/Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.dna.toplevel.fa.gz
+    gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_st398//Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.37.gff3.gz
+

As you can see this is very similar to what with did with sample_urls, just that we have one more nested level. Now to access the FTP path to the FASTA file for genome id NCTC8325 we can use config["genomes"]["NCTC8325"]["fasta"].

+

Let’s now look at how to do the mapping from genome id to FASTA path in the rule get_genome_fasta. This is how the rule currently looks (if you have added the log section as previously described).

+
rule get_genome_fasta:
+    """
+    Retrieve the sequence in fasta format for a genome.
+    """
+    output:
+        "data/raw_external/NCTC8325.fa.gz"
+    log:
+        "results/logs/get_genome_fasta/NCTC8325.log"
+    shell:
+        """
+        wget -o {log} ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz -O {output}
+        """
+

We don’t want the hard-coded genome id NCTC8325, so replace that with a wildcard, say {genome_id} (remember to add the wildcard to the log: directive as well). We now need to supply the remote paths to the FASTA file for a given genome id. Because we’ve added this information to the config file we just need to pass it to the rule in some way, and just like in the get_SRA_by_accession rule we’ll use a function to do the job:

+
def get_fasta_path(wildcards):
+    return config["genomes"][wildcards.genome_id]["fasta"]
+
+rule get_genome_fasta:
+    """
+    Retrieve the sequence in fasta format for a genome.
+    """
+    output:
+        "data/ref/{genome_id}.fa.gz"
+    log:
+        "results/logs/get_genome_fasta/{genome_id}.log"
+    params:
+        fasta_path = get_fasta_path
+    shell:
+        """
+        wget -o {log} {params.fasta_path} -O {output}
+        """
+

Now change the get_genome_gff3 rule in a similar manner. Click to see the solution below if you’re having trouble.

+
+ +
+
+
def get_gff_path(wildcards):
+    return config["genomes"][wildcards.genome_id]["gff3"]
+
+rule get_genome_gff3:
+    """
+    Retrieve annotation in gff3 format for a genome.
+    """
+    output:
+        "data/ref/{genome_id}.gff3.gz"
+    log:
+        "results/logs/get_genome_gff3/{genome_id}.log"
+    params:
+        gff3_path = get_gff_path
+    shell:
+        """
+        wget -o {log} {params.gff3_path} -O {output}
+        """
+
+
+
+

Also change in index_genome to use a wildcard rather than a hard-coded genome id. Here you will run into a complication if you have followed the previous instructions and use the expand() expression. We want the list to expand to ["results/bowtie2/{genome_id}.1.bt2", "results/bowtie2/{genome_id}.2.bt2", ...], i.e. only expanding the wildcard referring to the Bowtie2 index. To keep the genome_id wildcard from being expanded we have to “mask” it with double curly brackets: {genome_id}. In addition, we need to replace the hard-coded results/bowtie2/NCTC8325 in the shell directive of the rule with the genome id wildcard. Inside the shell directive the wildcard object is accessed with this syntax: {wildcards.genome_id}, so the Bowtie2-build command should be:

+
bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}
+

Note that this will only work if the {genome_id} wildcard can be resolved to something defined in the config (currently NCTC8325 or ST398). If you try to generate a FASTA file for a genome id not defined in the config Snakemake will complain, even at the dry-run stage.

+

Finally, remember that any wildcards need to be present both in the output: and log: directives? This means we have to update the log: directive in index_genome as well. The final rule should look like this:

+
rule index_genome:
+    """
+    Index a genome using Bowtie 2.
+    """
+    output:
+        expand("results/bowtie2/{{genome_id}}.{substr}.bt2",
+            substr = ["1", "2", "3", "4", "rev.1", "rev.2"])
+    input:
+        "data/ref/{genome_id}.fa.gz"
+    log:
+        "results/logs/index_genome/{genome_id}.log"
+    shadow: "minimal"
+    shell:
+        """
+        # Bowtie2 cannot use .gz, so unzip to a temporary file first
+        gunzip -c {input} > tempfile
+        bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}
+        """
+

Good job! The rules get_genome_fasta, get_genome_gff3 and index_genome can now download and index any genome as long as we provide valid links in the config file.

+

However, we need to define somewhere which genome id we actually want to use when running the workflow. This needs to be done both in align_to_genome and generate_count_table. Do this by introducing a parameter in config.yml called "genome_id" (you can set it to either NCTC8325 or ST398), e.g.:

+
genome_id: "NCTC8325"
+

Now we can resolve the genome_id wildcard from the config. See below for an example for align_to_genome. Here the substr wildcard gets expanded from a list while genome_id gets expanded from the config file.

+
input:
+    "data/{sample_id}.fastq.gz",
+    index = expand("results/bowtie2/{genome_id}.{substr}.bt2",
+           genome_id = config["genome_id"],
+           substr = ["1", "2", "3", "4", "rev.1", "rev.2"])
+

Also change the hard-coded genome id in the generate_count_table input in a similar manner:

+
rule generate_count_table:
+    """
+    Generate a count table using featureCounts.
+    """
+    output:
+        "results/tables/counts.tsv",
+        "results/tables/counts.tsv.summary"
+    input:
+        bams=expand("results/bam/{sample_id}.sorted.bam",
+                    sample_id = config["sample_ids"]),
+        annotation=expand("data/ref/{genome_id}.gff3.gz",
+                    genome_id = config["genome_id"])
+    log:
+        "results/logs/generate_count_table.log"
+    shell:
+        """
+        featureCounts -t gene -g gene_id -a {input.annotation} -o {output[0]} {input.bams} 2>{log}
+        """
+

In general, we want the rules as far downstream as possible in the workflow to be the ones that determine what the wildcards should resolve to. In our case this is align_to_genome and generate_count_table. You can think of it like the rule that really “needs” the file asks for it, and then it’s up to Snakemake to determine how it can use all the available rules to generate it. Here the align_to_genome rule says “I need this genome index to align my sample to” and then it’s up to Snakemake to determine how to download and build the index.

+

One last thing is to change the hard-coded NCTC8325 in the shell: directive of align_to_genome. Bowtie2 expects the index name supplied with the -x flag to be without the “.*.bt2” suffix so we can’t use -x {input.index}. Instead we’ll insert the genome_id directly from the config like this:

+
shell:
+    """
+    bowtie2 -x results/bowtie2/{config[genome_id]} -U {input[0]} > {output} 2>{log}
+    """
+
+
+
+ +
+
+Summary +
+
+
+

Well done! You now have a complete Snakemake workflow with a number of excellent features:

+
    +
  • A general RNA-seq pipeline which can easily be reused between projects, thanks to clear separation between code and settings.
  • +
  • Great traceability due to logs and summary tables.
  • +
  • Clearly defined the environment for the workflow using Conda.
  • +
  • The workflow is neat and free from temporary files due to using temp() and shadow.
  • +
  • A logical directory structure which makes it easy to separate data and results of different software packages.
  • +
  • A project set up in a way that makes it very easy to distribute and reproduce either via Git, Snakemake’s --archive option or a Docker image.
  • +
+
+
+
+
+

11 Reading samples from a file instead of hard-coding them

+

So far we’ve specified the samples to use in the workflow either as a hard-coded list in the Snakefile, or as a list in the configuration file. This is of course impractical for large real-world examples. Here we’ll just quickly show how you could supply the samples instead via a tab-separated file. For example you could create a file called samples.tsv with the following content:

+
SRR935090   https://figshare.scilifelab.se/ndownloader/files/39539767
+SRR935091   https://figshare.scilifelab.se/ndownloader/files/39539770
+SRR935092   https://figshare.scilifelab.se/ndownloader/files/39539773
+

The first column has the sample id and the second column has the url to the fastq file. Now in order to read this into the workflow we need to use a few lines of python code. Since you can mix python code with rule definitions in Snakemake we’ll just add the following lines to the top of the Snakefile:

+
# define an empty 'samples' dictionary
+samples = {}
+# read the sample list file and populate the dictionary
+with open("samples.tsv", "r") as fhin:
+    for line in fhin:
+        # strip the newline character from the end of the line
+        # then split by tab character to get the sample id and url
+        sample_id, url = line.strip().split("\t")
+        # store the url in the dictionary with the sample id as key
+        samples[sample_id] = url
+

Now we can use the samples dictionary in the workflow. For example, to get the url for SRR935090 we can use samples["SRR935090"].

+

For example, the get_sample_url function can now be written as:

+
def get_sample_url(wildcards):
+    return samples[wildcards.sample_id]
+

We can also use the samples dictionary in expand(), for example in the multiqc rule:

+
rule multiqc:
+    """
+    Aggregate all FastQC reports into a MultiQC report.
+    """
+    output:
+        html="results/multiqc/multiqc.html",
+        stats="results/multiqc/multiqc_general_stats.txt"
+    input:
+        expand("results/fastqc/{sample_id}_fastqc.zip", sample_id = samples.keys())
+    log:
+        "results/logs/multiqc/multiqc.log"
+    shadow: "minimal"
+    shell:
+        """
+        # Run multiQC and keep the html report
+        multiqc -n multiqc.html {input} 2> {log}
+        mv multiqc.html {output.html}
+        mv multiqc_data/multiqc_general_stats.txt {output.stats}
+        """
+

Now this depends on there being a samples.tsv file in the working directory. To make this a configurable parameter we can add it to the config file:

+
sample_list: "samples.tsv"
+

and update the code for populating the samples dictionary:

+
# define an empty 'samples' dictionary
+samples = {}
+# read the sample list file and populate the dictionary
+with open(config["sample_list"], "r") as fhin:
+    for line in fhin:
+        # strip the newline character from the end of the line
+        # then split by tab character to get the sample id and url
+        sample_id, url = line.strip().split("\t")
+        # store the url in the dictionary with the sample id as key
+        samples[sample_id] = url
+

This way, anyone can take our Snakefile and just update the path to their own sample_list using the config file.

+
+
+
+ +
+
+Quick recap +
+
+
+

In this section we’ve learned:

+
    +
  • How to generalize a Snakemake workflow.
  • +
+
+
+
+
+

12 Extra material

+

If you want to read more about Snakemake in general you can find several resources here:

+ +
+

12.1 Using containers in Snakemake

+

Snakemake also supports defining an Apptainer or Docker container for each rule (you will have time to work on the Containers tutorial later during the course). Analogous to using a rule-specific Conda environment, specify container: "docker://some-account/rule-specific-image" in the rule definition. Instead of a link to a container image, it is also possible to provide the path to a *.sif file (= a Singularity image file). When executing Snakemake, add the --software-deployment-method apptainer (or the shorthand --sdm apptainer) flag to the command line. For the given rule, an Apptainer container will then be created from the image or file that is provided in the rule definition on the fly by Snakemake and the rule will be run in this container.

+

You can find pre-made Apptainer or Docker images for many tools on https://biocontainers.pro/ (bioinformatics-specific) or on https://hub.docker.com/.

+

Here is an example for a rule and its execution:

+
rule align_to_genome:
+    output:
+        temp("results/bam/{sample_id,\w+}.bam")
+    input:
+        fastq = "data/{sample_id}.fastq.gz",
+        index = expand("results/bowtie2/{genome_id}.{substr}.bt2",
+            genome_id=config["genome_id"],
+            substr=["1", "2", "3", "4", "rev.1", "rev.2"])
+    log:
+        expand("results/logs/align_to_genome/{{sample_id}}_{genome_id}.log",
+            genome_id = config["genome_id"])
+    container: "docker://quay.io/biocontainers/bowtie2:2.5.0--py310h8d7afc0_0"
+    shell:
+        """
+        bowtie2 -x results/bowtie2/{config[genome_id]} -U {input.fastq} > {output} 2>{log}
+        """
+

Start your Snakemake workflow with the following command:

+
snakemake --software-deployment-method apptainer
+

Feel free to modify the MRSA workflow according to this example. As Apptainer is a container software that was developed for HPC clusters, and for example the Mac version is still a beta version, it might not work to run your updated Snakemake workflow with Apptainer locally on your computer. In the next section we explain how you can run Snakemake workflows on UPPMAX where Apptainer is pre-installed.

+
+
+

12.2 Running Snakemake workflows on HPC clusters

+

If you need to run a Snakemake workflow on a high-performance computing (HPC) cluster you have a wide range of options at your disposal. Via the plugin catalog you can find plugins that will add support for various HPC schedulers to Snakemake.

+

Here we will focus on how to run Snakemake workflows on clusters with SLURM, a workload manager commonly used on HPC clusters in Sweden such as Rackham, Tetralith and Dardel.

+
+
+
+ +
+
+Tip +
+
+
+

When running on remote clusters we highly recommend to use a session manager like tmux or screen so that you can run your workflow in a session in the background while doing other things on the cluster or even logging out of the cluster.

+
+
+
+

12.2.1 Option 1: Run the entire workflow as a single job

+

For short workflows with only a few rules that need the same compute resources in terms of CPU (cores) and memory, you can submit the entire workflow as a job directly to the SLURM scheduler, or start an interactive job (in your tmux or screen session) and run your Snakemake workflow as you would do that on your local machine. Make sure to give your job enough time to finish running all rules of your Snakemake workflow.

+

If you choose this option, you don’t need to install anything from the plugin catalogue. However, your workflow may not run as efficiently as it could if you were to add SLURM support in Snakemake.

+
+
+

12.2.2 Option 2: Use built-in SLURM support

+

For workflows with long run times and/or where each rule requires different compute resources, Snakemake comes with built in functionality for interacting with the SLURM workload manager and send each rule as a job to the SLURM queue and to track the status of each job.

+

In this case, you can start the workflow on the login node and let it run there until all jobs have finished. Given that workflows often consist of many rules, some of which may be highly resource demanding, this is the option we recommend when running most Snakemake workflows on HPC clusters.

+

To add SLURM support to Snakemake you first need to install the SLURM plugin from the plugin catalog. This can be done with conda:

+
conda install -c conda-forge snakemake-executor-plugin-slurm
+

Once installed, adding the --executor slurm flag to your Snakemake command line call will enable the plugin. You also need to specify how many jobs Snakemake can submit to the SLURM queue at the same time with the -j flag. For example, to allow up to 100 jobs to be put into the queue at any given time, you would run Snakemake with the following command:

+
snakemake --executor slurm -j 100 <other flags>
+
+
+
+

12.3 Specifying resources for SLURM

+

Depending on the cluster you are using, you will need to specify some resource requirements for the rules in your workflow, such as the number of CPUs, memory, runtime and account id. This can be done either:

+
    +
  1. directly on the command line with the --default-resources flag which sets default resource settings for all rules
  2. +
  3. in the rule definition of your workflow using the resources: directive, or
  4. +
  5. in a configuration profile, a folder with a config.yaml file that contains the resource settings.
  6. +
+

You can also use a combination of these methods. For example, the SLURM account id (_e.g. naiss-2023-01-001), which will most likely be the same for all rules, can be set with --default-resources:

+
snakemake --executor slurm -j 100 --default-resources slurm_account=naiss-2023-01-001
+

Rule-specific resources such as runtime, memory and number of CPUs can be set in the rule definition, for example:

+
rule testrule:
+    output:
+        "results/output.txt"
+    resources:
+        runtime = 60,
+        mem_mb = 16000,
+        cpus_per_task = 4
+    shell:
+        """
+        uname -a > {output}
+        """
+

This rule uses the standard resource runtime to set the maximum allowed time (in minutes) for the rule, sets the memory requirement with mem_mb and the number of requested CPUs with cpus_per_task. In this example the rule will have a time limit of 60 minutes, will require 16G of RAM and 4 CPUs.

+

Some clusters also require you to specify the partition you want to run your job on. The partition name will differ between clusters, for example the Rackham cluster uses core and node partitions, while Dardel uses e.g. shared and main. See the documentation for the cluster you are using for more information.

+

The partition can be set with the slurm_partition resource, for example like so:

+
rule testrule:
+    output:
+        "results/output.txt"
+    resources:
+        runtime = 60,
+        mem_mb = 16000,
+        cpus_per_task = 4,
+        slurm_partition: "shared"
+    shell:
+        """
+        uname -a > {output}
+        """
+

To make it easy to adapt your workflow to different compute clusters it is recommended to define resource settings in a configuration profile. A configuration profile is a folder with a config.yaml file that contains values for Snakemake command line arguments, allowing you to modify the behavior of Snakemake without changing the workflow code. For example, you could create a dardel folder (e.g. in the root of your workflow) with a config.yaml file that contains the following:

+
executor: "slurm"
+jobs: 100
+default-resources:
+  slurm_account: "naiss-2023-01-001"
+  slurm_partition: "shared"
+  mem_mb: 16000
+  cpus_per_task: 4
+  runtime: 60
+

This yaml-formatted file contains Snakemake command line arguments that will be used when running the workflow. You can then run Snakemake with the --profile flag pointing to the folder containing the config.yaml file:

+
snakemake --profile dardel
+

This greatly simplifies running the workflow on different clusters, and makes the command line call much more succinct.

+

To set rule-specific resources in the configuration profile, you can add a set_resources: section to the config.yaml file:

+
executor: "slurm"
+jobs: 100
+default-resources:
+  slurm_account: "naiss-2023-01-001"
+  slurm_partition: "shared"
+  mem_mb: 16000
+  cpus_per_task: 4
+  runtime: 60
+set_resources:
+  index_genome:
+    runtime: 240
+    mem_mb: 32000
+    cpus_per_task: 8
+  align_to_genome:
+    runtime: 120
+    mem_mb: 24000
+    cpus_per_task: 6
+

In this example, the index_genome rule will have a runtime of 240 minutes, will require 32G of RAM and 8 CPUs, while the align_to_genome rule will have a runtime of 120 minutes, will require 24G of RAM and 6 CPUs. Both rules will use the slurm_account and slurm_partition settings from the default_resources section, unless overridden in the rule-specific settings.

+

You can still define resources in the rule definition, but the values in the configuration profile will take precedence.

+

Now, when you run your Snakemake workflow with:

+
snakemake --profile dardel
+

Snakemake will submit each job to the SLURM queue and inform you about both the local jobid and the SLURM jobid by writing something similar to this to your terminal:

+
Job 0 has been submitted with SLURM jobid 37099380 (log: .snakemake/slurm_logs/rule_name/37099380.log).
+

In this example the log output from the job will be in .snakemake/slurm_logs/rule_name/37099380.log.

+

You can read more details about running Snakemake on compute clusters in the Snakemake docs.

+ + +
+
+ +
+ +
+ + + + + + + \ No newline at end of file diff --git a/search.json b/search.json index 14c02dc2..bc78337d 100644 --- a/search.json +++ b/search.json @@ -41,6 +41,62 @@ "section": "5 Docker", "text": "5 Docker\nDocker is infamous for quickly taking up huge amounts of space, and some maintenance is necessary every now and then. Here is how to uninstall Docker completely. Let’s start by removing individual images and containers:\n# Remove unused images\ndocker image prune\n\n# Remove stopped containers\ndocker container prune\n\n# Remove unused volumes (not used here, but included for reference)\ndocker volume prune\n\n# Stop and remove ALL containers\ndocker container rm $(docker container ls -a -q)\n\n# Remove ALL images\ndocker image rm $(docker image ls -a -q)\nRemoving Docker itself works differently on the three operating systems, which is described below:\n\n5.1 MacOS\nClick the Docker icon in the menu bar (upper right part of the screen) and select “Preferences”. In the upper right corner, you should find a little bug icon. Click on that icon and select “Reset to factory defaults”. You may have to fill in your password. Then select “Uninstall”. Once it’s done uninstalling, drag the Docker app from Applications to Trash.\n\n\n5.2 Linux\nIf you’ve installed Docker with apt-get, uninstall it like this:\napt-get purge docker-ce\nImages, containers, and volumes are not automatically removed. To delete all of them:\nrm -rf /var/lib/docker\n\n\n5.3 Windows\nUninstall Docker for Windows (on Windows 10) or Docker Toolbox (on Windows 7) via Control Panel > Programs > Programs and Features. Docker Toolbox will also have installed Oracle VM VirtualBox, so uninstall that as well if you’re not using it for other purposes." }, + { + "objectID": "pages/quarto.html", + "href": "pages/quarto.html", + "title": "Working with Quarto", + "section": "", + "text": "The Quarto format (.qmd) is a multi-functional format, which is especially useful for scientific coding and analyses. Quarto documents can be used both to save and execute code as well as generating reports in various output formats. This is done by mixing markdown and so-called code chunks in the same document (we have course materials for markdown if you are unfamiliar with this format). The code itself as well as the output it generates can be included in the final report. Not only can Quarto work great for scientific coding, but can also be used for things such as presentation and websites - this entire workshop website is, in fact, created using only Quarto!\nQuarto makes your analysis more reproducible by connecting your code, figures and descriptive text. You can use it to make reproducible reports, rather than e.g. copy-pasting figures into a Word document. You can also use it as a notebook, in the same way as lab notebooks are used in a wet lab setting (or as we utilise Jupyter notebooks in the tutorial after this one). Quarto itself does not require any particular programming language to be installed - any language you want to use can be installed separately. The currently supported languages are R, Python, Julia and Observable. Quarto is fully compatible with both R Markdown and Jupyter documents.\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already. Place yourself in the workshop-reproducible-research/tutorials/quarto/ directory, activate your quarto-env Conda environment and start your text editor or IDE of choice.\n\n\n\n\n\n\nA note on R Markdown\n\n\n\nQuarto is an evolution of the R Markdown format, which was previously used in this course. While R Markdown is a widely-used and excellent software for code and reports, Quarto is most easily thought of as “R Markdown 2.0”. If you’re familiar with R Markdown, you will find Quarto to be highly similar. The creators of both Quarto and R Markdown (Posit) have stated that R Markdown is not going to be deprecated, but most newer features will only come to Quarto. This means that if you’ve used R Markdown in the past now is a good time to make the switch, but you don’t have to. You can check out the Quarto website for more in-depth discussions regarding Quarto/R Markdown (dis-)similarities." + }, + { + "objectID": "pages/quarto.html#introduction", + "href": "pages/quarto.html#introduction", + "title": "Working with Quarto", + "section": "", + "text": "The Quarto format (.qmd) is a multi-functional format, which is especially useful for scientific coding and analyses. Quarto documents can be used both to save and execute code as well as generating reports in various output formats. This is done by mixing markdown and so-called code chunks in the same document (we have course materials for markdown if you are unfamiliar with this format). The code itself as well as the output it generates can be included in the final report. Not only can Quarto work great for scientific coding, but can also be used for things such as presentation and websites - this entire workshop website is, in fact, created using only Quarto!\nQuarto makes your analysis more reproducible by connecting your code, figures and descriptive text. You can use it to make reproducible reports, rather than e.g. copy-pasting figures into a Word document. You can also use it as a notebook, in the same way as lab notebooks are used in a wet lab setting (or as we utilise Jupyter notebooks in the tutorial after this one). Quarto itself does not require any particular programming language to be installed - any language you want to use can be installed separately. The currently supported languages are R, Python, Julia and Observable. Quarto is fully compatible with both R Markdown and Jupyter documents.\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already. Place yourself in the workshop-reproducible-research/tutorials/quarto/ directory, activate your quarto-env Conda environment and start your text editor or IDE of choice.\n\n\n\n\n\n\nA note on R Markdown\n\n\n\nQuarto is an evolution of the R Markdown format, which was previously used in this course. While R Markdown is a widely-used and excellent software for code and reports, Quarto is most easily thought of as “R Markdown 2.0”. If you’re familiar with R Markdown, you will find Quarto to be highly similar. The creators of both Quarto and R Markdown (Posit) have stated that R Markdown is not going to be deprecated, but most newer features will only come to Quarto. This means that if you’ve used R Markdown in the past now is a good time to make the switch, but you don’t have to. You can check out the Quarto website for more in-depth discussions regarding Quarto/R Markdown (dis-)similarities." + }, + { + "objectID": "pages/quarto.html#the-basics", + "href": "pages/quarto.html#the-basics", + "title": "Working with Quarto", + "section": "2 The basics", + "text": "2 The basics\nLet’s start with creating basic Quarto document that we can work with.\n\n2.1 Creating Quarto documents\nQuarto documents are just plain text files with the .qmd extension. Create a new file called e.g. quarto-tutorial.qmd and copy the following into it:\n---\ntitle: \"Untitled Quarto Document\"\nauthor: \"Jane Doe\"\nformat: html\n---\nThis is a so-called YAML header, which is where we specify the general settings of the document in the form of key: value. The title and author are just what they sound like, while the format field specifies what type of output you want the final report to be in (alternatives include pdf, revealjs and many others). Here we have specified that we want HTML output, which is perhaps the most useful for scientific computing.\n\nChange the title to My first Quarto document and the author to your name.\n\nLet’s add some actual content to the document, starting with some basic markdown:\n\nAdd some text into your Quarto document (including an empty line between the YAML header and the text), e.g. the following:\n\nThis is my first Quarto document!\n\n# This is a header\n\nThis is where I'll soon add some *code* related to the first header.\nLet’s see what this document looks like when it’s rendered into HTML by Quarto:\n\nGo to the command line and type quarto render quarto-tutorial.qmd.\n\n\n\n\n\n\n\nRendering\n\n\n\nIf you’re using e.g. RStudio or VSCode to edit your Quarto document you might have access to a render button, which means you don’t have to run the above command from the command line if you prefer.\n\n\nOpen your new quarto-tutorial.html file that was created and see what it looks like. It’s only markdown content so far, so let’s add some R code using a code chunk:\n```{r}\nSys.Date()\n```\nNotice that we delimit the code chunk from the rest of the document’s contents using three backticks (```) and specify the R language using curly brackets ({r}). The code itself just prints the current date.\n\nRender the document again and see what it looks like.\n\nYou can also name chunks by adding it after the language:\n```{r}\nSys.Date()\n```\nThis is useful for debugging when something has gone wrong, since it’ll be easier to see exactly which code chunk an error happened (instead of just showing the chunk as a number).\nWe can also get in-line code using {r} <R CODE>, like so:\nThe current date is `{r} Sys.Date()`.\n\nAdd the example above and render the document again to make sure it worked.\n\n\n\n2.2 Previewing documents\nQuarto has a highly useful command for when you’re working on a document: preview. It’s essentially a live preview of the document you’re working on that will automatically render when you introduce changes to the document.\n\nType quarto preview quarto-tutorial.qmd in the command line.\n\nYour default web browser should now have opened a new window with your rendered document, while your command line should say something like the following:\nWatching files for changes\nBrowse at http://localhost:4175/\nYou can’t type new commands at the moment, because the Quarto Preview command is still running - it’s watching for any new changes to the Quarto document you specified.\n\nChange or add some markdown text to your Quarto document, e.g. This is a code chunk instead of the previous text under the first header. Make sure you save the document.\n\nThe HTML document in your browser should have updated to reflect your newest changes automatically. Previewing documents is great when you want to have continuous feedback to the changes you make and can make the process of writing more seamless, since you don’t have to manually render all the time. Previewing will still render the entire document, however, meaning that if you have some heavy computations you might not want to re-render on every single save. For those cases you might instead prefer to stick with manual rendering when you are satisfied with multiple changes. You can abort a preview like any on-going command, e.g. using Ctrl-C.\nIn the rest of the tutorial it’s up to you whether you want to use preview or not - the tutorial will just mention when it’s time to render, you decide how that’s done.\n\n\n2.3 Rendering to PDF\nSo far we’ve only rendered to HTML, but sometimes you prefer a PDF. This entails changing the format option in the YAML header:\n\nChange the format to pdf in the header and render your document.\n\nYou can add any raw LaTeX commands you want to your document when you’re rendering to PDF, e.g. \\footnotsize to change the font size. You also have LaTeX-specific settings, such as setting the geometry for the whole document or specifying a citation method. While the details of LaTeX are outside the scope of this course, it’s useful to be aware of this functionality of Quarto so that you may use it if you already know LaTeX or if you want to learn it.\nSwitch back to HTML rendering before you move on.\n\n\n2.4 Languages\nThe examples so far have been using R, but we could just as easily have used Python. All we have to do is to change our code chunk to specify {python} as language and its content to be the equivalent Python code:\n```{python}\nfrom datetime import date\nprint(date.today())\n```\n\nChange the code chunk to the above Python chunk instead and render your document again.\n\n\n\n\n\n\n\nA note on Python in-line code\n\n\n\nQuarto support for in-line python code was added in version 1.4, so if you’re using an older version of Quarto simply remove the in-line code example. You can check your quarto version by running quarto --version on the commandline. As of this writing, the 1.4 version of Quarto can be obtained from the pre-release page: https://quarto.org/docs/download/prerelease\nIf you’re using Quarto version 1.4 or higher and want to try the in-line code example above for Python, change the line to:\nThe current date is `{python} date.strftime(date.today(), format=\"%Y-%m-%d\")`\n\n\nSo far we’ve had Quarto automatically determine which language engine should be used, which it detects through the code chunks we’ve written. We can also do this explicitly by adding engine: knitr or engine: jupyter to the YAML header.\n\nExplicitly add engine: jupyter to your YAML header and render the document.\n\n\n\n\n\n\n\nMaking sure your Jupyter engine is recognised\n\n\n\nQuarto attempts to identify a suitable Jupyter engine for your system when you include Python code chunks. However, if you want to use Jupyter available in a specific conda environment (e.g. your quarto-env environment) you need to take some extra steps. Please visit this link and follow steps 1-4. In the final step, check for the name of the kernel matching your quarto-env conda environment, e.g.\n$ jupyter kernelspec list\nAvailable kernels:\n...\n conda-env-quarto-env-py /Users/<your-user-name/Library/Jupyter/kernels/conda-env-quarto-env-py\nUsing the example output from above we can add the following to the YAML header of our Quarto document:\njupyter:\n kernelspec:\n display_name: Python 3\n language: python\n name: conda-env-quarto-env-py\n\n\nIt can be useful to explicitly set the language for the document, as it makes it clearer from just the YAML header what language will be used. There are also more language-related options for Quarto, but we’ll save those for later in the tutorial.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section you learned how to create, edit and render basic Quarto documents using different languages." + }, + { + "objectID": "pages/quarto.html#code-chunks", + "href": "pages/quarto.html#code-chunks", + "title": "Working with Quarto", + "section": "3 Code chunks", + "text": "3 Code chunks\nSometimes you want to add chunk options to the code chunks in your Quarto documents. They are also in YAML format and are prefixed with a special type of comment (#|). It can look something like this:\n```{python}\n#| echo: false\nfrom datetime import date\nprint(date.today())\n```\n\nAdd the chunk option above to your document and render the document again.\n\nNotice how we no longer see the code itself, just the output? This is because the echo option specifies just that: whether we see the code or not. There are a number of such chunk options that are useful to know about:\n\n\n\n\n\n\n\nChunk option\nEffect\n\n\n\n\necho\nInclude the chunk code in the output.\n\n\neval\nEvaluate the code chunk.\n\n\noutput\nInclude the results of executing the code in the output.\n\n\nwarning\nInclude warnings in the output.\n\n\nerror\nInclude errors in the output (note that this implies that errors executing code will not halt processing of the document).\n\n\ninclude\nPrevent both code and output from being included.\n\n\n\n\nCheck what happens if you change echo: False to eval: False.\n\nNow the code in the code chunk is not run, which means that if you previously added the python inline code it will no longer work because it depends on date from the datetime module that we import in the code chunk. Remove the inline code snippet if you added it. Then try rendering again. Now you should see the code itself but it won’t be run and therefore has no output.\n\n3.1 Figure options\nThere are also options related to figures, but for that we need to actually have some code that produces a figure.\n\nChange the YAML header to use R instead of Python, remove the Python code chunk and replace it with the following (don’t worry if you don’t understand the R code itself, it’s just as example):\n\n```{r}\nlibrary(\"ggplot2\")\nlibrary(\"palmerpenguins\")\ndata(penguins, package = \"palmerpenguins\")\nggplot(penguins, aes(x = bill_length_mm,\n y = body_mass_g,\n colour = species)) +\n geom_point(size = 2) +\n theme_bw() +\n labs(x = \"Bill length (mm)\",\n y = \"Body mass (g)\",\n colour = \"Species\") +\n ggtitle(\"Penguin weight and bill length\") +\n theme(plot.title = element_text(hjust = 0.5)) +\n scale_colour_manual(values = c(\"#c1dea0\", \"#85be42\", \"#425f21\"))\n```\nWhen you’ve rendered the document you should see both the code and a figure using the Palmer Penguins dataset. You should also see a warning along the lines of Removed 2 rows containing missing values.\n\nSuppress the warning by adding #| warning: false as a chunk option and render.\n\nThere are two chunk options related to figure sizes: fig-width and fig-height (expressed in inches). These allow you to experiment with your figures and make them look the way you want.\n\nAdd both the fig-width: 10 and fig-height: 5 chunk options and render.\n\n\n\n\n\n\n\nNote\n\n\n\nThese two chunk options are only available when using the Knitr engine, not for Jupyter. There is a way to set these for the whole document with Jupyter, though, which we’ll talk more about in the next section of the tutorial.\n\n\nYou can also add captions and alt text using fig-cap and fig-alt, respectively.\n\nAdd a suitable caption and alt text to the figure and render.\n\nIf you want to place the caption in the margin of your document you can use the cap-location chunk option.\n\nAdd cap-location: margin to your chunk options and render.\n\n\n\n\n\n\n\nNote\n\n\n\nOn some quarto versions the cap-location: option may not work as expected. If you experience this, try also adding #| label: fig-penguins to the chunk.\n\n\n\n\n3.2 Cross-references\nA convenient way to be able to refer to figures in text is by adding a figure label, which will automatically add a figure number before your caption.\n\nAdd a suitable label, e.g. label: fig-penguins to the chunk options.\n\nCross-references use the @ symbol and the corresponding label. You can thus write some markdown outside of a code chunk and refer to e.g. @fig-penguins, as per the example here. This is extremely useful if you’re writing a paper or a report where you want to refer to figures and content in the markdown text. Quarto even adds a clickable link to the figure itself as well!\n\n\n3.3 Sub-figures\nIt’s also possible to create sub-figures using Quarto, instead of using whatever plotting library that your created the figures with.\n\nAdd the following (almost identical) code at the bottom of the chunk you already have:\n\nggplot(penguins, aes(x = bill_depth_mm,\n y = body_mass_g,\n colour = species)) +\n geom_point(size = 2) +\n theme_bw() +\n labs(x = \"Bill depth (mm)\",\n y = \"Body mass (g)\",\n colour = \"Species\") +\n scale_colour_manual(values = c(\"#c1dea0\", \"#85be42\", \"#425f21\"))\n\nAlso add the following to the chunk options:\n\n#| fig-subcap:\n#| - Bill length vs. body mass\n#| - Bill depth vs. body mass\nYou should now see that we have two figures with separate sub-captions as well as the overall figure caption we previously added. We can also control the layout of these figures using the layout-ncol chunk option.\n\nAdd a layout-ncol: 2 chunk option and render the document.\n\nWe now have a different, two-column layout instead, but whether you prefer this or just a one-column layout is up to you.\n\n\n3.4 Tables\nTables work much in the same way as figures. It might, in our example, be nice to add a table with the data we previously plotted.\n\nAdd the following code chunk to your document and render it:\n\n```{r}\n#| label: tbl-penguins\n#| tbl-cap: Palmer penguins bill length, width and body mass.\n#| tbl-cap-location: margin\nknitr::kable(\n penguins[1:10, c(\"species\", \"bill_length_mm\", \"bill_depth_mm\", \"body_mass_g\")],\n col.names = c(\"Species\", \"Bill length (mm)\", \"Bill depth (mm)\", \"Body mass (g)\")\n)\n```\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section you learned several chunk, figure and table options, how cross-referencing works and how to add sub-figures." + }, + { + "objectID": "pages/quarto.html#document-options", + "href": "pages/quarto.html#document-options", + "title": "Working with Quarto", + "section": "4 Document options", + "text": "4 Document options\nSo far we’ve mostly worked with chunk options, which are specific to the chunk they appear in. You can set many of these at the global document level, however, and there are also some options specifically for tailoring the document as a whole, regardless of chunk content.\nWe’ve already looked at some global options, such as title, author, format and engine. Something that would go nicely with the first two is the date option. You could just write the actual date if you like, or you can use the today option:\n\nAdd the following to the options: date: today\n\n\n4.1 Code folding\nA useful option we haven’t touched already is the code-fold option. This and similar global options are specified nested inside the format option, like so:\nformat:\n html:\n code-fold: true\n\nAdd the code-fold option to your document and render it.\n\nThis can be a nice default to use in scientific reports, as it hides the code by default but is always there for those who want to inspect it. You can also use the code-summary chunk option to specify a different text to show with the folded code instead of the default Code, e.g. code-summary: Click to show code.\nIf you want to add the code-summary option to all chunks you can add the following to the yaml header:\nlanguage:\n code-summary: Click to show code\nYou can also add the code-tools option, which will add a drop-down menu to toggle visibility of all code as well as the ability to view the source of the document.\n\nAdd the code-tools: true option and render the document.\n\n\n\n4.2 Table of contents\nAnother useful document option is to add a table of contents, which can be done with the toc option. This will automatically populate the table of contents using the headers from your document.\n\nAdd some more headings and/or sub-headings to your document.\nAdd the toc: true option to the html format and render.\n\nThe table of contents is to the right of the document by default, but you can change it using toc-location. The toc-depth allows you to control how many sub-heading levels are included in the table of contents.\n\nAdd toc-location: left and toc-depth: 2 to your document and render it.\n\nHaving the table of contents on the left can be useful if you are using the margins for something, such as we are doing in this tutorial. You can similarly add section numbering using number-sections and number-depth. Smooth scrolling is not enabled by default, but you can add it using smooth-scroll: true. You can change the title of the table of contents using toc-title.\n\nAdd section numbers, depth, smooth scrolling and a different table of contents title to your document and render it.\n\n\n\n4.3 Themes\nQuarto has a lot of themes available for it.\n\nAdd theme: flatly under the HTML format option and render.\n\nIf you want to get real advanced you can play around with lots of details regarding the themes and adjust as you see fit, or even just create your own theme. This is a bit too advanced to go through here, but you can read about it more in the official documentation.\n\n\n4.4 Global chunk options\nThe chunk options we learnt about in the previous section of this tutorial can also be specified on the global document level. Instead of specifying e.g. warning: false or fig-height: 5 in individual chunks we can add it to the main YAML header in the same manner as for e.g. code folding or table of contents. We’ll still have to specify options like labels or captions at the chunk-level, though.\n\nAdd warning: false to your document header and remove it from the penguin figure chunk you already have.\n\n\n\n4.5 Embedding HTML resources\nWhen rendering HTML documents you get any figures and other resources in a <document-name>_files/ directory, which is not always desirable. It’s easier to move the HTML around if all figures etc. are embedded directly in the HTML itself, which can be done by specifying embed-resources: true in the HTML format options. This option is false by default, meaning that you’ll also have to include the previously mentioned directory if you want to share the HTML with anybody.\n\nRemove the <document-name>_files/ directory, refresh the rendered document and see what happens.\nAdd the embed_resources option and render your document again.\n\nWhat happened first is that your figures should have disappeared when you deleted the resources directory. Embedding resources and rendering again should not re-create this directory, so now you’ll just have a stand-alone HTML file that is more portable than before.\n\n\n4.6 Multiple formats\nSo far we’ve mostly been working with HTML output, but you don’t need to limit yourself to a single output format if you don’t want to.\n\nAdd the docx: default line in the format: part of your YAML header and render your document.\n\nYou should have gotten two separate output files now: a HTML and a DOCX (Word) file. You can specify further options for any of the formats you include, instead of just using the default settings as in this example.\n\nRender your document again, but supply the --to html flag.\n\nThis will only render to the specified output format, which is highly useful when you want to write a Quarto document with more than one format but not always render them all.\n\n\n4.7 Parameters\nThe last document-wide option we’ll touch on is parameters. This is useful for when you want to be able to run the same document with different parameters or options for some computations. How parameters are specified depends on which engine you’re using. With Knitr you can specify parameters using the params option:\n\nAdd the following code to your YAML header:\n\nparams:\n point_size: 2\n\nAlso change the hard-coded geom_point(size = 2) to geom_point(size = params$point_size) in the two ggplot calls in the first code chunk.\n\nWe have thus specified a parameter called point_size in the YAML header and referred to it in the code using params$point_size. You can now change this parameter at run-time by supplying the -P <param>:<value> (or --execute-param) flag to quarto render.\nNotice that this won’t work if you want to use a parameter to control e.g. a chunk option like layout-ncol. For this we need to use an in-line code expression: #| layout-ncol: !expr params$ncols.\n\nAdd a parameter for the layout-ncol chunk option to the YAML header\nAlso add the layout-ncol chunk option to the figure chunk using the syntax above and render to make sure it works.\n\nNote that to modify multiple parameters at run-time you have to use the -P param:value flag multiple times, like so:\nquarto render quarto-tutorial.qmd -P point_size:4 -P ncols:1\nIf you’re using the Jupyter engine you can instead specify parameters by designating a single cell as a parameter cell, like so:\n```{python}\n#| tags: [parameters]\npoint_size = 2\n```\nYou can also specify parameters in a params.yml file and instruct quarto to use them with the --execute-params params.yml flag when rendering. Note that the parameters must be defined in the document (in the YAML header when using the knitr engine, or in a cell when using the jupyter engine). Pointing quarto to a params.yml file with --execute-params only overrides them when rendering.\nUsing parameters is extremely useful when you’re using a workflow manager system (e.g. Snakemake or Nextflow), since you can easily specify sample-specific parameters from the command line directly from your workflow manager.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this sections we covered a number of document-wide options, including code-folding, table of contents, theming, HTML portability, using multiple output formats and parameters." + }, + { + "objectID": "pages/quarto.html#presentations", + "href": "pages/quarto.html#presentations", + "title": "Working with Quarto", + "section": "5 Presentations", + "text": "5 Presentations\nQuarto can also be used to create presentations in multiple formats such as reveal.js (HTML), beamer (PDF) and pptx (PowerPoint) - the most powerful of these formats by far is the first one. Creating presentations with Quarto is quite similar to creating general Quarto documents, with some added features to keep in mind.\n\n5.1 Slides\nThe first thing that’s needed for creating a presentation is deciding what constitutes a slide. The default is that slides are delimited by a document’s header levels.\n\nRender your document using the --to revealjs flag and open it.\n\nYou should now have the same document we’ve been working on for this tutorial in presentation format! You can step through the slides using the arrow keys, press F to go into full-screen mode, S to view speaker notes, M for the menu (you can also click in the lower left corner to get this menu) and ESC to go back.\nIf you’ve followed along you should have one level-1 header (#) and two level-2 headers (##). Notice that the level-1 header here will render as a blank page with just the header content on it, while the level-2 headers will render as normal slide headers. This all looks quite nice, and we didn’t even have to change a thing! Disregard that the table on the last slide doesn’t fit for now, we’ll get back to it later. Another method of delimiting slides is using a horizontal rule, ---, which allows you more fine-grained control over slides and their content (and is especially useful if you want to have a slide without a title).\n\n\n5.2 Divisions\nThere are many ways you can add presentation-specific content to your slides, some of which you’d recognise from e.g. PowerPoint functionality.\nSo called “divisions” or “divs” allow you to control the appearance of content in your slides.\nLet’s fix that issue with the table that was larger than the page. The problem here is one of content overflow, which can be fixed by adding a special {.smaller} div.\n\n\n\n\n\n\nNote\n\n\n\nDivs do not work for level1 headings (starting with a single #).\n\n\n\nAdd the {.smaller} div to the table header (it should read something like ## A table {.smaller}) and render.\n\nThat should have automatically re-sized the table to fit into the slide. Another way to solve this is to make slide content scrollable.\n\nChange the {.smaller} div to a {.scrollable} div and render.\n\nInstead of re-sizing the table we now get the ability to scroll down it instead; whichever solution you prefer is up to you.\nAdding divisions of various types like this is a common thing for Quarto presentations. Another common presentation-functionality is incremental lists, which can also be achieved with divisions. When adding a division to slide content we specify the division’s content in a manner similar to a code chunk, like in the following example:\n## Penguin species\n\n::: {.incremental}\n - Adelie\n - Chinstrap\n - Gentoo\n:::\n\nAdd the code above to your document and render it.\n\nStepping through incremental content works the same as for stepping through slides, i.e. using the arrow keys.\n\nRender your document to html instead of revealjs.\n\nNotice that Quarto rendered the HTML document just fine, even though you now have some presentation-specific code? This allows you to switch between the formats on-demand without having much overhead or format-specific code, which is great when you want to present your work without having to whip out a full-fledged presentation and all the work that goes into that!\nThere are other useful divisions as well, including {.notes} (speaker notes), {.aside} (additional commentary similar to footnotes), {.footer} (slide footers), which you can add in the same way as we did for the incremental list above.\n\nPick one of the above-mentioned divisions to add to your presentation and render it.\n\n\n\n\n\n\n\nNote\n\n\n\nThe notes and footer divisions will appear as normal Markdown text when rendering to HTML, while asides will appear in the margin. These divisions thus represent cases you might want to avoid if you want to be completely format-agnostic.\n\n\n\n\n5.3 Presentation options\nJust like the other formats you can specify presentation-specific options at the document-level using the YAML header. You could, for example, add the {.scrollable} or {.smaller} div to the entire document.\n\nAdd the revealjs format to the YAML header as well as a scrollable: true option to it.\n\nYou can also specify one of the built-in themes here.\n\nAdd theme: simple to your YAML header and render.\n\nYou can find the entire list of themes at the Quarto website.\n\n\n5.4 Multiple columns\nSometimes you’ll want to have more than one column in your presentation, which is done with the {.columns} and {.column} divisions. The former specifies that a section with multiple columns is starting, while the second specifies when each column starts, like so:\n:::: {.columns}\n\n::: {.column}\nLeft column\n:::\n\n::: {.column}\nRight column\n:::\n\n::::\n\nAdd multiple columns with some content to your presentation and render it.\n\nYou can also control the widths of these columns using e.g. {.column width=\"40%\"}.\n\n\n\n\n\n\nNote\n\n\n\nThe {.columns} div also works for a normal HTML render, so it’ll look the same regardless of whether you output as a document or a presentation.\n\n\n\n\n5.5 Fragments\nWe’ve already learnt how to get incremental lists working, but what about general content we want to incrementally step through? This is done with the {.fragment} div.\n\nAdd a {.fragment} div to some slide content and render.\n\nFragments are similar to “animations” from PowerPoint and come with lots of built-in variations, e.g. fade-out, grow, strike and several others.\n\nAdd a fragment variant to your content, e.g. {.fragment .grow} and render your document.\n\nYou can also control the order in which fragments appear using the fragment-index=<NUMBER> option.\n\nCreate a new slide and add some content with a different order of appearance than the order of the code. If you need help or inspiration, click below.\n\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n## Why Palmer Penguins?\n\n::: {.fragment fragment-index=2}\n![](https://allisonhorst.github.io/palmerpenguins/logo.png){fig-align=\"center\"}\n:::\n\n::: {.fragment fragment-index=1}\nThe goal of `palmerpenguins` is to provide a good dataset for data exploration\nand visualization, as an alternative to `iris.`\n:::\n\n\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we covered how to create presentations using Quarto, including how to add various divisions, global slide-options, multiple columns and fragments." + }, + { + "objectID": "pages/quarto.html#extra-material", + "href": "pages/quarto.html#extra-material", + "title": "Working with Quarto", + "section": "6 Extra material", + "text": "6 Extra material\nThe following material contains some more advanced things that you can do with Quarto but are not really part of the core of the Quarto material. It’s a mix of various functionalities, and you don’t have to go through it if you don’t want to.\nIf you’re interested in learning more about Quarto in general, here are some reading tips:\n\nThe Quarto documentation\nA gallery of Quarto examples\nAn awesome list of Quarto content\n\n\n6.1 Tabsets\nSometimes you’ll want to present the same content in different ways, e.g. the equivalent code in different languages. Look at the following toy example:\n::: {.panel-tabset}\n## R\n```{r}\nwords <- c(\"Foo\", \"bar\")\nprint(paste(words), collapse = ' ')\n```\n\n## Python\n```{python}\nwords = [\"Foo\", \"bar\"]\nprint(' '.join(words))\n```\n:::\nTry adding that to a document and see that you’ll get a set of tabs that change the content of the code chunk to the respective language. This is not only useful for showing different languages, but can be used for other situations as well. For example, you might want to run different analyses and show them in different tabs, or even show different interactive elements in separate tabs.\n\n\n6.2 Callouts\nIf you’re writing some sort of documentation, tutorial or just want to draw special attention to something, callouts are here for you. They render as a coloured block with a header and content. There are five types of callouts: note, tip, warning, caution, and important. As with lots of Quarto things they are specified using a division, like so:\n::: {.callout-note}\nThis is a note callout.\n:::\nThe different callouts come with appropriate colours by default, which you can change in the theme. You can also have collapsible callouts by adding the collapse=true option, where true will have the callout collapsed by default. You can also specify titles in the same way using the title=<TITLE> option or by adding the title directly to the callout content, like so:\n::: {.callout-note}\n## This is the callout title\n\nThis is a note callout.\n:::\nYou can change the overall appearance of callouts by using the appearance option or the callout-appearance global option. Valid values are default, simple and minimal, with decreasing usage of colours and weights. You can also suppress the callout icons using icon=false or callout-icon: false in a similar manner.\n\n\n6.3 Mixing R and Python\nEarlier in the tutorial we showed how to change the language using the engine global option, but there is actually a way to use both R and Python in the same Quarto document. This is done via the Knitr engine and the reticulate R package, which allows communication between any variables and data you store in either R or Python code chunks. While this may not be that common of a use-case, it’s still great that it’s there for those that want access to it. We won’t go through the details of how this works here, but you’re welcome to go and check out the official reticulate website for yourself.\nIf you just want to mix R and Python in a single Quarto document without the interoperability between the languages it’s a lot simpler, though. You can either just install the reticulate package (r-reticulate in Conda) or add the python.reticulate=FALSE chunk option to the Python chunks.\n\n\n6.4 Citations\nYou can actually write whole articles in Quarto! For that purpose, it’s also great that you can cite things from a bibliography as well. Specifying the bibliography file(s) is done using the bibliography global option; specifying the citation style can be done using a csl (Citation Style Language) file and the csl global option. Citation itself is similar to cross-referencing (@cross-ref), but is surrounded by square brackets: [@citation]. You can read more details about citations at the Quarto website." + }, + { + "objectID": "pages/markdown.html", + "href": "pages/markdown.html", + "title": "Markdown", + "section": "", + "text": "A markup language is a system for annotating text documents in order to e.g. define formatting. HTML, if you are familiar with that, is an example of a markup language. HTML uses tags, such as:\n<h1> Heading </h1>\n<h2> Sub-heading </h2>\n<a href=\"www.webpage.com\"> Link </a>\n<ul>\n <li> List-item1 </li>\n <li> List-item2 </li>\n <li> List-item3 </li>\n</ul>\nMarkdown is a lightweight markup language which uses plain-text syntax in order to be as unobtrusive as possible, so that a human can easily read it. Look at the following toy example:\n# A header\n\nA [link](http://example.com).\n\n## A sub-header\n\nText attributes _italic_, *italic*, **bold**, `monospace`.\n\n### A deeper sub-header\n\nBullet list:\n\n - Apples\n - Oranges\n - Pears\nThis would render to something like this:\n\nA markdown document can be converted to other formats, such as HTML or PDF, for viewing in a browser or a PDF reader; in fact, the page you are reading right now is written in markdown. Markdown is somewhat ill-defined, and as a consequence of that there exist many implementations and extensions. They share most of the syntax, however, with various additions on top.\nThere are a lot more things you can do with markdown than what we show here. Indeed, this entire course is mostly written in markdown! You can read more about markdown here." + }, { "objectID": "pages/introduction.html", "href": "pages/introduction.html", @@ -55,6 +111,69 @@ "section": "The case study", "text": "The case study\nWe will be running a small bioinformatics project as a case study, and use that to exemplify the different steps of setting up a reproducible research project. To give you some context, the study background and analysis steps are briefly described below.\n\nBackground\nThe data is taken from Osmundson, Dewell, and Darst (2013), who have studied methicillin-resistant Staphylococcus aureus (MRSA). MRSA is resistant to broad spectrum beta-lactam antibiotics and lead to difficult-to-treat infections in humans. Lytic bacteriophages have been suggested as potential therapeutic agents, or as the source of novel antibiotic proteins or peptides. One such protein, gp67, was identified as a transcription-inhibiting transcription factor with an antimicrobial effect. To identify S. aureus genes repressed by gp67, the authors expressed gp67 in S. aureus cells. RNA-seq was then performed on three S. aureus strains:\n\nRN4220 with pRMC2 with gp67\nRN4220 with empty pRMC2\nNCTC8325-4\n\n\n\nAnalysis\nThe graph below shows the different steps of the analysis that are included in this project:\n\nThe input files are:\n\nRNA-seq raw data (FASTQ files) for the three strains\nS. aureus genome sequence (a FASTA file)\nS. aureus genome annotation (a GFF file)\n\nThe workflow itself will perform the following tasks:\n\nDownloading and indexing of the reference genome using Bowtie2\nDownloading the raw FASTQ data from the Sequence Read Archive (SRA)\nRun some quality controls on the data using FastQC and MultiQC\nAlign the raw data to the genome and calculate the gene expression using featureCounts\nProduce supplementary materials using data from quality controls, gene expression and the workflow figure shown above" }, + { + "objectID": "pages/containers.html", + "href": "pages/containers.html", + "title": "Controlling your environment with Containers", + "section": "", + "text": "Container-based technologies are designed to make it easier to create, deploy, and run applications by isolating them in self-contained software units (hence their name). The idea is to package software and/or code together with everything it needs (other packages it depends, various environment settings, etc.) into one unit, i.e. a container. This way we can ensure that the software or code functions in exactly the same way regardless of where it’s executed. Containers are in many ways similar to virtual machines but more lightweight. Rather than starting up a whole new operating system, containers can use the same kernel (usually Linux) as the system that they’re running on. This makes them much faster and smaller compared to virtual machines. While this might sound a bit technical, actually using containers is quite smooth and very powerful.\nContainers have also proven to be a very good solution for packaging, running and distributing scientific data analyses. Some applications of containers relevant for reproducible research are:\n\nWhen publishing, package your analyses in a container image and let it accompany the article. This way interested readers can reproduce your analysis at the push of a button.\nPackaging your analysis in a container enables you to develop on e.g. your laptop and seamlessly move to cluster or cloud to run the actual analysis.\nSay that you are collaborating on a project and you are using Mac while your collaborator is using Windows. You can then set up a container image specific for your project to ensure that you are working in an identical environment.\n\nOne of the largest and most widely used container-based technologies is Docker. Just as with Git, Docker was designed for software development but is rapidly becoming widely used in scientific research. Another container-based technology is Apptainer (and the related Singularity), which was developed to work well in computer cluster environments such as Uppmax. We will cover both Docker and Apptainer in this course, but the focus will be be on the former (since that is the most widely used and runs on all three operating systems).\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to install Docker if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/containers.\n\n\n\n\n\n\nDockage and storage\n\n\n\nDocker images tend to take up quite a lot of space. In order to do all the exercises in this tutorial you need to have ~10 GB available." + }, + { + "objectID": "pages/containers.html#introduction", + "href": "pages/containers.html#introduction", + "title": "Controlling your environment with Containers", + "section": "", + "text": "Container-based technologies are designed to make it easier to create, deploy, and run applications by isolating them in self-contained software units (hence their name). The idea is to package software and/or code together with everything it needs (other packages it depends, various environment settings, etc.) into one unit, i.e. a container. This way we can ensure that the software or code functions in exactly the same way regardless of where it’s executed. Containers are in many ways similar to virtual machines but more lightweight. Rather than starting up a whole new operating system, containers can use the same kernel (usually Linux) as the system that they’re running on. This makes them much faster and smaller compared to virtual machines. While this might sound a bit technical, actually using containers is quite smooth and very powerful.\nContainers have also proven to be a very good solution for packaging, running and distributing scientific data analyses. Some applications of containers relevant for reproducible research are:\n\nWhen publishing, package your analyses in a container image and let it accompany the article. This way interested readers can reproduce your analysis at the push of a button.\nPackaging your analysis in a container enables you to develop on e.g. your laptop and seamlessly move to cluster or cloud to run the actual analysis.\nSay that you are collaborating on a project and you are using Mac while your collaborator is using Windows. You can then set up a container image specific for your project to ensure that you are working in an identical environment.\n\nOne of the largest and most widely used container-based technologies is Docker. Just as with Git, Docker was designed for software development but is rapidly becoming widely used in scientific research. Another container-based technology is Apptainer (and the related Singularity), which was developed to work well in computer cluster environments such as Uppmax. We will cover both Docker and Apptainer in this course, but the focus will be be on the former (since that is the most widely used and runs on all three operating systems).\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to install Docker if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/containers.\n\n\n\n\n\n\nDockage and storage\n\n\n\nDocker images tend to take up quite a lot of space. In order to do all the exercises in this tutorial you need to have ~10 GB available." + }, + { + "objectID": "pages/containers.html#the-basics", + "href": "pages/containers.html#the-basics", + "title": "Controlling your environment with Containers", + "section": "2 The basics", + "text": "2 The basics\nWe’re almost ready to start, just one last note on nomenclature. You might have noticed that we sometimes refer to “Docker images” and sometimes to “Docker containers”. We use images to start containers, so containers are simply an instances of an image. You can have an image containing, say, a certain Linux distribution, and then start multiple containers running that same OS.\n\n\n\n\n\n\nRoot privileges are required\n\n\n\nIf you don’t have root privileges you have to prepend all Docker commands with sudo.\n\n\n\n2.1 Downloading images\nDocker containers typically run Linux, so let’s start by downloading an image containing Ubuntu (a popular Linux distribution that is based on only open-source tools) through the command line.\ndocker pull ubuntu:latest\nYou will notice that it downloads different layers with weird hashes as names. This represents a very fundamental property of Docker images that we’ll get back to in just a little while. The process should end with something along the lines of:\nStatus: Downloaded newer image for ubuntu:latest\ndocker.io/library/ubuntu:latest\nLet’s take a look at our new and growing collection of Docker images:\ndocker image ls\nThe Ubuntu image should show up in this list, with something looking like this:\nREPOSITORY TAG IMAGE ID CREATED SIZE\nubuntu latest d70eaf7277ea 3 weeks ago 72.9MB\n\n\n2.2 Running containers\nWe can now start a container from the image we just downloaded. We can refer to the image either by “REPOSITORY:TAG” (“latest” is the default so we can omit it) or “IMAGE ID”. The syntax for docker run is docker run [OPTIONS] IMAGE [COMMAND] [ARG...]. To see the available options run docker run --help. The COMMAND part is any command that you want to run inside the container, it can be a script that you have written yourself, a command line tool or a complete workflow. The ARG part is where you put optional arguments that the command will use.\nLet’s run uname -a to get some info about the operating system. In this case, uname is the COMMAND and -a the ARG. This command will display some general info about your system, and the -a argument tells uname to display all possible information.\nFirst run it on your own system (use systeminfo if you are on Windows):\nuname -a\nThis should print something like this to your command line:\nDarwin liv433l.lan 15.6.0 Darwin Kernel Version 15.6.0: Mon Oct 2 22:20:08 PDT 2017; root:xnu-3248.71.4~1/RELEASE_X86_64 x86_64\nSeems like I’m running the Darwin version of MacOS. Then run it in the Ubuntu Docker container:\ndocker run ubuntu uname -a\nHere I get the following result:\nLinux 24d063b5d877 5.4.39-linuxkit #1 SMP Fri May 8 23:03:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux\nAnd now I’m running on Linux! What happens is that we use the downloaded ubuntu image to run a container that has Ubuntu as the operating system, and we instruct Docker to execute uname -a to print the system info within that container. The output from the command is printed to the terminal.\nTry the same thing with whoami instead of uname -a.\n\n\n2.3 Running interactively\nSo, seems we can execute arbitrary commands on Linux. This looks useful, but maybe a bit limited. We can also get an interactive terminal with the flags -it.\ndocker run -it ubuntu\nYour prompt should now look similar to:\nroot@1f339e929fa9:/#\nYou are now using a terminal inside a container running Ubuntu. Here you can do whatever; install, run, remove stuff. Anything you do will be isolated within the container and never affect your host system.\nNow exit the container with exit.\n\n\n2.4 Containers inside scripts\nOkay, so Docker lets us work in any OS in a quite convenient way. That would probably be useful on its own, but Docker is much more powerful than that. For example, let’s look at the shell part of the index_genome rule in the Snakemake workflow for the MRSA case study:\nshell:\n \"\"\"\n bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}\n \"\"\"\nYou may have seen that one can use containers through both Snakemake and Nextflow if you’ve gone through their tutorial’s extra material, but we can also use containers directly inside scripts in a very simple way. Let’s imagine we want to run the above command using containers instead. How would that look? It’s quite simple, really: first we find a container image that has bowtie2 installed, and then prepend the command with docker run <image>.\nFirst of all we need to download the genome to index though, so run:\ncurl -o NCTC8325.fa.gz ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz\ngunzip -c NCTC8325.fa.gz > tempfile\nTo download and prepare the input for Bowtie2.\nNow try running the following Bash code:\ndocker run -v $(pwd):/analysis quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 bowtie2-build /analysis/tempfile /analysis/NCTC8325\nDocker will automatically download the container image for Bowtie2 version 2.5.1 from the remote repository https://quay.io/repository/biocontainers/bowtie2 and subsequently run the command! This is the docker run [OPTIONS] IMAGE [COMMAND] [ARG...] syntax just like before. In this case quay.io/biocontainers/bowtie2:2.5.1--py39h3321a2d_0 is the IMAGE but instead of first downloading and then running it we point to its remote location directly, which will cause Docker to download it on the fly. The bowtie2-build part is the COMMAND followed by the ARG (the input tempfile and the output index)\nThe -v $(pwd):/analysis part is the OPTIONS which we use to mount the current directory inside the container in order to make the tempfile input available to Bowtie2. More on these so-called “Bind mounts” in Section 4 of this tutorial.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to use docker pull for downloading remotely stored images\nHow to use docker image ls for getting information about the images we have on our system.\nHow to use docker run for starting a container from an image.\nHow to use the -it flag for running in interactive mode.\nHow to use Docker inside scripts." + }, + { + "objectID": "pages/containers.html#building-images", + "href": "pages/containers.html#building-images", + "title": "Controlling your environment with Containers", + "section": "3 Building images", + "text": "3 Building images\nIn the previous section we downloaded a Docker image of Ubuntu and noticed that it was based on layers, each with a unique hash as id. An image in Docker is based on a number of read-only layers, where each layer contains the differences to the previous layers. If you’ve done the Git tutorial this might remind you of how a Git commit contains the difference to the previous commit. The great thing about this is that we can start from one base layer, say containing an operating system and some utility programs, and then generate many new images based on this, say 10 different project-specific images. This dramatically reduces the storage space requirements. For example, Bioconda (see the Conda tutorial) has one base image and then one individual layer for each of the more than 3000 packages available in Bioconda.\nDocker provides a convenient way to describe how to go from a base image to the image we want by using a “Dockerfile”. This is a simple text file containing the instructions for how to generate each layer. Docker images are typically quite large, often several GBs, while Dockerfiles are small and serve as blueprints for the images. It is therefore good practice to have your Dockerfile in your project Git repository, since it allows other users to exactly replicate your project environment.\nWe will be looking at a Dockerfile called Dockerfile_slim that is located in your containers directory (where you should hopefully be standing already). We will now go through that file and discuss the different steps and what they do. After that we’ll build the image and test it out. Lastly, we’ll start from that image and make a new one to reproduce the results from the Conda tutorial.\n\n3.1 Understanding Dockerfiles\nHere are the first few lines of Dockerfile_slim. Each line in the Dockerfile will typically result in one layer in the resulting image. The format for Dockerfiles is INSTRUCTION arguments. A full specification of the format, together with best practices, can be found here.\nFROM condaforge/miniforge3\n\nLABEL description = \"Minimal image for the NBIS reproducible research course.\"\nMAINTAINER \"John Sundh\" john.sundh@scilifelab.se\nHere we use the instructions FROM, LABEL and MAINTAINER. While LABEL and MAINTAINER is just meta-data that can be used for organizing your various Docker components the important one is FROM, which specifies the base image we want to start from. Because we want to use conda to install packages we will start from an image from the conda-forge community that has conda pre-installed. This image was in turn built using a Dockerfile as a blueprint and then uploaded to Dockerhub. The conda-forge community keeps the Dockerfile in a git repository and you can view the file here. You will see that it starts from an official Ubuntu image (check the first line with the FROM instruction), followed by code to install various packages including conda.\n\n\n\n\n\n\nThere are many roads to Rome\n\n\n\nWhen it comes to choosing the best image to start from there are multiple routes you could take. Say you want to run RStudio in a Conda environment through a Jupyter notebook. You could then start from one of the rocker images for R, a Condaforge image, or a Jupyter image. Or you just start from one of the low-level official images and set up everything from scratch.\n\n\nLet’s take a look at the next section of Dockerfile_slim.\n# Use bash as shell\nSHELL [\"/bin/bash\", \"--login\", \"-c\"]\n\n# Set workdir\nWORKDIR /course\n\n# Set time zone\nENV TZ=\"Europe/Stockholm\"\nENV DEBIAN_FRONTEND=noninteractive\nSHELL simply sets which shell to use and WORKDIR determines the directory the container should start in. The ENV instruction is used to set environmental variables and here we use it to set the time zone by declaring a TZ variable. The DEBIAN_FRONTEND=noninteractive line means that we force the subsequent installation to not prompt us to set the time zone manually.\nThe next few lines introduce the important RUN instruction, which is used for executing shell commands:\n# Install package for setting time zone\nRUN apt-get update && apt-get install -y tzdata && apt-get clean\n\n# Configure Conda\nRUN conda init bash && conda config --set channel_priority strict && \\\n conda config --append channels bioconda && \\\n conda config --append channels r && \\\n conda config --set subdir linux-64\nThe first RUN command installs the tzdata package for managing local time settings in the container. This may not always be required for your Dockerfile but it’s added here because some R packages used in the course require it.\n\n\n\n\n\n\nNote\n\n\n\nWhile installing things with apt-get inside Dockerfiles is relatively common practice, it’s important to note that this may affect reproducibility, since it’s not common to specify an exact version. The packages installed in this manner are, however, usually not important for the actual analyses performed, but rather help in the building of the container image itself. While not critical, it’s important to note this from a reproducibility perspective.\n\n\nNext, we run conda init bash to initialize the bash shell inside the image, meaning we can use conda activate in containers that run from the image. In the same RUN statement we also configure the strict channel priority and add appropriate channels with conda config. You’ll probably recognize this from the pre-course-setup. The last part sets the somewhat obscure subdir config parameter pointing to the linux-64 architecture of conda channels.\nAs a general rule, you want each layer in an image to be a “logical unit”. For example, if you want to install a program the RUN command should both retrieve the program, install it and perform any necessary clean up. This is due to how layers work and how Docker decides what needs to be rerun between builds. More on this later.\nNext up is:\n# Open port for running Jupyter Notebook\nEXPOSE 8888\n\n# Start Bash shell by default\nCMD /bin/bash\nEXPOSE opens up the port 8888, so that we can later run a Jupyter Notebook server on that port. CMD is an interesting instruction. It sets what a container should run when nothing else is specified, i.e. if you run docker run [OPTIONS] [IMAGE] without the additional [COMMAND] [ARG]. It can be used for example for printing some information on how to use the image or, as here, start a Bash shell for the user. If the purpose of your image is to accompany a publication then CMD could be to run the workflow that generates the paper figures from raw data, e.g. CMD snakemake -s Snakefile -c 1 generate_figures.\n\n\n3.2 Building from Dockerfiles\nNow we understand how a Dockerfile works. Constructing the image itself from the Dockerfile can be done as follows - try it out:\n\n\n\n\n\n\nImage platforms on newer Macs\n\n\n\nIf your computer is a MAC with the M1 chip, you may have to add --platform linux/x86_64 to the docker build command.\n\n\ndocker build -f Dockerfile_slim -t my_docker_image .\nThis should result in something similar to this:\n [+] Building 2.2s (7/7) FINISHED\n => [internal] load build definition from Dockerfile_slim 0.0s\n => => transferring dockerfile: 667B 0.0s\n => [internal] load .dockerignore 0.0s\n => => transferring context: 2B 0.0s\n => [internal] load metadata for docker.io/condaforge/miniforge3:latest 0.0s\n => [1/3] FROM docker.io/condaforge/miniforge3 0.0s\n => CACHED [2/3] WORKDIR /course 0.0s\n => [3/3] RUN conda init bash && conda config --set channel_priority strict && conda config --append channels bioconda && conda config --append channels r && conda config --set subdir 2.1s\n => exporting to image 0.0s\n => => exporting layers 0.0s\n => => writing image sha256:53e6efeaa063eadf44c509c770d887af5e222151f08312e741aecc687e6e8981 0.0s\n => => naming to docker.io/library/my_docker_image\nExactly how the output looks depends on which version of Docker you are using. The -f flag sets which Dockerfile to use and -t tags the image with a name. This name is how you will refer to the image later. Lastly, the . is the path to where the image should be build (. means the current directory). This had no real impact in this case, but matters if you want to import files. Validate with docker image ls that you can see your new image.\n\n\n3.3 Creating your own Dockerfile\nNow it’s time to make your own Dockerfile to reproduce the results from the Conda tutorial. If you haven’t done the tutorial, it boils down to creating a Conda environment file, setting up that environment, downloading three RNA-seq data files, and running FastQC on those files. We will later package and run the whole RNA-seq workflow in a Docker container, but for now we keep it simple to reduce the size and time required.\nThe Conda tutorial uses a shell script, run_qc.sh, for downloading and running the analysis. A copy of this file should also be available in your current directory. If we want to use the same script we need to include it in the image. A basic outline of what we need to do is:\n\nCreate a file called Dockerfile_conda\nStart the image from the my_docker_image we just built\nInstall the package fastqc which is required for the analysis.\nAdd the run_qc.sh script to the image\nSet the default command of the image to run the run_qc.sh script.\n\nWe’ll now go through these steps in more detail. Try to add the corresponding code to Dockerfile_conda on your own, and if you get stuck you can click to reveal the solution below under “Click to show solution”.\nSet image starting point\nTo set the starting point of the new image, use the FROM instruction and point to my_docker_image that we built in the previous Building from Dockerfiles step.\nInstall packages\nUse the RUN instruction to install the package fastqc=0.11.9 with conda. Here there are several options available. For instance we could add an environment file e.g. environment.yml from the Conda tutorial and use conda env create to create an environment from that file. Or we could create an environment directly with conda create. We’ll try this later option here, so add a line that will create an environment named project_mrsa containing the fastqc package, and also clean up packages and cache after installation. Use the -y flag to conda create to avoid the prompt that expects an interaction from the user.\nIn order to have the project_mrsa environment activated upon start-up we need to add two more lines to the Dockerfile. First we need to use a RUN instruction to run echo \"source activate project_mrsa\" >> ~/.bashrc, and then we need to use the ENV instruction to set the $PATH variable inside the image to /opt/conda/envs/project_mrsa/bin:$PATH.\nAdd the analysis script\nUse the COPY instruction to Add run_qc.sh to the image. The syntax is COPY SOURCE TARGET. In this case SOURCE is the run_qc.sh script and TARGET is a path inside the image, for simplicity it can be specified with ./.\nSet default command\nUse the CMD instruction to set the default command for the image to bash run_qc.sh.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nFROM my_docker_image\n\nRUN conda create -y -n project_mrsa -c bioconda fastqc=0.11.9 && conda clean -a\n\nRUN echo \"source activate project_mrsa\" >> ~/.bashrc\n\nENV PATH=/opt/conda/envs/project_mrsa/bin:$PATH\n\nCOPY run_qc.sh .\n\nCMD bash run_qc.sh\n\n\n\nBuild the image and tag it my_docker_conda (remember to add --platform linux/x86_64 to the build command if you are using a Mac with the Apple chip).\ndocker build -t my_docker_conda -f Dockerfile_conda .\nVerify that the image was built using docker image ls.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow the keywords FROM, LABEL, MAINTAINER, RUN, ENV, SHELL, WORKDIR, and CMD can be used when writing a Dockerfile.\nHow to use docker build to construct and tag an image from a Dockerfile.\nHow to create your own Dockerfile." + }, + { + "objectID": "pages/containers.html#managing-containers", + "href": "pages/containers.html#managing-containers", + "title": "Controlling your environment with Containers", + "section": "4 Managing containers", + "text": "4 Managing containers\nWhen you start a container with docker run it is given an unique id that you can use for interacting with the container. Let’s try to run a container from the image we just created:\ndocker run my_docker_conda\nIf everything worked run_qc.sh is executed and will first download and then analyse the three samples. Once it’s finished you can list all containers, including those that have exited.\ndocker container ls --all\nThis should show information about the container that we just ran. Similar to:\nCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES\nb6f7790462c4 my_docker_conda \"tini -- /bin/bash -…\" 3 minutes ago Up 24 seconds 8888/tcp sad_maxwell\nIf we run docker run without any flags, your local terminal is attached to the container. This enables you to see the output of run_qc.sh, but also disables you from doing anything else in the meantime. We can start a container in detached mode with the -d flag. Try this out and run docker container ls to validate that the container is running.\nBy default, Docker keeps containers after they have exited. This can be convenient for debugging or if you want to look at logs, but it also consumes huge amounts of disk space. It’s therefore a good idea to always run with --rm, which will remove the container once it has exited.\nIf we want to enter a running container, there are two related commands we can use, docker attach and docker exec. docker attach will attach local standard input, output, and error streams to a running container. This can be useful if your terminal closed down for some reason or if you started a terminal in detached mode and changed your mind. docker exec can be used to execute any command in a running container. It’s typically used to peak in at what is happening by opening up a new shell. Here we start the container in detached mode and then start a new interactive shell so that we can see what happens. If you use ls inside the container you can see how the script generates file in the data and results directories. Note that you will be thrown out when the container exits, so you have to be quick.\ndocker run -d --rm --name my_container my_docker_conda\ndocker exec -it my_container /bin/bash\n\n4.1 Bind mounts\nThere are obviously some advantages to isolating and running your data analysis in containers, but at some point you need to be able to interact with the rest of the host system (e.g. your laptop) to actually deliver the results. This is done via bind mounts. When you use a bind mount, a file or directory on the host machine is mounted into a container. That way, when the container generates a file in such a directory it will appear in the mounted directory on your host system.\n\n\n\n\n\n\nTip\n\n\n\nDocker also has a more advanced way of data storage called volumes. Volumes provide added flexibility and are independent of the host machine’s file system having a specific directory structure available. They are particularly useful when you want to share data between containers.\n\n\nSay that we are interested in getting the resulting html reports from FastQC in our container. We can do this by mounting a directory called, say, fastqc_results in your current directory to the /course/results/fastqc directory in the container. Try this out by running:\ndocker run --rm -v $(pwd)/fastqc_results:/course/results/fastqc my_docker_conda\nHere the -v flag to docker run specifies the bind mount in the form of directory/on/your/computer:/directory/inside/container. $(pwd) simply evaluates to the working directory on your computer.\nOnce the container finishes validate that it worked by opening one of the html reports under fastqc_results/.\nWe can also use bind mounts for getting files into the container rather than out. We’ve mainly been discussing Docker in the context of packaging an analysis pipeline to allow someone else to reproduce its outcome. Another application is as a kind of very powerful environment manager, similarly to how we’ve used Conda before. If you’ve organized your work into projects, then you can mount the whole project directory in a container and use the container as the terminal for running stuff while still using your normal OS for editing files and so on. Let’s try this out by mounting our current directory and start an interactive terminal. Note that this will override the CMD command, so we won’t start the analysis automatically when we start the container.\ndocker run -it --rm -v $(pwd):/course/ my_docker_conda /bin/bash\nIf you run ls you will see that all the files in the container/ directory are there.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to use docker run for starting a container and how the flags -d and --rm work.\nHow to use docker container ls for displaying information about the containers.\nHow to use docker attach and docker exec to interact with running containers.\nHow to use bind mounts to share data between the container and the host system." + }, + { + "objectID": "pages/containers.html#sharing-images", + "href": "pages/containers.html#sharing-images", + "title": "Controlling your environment with Containers", + "section": "5 Sharing images", + "text": "5 Sharing images\nThere would be little point in going through all the trouble of making your analyses reproducible if you can’t distribute them to others. Luckily, sharing Docker containers is extremely easy, and can be done in several ways. One of the more common ways to share Docker images is through container registries and repositories.\nFor example, a Docker registry is a service that stores Docker images, which could be hosted by a third party, publicly or privately. One of the most common registries is Docker Hub, which is a registry hosted by Docker itself. A repository, on the other hand, is a collection of container images with the same name but different tags (i.e. versions), for example ubuntu:latest or ubuntu:20.04. Repositories are stored in registries.\n\n\n\n\n\n\nNote\n\n\n\nRemember that we now have some clashing nomenclature between Git repositories (which we covered in the Git tutorial) and container repositories, so be aware of which one you’re talking about!\n\n\nThere are many registries out there, but here are some that might be of interest to you who are taking this course:\n\nDocker Hub\nGitHub Container Registry\nQuay\nBiocontainers\nRocker\nJupyter containers\n\nThe most common registry is probably Docker Hub, which lets you host unlimited public images and one private image for free (after which they charge a small fee). The GitHub Container Registry is also quite handy if you’re already using GitHub. Let’s see how it’s done using Docker Hub!\n\nRegister for an account on Docker Hub.\nUse docker login -u your_dockerhub_id to login to the Docker Hub registry. Or use the Sign in button in Docker Desktop.\nWhen you build an image, tag it with -t your_dockerhub_id/image_name, rather than just image_name.\nOnce the image has been built, upload it to Docker Hub with docker push your_dockerhub_id/image_name.\nIf another user runs docker run your_dockerhub_id/image_name the image will automatically be retrieved from Docker Hub. You can use docker pull for downloading without running.\n\nIf you want to refer to a Docker image in for example a publication, it’s very important that it’s the correct version of the image. This is handled via the ‘tags’ (e.g. docker build -t your_dockerhub_id/image_name:tag_name) that we introduced in Containers 2: The basics and used when building images in Containers 3: Building images.\n\n\n\n\n\n\nTip\n\n\n\nOn Docker Hub it is also possible to link to your Bitbucket or GitHub account and select repositories from which you want to automatically build and distribute Docker images. The Docker Hub servers will then build an image from the Dockerfile in your Git repository and make it available for download using docker pull. That way, you don’t have to bother manually building and pushing using docker push. The GitHub repository for this course is linked to Docker Hub and the Docker images are built automatically from Dockerfile and Dockerfile_slim, triggered by changes made to the GitHub repository. You can take a look at the course on Docker Hub here.\n\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow container registries and repositories work\nHow to use Docker Hub to share Docker images" + }, + { + "objectID": "pages/containers.html#packaging-the-case-study", + "href": "pages/containers.html#packaging-the-case-study", + "title": "Controlling your environment with Containers", + "section": "6 Packaging the case study", + "text": "6 Packaging the case study\nDuring these tutorials we have been working on a case study about the multi-resistant bacteria MRSA. Here we will build and run a Docker container that contains all the work we’ve done so far.\n\nWe’ve set up a GitHub repository for version control and for hosting our project.\nWe’ve defined a Conda environment that specifies the packages we’re depending on in the project.\nWe’ve constructed a Snakemake workflow that performs the data analysis and keeps track of files and parameters.\nWe’ve written a Quarto document that takes the results from the Snakemake workflow and summarizes them in a report.\n\nThe workshop-reproducible-research/tutorials/containers directory contains the final versions of all the files we’ve generated in the other tutorials: environment.yml, Snakefile, config.yml and code/supplementary_material.qmd. The only difference compared to the other tutorials is that we have also included the rendering of the Supplementary Material HTML file into the Snakemake workflow as the rule make_supplementary. Running all of these steps will take some time to execute (around 20 minutes or so), in particular if you’re on a slow internet connection.\nNow take a look at Dockerfile. Everything should look quite familiar to you, since it’s basically the same steps as in the image we constructed in the Building images section, although with some small modifications. The main difference is that we add the project files needed for executing the workflow (mentioned in the previous paragraph), and install the conda packages using environment.yml. If you look at the CMD command you can see that it will run the whole Snakemake workflow by default.\nNow run docker build as before, tag the image with my_docker_project (remember the --platform linux/x86_64 flag if you’re on a new Mac with the Apple chip):\ndocker build -t my_docker_project -f Dockerfile .\nGo get a coffee while the image builds (or you could use docker pull nbisweden/workshop-reproducible-research which will download the same image).\nValidate with docker image ls. Now all that remains is to run the whole thing with docker run. We just want to get the results, so mount the directory /course/results/ to, say, results/ in your current directory. Click below to see how to write the command.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nIf building your own image:\ndocker run -v $(pwd)/results:/course/results my_docker_project\nIf you pulled the image from DockerHub:\ndocker run -v $(pwd)/results:/course/results nbisweden/workshop-reproducible-research\n\n\n\nWell done! You now have an image that allows anyone to exactly reproduce your analysis workflow (if you first docker push to Dockerhub that is).\n\n\n\n\n\n\nTip\n\n\n\nIf you’ve done the Jupyter tutorial, you know that Jupyter Notebook runs as a web server. This makes it very well suited for running in a Docker container, since we can just expose the port Jupyter Notebook uses and redirect it to one of our own. You can then work with the notebooks in your browser just as you’ve done before, while it’s actually running in the container. This means you could package your data, scripts and environment in a Docker image that also runs a Jupyter Notebook server. If you make this image available, say on Dockerhub, other researchers could then download it and interact with your data/code via the fancy interactive Jupyter notebooks that you have prepared for them. We haven’t made any fancy notebooks for you, but we have set up a Jupyter Notebook server. Try it out if you want to (replace the image name with your version if you’ve built it yourself):\ndocker run -it nbisweden/workshop-reproducible-research jupyter notebook -allow-root --no-browser" + }, + { + "objectID": "pages/containers.html#apptainer", + "href": "pages/containers.html#apptainer", + "title": "Controlling your environment with Containers", + "section": "7 Apptainer", + "text": "7 Apptainer\nApptainer is a container software alternative to Docker. It was originally developed as Singularity by researchers at Lawrence Berkeley National Laboratory (read more about this below) with focus on security, scientific software, and HPC clusters. One of the ways in which Apptainer is more suitable for HPC is that it very actively restricts permissions so that you do not gain access to additional resources while inside the container. Apptainer also, unlike Docker, stores images as single files using the Singularity Image Format (SIF). A SIF file is self-contained and can be moved around and shared like any other file, which also makes it easy to work with on an HPC cluster.\n\n\n\n\n\n\nApptainer and Singularity\n\n\n\nThe open source Singularity project was renamed to Apptainer in 2021. The company Sylabs still keeps their commercial branch of the project under the Singularity name, and offer a free ‘Community Edition’ version. The name change was done in order to clarify the distinction between the open source project and the various commercial versions. At the moment there is virtually no difference to you as a user whether you use Singularity or Apptainer, but eventually it’s very likely that the two will diverge.\n\n\nWhile it is possible to define and build Apptainer images from scratch, in a manner similar to what you’ve already learned for Docker, this is not something we will cover here (but feel free to read more about this in e.g. the Apptainer docs.\nThe reasons for not covering Apptainer more in-depth are varied, but it basically boils down to it being more or less Linux-only, unless you use Virtual Machines (VMs). Even with this you’ll run into issues of incompatibility of various kinds, and these issues are further compounded if you’re on one of the new ARM64-Macs. You also need root (admin) access in order to actually build Apptainer images regardless of platform, meaning that you can’t build them on e.g. Uppmax, even though Apptainer is already installed there. You can, however, use the --remote flag, which runs the build on Apptainer’s own servers. This doesn’t work in practice a lot of the time, though, since most scientist will work in private Git repositories so that their research and code is not available to anybody, and the --remote flag requires that e.g. the environment.yml file is publicly available.\nThere are very good reasons to use Apptainer, however, the major one being that you aren’t allowed to use Docker on most HPC systems! One of the nicer features of Apptainer is that it can convert Docker images directly for use within Apptainer, which is highly useful for the cases when you already built your Docker image or if you’re using a remotely available image stored on e.g. DockerHub. For a lot of scientific work based in R and/or Python, however, it is most often the case that you build your own images, since you have a complex dependency tree of software packages not readily available in existing images. So, we now have another problem for building our own images:\n\nOnly Apptainer is allowed on HPC systems, but you can’t build images there due to not having root access.\nYou can build Apptainer images locally and transfer them to HPCs, but this is problematic unless you’re running Linux natively.\n\nSeems like a “catch 22”-problem, right? There are certainly workarounds (some of which we have already mentioned) but most are roundabout or difficult to get working for all use-cases. Funnily enough, there’s a simple solution: run Apptainer locally from inside a Docker container! Conceptually very meta, yes, but works very well in practice. What we are basically advocating for is that you stick with Docker for most of your container-based work, but convert your Docker images using Apptainer-in-Docker whenever you need to work on an HPC. This is of course not applicable to Linux users or those of you who are fine with working through using VMs and managing any issues that arise from doing that.\n\n\n\n\n\n\nSummary\n\n\n\nApptainer is a great piece of software that is easiest to use if you’re working on a Linux environment. Docker is, however, easier to use from a cross-platform standpoint and covers all use-cases except running on HPCs. Running on HPCs can be done by converting existing Docker images at runtime, while building images for use on HPCs can be done using local Docker images and Apptainer-in-Docker.\n\n\n\n7.1 Apptainer-in-Docker\nBy creating a bare-bones, Linux-based Docker image with Apptainer you can build Apptainer images locally on non-Linux operating systems. There is already a good image setup for just this, and it is defined in this GitHub repository. Looking at the instructions there we can see that we need to do the following:\ndocker run \\\n --rm \\\n -v /var/run/docker.sock:/var/run/docker.sock \\\n -v $(pwd):/work \\\n kaczmarj/apptainer \\\n build <IMAGE>.sif docker-daemon://<IMAGE>:<TAG>\nYou already know about docker run, the --rm flag and bind mounts using -v. The /var/run/docker.sock part is the Unix socket that the Docker daemon listens to by default, meaning that it is needed for us to be able to specify the location of the Docker container we want to convert to a SIF file. The kaczmarj/apptainer part after the bind mounts is the image location hosted at DockerHub, while the last line is the Apptainer command that actually does the conversion. All we need to do is to replace the <IMAGE> part with the Docker image we want to convert, e.g. my_docker_image.\n\nReplace <IMAGE> and <TAG> with one of your locally available Docker images and one of its tags and run the command - remember that you can use docker image ls to check what images you have available.\n\nIn the end you’ll have a SIF file (e.g. my_docker_image.sif) that you can transfer to an HPC such as Uppmax and run whatever analyses you need. If you want to be able to do this without having to remember all the code you can check out the this script.\n\n\n7.2 Running Apptainer\nThe following exercises assume that you have a login to the Uppmax HPC cluster in Uppsala, but will also work for any other system that has Apptainer installed - like if you managed to install Apptainer on your local system or have access to some other HPC cluster. Let’s try to convert the Docker image for this course directly from DockerHub:\napptainer pull mrsa_proj.sif docker://nbisweden/workshop-reproducible-research\nThis should result in a SIF file called mrsa_proj.sif.\nIn the Docker image we included the code needed for the workflow in the /course directory of the image. These files are of course also available in the Apptainer image. However, a Apptainer image is read-only. This will be a problem if we try to run the workflow within the /course directory, since the workflow will produce files and Snakemake will create a .snakemake directory. Instead, we need to provide the files externally from our host system and simply use the Apptainer image as the environment to execute the workflow in (i.e. all the software and dependencies).\nIn your current working directory (workshop-reproducible-research/tutorials/containers/) the vital MRSA project files are already available (Snakefile, config.yml and code/supplementary_material.qmd). Since Apptainer bind mounts the current working directory we can simply execute the workflow and generate the output files using:\napptainer run mrsa_proj.sif\nThis executes the default run command, which is snakemake -rp -c 1 --configfile config.yml (as defined in the original Dockerfile). Once completed you should see a bunch of directories and files generated in your current working directory, including the results/ directory containing the final HTML report.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to build a Apptainer image using Apptainer inside Docker.\nHow to convert Docker images to Apptainer images.\nHow to run Apptainer images." + }, + { + "objectID": "pages/containers.html#extra-material", + "href": "pages/containers.html#extra-material", + "title": "Controlling your environment with Containers", + "section": "8 Extra material", + "text": "8 Extra material\nContainers can be large and complicated, but once you start using them regularly you’ll find that you start understand these complexities. There are lots of different things you can do with images and containers in general, especially when it comes to optimising build time or final image size. Here is some small tips and tricks that you can be inspired from!\nIf you want to read more about containers in general you can check out these resources:\n\nA “Get started with Docker” at the Docker website.\nAn early paper on the subject of using Docker for reproducible research.\n\n\n8.1 Building for multiple platforms\nWith the newer ARM64 architectures introduced by Apple one often runs into the problem of not having an architecture-native image to run with. This is sometimes okay since the Rosetta2 software can emulate the old AMD64 architecture on newer ARM64 computers, but results in a performance hit. One could just build for ARM64 using --platform=linux/arm64 instead, but then somebody who doesn’t have the new architecture can’t run it. There is a way around this, however: multi-platform builds. We can build for multiple platforms at the same time and push those to e.g. DockerHub and anybody using those images will automatically pull the one appropriate for their computer. Here’s how to do it:\n\nStart by checking the available builders using docker buildx ls.\n\nYou should only see the default builder, which does not have access to multi-platform builds. Let’s create a new builder that does have access to it:\n\nRun the following: docker buildx create --name mybuilder --driver docker-container --bootstrap.\nSwitch to using the new builder with docker buildx use mybuilder and check that it worked with docker buildx ls.\n\nAll that’s needed now is to build and push the images! The following command assumes that you have an account with <username> at DockerHub and you’re pushing the <image> image:\ndocker buildx build --platform linux/amd64,linux/arm64 -t <username>/<image>:latest --push .\n\nExecute the above command with your username and your image.\n\nThat’s it! Now anybody who does e.g. docker pull <username>/<image> will get an image appropriate for their architecture whether they are on AMD64 or ARM64!\n\n\n\n\n\n\nAn alias to buildx\n\n\n\nYou can type docker buildx install to make the docker build into an alias for docker buildx, allowing you to run multi-platform builds using docker build. Use docker buildx uninstall to remove this alias." + }, { "objectID": "lectures/snakemake/snakemake.html#why-do-we-need-workflow-managers", "href": "lectures/snakemake/snakemake.html#why-do-we-need-workflow-managers", @@ -606,14 +725,14 @@ "href": "index.html", "title": "", "section": "", - "text": "How to work reproducibly with control and structuring of project code, environment and workflow management\n\n\nUpdated: 10-10-2024 at 09:14:09 ." + "text": "How to work reproducibly with control and structuring of project code, environment and workflow management\n\n\nUpdated: 15-10-2024 at 16:39:31 ." }, { "objectID": "index.html#tools-for-reproducible-research", "href": "index.html#tools-for-reproducible-research", "title": "", "section": "", - "text": "How to work reproducibly with control and structuring of project code, environment and workflow management\n\n\nUpdated: 10-10-2024 at 09:14:09 ." + "text": "How to work reproducibly with control and structuring of project code, environment and workflow management\n\n\nUpdated: 15-10-2024 at 16:39:31 ." }, { "objectID": "home_schedule.html", @@ -1105,6 +1224,41 @@ "section": "Quarto vs. R Markdown", "text": "Quarto vs. R Markdown\n\n\nQuarto is a command line tool\nQuarto \\(\\thickapprox\\) R Markdown 2.0\nQuarto is language-agnostic (does not depend on R)\nQuarto has all functionality built-in (you don’t need to install another package to create e.g. presentations)\nThe Quarto format is similar to R Markdown\nQuarto can render R Markdown documents\nR Markdown will continue to be supported, but Quarto is the focus of new functionality and major development" }, + { + "objectID": "pages/conda.html", + "href": "pages/conda.html", + "title": "Controlling your environment with Conda", + "section": "", + "text": "Conda is a package and environment manager. As a package manager it enables you to install a wide range of software and tools using one simple command: conda install. As an environment manager it allows you to create and manage multiple different environments, each with their own set of packages.\nWhat are the benefits of using an environment manager? Some examples include the ability to easily run different versions of the same package, have different cross-package dependencies that are otherwise incompatible with each other and, last but not least, easy installation of all the software needed for an analysis.\nEnvironments are of particular relevance when making bioinformatics projects reproducible. Full reproducibility requires the ability to recreate the system that was originally used to generate the results. This can, to a large extent, be accomplished by using Conda to make a project environment with specific versions of the packages that are needed in the project. You can read more about Conda here.\nA Conda package is a compressed tarball (system-level libraries, Python or other modules, executable programs or other components). Conda keeps track of the dependencies between packages and platforms - this means that when installing a given package, all necessary dependencies will also be installed.\nConda packages are typically hosted and downloaded from remote so-called channels. Some widely used channels for general-purpose and bioinformatics packages are conda-forge and Bioconda, respectively. Both of these are community-driven projects, so if you’re missing some package you can contribute to the channel by adding the package to it. When installing a Conda package you specify the package name, version (optional) and channel to download from.\nA Conda environment is essentially a directory that is added to your PATH and that contains a specific collection of packages that you have installed. Packages are symlinked between environments to avoid unnecessary duplication.\n\nDifferent Conda flavours You may come across several flavours of Conda. There’s Miniconda, which is the installer for Conda. The second is Anaconda, which is a distribution of not only Conda, but also over 150 scientific Python packages curated by the company by the same name (Anaconda). It’s generally better to stick with the Miniconda installation rather than installing 3 GB worth of packages you may not even use. Then, lastly, there’s the Miniforge flavour that we’re using here, which is a community-driven version of Conda that’s highly popular within the scientific community.\nThe difference between Miniconda and Miniforge is that the former points to points to the default channel by default (which requires an Anaconda license for commercial purposes), while the latter points to the community-maintained conda-forge channel by default. While Conda is created and owned by Anaconda the company, Conda itself is open source - it’s the default channel that is proprietary. The conda-forge and bioconda channels (two of the largest channels outside of default) are community-driven. Confusing? Yes. If you want this information more in-depth you can read this blog post by Anaconda." + }, + { + "objectID": "pages/conda.html#introduction", + "href": "pages/conda.html#introduction", + "title": "Controlling your environment with Conda", + "section": "", + "text": "Conda is a package and environment manager. As a package manager it enables you to install a wide range of software and tools using one simple command: conda install. As an environment manager it allows you to create and manage multiple different environments, each with their own set of packages.\nWhat are the benefits of using an environment manager? Some examples include the ability to easily run different versions of the same package, have different cross-package dependencies that are otherwise incompatible with each other and, last but not least, easy installation of all the software needed for an analysis.\nEnvironments are of particular relevance when making bioinformatics projects reproducible. Full reproducibility requires the ability to recreate the system that was originally used to generate the results. This can, to a large extent, be accomplished by using Conda to make a project environment with specific versions of the packages that are needed in the project. You can read more about Conda here.\nA Conda package is a compressed tarball (system-level libraries, Python or other modules, executable programs or other components). Conda keeps track of the dependencies between packages and platforms - this means that when installing a given package, all necessary dependencies will also be installed.\nConda packages are typically hosted and downloaded from remote so-called channels. Some widely used channels for general-purpose and bioinformatics packages are conda-forge and Bioconda, respectively. Both of these are community-driven projects, so if you’re missing some package you can contribute to the channel by adding the package to it. When installing a Conda package you specify the package name, version (optional) and channel to download from.\nA Conda environment is essentially a directory that is added to your PATH and that contains a specific collection of packages that you have installed. Packages are symlinked between environments to avoid unnecessary duplication.\n\nDifferent Conda flavours You may come across several flavours of Conda. There’s Miniconda, which is the installer for Conda. The second is Anaconda, which is a distribution of not only Conda, but also over 150 scientific Python packages curated by the company by the same name (Anaconda). It’s generally better to stick with the Miniconda installation rather than installing 3 GB worth of packages you may not even use. Then, lastly, there’s the Miniforge flavour that we’re using here, which is a community-driven version of Conda that’s highly popular within the scientific community.\nThe difference between Miniconda and Miniforge is that the former points to points to the default channel by default (which requires an Anaconda license for commercial purposes), while the latter points to the community-maintained conda-forge channel by default. While Conda is created and owned by Anaconda the company, Conda itself is open source - it’s the default channel that is proprietary. The conda-forge and bioconda channels (two of the largest channels outside of default) are community-driven. Confusing? Yes. If you want this information more in-depth you can read this blog post by Anaconda." + }, + { + "objectID": "pages/conda.html#the-basics", + "href": "pages/conda.html#the-basics", + "title": "Controlling your environment with Conda", + "section": "2 The basics", + "text": "2 The basics\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if, you haven’t done so already. Then open up a terminal and go to workshop-reproducible-research/tutorials/conda. Instructions below assume that you are standing in workshop-reproducible-research/tutorials/conda/ unless otherwise specified (e.g. if it says “create a file”, it means save it in workshop-reproducible-research/tutorials/conda/).\nLet’s assume that you are just about to start a new exciting research project called Project A.\n\n2.1 Creating Conda environments\n\nLet’s make our first Conda environment:\n\nconda create -n project_a -c bioconda fastqc\nThis will create an environment called project_a, containing FastQC from the Bioconda channel. Conda will list the packages that will be installed and ask for your confirmation.\n\nOnce it is done, you can activate the environment:\n\nconda activate project_a\nBy default, Conda will add information to your prompt telling you which environment that is active.\n\nTo see all your environments you can run:\n\nconda info --envs\nThe active environment will be marked with an asterisk.\n\nTo see the installed packages and their versions in the active environment, run:\n\nconda list\n\nTo save the installed packages to a file, run:\n\nconda env export --from-history > environment.yml\nWhere --from-history only reports the packages requested to be installed and not additional dependencies. A caveat is that if no version was originally specified, then it is not included in the export file either.\n\nNow, deactivate the environment by running conda deactivate.\nList all environments again. Which environment is now marked as active?\nTry to run FastQC:\n\nfastqc --version\n\nDid it work? Activate your project_a environment and run the fastqc --version command again. Does it work now?\n\nHopefully the FastQC software was not found in your base environment (unless you had installed it previously), but worked once your environment was activated.\n\n\n2.2 Adding more packages\n\nNow, let’s add another package (MultiQC) to our environment using conda install. Make sure that project_a is the active environment first.\n\nconda install -c bioconda multiqc\n\nIf we don’t specify the package version, the latest available version will be installed. What version of MultiQC got installed?\nRun the following to see what versions are available:\n\nconda search -c bioconda multiqc\n\nNow try to install a different version of MultiQC, e.g.:\n\nconda install -c bioconda multiqc=1.13\nRead the information that Conda displays in the terminal. It probably asks if you want to downgrade the initial MultiQC installation to the one specified here (1.13 in the example). You can only have one version of a given package in a given environment.\nLet’s assume that you will have sequencing data in your Project A, and want to use the latest BBMap software to align your reads.\n\nFind out what versions of BBMap are available in the Bioconda channel using conda search -c bioconda bbmap.\nNow install the latest available version of BBMap in your project_a environment.\n\nLet’s further assume that you have an old project (called Project Old) where you know you used BBMap 37.10. You just got back reviewer comments and they want you to include some alignment statistics. Unfortunately, you haven’t saved that information so you will have to rerun the alignment. Now, it is essential that you use the same version of BBMap that your results are based on, otherwise the alignment statistics will be misleading. Using Conda environments this becomes simple. You can just have a separate environment for your old project where you have an old version of BBMap without interfering with your new Project A where you want the latest version.\n\nMake a new environment for your old project:\n\nconda create -n project_old -c bioconda bbmap=37.10\n\nList your environments (do you remember the command?).\nActivate project_old and check the BBMap version (bbmap.sh --version).\nActivate project_a again and check the BBMap version.\n\n\n\n2.3 Removing packages\nNow let’s try to remove an installed package from the active environment:\nconda remove multiqc\n\nRun conda deactivate to exit your active environment.\nNow, let’s remove an environment:\n\nconda env remove -n project_old\nAfter making a few different environments and installing a bunch of packages, Conda can take up some disk space. You can remove unnecessary files with the command:\nconda clean -a\nThis will remove package tar-balls that are left from package installations, unused packages (i.e. those not present in any environments), and cached data.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to use conda install for installing packages on the fly.\nHow to create, activate and change between environments.\nHow to remove packages or environments and clean up." + }, + { + "objectID": "pages/conda.html#working-with-environments", + "href": "pages/conda.html#working-with-environments", + "title": "Controlling your environment with Conda", + "section": "3 Working with environments", + "text": "3 Working with environments\nWe have up until now specified which Conda packages to install directly on the command line using the conda create and conda install commands. For working in projects this is not the recommended way. Instead, for increased control and reproducibility, it is better to use an environment file (in YAML format) that specifies the packages, versions and channels needed to create the environment for a project.\nThroughout these tutorials we will use a case study where we analyse an RNA-seq experiment with the multi-resistant bacteria MRSA (see intro). You will now start to make a Conda YAML file for this MRSA project. The file will contain a list of the software and versions needed to execute the analysis code.\nIn this Conda tutorial, all code for the analysis is available in the script code/run_qc.sh. This code will download the raw FASTQ-files and subsequently run quality control on these using the FastQC software.\n\n3.1 Working with environments\nWe will start by making a Conda YAML-file that contains the required packages to perform these two steps. Later in the course, you will update the Conda YAML-file with more packages, as the analysis workflow is expanded.\n\nLet’s get going! Make a YAML file called environment.yml looking like this, and save it in the current directory (which should be workshop-reproducible-research/tutorials/conda):\n\nchannels:\n - conda-forge\n - bioconda\ndependencies:\n - fastqc=0.12.1\n\nNow, make a new Conda environment from the YAML file (note that here the command is conda env create as opposed to conda create that we used before):\n\nconda env create -n project_mrsa -f environment.yml\n\n\n\n\n\n\nTip\n\n\n\nYou can also specify exactly which channel a package should come from inside the environment file, using the channel::package=version syntax.\n\n\n\n\n\n\n\n\nTip\n\n\n\nInstead of the -n flag you can use the -p flag to set the full path to where the Conda environment should be installed. In that way you can contain the Conda environment inside the project directory, which does make sense from a reproducibility perspective, and makes it easier to keep track of what environment belongs to what project. If you don’t specify -p the environment will be installed in the envs/ directory inside your Conda installation path.\n\n\n\nActivate the environment!\nNow we can run the code for the MRSA project found in code/run_qc.sh, either by running bash code/run_qc.sh or by opening the run_qc.sh file and executing each line in the terminal one by one. Do this!\n\nThis should download the project FASTQ files and run FastQC on them (as mentioned above).\n\nCheck your directory contents (ls -Rlh, or in your file browser). It should now have the following structure:\n\n conda/\n |\n |- code/\n | |- run_qc.sh\n |\n |- data/\n | |- SRR935090.fastq.gz\n | |- SRR935091.fastq.gz\n | |- SRR935092.fastq.gz\n |\n |- results/\n | |- fastqc/\n | |- SRR935090_fastqc.html\n | |- SRR935090_fastqc.zip\n | |- SRR935091_fastqc.html\n | |- SRR935091_fastqc.zip\n | |- SRR935092_fastqc.html\n | |- SRR935092_fastqc.zip\n |\n |- environment.yml\nNote that all that was needed to carry out the analysis and generate these files and results was environment.yml (that we used to create a Conda environment with the required packages) and the analysis code in code/run_qc.sh.\n\n\n3.2 Keeping track of dependencies\nProjects can often be quite large and require lots of dependencies; it can feel daunting to try to capture all of that in a single Conda environment, especially when you consider potential incompatibilities that may arise. It can therefore be a good idea to start new projects with an environment file with each package you know that you will need to use, but without specifying exact versions (except for those packages where you know you need a specific version). This will install the latest compatible versions of all the specified software, making the start-up and installation part of new projects easier. You can then add the versions that were installed to your environment file afterwards, ensuring future reproducibility.\nThere is one command that can make this easier: conda env export. This allows you to export a list of the packages you’ve already installed, including their specific versions, meaning you can easily add them after the fact to your environment file. If you use the --no-builds flag, you’ll get a list of the packages minus their OS-specific build specifications, which is more useful for making the environment portable across systems. This way, you can start with an environment file with just the packages you need (without version), which will install the most up-to-date version possible, and then add the resulting version back in to the environment file using the export command!\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to define our Conda environment using a YAML-file.\nHow to use conda env create to make a new environment from a YAML-file.\nHow to use conda env export to get a list of installed packages.\nHow to work in a project-like setting." + }, + { + "objectID": "pages/conda.html#extra-material", + "href": "pages/conda.html#extra-material", + "title": "Controlling your environment with Conda", + "section": "4 Extra material", + "text": "4 Extra material\nThe following extra material contains some more advanced things you can do with Conda and the command line in general, which is not part of the main course materials. All the essential skills of are covered by the previous section: the material here should be considered tips and tricks from people who use Conda as part of their daily work. You thus don’t need to use these things unless you want to, and you can even skip this part of the lesson if you like!\n\n4.1 Configuration\nThe behaviour of your Conda installation can be changed using an optional configuration file .condarc. On a fresh Conda install no such file is included but it’s created in your home directory as ~/.condarc the first time you run conda config.\nYou can edit the .condarc file either using a text editor or by way of the conda config command. To list all config parameters and their settings run:\nconda config --show\nSimilar to Conda environment files, the configuration file is in YAML syntax. This means that the config file is structured in the form of key:value pairs where the key is the name of the config parameter (e.g. auto_update_conda) and the value is the parameter setting (e.g. True).\nAdding the name of a config parameter to conda config --show will show only that parameter, e.g. conda config --show channels.\nYou can change parameters with the --set, --add, --append and --remove flags to conda config.\nIf you for example want to enable the ‘Always yes’ behaviour which makes Conda automatically choose the yes option, such as when installing, you can run:\nconda config --set always_yes True\nTo see details about a config parameter you can run conda config --describe parameter. Try running it on the channels parameter:\nconda config --describe channels\nIn the beginning of this tutorial we added Conda channels to the .condarc file using conda config --add channels. To remove one of the channels from the configuration file you can run:\nconda config --remove channels conda-forge\nCheck your .condarc file to see the change. To add the conda-forge channel back to the top of the channels simply run:\nconda config --add channels conda-forge\nTo completely remove a parameter and all its values run:\nconda config --remove-key parameter\nFor a list of Conda configuration parameters see the Conda configuration page.\n\n\n4.2 Managing Python versions\nWith Conda environments it’s possible to keep several different versions of Python on your computer at the same time, and switching between these versions is very easy. However, a single Conda environment can only contain one version of Python.\n\n4.2.1 Your current Python installation\nThe base environment has its own version of Python installed. When you open a terminal (after having installed Conda on your system) this base environment is activated by default (as evidenced by (base) prepended to your prompt). You can check what Python version is installed in this environment by running python --version. To see the exact path to the Python executable type which python.\nIn addition to this your computer may already have Python installed in a separate (system-wide) location outside of the Conda installation. To see if that is the case type conda deactivate until your prompt is not prepended with a Conda environment name. Then type which python. If a path was printed to the terminal (e.g. /usr/bin/python) that means some Python version is already installed in that location. Check what version it is by typing python --version.\nNow activate the base environment again by typing conda activate (or the equivalent conda activate base) then check the Python installation path and version using which and python --version as above. See the difference? When you activate an environment your $PATH variable is updated so that when you call python (or any other program) the system first searches the directory of the currently active environment.\n\n\n4.2.2 Different Python versions\nWhen you create a new Conda environment you can choose to install a specific version of Python in that environment as well. As an example, create an environment containing Python version 3.5 by running:\nconda create -n py35 python=3.5\nHere we name the environment py35 but you can choose whatever name you want.\nTo activate the environment run:\nconda activate py35\nYou now have a completely separate environment with its own Python version.\nLet’s say you instead want an environment with Python version 2.7 installed. You may for instance want to run scripts or packages that were written for Python 2.x and are thus incompatible with Python 3.x. Simply create the new Conda environment with:\nconda create -n py27 python=2.7\nActivate this environment with:\nconda activate py27\nNow, switching between Python versions is as easy as typing conda activate py35 / conda activate py27.\n\n\n\n\n\n\nNote\n\n\n\nIf you create an environment where none of the packages require Python, and you don’t explicitly install the python package then that new environment will use the Python version installed in your base environment.\n\n\n\n\n\n4.3 Decorating your prompt\nBy default, the name of the currently activated environment is added to your command line prompt. This is a good thing, as it makes it easier to keep track of what environment and packages you have access to. The way this is done in the default implementation becomes an issue when using absolute paths for environments (specifying conda env create -p path/to/environment, though, as the entire path will be added to the prompt. This can take up a lot of unnecessary space on your screen, but can be solved in a number of ways.\nThe most straightforward way to solve this is to change the Conda configuration file, specifically the settings of the env_prompt configuration value which determines how Conda modifies your command line prompt. For more information about this setting you can run conda config --describe env_prompt and to see your current setting you can run conda config --show env_prompt.\nBy default env_prompt is set to ({default_env}) which modifies your prompt with the active environment name if it was installed using the -n flag or if the environment folder has a parent folder named envs/. Otherwise the full environment path (i.e. the ‘prefix’) is displayed.\nIf you instead set env_prompt to ({name}) Conda will modify your prompt with the folder name of the active environment. You can change the setting by running conda config --set env_prompt '({name}) '\nIf you wish to keep the ({default_env}) behaviour, or just don’t want to change your Conda config, an alternative is to keep Conda environment folders within a parent folder called envs/. This will make Conda only add the folder name of the Conda environment to your prompt when you activate it.\nAs an example, say you have a project called project_a with the project path ~/myprojects/project_a. You could then install the environment for project_a into a folder ~/myprojects/project_a/envs/project_a_environment. Activating the environment by pointing Conda to it (e.g. conda activate ~/myprojects/project_a/envs/project_a_environment) will only cause your prompt to be modified with project_a_environment.\n\n\n4.4 Bash aliases for conda\nSome programmers like to have aliases (i.e. shortcuts) for common commands. Two aliases that might be useful for you are alias coac='conda activate' and alias code='conda deactivate'. Don’t forget to add them to your ~/.bash_profile if you want to use them!\n\n\n4.5 Rolling back to an earlier version of the environment\nThe history of the changes to an environment are automatically tracked. You can see revisions to an environment by using:\nconda list --revisions\nWhich shows each revision (numbered) and what’s installed.\nYou can revert back to particular revision using:\nconda install --revision 5\n\n\n4.6 Mamba, the drop-in Conda replacement\nThere is another piece of software that is built on top of Conda as a drop-in replacement for it: Mamba. The reason for Mamba’s existence is that it used to have a better solver algorithm for the dependency tree than Conda did. These days, however, this algorithm is included in Conda as the default. There is still some minor reasons you might want to use Mamba, however, the first of which being that Mamba re-implements Conda in C++, which runs slightly faster than the Python-based Conda. This only yields a minor speed increase compared to the dependency-tree algorithm, though, so don’t expect major differences in execution time between Conda and Mamba. Another reason is that Mamba colours its output, which is nice if you care about that sort of thing. If you installed Conda as described in the pre-course material you’ll, conveniently, already have installed Mamba as well!" + }, { "objectID": "pages/git.html", "href": "pages/git.html", @@ -1176,10 +1330,283 @@ "text": "9 Extra material\nThe following extra material contains some more advanced things you can do with Git and the command line in general, which is not part of the main course materials. All the essential skills of Git are covered by the previous sections; the material here should be considered tips and tricks from people who use Git every day. You thus don’t need to use these things unless you want to, and you can even skip this part of the lesson if you like!\nIf you are interested in learning more about Git in general, here are some reading tips for you:\n\nGit cheat-sheet\nA simple Git guide\nResources to learn Git\nGit reference manual\n\n\n9.1 Forking\nWhen you want to work on an Open Source project that is available on e.g. GitHub, you usually don’t have permission to directly push code to the project’s repository - this is so that the project’s maintainers are the only ones that can directly change anything in their codebase. How do you then contribute to projects that don’t allow you to push your code to their repository? Simple: use forking!\nForking is when you make your own copy of a repository on your GitHub account, which you will then have permissions to change as you see fit. You can then create pull requests from your fork to the original repository, rather than pushing code to a new branch and making a pull request from that. Working with forks just adds an additional step to the whole workflow: instead of being “clone; code and commit changes on a new branch; push branch to remote; pull request from branch” it becomes “fork; clone; code and commit changes; push code to fork; pull request from fork”.\nYou might also want to do a fork of a project simply because you want to have your own copy of it as well, without ever having the intention of changing it. This is, of course, perfectly fine as well, but do keep in mind that developers are usually quite happy to incorporate new changes from contributors if they are reasonable and fulfil a purpose and add functionality to the project. It is quite common that you have a use-case the maintainer didn’t think of before, and that you’ve helped the project grow by contributing your code!\n\n\n9.2 Amending commits\nOnce in a while you’ll have just committed something to your Git repo and immediately remembered that you forgot to add something small, or perhaps you saw an error somewhere. While you can certainly just add that and make a new commit, wouldn’t it be nicer if you could just make the change as if it was already a part of the first commit? Well, you can! Just make the change, stage it and the commit together with the --amend flag, like so:\ngit add <file>\ngit commit --amend\nThis will add the staged changes to the previous commit as if they had always been there. Be careful, though! This will actually rewrite history, meaning that it only works if you only amended local changes. If you had already pushed the first commit to a remote repository you would run into trouble: you will be able to make the amend without issue, but you’ll get an error when you try to push your new changes, since the remote already contains the first version of the commit and can’t simply rewrite what it already has.\nAmending changes is thus a good way to fix small mistakes you realise you made just after committing them, as long as you only amend local changes!\n\n\n9.3 Rebasing\nThe git rebase command is an alternative to git merge in that it solves the same problem: getting changes in one branch into another branch. We’ve already gone through merging extensively, so how is rebasing different? Let’s look at a common case: a feature-branch which we want to get into the main branch.\n\nRecall that a merge creates a merge commit, something akin to Merge branch 'feature-branch' into main or similar. This is a new commit that didn’t exist that brings the changes on feature-branch into main, but it contains no actual work itself. This is both a good and a bad thing: good, because merging is a safe, non-destructive operation (it doesn’t alter history); bad, because it can make the history itself look quite messy. These are the commands used and what the history will look like afterwards:\ngit switch main\ngit merge feature-branch\n\n(The commit with the dashed border is the merge commit.)\nRebasing, on the other hand does not create merge commits. Indeed, what rebase does is to “re-base” one branch on the other, i.e. pretend that new changes were done on a different base than what actually happened (hence the name). Getting our feature-branch onto main using rebase actually entails two separate steps: first the rebase itself, followed by a fast-forward merge:\ngit switch feature-branch\ngit rebase main\n\nThis step rebases our feature-branch on top of main, meaning that we pretend that the commits on feature-branch were done based on the latest commits on main - you can also think of it as moving the entire feature-branch to the tip of the main branch. The commits with the dashed borders here indicate brand new commits; rebasing can’t somehow move the commits to the new base, rather it has to “replay” those commits as if they were done on the new base.\ngit switch main\ngit merge feature-branch\n\nWe’ve now got our feature-branch commits onto main with a single, linear history without any merge commits! We did have to rewrite history, though, when we did the rebase itself. As with amending (see above), this is fine if we’re only working locally, but we’ll quickly run into trouble if we try to rebase things that have already been pushed. We can rebase on top of remote things, of course, since we’re not changing any remote history, only the local history. Be careful when you rebase!\n\n\n9.4 Rebasing as clean-up\nIf the above section felt scary, don’t worry! There’s another highly useful use-case for git rebase that doesn’t risk destroying any history, namely local clean-up!\nLet’s imagine you’ve worked on your local feature-branch for some time, and you have a number of commits on it. Some are highly related to each other and might actually be better suited as a single commit. You’ve also spotted a spelling error in one commit message, and realised that you missed important information in another. We can actually solve all of these issues with an interactive rebase! If you have 4 commits on your branch you can type the following:\ngit rebase -i HEAD~4\nThe -i flag means interactive, while HEAD~4 means 4 commits back from HEAD. This will open your default text editor and give you a selection looking something like this:\npick 0abf162 First feature commit\npick befc682 A minor change on the first commit\npick c9d1426 A commit with an uncomplete commit message\npick 2e0cb97 A commit with a spelling mitake\n\n# Rebase 879ddcc..0abf162 onto 879ddcc (4 commands)\n#\n# Commands:\n# p, pick <commit> = use commit\n# r, reword <commit> = use commit, but edit the commit message\n# e, edit <commit> = use commit, but stop for amending\n# s, squash <commit> = use commit, but meld into previous commit\n\n(... more instructions ...)\nThe commits are ordered with the most recent one at the bottom. The commented instructions (all of which are not shown here) show you what alternatives you have to work with; all you have to do is to change the pick keyword next to the commit hashes to whatever keyword you need from the list, save and exit.\nIn order to solve the toy example here we might decide that the four keywords should be pick, squash, reword and reword, from top to bottom. Once that’s done simply save and exit, and another instance of your default text editor will open for you to complete the specified changes. In the case above we’d get two separate new instances where we can change the commit message - these work the same as any normal commit.\nInteractive rebasing is thus well-suited for fixing and cleaning of local changes you have yet to push anywhere, even if you don’t use rebasing as an alternative to merging! This can make your Git history both cleaner and more concise, which is great when you’re collaborating with others.\n\n\n9.5 Resetting\nSometimes you’ll want to simply discard changes you’ve already committed. This should, however, be something that you rarely have to do. Completely moving back to a previous commit is something called a hard reset, which can be accomplished like so:\ngit reset --hard 5b83463\nYou specify the commit you wish to return to, discarding all other changes, including any changes done to the working directory. It goes without saying that this command is among the most dangerous commands available in Git and should be used with caution.\n\n\n9.6 The reflog\nWe have shown many ways to work with Git and its various commands, and it occasionally happens that errors are introduced - especially when you’re not careful with using git commit --amend, git rebase or git reset on remote changes. This is where the reflog comes in. Think of the reflog as Git’s “safety net”: it stores almost every change you make to a Git repository (regardless of whether you commit the change) in a chronological manner. The following is an example of what the output of the git reflog command might show:\n58deba6 HEAD@{0}: merge: feature-branch: Fast-forward\n8c80c88 HEAD@{1}: checkout: moving from feature-branch to main\n555544a HEAD@{2}: commit: feature development 2\n4c92630 HEAD@{3}: commit: feature development 1\n8c80c88 HEAD@{4}: checkout: moving from main to feature-branch\nIt shows the most recent change at the top, notified by HEAD@{0}. We thus have a merging of feature-branch into main, a checkout (switch) into main, two commits on feature-branch and a checkout into feature-branch - reading it backwards we get a chronological log of what has happened.\nThe reflog is incredibly useful for when you’ve lost something you later realise you want to access again, such as when you’ve just used git reset. The reflog might look like this, for example:\nbc3641f HEAD@{0}: reset: moving to HEAD~2\ncaf9321 HEAD@{1}: commit: More work on the feature\n1bc36af HEAD@{2}: commit: Work on a new feature\nWe see two commits related to some new feature and a reset to HEAD~2 (two commits back from HEAD). If we realise that we actually liked the work we just threw away we can move around in the reflog in a similar manner we do normal commits:\ngit reset HEAD@{1}\nThis will put us back to the state we were in before we used git reset. We here refer to the reflog using the HEAD@{N} notation, which differs from the usual HEAD~N notation so that it is clear if it is the commit history or the reflog that is intended. While the reflog is hopefully not something you’ll have to use often it’s quite useful to know it exists, if only to be able to search the internet for more details regarding a problem you’ve encountered!\n\n\n9.7 Decorating your prompt\nWhen you are working on the command line interface (CLI), you will usually have some small pieces of information relating to your current directory, the name of the computer or host you’re working on, and so forth. You’ve probably already seen your prompt while working with Git throughout this lesson, but here’s an example of what one might look like:\nerikfmbp:~/teaching/workshop-reproducible-research erik.fasterius $\nThe above prompt contains the name of the computer, a colon, the current working directory, the username and a dollar-sign; it is stored in the variable PS1. You can type echo $PS1 to see what variables your prompt is made up of; the above example contains \\h:\\W \\u\\$, where \\h is the hostname, \\W the working directory and \\u the username.\n\n\n\n\n\n\nNote\n\n\n\nIf you’re using zsh instead of bash you’ll have to replace the backslashes (\\) in the commands with percent signs (%).\n\n\nSome people like to also show the current branch on their prompt, thus avoiding having to type git branch continuously. There are several ways you might do this, and we’re only presenting one of them here: a bash function.\ngit_branch() {\n git branch 2> /dev/null | sed -e '/^[^*]/d' -e 's/* \\(.*\\)/ (\\1)/'\n}\nThis function does a number of things:\n\nEjects the error message from Git if the current directory isn’t a part of a Git repository into /dev/null (i.e. into nothing).\nFind the current branch by searching for a line that starts with * (i.e. the current branch) using the command line program sed.\nPut the current branch into parentheses with a space before it.\n\nWe can then build our new prompt by adding this function into it:\n# The first part of the old prompt\nPS1='\\h:\\W \\u'\n\n# Add the Git branch\nPS1=$PS1'$(git_branch)'\n\n# Add the last part of the old prompt\nPS1=$PS1' \\$'\nNow you should see the current Git branch on your prompt! The only problem now is that this only works for your current session: once you restart your CLI you’ll have to re-define your prompt again. This can be circumvented, though. What you need to do is to add the code defining your prompt into your so-called bash profile: ~/.bash_profile. Every time you load a new CLI session this file is read and any code inside it is executed. You might already have this file, so make sure you don’t overwrite it!\n\n\n9.8 Bash aliases for git\nSome Git commands are used over and over again when working with git, such as git status. Some people like to have aliases (i.e. shortcuts) for these common commands. Here is a small list of such aliases that you may find useful or, even better, might inspire you to create your own! Add them to your ~/.bash_profile as above, so that they’re available across sessions.\n# Basic Git commands\nalias ga='git add'\nalias gb='git branch'\nalias gc='git commit'\nalias gd='git diff'\nalias gl='git log'\nalias gm='git merge'\nalias gp='git push'\nalias gt='git tag'\nalias gu='git pull'\nalias gw='git switch'\n\n# Git status in short format\nalias gs='git status --short'\n\n# Show diff of staged files\nalias gds='git diff --staged'\n\n# Add and commit all tracked and modified files\nalias gca='git commit --all'\n\n# Create and switch to a new branch\nalias gwc='git switch --create'\n\n# Git log with one line per commit\nalias glo='git log --oneline'\n\n\n9.9 Pretty logs\nIf you want to customise e.g. the format and the colours of the logs you can use the gitconfig file (the same one we added things to using git config --global user.name \"Mona Lisa\" in the pre-course setup). You can read more about exactly what you can do at the documentation for Git configs and pretty formats, but we’ll provide two examples here:\n[format]\n pretty = format:%C(yellow)commit %H %C(auto)%d %nAuthor: %C(cyan)%aN %C(italic reset)(%ae) %nDate: %C(blue)%ar %C(italic reset)(%ai) %n%n%C(bold reset)%w(0,6,6)%s%n%C(reset)%+b\nThis first example alters the format of the default git log command. It looks similar to what you’d be used to seeing with that command, except his has some colour highlights and adds the relative date (e.g. “1 hour ago” and similar relative times).\n[pretty]\n line = format:%C(yellow)%h %C(blue)%>(12)%ar %C(cyan)%aN%C(auto)%d %C(reset)%s\nThis second example is a custom format that can be called using git log --pretty=<format-name>, and is similar to the built-in --oneline flag, but also containing nicer colours, the relative date as well as the author name ;the format name line here is used for its similarity to oneline. You can add any number of custom formats you like using such config specifications. If you’re using aliases as in the section above you might change the glo alias to be git log --pretty=line instead, which will give you the nicer log on one line." }, { - "objectID": "pages/markdown.html", - "href": "pages/markdown.html", - "title": "Markdown", + "objectID": "pages/jupyter.html", + "href": "pages/jupyter.html", + "title": "Working with Jupyter", "section": "", - "text": "A markup language is a system for annotating text documents in order to e.g. define formatting. HTML, if you are familiar with that, is an example of a markup language. HTML uses tags, such as:\n<h1> Heading </h1>\n<h2> Sub-heading </h2>\n<a href=\"www.webpage.com\"> Link </a>\n<ul>\n <li> List-item1 </li>\n <li> List-item2 </li>\n <li> List-item3 </li>\n</ul>\nMarkdown is a lightweight markup language which uses plain-text syntax in order to be as unobtrusive as possible, so that a human can easily read it. Look at the following toy example:\n# A header\n\nA [link](http://example.com).\n\n## A sub-header\n\nText attributes _italic_, *italic*, **bold**, `monospace`.\n\n### A deeper sub-header\n\nBullet list:\n\n - Apples\n - Oranges\n - Pears\nThis would render to something like this:\n\nA markdown document can be converted to other formats, such as HTML or PDF, for viewing in a browser or a PDF reader; in fact, the page you are reading right now is written in markdown. Markdown is somewhat ill-defined, and as a consequence of that there exist many implementations and extensions. They share most of the syntax, however, with various additions on top.\nThere are a lot more things you can do with markdown than what we show here. Indeed, this entire course is mostly written in markdown! You can read more about markdown here." + "text": "The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain code, equations, visualizations and text. The functionality is partly overlapping with Quarto (see the tutorial), in that they both use markdown and code chunks to generate reports that integrate results of computations with the code that generated them. Jupyter Notebook comes from the Python community while Quarto was developed by Posit (who also created R Markdown and RStudio), but you could use most common programming languages in either alternative. In practice though, it’s quite common that R developers use Jupyter but probably not very common that Python developers use RStudio. Some reasons to use Jupyter include:\n\nPython is lacking a really good IDE for doing exploratory scientific data analysis, like RStudio or Matlab. Some people use Jupyter simply as an alternative for that.\nThe Jupyter Project community is large and dynamic, and there are lots of tools for sharing, displaying or interacting with notebooks.\nAn early ambition with Jupyter notebooks (and its predecessor IPython notebooks) was to be analogous to the lab notebook used in a wet lab. It would allow the data scientist to document his or her day-to-day work and interweave results, ideas, and hypotheses with the code. From a reproducibility perspective, this is one of the main advantages.\nJupyter notebooks can be used, just like Quarto, to provide a tighter connection between your data and your results by integrating results of computations with the code that generated them. They can also do this in an interactive way that makes them very appealing for sharing with others.\n\nAs always, the best way is to try it out yourself and decide what to use it for!\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already. Then open up a terminal and go to workshop-reproducible-research/tutorials/jupyter and activate your jupyter-env Conda environment.\n\n\n\n\n\n\nA note on nomenclature\n\n\n\n\nJupyter: a project to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. Lives at jupyter.org.\nJupyter Notebook: A web application that you use for creating and managing notebooks. One of the outputs of the Jupyter project.\nJupyter lab: A more powerful and feature-rich interface that also includes a terminal, debugger, tabs etc.\nJupyter notebook: The actual .ipynb file that constitutes your notebook." + }, + { + "objectID": "pages/jupyter.html#introduction", + "href": "pages/jupyter.html#introduction", + "title": "Working with Jupyter", + "section": "", + "text": "The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain code, equations, visualizations and text. The functionality is partly overlapping with Quarto (see the tutorial), in that they both use markdown and code chunks to generate reports that integrate results of computations with the code that generated them. Jupyter Notebook comes from the Python community while Quarto was developed by Posit (who also created R Markdown and RStudio), but you could use most common programming languages in either alternative. In practice though, it’s quite common that R developers use Jupyter but probably not very common that Python developers use RStudio. Some reasons to use Jupyter include:\n\nPython is lacking a really good IDE for doing exploratory scientific data analysis, like RStudio or Matlab. Some people use Jupyter simply as an alternative for that.\nThe Jupyter Project community is large and dynamic, and there are lots of tools for sharing, displaying or interacting with notebooks.\nAn early ambition with Jupyter notebooks (and its predecessor IPython notebooks) was to be analogous to the lab notebook used in a wet lab. It would allow the data scientist to document his or her day-to-day work and interweave results, ideas, and hypotheses with the code. From a reproducibility perspective, this is one of the main advantages.\nJupyter notebooks can be used, just like Quarto, to provide a tighter connection between your data and your results by integrating results of computations with the code that generated them. They can also do this in an interactive way that makes them very appealing for sharing with others.\n\nAs always, the best way is to try it out yourself and decide what to use it for!\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already. Then open up a terminal and go to workshop-reproducible-research/tutorials/jupyter and activate your jupyter-env Conda environment.\n\n\n\n\n\n\nA note on nomenclature\n\n\n\n\nJupyter: a project to develop open-source software, open-standards, and services for interactive computing across dozens of programming languages. Lives at jupyter.org.\nJupyter Notebook: A web application that you use for creating and managing notebooks. One of the outputs of the Jupyter project.\nJupyter lab: A more powerful and feature-rich interface that also includes a terminal, debugger, tabs etc.\nJupyter notebook: The actual .ipynb file that constitutes your notebook." + }, + { + "objectID": "pages/jupyter.html#the-basics", + "href": "pages/jupyter.html#the-basics", + "title": "Working with Jupyter", + "section": "2 The basics", + "text": "2 The basics\nOne thing that sets Jupyter Notebook apart from what you might be used to is that it’s a web application, i.e. you edit and run your code from your browser. But first you have to start the Jupyter Notebook server. At this point you may either try the classic notebook interface by running:\njupyter notebook --allow-root\nOr give the more feature-rich Jupyter lab interface a try by running:\njupyter lab --allow-root\nWhichever interface you choose you should see something similar to this printed to your terminal:\n[I 18:02:26.722 NotebookApp] Serving notebooks from local directory: /Users/john/workshop-reproducible-research/tutorials/jupyter\n[I 18:02:26.723 NotebookApp] 0 active kernels\n[I 18:02:26.723 NotebookApp] The Jupyter Notebook is running at:\n[I 18:02:26.723 NotebookApp] http://localhost:8888/?token=e03f10ccb40efc3c6154358593c410a139b76acf2cae000\n[I 18:02:26.723 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).\n[C 18:02:26.724 NotebookApp]\n\n Copy/paste this URL into your browser when you connect for the first time,\n to login with a token:\n http://localhost:8888/?token=e03f10ccb40efc3c6154358593c410a139b76acf2cae785c\n[I 18:02:27.209 NotebookApp] Accepting one-time-token-authenticated connection from ::1\n\n\n\n\n\n\nA note for Windows users\n\n\n\nIf you see the error message Start : This command cannot be run due to the error: The system cannot find the file specified. ... then try starting Jupyter with jupyter notebook --no-browser then copy the URL given into the browser directly.\n\n\n\n\n\n\n\n\nJupyter notebook versions\n\n\n\nDepending on what version of the notebook conda package you have installed, the interface may look slightly different. The screenshots in this tutorial are from version 7, an update which has brought the ‘classic’ Jupyter notebook closer to the Jupyter lab interface. Read more about this update at the Jupyter homepage.\n\n\nThe Jupyter Notebook/Lab interface probably opened up a web browser for you automatically, otherwise go to the address specified in the message in the terminal. Note that the server is running locally (as http://localhost:8888) so this does not require that you have an active internet connection. Also note that it says:\nServing notebooks from local directory: </some/local/path/workshop-reproducible-research/tutorials/jupyter>\nEverything you do in your Notebook session will be stored in this directory, so you won’t lose any work if you shut down the server.\n\n\n\nWhat you’re looking at is the Notebook dashboard. This is where you manage your files, notebooks, and kernels. The Files tab shows the files in your directory. The Running tab keeps track of all your processes.\nThe Jupyter lab dashboard should look something like this:\n\n\n\nLet’s start by creating an empty notebook. You can do this by selecting the Files tab and clicking New > Notebook. When the notebook opens, select the suggested Python 3 kernel from the drop-down menu.\nThis will open up a new tab or window looking like this:\n\n\n\nStart by giving your notebook a name by clicking on the text “Untitled” at the top of the page. Enter “jupyter-tutorial.ipynb”.\nNote that for most of this tutorial we will describe how you work in the actual notebook and not devote a lot of time to the extra features available in the Jupyter lab interface.\n\n\n\n\n\n\nTip\n\n\n\nIf you want to start Jupyter Notebooks on a cluster that you SSH to (e.g. Uppmax) see the section in the Extra material\n\n\nJupyter notebooks are made up of cells, and you are currently standing in the first cell in your notebook. Your cursor should be blinking in this cell, indicating that you are in “Edit mode” meaning that you can type text in the cell. Pressing the Esc key puts you in “Command mode” which allows you to manipulate the notebook as a whole, more on this later.\nCells in Jupyter notebooks can be of two types:markdown or code.\n\nMarkdown:\n\nThese cells contain static material such as captions, text, lists, images and so on. You express this using Markdown, which is a lightweight markup language. Markdown documents can then be converted to other formats for viewing (the document you’re reading now is written in Markdown and then converted to HTML). The format is discussed a little more in detail in the Quarto tutoriall. Jupyter Notebook uses a dialect of Markdown called GitHub Flavoured Markdown, which is described here.\n\nCode:\n\nThese are the cells that actually do something, just as code chunks do in Quarto/R Markdown. You can write code in dozens of languages and do all kinds of clever tricks. You then run the code cell and any output the code generates, such as text or figures, will be displayed beneath the cell. We will get back to this in much more detail, but for now it’s enough to understand that code cells are for executing code that is interpreted by a kernel (in this case the Python version in your Conda environment).\nBefore we continue, here are some shortcuts that can be useful. Note that they are only applicable when in “Command mode”. Most of them are also available from the menus. You can also view this list of shortcuts from the Help menu under “Show Keyboard Shortcuts”.\n\n\n\nShortcut\nEffect\n\n\n\n\nenter\nEnter Edit mode\n\n\nescape\nEnter Command mode\n\n\nctrl-enter\nRun the cell\n\n\nshift-enter\nRun the cell and select the cell below\n\n\nalt-enter\nRun the cell and insert a new cell below\n\n\ns\nSave the notebook\n\n\ntab\nFor code completion or indentation\n\n\nm,y\nToggle between Markdown and Code cells\n\n\nd-d\nDelete a cell\n\n\na\nInsert cells above current cell\n\n\nb\nInsert cells below current cell\n\n\nx\nCut currently selected cells\n\n\nv\nPaste cell below\n\n\no\nToggle output of current cell" + }, + { + "objectID": "pages/jupyter.html#writing-markdown", + "href": "pages/jupyter.html#writing-markdown", + "title": "Working with Jupyter", + "section": "3 Writing markdown", + "text": "3 Writing markdown\nLet’s use our first cell to create a header. Change the format from Code to Markdown using the drop-down list in the Notebook Toolbar, or by pressing the m key when in command mode. Double click on the cell, or hit enter to enter editing mode and input “# My notebook” (“#” is used in Markdown for header 1). Run the cell with ctrl-enter (cmd-enter on Mac).\nMarkdown is a simple way to structure your notebook into sections with descriptive notes, lists, links, images etc.\nBelow are some examples of what you can do in markdown. Paste all or parts of it into one or more cells in your notebook to see how it renders. Make sure you set the cell type to Markdown.\n## Introduction\nIn this notebook I will try out some of the **fantastic** concepts of Jupyter\nNotebooks.\n\n## Markdown basics\nExamples of text attributes are:\n\n- *italics*\n- **bold**\n- `monospace`\n\nSections can be separated by horizontal lines.\n\n---\n\nBlockquotes can be added, for instance to insert a Monty Python quote:\n\n> Spam!\n> Spam!\n> Spam!\n> Spam!\n\nSee [here](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html) for more information." + }, + { + "objectID": "pages/jupyter.html#writing-code", + "href": "pages/jupyter.html#writing-code", + "title": "Working with Jupyter", + "section": "4 Writing code", + "text": "4 Writing code\nNow let’s write some code! Since we chose a Python kernel, Python would be the native language to run in a cell. Enter this code in the second cell and run it:\nprint(\"Hello world!\")\nNote how the output is directly displayed below the cell. This interactive way of working is one of the things that sets Jupyter Notebook apart from RStudio and Quarto. In RStudio/Quarto, documents are typically rendered top-to-bottom in one run, while you work in a Jupyter notebook in a different way. This requires some special attention when it comes to reproducibility, which we will get back to in the reproducibility section.\nWhat is a Jupyter notebook? Let’s look a closer at the notebook we’re currently working in. Jupyter Notebooks are autosaved every minute or so, so you will already have it available. We can be a little meta and do this from within the notebook itself, by running some shell commands in a code cell. This very handy functionality is possible by prepending the command with !. Try adding !ls to a code cell and run it. This will list the files in the current directory.\nAha, we have a new file called jupyter-tutorial.ipynb! This is our notebook. Look at the first ten lines of the file by using !head jupyter-tutorial.ipynb. Seems like it’s just a plain old JSON file. Since it’s a text file it’s suitable for version control with for example Git. There are however some special considerations to take into account for Notebooks which we will cover in the reproducibility section of this tutorial.\nVariables defined in cells become variables in the global namespace. You can therefore share information between cells. Try to define a function or variable in one cell and use it in the next. For example, add the following to a new cell and run it:\ndef print_me(str):\n print(str)\nNow create a new cell and add:\nprint_me(\"Hi!\")\nYour notebook should now look something like this.\n\n\n\nThe focus of this tutorial is not on how to write Markdown or Python; you can make really pretty notebooks with Markdown and you can code whatever you want with Python. Rather, we will focus on the Jupyter Notebook features that allow you to do a little more than that.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nThat a Jupyter notebook consists of a series of cells, and that they can be either markdown or code cells.\nThat we execute the code in a code cell with the kernel that we chose when opening the notebook.\nWe can run shell commands by prepending them with !.\nA Jupyter notebook is simply a text file in JSON format." + }, + { + "objectID": "pages/jupyter.html#magics", + "href": "pages/jupyter.html#magics", + "title": "Working with Jupyter", + "section": "5 Magics", + "text": "5 Magics\nMagics constitute a simple command language that significantly extends the power of Jupyter notebooks. There are two types of magics:\n\nLine magics: Commands that are prepended by %, and whose arguments only extend to the end of the line.\nCell magics: Commands that start with %% and then apply to the whole cell. Must be written on the first line of a cell.\n\nNow list all available magics with %lsmagic (which itself is a magic). You add a question mark to a magic to show the help (e.g. %lsmagic?). Some of them act as shortcuts for commonly used shell commands (%ls, %cp, %cat, ..). Others are useful for debugging and optimizing your code (%timeit, %debug, %prun, ..). For more information see the magics documentation.\nA very useful magic, in particular when using shell commands a lot in your work, is %%capture. This will capture the stdout/stderr of any code cell and store them in a Python object. Run %%capture? to display the help and try to understand how it works. Try it out with either some Python code, other magics or shell commands. Here is an example of how you can make it work:\n%%capture output\n%%bash\necho \"Print to stdout\"\necho \"Print to stderr\" >&2\n… And in another cell:\nprint(\"stdout:\" + output.stdout)\nprint(\"stderr:\" + output.stderr)\n\nTip You can capture the output of some magics directly like this: my_dir = %pwd.\n\nThe %%script magic is used for specifying a program (Bash, Perl, Ruby, ..) with which to run the code (similar to a shebang). For some languages it’s possible to use these shortcuts:\n\n%%ruby\n%%perl\n%%bash\n%%html\n%%latex\n%%R\n\n\n\n\n\n\n\nA note on R code\n\n\n\nIn order to use the %%R magic you need to install the rpy2 extension, for example with Conda. This package is already installed in the jupyter-env environment you’re using for this tutorial. However, you also have to load it by running %load_ext rpy2.ipython in a cell.\n\n\nTry this out if you know any of the languages above. Otherwise you can always try to print the quadratic formula with LaTeX!\n\\begin{array}{*{20}c} {x = \\frac{{ - b \\pm \\sqrt {b^2 - 4ac} }}{{2a}}} & {{\\rm{when}}} & {ax^2 + bx + c = 0} \\\\ \\end{array}\nAnother useful magic is %precision which sets the floating point precision in the notebook. As a quick example, add the following to a cell and run it:\nfloat(100/3)\nNext set the precision to 4 decimal points by running a cell with:\n%precision 4\nNow run the cell with float(100/3) again to see the difference.\nRunning %precision without additional arguments will restore the default.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nThe basics of Jupyter magics and the difference between line magics and cell magics\nHow to capture and use output from notebook cells with %%capture\nHow to use magics to run non-Python code in notebooks" + }, + { + "objectID": "pages/jupyter.html#plotting", + "href": "pages/jupyter.html#plotting", + "title": "Working with Jupyter", + "section": "6 Plotting", + "text": "6 Plotting\nAn essential feature of Jupyter Notebooks is of course the ability to visualize data and results via plots. A full guide to plotting in Python is beyond the scope of this course, but we’ll offer a few glimpses into the plotting landscape of Python.\nFirst of all, Python has a library for plotting called matplotlib, which comes packed with functionality for creating high-quality plots. Below is an example of how to generate a line plot of a sine wave.\n# Import packages\nimport numpy as np\nimport matplotlib.pyplot as plt\n# Generate a set of evenly spaced numbers between 0 and 100\nx = np.linspace(0,3*np.pi,100)\n# Use the sine function to generate y-values\ny = np.sin(x)\n# Plot the data\nline, = plt.plot(x, y, color='red', linestyle=\"-\")\nBy default plots are rendered in the notebook as rasterised images which can make the quality poor. To render in scalable vector graphics format use the set_matplotlib_formats from the matplotlib_inline package:\nimport matplotlib_inline\nmatplotlib_inline.backend_inline.set_matplotlib_formats('pdf', 'svg')\nNow try running the code for the sine wave plot again." + }, + { + "objectID": "pages/jupyter.html#other-packages-for-plotting", + "href": "pages/jupyter.html#other-packages-for-plotting", + "title": "Working with Jupyter", + "section": "7 Other packages for plotting", + "text": "7 Other packages for plotting\nAs we mentioned Matplotlib comes with a lot of functionality which is great because it allows you to create all sorts of plots and modify them exactly to your liking. However, this can also mean that creating very basic plots might involve a lot of cumbersome coding, when all you want is a simple bar chart!\nFortunately there are a number of Python packages that build upon matplotlib but with a much simplified interface. One such popular package is Seaborn. Below we’ll see how to generate a nice looking bar plot with error bars.\nFirst import the Seaborn package (using an abbreviated name to simplify typing):\nimport seaborn as sns\nNext we’ll load some example data of penguins collected at the Palmer Station, in Antarctica.\npenguins = sns.load_dataset(\"penguins\")\n# Look at first 5 lines of the data\npenguins.head(5)\nThe most basic way to generate a bar plot of this data with Seaborn is:\nsns.barplot(data=penguins)\nSimple right? Yes, but maybe not very informative. Here Seaborn simply calculates the mean of all numeric variables for the penguins and plots them with error bars representing a 95% confidence interval.\nLet’s say that instead we want to plot the mean value of the body mass of the penguins at the different islands where they were examined.\nsns.barplot(data=penguins, x=\"island\", y=\"body_mass_g\", errorbar=\"sd\");\nHere we specified to use values in the ‘island’ column as categories for the x-axis, and values in the ‘body_mass_g’ column as values for the y-axis. The barplot function of Seaborn will then calculate the mean body mass for each island and plot the bars. With errorbar=\"sd\" we tell the function to draw the standard deviation as error bars, instead of computing a confidence interval.\nIf we instead want to visualize the data as a scatterplot we can use the sns.scatterplot function. Let’s plot the body mass vs bill length for all penguins and colour the data points by species. We’ll also move the legend outside of the plotting area and modify the x and y-axis labels:\n# Store the matplotlib axes containing the plot in a variable called 'ax'\nax = sns.scatterplot(data=penguins, x=\"bill_length_mm\", y=\"body_mass_g\",\n hue=\"species\")\n# Modify the labels of the plot\nax.set_xlabel(\"Bill length (mm)\")\nax.set_ylabel(\"Body mass (g)\")\n# Set legend position outside of plot\nax.legend(bbox_to_anchor=(1,1));\nIf you want to save a plot to file you can use the plt.savefig function. Add the following to the bottom of the cell with the scatterplot code:\nplt.savefig(\"scatterplot.pdf\", bbox_inches=\"tight\")\nThe bbox_inches=\"tight\" setting ensures that the figure is not clipped when saved to file.\nThe Seaborn website contains great tutorials and examples of other ways to plot data!\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to generate simple plots with matplotlib\nHow to import and use the Seaborn package for plotting\nHow to save plots from notebooks to a file" + }, + { + "objectID": "pages/jupyter.html#widgets", + "href": "pages/jupyter.html#widgets", + "title": "Working with Jupyter", + "section": "8 Widgets", + "text": "8 Widgets\nSince we’re typically running our notebooks in a web browser, they are quite well suited for also including more interactive elements. A typical use case could be that you want to communicate some results to a collaborator or to a wider audience, and that you would like them to be able to modify how the results are displayed. It could, for example, be to select which gene to plot for, or to see how some parameter value affects a clustering. Jupyter notebooks has great support for this in the form of widgets.\nWidgets are eventful Python objects that have a representation in the browser, often as a control like a slider, text box, etc. These are implemented in the ipywidgets package.\nThe easiest way to get started with using widgets are via the interact and interactive functions. These functions auto-generate widgets from functions that you define, and then call those functions when you manipulate the widgets. This might sound abstract so let’s look at an example.\nLet’s take the scatterplot of the penguins dataset that we generated in the previous section and add widgets that lets us choose variables to plot as well as coloring of the points.\nFirst we’ll import the interactive function from ipywidgets. Let’s also import the widgets module which we we’ll use later. Add the following code to a cell and run it:\nfrom ipywidgets import interactive, widgets\nNow, in a new cell, define a function called scatterplot with the code to generate the plot itself. Also add a palette argument to the function so that we can specify the colour palette to use for the plot. The function should look like this:\ndef scatterplot(x, y, hue, palette):\n ax = sns.scatterplot(data=penguins, x=x, y=y, hue=hue, palette=palette)\nRun the cell and create a new cell below it.\nNext, we’ll use the interactive function to generate a widget to control the x, y, hue and palette arguments. The interactive function takes a function as its first argument, and then keyword arguments for each of the arguments in the function. The returned value is a widget which we will store in a variable called interactive_scatterplot. Add the following to a cell and run it:\ninteractive_scatterplot = interactive(scatterplot,\n x=[\"bill_length_mm\",\"bill_depth_mm\",\"flipper_length_mm\",\"body_mass_g\"],\n y=[\"body_mass_g\",\"bill_length_mm\",\"bill_depth_mm\",\"flipper_length_mm\"],\n hue=[\"species\",\"island\",\"sex\"],\n palette=[\"Set1\",\"Set2\",\"Dark2\",\"Paired2\"])\nImportantly, all parameters defined in the scatterplot function must be given in the interactive call. The interactive_scatterplot widget is now tied to the scatterplot function. However, we still haven’t displayed the widget itself. To do that, simply add interactive_scatterplot to a new cell and run it:\ninteractive_scatterplot\nThis should show the scatterplot with drop-down menus for each of the arguments. Try changing the x and y variables to plot by selecting from the respective drop-downs. The hue drop-down now lets you change what variable to use for colouring the points and the palette drop-down changes the colour palette. As you can see, the available options in the drop-downs are the ones we specified in the interactive call.\nDepending on the type of the passed argument different types of widgets will be created by interactive. For instance:\n\nint or float arguments will generate a slider\nbool arguments (True/False) will generate checkbox widgets\nlist arguments will generate a drop-down\nstr arguments will generate a text-box\n\nLet’s add a slider to control the size of the points. In the Seaborn package this is controlled by the s argument to the scatterplot function. Modify the cell with your scatterplot function so it looks like this (remember to run the cell in order to update the function definition):\ndef scatterplot(x, y, hue, palette, size=50):\n ax = sns.scatterplot(data=penguins, x=x, y=y, hue=hue, palette=palette, s=size)\nNote that we added a size argument to the function and supplied it to the Seaborn scatterplot call with s=size. Setting size=50 in the function definition means that the default size of the points will be 50.\nNow we need to add a slider for the size argument. Update the cell where we call the interactive function so that it looks like this, then run it:\ninteractive_scatterplot = interactive(scatterplot,\n x=[\"bill_length_mm\",\"bill_depth_mm\",\"flipper_length_mm\",\"body_mass_g\"],\n y=[\"body_mass_g\",\"bill_length_mm\",\"bill_depth_mm\",\"flipper_length_mm\",],\n hue=[\"species\",\"island\",\"sex\"],\n palette=[\"Set1\",\"Set2\",\"Dark2\",\"Paired2\"],\n size=(20,100,10))\nHere the size argument is defined as a tuple which sets the minimum value of the slider to 20, the maximum value to 100 and the step size to 10.\nFinally, re-run the cell where we displayed the interactive_scatterplot widget. You should now see a slider for the size argument (starting at 50). Try changing the size of the points by moving the slider.\nThis is how it should look if everything works.\n\nThere are lots of widgets, e.g.:\n\nDrop-down menus\nToggle buttons\nRange sliders\nFile uploader\n\n… And much, much more. Here is a list of all available widgets together with documentation and examples. Some of these widgets cannot be auto-generated by interactive, but fear not! Instead of relying on auto-generation we can define the widget and supply it directly to interactive.\nTo see this in practice, we’ll modify the scatterplot function to display a title and add a color picker widget that let’s us set the color of the title text.\nFirst, update the scatterplot function so that it looks like this:\ndef scatterplot(x, y, hue, palette, size, color):\n ax = sns.scatterplot(data=penguins, x=x, y=y, hue=hue, palette=palette, s=size)\n ax.set_title(\"Penguin scatterplot\", color=color)\nThen run the cell to update the function definition.\nNext, we’ll define the colour picker widget. Add the definition to the cell where you defined the interactive_scatterplot then supply the widget to the interactive call. The cell should look like this:\ncolorpicker = widgets.ColorPicker(\n concise=False,\n description='Title color',\n value='blue',\n disabled=False\n)\ninteractive_scatterplot = interactive(scatterplot,\n x=[\"bill_length_mm\",\"bill_depth_mm\",\"flipper_length_mm\",\"body_mass_g\"],\n y=[\"body_mass_g\",\"bill_length_mm\",\"bill_depth_mm\",\"flipper_length_mm\"],\n hue=[\"species\",\"island\",\"sex\"],\n palette=[\"Set1\",\"Set2\",\"Dark2\",\"Paired2\"],\n size=(20, 100, 10),\n color=colorpicker)\nRun the cell to update the widgets.\nFinally, re-run the cell where we displayed the interactive_scatterplot. The plot should now have a title and you should see a new color picker below the slider for the point size. Try changing the title colour by clicking on the new color picker.\n\n\n\n\n\n\nCaution\n\n\n\nNote that you may have to close the colour picker once you’ve made your choice in order to make the plot update." + }, + { + "objectID": "pages/jupyter.html#other-interactive-plots", + "href": "pages/jupyter.html#other-interactive-plots", + "title": "Working with Jupyter", + "section": "9 Other interactive plots", + "text": "9 Other interactive plots\nJupyter widgets, like we used here, is the most vanilla way of getting interactive graphs in Jupyter notebooks. Some other alternatives are:\n\naltair is a plotting library that uses Vega-Lite grammar which is reminiscent of ggplot2 in R. The syntax is different from what we’ve shown here, but it’s very powerful once you get the hang of it.\nPlotly is actually an API to a web service that renders your graph and returns it for display in your Jupyter notebook. Generates very visually appealing graphs, but from a reproducibility perspective it’s maybe not a good idea to be so reliant on a third party.\nBokeh is another popular tool for interactive graphs. Most plotting packages for Python are built on top of matplotlib, but Bokeh has its own library. This can give a steeper learning curve if you’re used to the standard packages.\nmpld3 tries to integrate matplotlib with Javascript and the D3js package. It doesn’t scale well for very large datasets, but it’s easy to use and works quite seamlessly.\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to implement interactive widgets in notebooks" + }, + { + "objectID": "pages/jupyter.html#extensions", + "href": "pages/jupyter.html#extensions", + "title": "Working with Jupyter", + "section": "10 Extensions", + "text": "10 Extensions\nJupyter Notebook extensions are add-ons that can increase the functionality of your notebooks. Extensions include themes, editors, git support, renderers and much more. The most user-friendly way of managing extensions is via the Extension Manager available in the Jupyter lab interface. You can access it by clicking the puzzle piece icon in the left sidebar.\n\n\n\n\n\n\nCaution\n\n\n\nNote that the extension manager contains a disclaimer warning you that these third-party extensions are not reviewed for vulnerabilities which means that you should be careful about what extensions you install.\n\n\nYou can use the search field to perform a free text search for available extensions, then click ‘Install’ to install an extension. Not that in some cases you will be prompted to install additional packages.\nWhile an in-depth listing of available extensions is well beyond the scope of this tutorial we offer this list of a few extensions that are of particular relevance to this course:\n\nJupyterlab/GitHub - view and open files from GitHub\nJupyterlab/Git - version controlling with git\nmamba-org/gator-lab - manage Conda environments\nvoila-dashboards/Jupyterlab-preview - preview a rendered version of your notebook\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nWhat Jupyter extensions are and how to manage them" + }, + { + "objectID": "pages/jupyter.html#reproducibility", + "href": "pages/jupyter.html#reproducibility", + "title": "Working with Jupyter", + "section": "11 Reproducibility", + "text": "11 Reproducibility\nNow that you have a feeling for what Jupyter can do we’ll spend a little time on things to consider specifically from a repdroducibility point of view when it comes to Jupyter notebooks." + }, + { + "objectID": "pages/jupyter.html#version-control-of-jupyter-notebooks", + "href": "pages/jupyter.html#version-control-of-jupyter-notebooks", + "title": "Working with Jupyter", + "section": "12 Version control of Jupyter notebooks", + "text": "12 Version control of Jupyter notebooks\nAs we’ve seen, Jupyter notebooks are plain-text JSON files. This means that they can be version controlled with Git just like any other text file. However, because of the way Jupyter notebooks store their content, the diffs produced by Git can be difficult to interpret. Luckily, there are tools that can provide content-aware diffs and merge functionality for Jupyter notebooks.\nOne such tool is nbdime. nbdime is built to understand the structure of Jupyter notebooks and can therefore generate diffs that are easier to read. It can also be used to merge changes made to notebooks, which is great especially when collaborating on notebooks with others.\nnbdime is already installed in the jupyter-env Conda environment you are using for this tutorial. To try it in action, create a new notebook and name it Analysis.ipynb. Add the following code to the first cell, then run it:\nimport numpy as np\nimport seaborn as sns\npenguins = sns.load_dataset(\"penguins\")\nThis simply imports some python modules and loads a dataset.\nSave the notebook. Now we’ll add and commit the new notebook to the Git repository:\ngit add Analysis.ipynb\ngit commit -m \"Add Analysis notebook\"\nSo far so good. And nothing new here compared to what we’ve already learned about version control. Now let’s make some changes to the notebook. First we’ll replace one of the loaded modules. Update the first cell of the notebook so that it reads:\nimport pandas as pd\nimport seaborn as sns\npenguins = sns.load_dataset(\"penguins\")\nThen create a new cell where we’ll calculate the mean of each numeric value grouped by species. In the new cell, add the following code:\npenguins.groupby(\"species\").mean(numeric_only=True)\nRun the cell and save the notebook.\nNow use git diff to view the changes we’ve made to the notebook. Run:\ngit diff Analysis.ipynb\nEven with very minor modifications to the notebook the diff will contain numerous lines that are difficult to interpret. This is because the notebook not only contains the code, but also cell metadata and output (in this case a table produced by the second cell).\nNow let’s generate a more easy-to-read diff. Run:\nnbdiff -s Analysis.ipynb\nThis will use the nbdiff tool that comes with nbdime to show an inline diff of the notebook. The -s flag tells nbdiff to only show differences for the actual code changes, ignoring changes in metadata and output. There are a number of flags you can use here to customise the diff. The uppercase version of each flag will ignore the respective change type. For example, to see the diff but ignore changes to the output of cells you can run:\nnbdiff -O Analysis.ipynb\nnbdime also comes with a graphical web-based diff viewer. To try it, run:\nnbdiff-web Analysis.ipynb\nThis will open up a tab in your web browser showing you changes made to the notebook side-by-side for each cell, including also cell output. This makes it easy to see changes made both to code and outputs such as tables and plots.\n\n12.1 Other tools for version control of notebooks\n\nYou can also install the nbdime jupyter lab extension to get access to the diff functionality directly from the Jupyter lab interface. If you also install the jupyterlab-git extension you can both view diffs and commit changes directly from Jupyter lab.\nVS Code actually comes with built-in support for both Jupyter notebooks and Git so that you can view informative diffs directly from the editor" + }, + { + "objectID": "pages/jupyter.html#making-sure-notebooks-work-as-expected", + "href": "pages/jupyter.html#making-sure-notebooks-work-as-expected", + "title": "Working with Jupyter", + "section": "13 Making sure notebooks work as expected", + "text": "13 Making sure notebooks work as expected\nOne of the great things with Jupyter notebooks is the ability to do data exploration in an interactive way. Because loaded data, defined variables and functions remain in the notebook until you restart the kernel, you can easily make changes to your analysis and re-run cells to see the effect of the changes immediately. However, this can also be a source of errors and inconsistencies if you, during your work, modify or use variables in cells upstream of their initial definition.\nThe nbval package can help you catch these types of errors. nbval is a plugin for the pytest testing framework that can be used to test Jupyter notebooks. It works by executing each cell in the notebook and comparing the output to the output stored in the notebook. If the output is the same, the test passes. If the output differs, the test fails. nbval is also pre-installed in the jupyter-env Conda environment you’re using for this tutorial.\nAs an example, we’ll keep working with the Analysis.ipynb notebook we’ve created.\nLet’s say we want to estimate the size of the bill of penguins using the bill_length_mm and bill_depth_mm columns. We’ll do this by adding a new cell to our notebook with the following code:\npenguins[\"bill_size\"] = (penguins[\"bill_length_mm\"] * penguins[\"bill_depth_mm\"])\nRun the cell and add a new one below it. In the new cell, output the mean of each column grouped by island using the following code:\npenguins.groupby(\"island\").mean(numeric_only=True)\nRun the cell to see the output. Looks good. Now we have a very simple example of some exploratory analyses on a dataset.\nSave the notebook and try running nbval on it to see if it works as expected. From the commandline, run:\npytest --nbval Analysis.ipynb\nnbval tests each cell in your notebook by executing it and comparing the output to the output stored in the notebook. If the output is the same, the test passes. The output of the test should look something like this:\ncollected 4 items\n\nAnalysis.ipynb .... [100%]\n\n========== 4 passed in 1.93s ==========\nNow let’s say we realize that we want to normalize the bill_size values by the body mass of the penguins. We’ll just modify the cell where we calculated this value, introducing a small piece of code to divide by the body_mass_g column.\nChange the third cell of the notebook so that it reads:\npenguins[\"bill_size\"] = (penguins[\"bill_length_mm\"] * penguins[\"bill_depth_mm\"]) / penguins[\"body_mass_g\"]\nsns.scatterplot(data=penguins, x=\"bill_size\", y=\"flipper_length_mm\", hue=\"island\")\nRe-run the cell and save the notebook. So far so good! Let’s test the notebook again with nbval. Just like before run it from the commandline with:\npytest --nbval Analysis.ipynb\nIf you’ve followed the instructions, this second run of nbval should generate a FAILED test, showing something like:\n==================== short test summary info ====================\nFAILED Analysis.ipynb::Cell 3\n================== 1 failed, 3 passed in 1.83s ==================\nWhat happened here was that we modified the cell where we calculated the bill_size value, but we didn’t re-run the cell where we output the mean of each column grouped by island. This means that the output of the last cell in the notebook now differs from what is actually stored in the notebook variables. This type of error can be difficult to spot, especially if you have a large notebook with many cells. Luckily, nbval can help us here.\n\n\n\n\n\n\nNote\n\n\n\nNote that nbval reports cell numbers using 0-based numbering, so when the test fails on Cell 3 it actually refers to the 4th cell in the notebook.\n\n\nThis problem could have been solved if we had re-run the cell where we output the mean of each column grouped by island. In fact, it is good practice to re-run all cells in a notebook before saving it. If you in addition restart the kernel before re-running you make sure that you haven’t introduced any ‘hidden states’\n\n\n\n\n\n\nIgnoring specific cells\n\n\n\nOne caveat of nbval is that it doesn’t work well with cells that generate plots. You can tell nbval to ignore the output of specific cells by adding # NBVAL_IGNORE_OUTPUT to the top of a cell.\n\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned: - How to use nbdime to view diffs of Jupyter notebooks - How to use nbval to test that notebooks work as expected" + }, + { + "objectID": "pages/jupyter.html#converting-notebooks", + "href": "pages/jupyter.html#converting-notebooks", + "title": "Working with Jupyter", + "section": "14 Converting notebooks", + "text": "14 Converting notebooks\nNotebooks can be converted to various output formats such as HTML, PDF, LaTeX etc. directly from the File -> Save and Export Notebook As… menu.\nConversion can also be performed on the command line using the jupyter nbconvert command. nbconvert is installed together with the jupyter Conda package and is executed on the command line by running jupyter nbconvert.\nThe syntax for converting a Jupyter notebook is:\njupyter nbconvert --to <FORMAT> notebook.ipynb\nHere <FORMAT> can be any of asciidoc, custom, html, latex, markdown, notebook, pdf, python, rst, script, slides. Converting to some output formats (e.g. PDF) may require you to install separate software such as Pandoc or a TeX environment.\nTry converting the jupyter-tutorial.ipynb notebook that you have been working on for this tutorial to HTML using jupyter nbconvert.\n\n\n\n\n\n\nTip\n\n\n\nIf the plots in HTML rendered version of your notebook are not displayed properly, try changing the matplotlib_inline.backend_inline.set_matplotlib_formats('pdf', 'svg') line to matplotlib_inline.backend_inline.set_matplotlib_formats('retina').\n\n\nnbconvert can also be used to run a Jupyter notebook from the command line by running:\njupyter nbconvert --execute --to <FORMAT> notebook.ipynb\nnbconvert executes the cells in a notebook, captures the output and saves the results in a new file. Try running it on the jupyter-tutorial.ipynb notebook.\nYou can also specify a different output file with --output <filename>.\nSo in order to execute your jupyter-tutorial.ipynb notebook and save it to a file named report.html you could run:\njupyter nbconvert --to html --output report.html --execute jupyter-tutorial.ipynb\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to convert Jupyter notebooks to various other formats\nHow to use nbconvert to convert notebooks on the command line" + }, + { + "objectID": "pages/jupyter.html#notebooks-and-quarto", + "href": "pages/jupyter.html#notebooks-and-quarto", + "title": "Working with Jupyter", + "section": "15 Notebooks and Quarto", + "text": "15 Notebooks and Quarto\nYou may have noticed that a lot of the functionality in Jupyter is overlapping with Quarto. And you may be wondering which one to use. This is a difficult question to answer as it will depend on your use-case and personal preference. As such, any answer will be subjective, but we’ll try to give you some pointers on how to get the best out of both worlds.\nWhile similar in some ways Jupyter and Quarto are not completely overlapping. Quarto is great for generating high-quality reports and manuscripts, and is agnostic to the programming language used. Jupyter on the other hand is great for interactive data analysis and exploration with a more direct connection between code and output. While Jupyter is also somewhat agnostic to programming language, it is most commonly used with Python and with both the Jupyter and Python ecosystem at its back it can be customized with a lot of different extensions and plugins.\nThe good news is that the two can be used together allowing you to get the best of both. For example, you may like the professional look of rendered Quarto documents but really like the interactive and exploratory nature of Jupyter. Well you can simply work as you normally do in Jupyter and then use Quarto to render the notebook to a high-quality report or manuscript.\nTo give you an example, take a look at the supplementary_material.ipynb file in the jupyter/ tutorial directory. Open this notebook in the Jupyter lab interface (make sure you have activated the jupyter-env Conda environment).\nAs you can see this notebook contains some brief descriptions in Markdown and code to generate a few plots. It uses the output from the MRSA case-study Snakemake workflow you worked on in the Snakemake tutorial. This is a common use-case for Jupyter notebooks; to generate summary statistics and plots from the results of a workflow run. (A real-world example could of course include a lot more in-depth exploratory analyses).\nNow, let’s say you want to share the results of this notebook with your PI or collaborators. We could simply share the notebook file, or as we saw in the previous section, convert it to HTML or PDF via jupybter nbconvert.\nLet’s do that first so we have something to compare with. Run the following:\njupyter nbconvert --to HTML --output supplementary_material.nbconvert.html supplementary_material.ipynb\nOpen the supplementary_material.nbconvert.html file in a browser to see that it looks like you expect. This looks more or less like the original notebook.\nNow let’s go one step further and render the notebook to a high-quality report using Quarto. We can actually add a YAML header to the notebook with some document options that Quarto understands. Create a new cell in the notebook (from the Jupyter lab interface) and move it to the top. In this cell, add the following:\n---\ntitle: Supplementary material\nsubtitle: Supplementary tables and plots for the MRSA study\nformat:\n html:\n embed-resources: true\n code-fold: true\n code-tools: true\nlanguage:\n code-summary: Click to show code\nbibliography: references.bib\n---\nSet the cell type to Markdown, then run the cell. Most likely that cell will look rather weird but that’s OK. We’ll fix that in a bit.\nSave the notebook and now render the document with Quarto from the commandline:\nquarto render supplementary_material.ipynb\nOpen up the supplementary_material.html file in a browser and compare it to the supplementary_material.nbconvert.html file. You should see that the Quarto version looks a lot better. The fact that Quarto supports rendering of Jupyter notebooks means you can keep editing your notebooks as you normally would and use Quarto for rendering the final document. Also there’s very little we had to change in the notebook to make it work with Quarto. If you look closely at the code cells used to generate the plots and table you’ll see that they contain code-chunk options in the same form we used in the Quarto tutorial. These options do not impact the notebook when run in Jupyter, making it easy to use the two tools in combination.\nLet’s go back to the YAML header cell and fix how it looks in the Jupyter notebook. The reason it looks weird is that Jupyter doesn’t understand the syntax. But luckily there’s a Jupyter lab Quarto extension you can install to fix this. Click the extension icon in the left sidebar and search for quarto. Install the jupyterlab-quarto extension and then reload the page. Now the YAML header should look a lot better.\nTry adding more options to the header to customize the look of the rendered document. For instance you could:\n\nadd a Table of contents with (toc: true)\ntry out different themes\nadd your name as author (author: Your Name)\nadd a date (date: last-modified)\n\nand much more.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to use Quarto to render Jupyter notebooks to high-quality reports." + }, + { + "objectID": "pages/jupyter.html#extra-material", + "href": "pages/jupyter.html#extra-material", + "title": "Working with Jupyter", + "section": "16 Extra material", + "text": "16 Extra material\nThe following material contains some additional tips and tricks on how to use Jupyter notebooks. This is not part of the core of the Jupyter material and you can choose what you want to go through, or skip it entirely.\nHere are some useful resources if you want to read more about Jupyter in general:\n\nThe Jupyter project site contains a lot of information and inspiration.\nThe Jupyter Notebook documentation.\nA guide to using widgets for creating interactive notebooks." + }, + { + "objectID": "pages/jupyter.html#running-jupyter-notebooks-on-a-cluster", + "href": "pages/jupyter.html#running-jupyter-notebooks-on-a-cluster", + "title": "Working with Jupyter", + "section": "17 Running Jupyter notebooks on a cluster", + "text": "17 Running Jupyter notebooks on a cluster\n\nLogin to Uppmax, making sure to use a specific login node, e.g. rackham1:\n\nssh <your-user-name>@rackham1.uppmax.uu.se\n\nCreate/activate a Conda environment containing jupyter, e.g.:\n\nconda create -n jupyter -c conda-forge jupyter\n\nactivate the environment, then run:\n\njupyter notebook --no-browser\nWhen the Jupyter server starts up you should see something resembling:\n[I 2023-11-13 22:15:36.944 ServerApp] Serving notebooks from local directory: <path-to-your-directory>\n[I 2023-11-13 22:15:36.944 ServerApp] Jupyter Server 2.10.0 is running at:\n[I 2023-11-13 22:15:36.944 ServerApp] http://localhost:8888/tree?token=25fa07e89b7c0bc2e518f259ba79c67847ca813cdf4eeed6\n[I 2023-11-13 22:15:36.944 ServerApp] http://127.0.0.1:8888/tree?token=25fa07e89b7c0bc2e518f259ba79c67847ca813cdf4eeed6\n[I 2023-11-13 22:15:36.944 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).\nNow a Jupyter notebook server is running on the Uppmax end. The line that says:\n[I 2023-11-13 22:15:36.944 ServerApp] http://localhost:8888/tree?token=25fa07e89b7c0bc2e518f259ba79c67847ca813cdf4eeed6\nContains information on the port used on the server side (8888 in this case) and the token required to use the server (25fa07e89b7c0bc2e518f259ba79c67847ca813cdf4eeed6).\nNext step is to use this information to login to the server from your local computer.\nOn your local computer\nIn a terminal, run the following command to start port forwarding of port 8080 on your local computer to the remote port on the Uppmax side. Replace with the port given when you started the server on Uppmax. Also replace with your user name on Uppmax.\nssh -N -L localhost:8080:localhost:<remote-port> <your-user-name>@rackham1.uppmax.uu.se\nAs long as this process is running the port forwarding is running. To disable it simply interrupt it with CTRL + C.\nConnect to the Jupyter server by opening localhost:8080 in your browser. When prompted, paste the token you got when starting the server on Uppmax and set a new password." + }, + { + "objectID": "pages/jupyter.html#using-binder-to-share-interactive-notebooks", + "href": "pages/jupyter.html#using-binder-to-share-interactive-notebooks", + "title": "Working with Jupyter", + "section": "18 Using Binder to share interactive notebooks", + "text": "18 Using Binder to share interactive notebooks\nBinder is a service that allows you to share Jupyter notebooks with others, while also allowing them to run the notebooks in the browser. This is great if you wish to share an analysis and have others interact with the code and results, without them having to install anything locally. What you will need is:\n\nA public GitHub repository containing the notebooks you want to share.\nAn environment.yml file in the repository containing the Conda environment required to run the notebooks.\nData files (if any) required to run the notebook(s).\n\nBinder will then create a Docker image containing the Conda environment and the notebooks, and run a Jupyter server on this image. The Docker image is then hosted on the Binder server and can be used by anyone with the link to the repository to run the notebooks interactively in their browser.\nTo show you an example we’ve created a basic GitHub repository containing the supplementary_material.ipynb notebook from the previous section. If you go to the repository you will see a badge saying “launch binder”, click this to start the Binder server. This will take a few minutes the first time you do it, but after that it should be faster. When the server is ready you will be presented with the now familiar Jupyter interface. Go ahead and open up the supplementary_material.ipynb notebook and run it.\nYou can now interact with the notebook as you would if you had it running on a local Jupyter server. You can change the code, run it, and see the results. You can also add new cells and write new code. However, you cannot save the changes you make to the notebook.\nTo read more about Binder and how to use it, see the Binder documentation. For pointers on how to make data available to the notebooks you share via Binder, see this guide on Accessing data in your Binder." + }, + { + "objectID": "pages/nextflow.html", + "href": "pages/nextflow.html", + "title": "Working with Nextflow", + "section": "", + "text": "Nextflow is a workflow management system (WfMS), and is one of the most common such systems within the bioinformatic and academic communities. These systems are important for scientific reproducibility in that they greatly facilitate keeping track of which files have been processed in what way throughout an entire project.\nNextflow is built from the ground-up to be portable, scalable, reproducible and usable in a platform-agnostic sense. This means that any workflow you write in Nextflow can be run locally on your laptop, a computer cluster or a cloud service (as long as your architecture has the necessary computational resources). You can also define the compute environment in which each task is carried out on a per-task basis. You might thus develop your workflow on your local computer using a minimal test dataset, but run the full analyses with all samples on e.g. a computer cluster. Nextflow can work on both files and arbitrary values, often-times connected in useful and advanced ways.\nNextflow can easily work with dynamic inputs where the exact output is unknown, e.g. the exact number of files or which samples pass some arbitrary quality control threshold. While Nextflow is based on the Groovy language, you don’t need to know how to code Groovy to be able to write good Nextflow workflows. Nextflow has a large community centred around it, including the nf-core curated collection of high quality pipelines used by e.g. the National Genomics Infrastructure.\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/nextflow and activate your nextflow-env Conda environment." + }, + { + "objectID": "pages/nextflow.html#introduction", + "href": "pages/nextflow.html#introduction", + "title": "Working with Nextflow", + "section": "", + "text": "Nextflow is a workflow management system (WfMS), and is one of the most common such systems within the bioinformatic and academic communities. These systems are important for scientific reproducibility in that they greatly facilitate keeping track of which files have been processed in what way throughout an entire project.\nNextflow is built from the ground-up to be portable, scalable, reproducible and usable in a platform-agnostic sense. This means that any workflow you write in Nextflow can be run locally on your laptop, a computer cluster or a cloud service (as long as your architecture has the necessary computational resources). You can also define the compute environment in which each task is carried out on a per-task basis. You might thus develop your workflow on your local computer using a minimal test dataset, but run the full analyses with all samples on e.g. a computer cluster. Nextflow can work on both files and arbitrary values, often-times connected in useful and advanced ways.\nNextflow can easily work with dynamic inputs where the exact output is unknown, e.g. the exact number of files or which samples pass some arbitrary quality control threshold. While Nextflow is based on the Groovy language, you don’t need to know how to code Groovy to be able to write good Nextflow workflows. Nextflow has a large community centred around it, including the nf-core curated collection of high quality pipelines used by e.g. the National Genomics Infrastructure.\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/nextflow and activate your nextflow-env Conda environment." + }, + { + "objectID": "pages/nextflow.html#the-basics", + "href": "pages/nextflow.html#the-basics", + "title": "Working with Nextflow", + "section": "2 The basics", + "text": "2 The basics\nWe’ll start by creating a very simple workflow from scratch, to show how Nextflow works: it will take two input files and convert them to UPPERCASE letters.\n\nStart by running the following commands:\n\ntouch main.nf\necho \"This is a.txt\" > a.txt\necho \"This is b.txt\" > b.txt\nOpen the main.nf file with an editor of your choice. This is the main workflow file used in Nextflow, where workflows and their processes are defined.\n\nCopy the following code into your main.nf file:\n\n// Workflow definition\nworkflow {\n // Define input files\n ch_input = Channel.fromPath( \"a.txt\" )\n\n // Run workflow\n CONVERT_TO_UPPER_CASE( ch_input )\n}\n\n// Process definition\nprocess CONVERT_TO_UPPER_CASE {\n publishDir \"results/\",\n mode: \"copy\"\n\n input:\n path(file)\n\n output:\n path(\"a.upper.txt\")\n\n script:\n \"\"\"\n tr [a-z] [A-Z] < ${file} > a.upper.txt\n \"\"\"\n}\nHere we have two separate parts. The first is the workflow definition, while the last is a process. Let’s go through them both in more detail!\n\n\n\n\n\n\nNextflow comments\n\n\n\nDouble-slashes (//) are used for comments in Nextflow.\n\n\n\n\n\n\n\n\nNextflow and whitespace\n\n\n\nNextflow is not indentation-sensitive. In fact, Nextflow doesn’t care at all about whitespace, so go ahead and use it in whatever manner you think is easiest to read and work with! Do keep in mind that indentations and other types of whitespace does improve readability, so it’s generally not a good idea to forego it entirely, even though you can.\n\n\n\n2.1 Workflow definitions\nworkflow {\n // Define input files\n ch_input = Channel.fromPath( \"a.txt\" )\n\n // Run workflow\n CONVERT_TO_UPPER_CASE( ch_input )\n}\nThe workflow definition here has two parts, each doing an important job for any Nextflow workflow. The first part defines a channel, which is an asynchronous first-in-first-out stream of data that connect a workflow’s various inputs and outputs. In simpler terms, channels contain the data that you want to process with the workflow and can be passed between the various parts of the workflow.\nChannels can be created in various different ways using channel factories, depending on what type data you want to put into them and where this data is stored. In this particular case we define our ch_input channel using the .fromPath channel factory, which takes a file path as input - here we use the a.txt file. You can thus read ch_input = Channel.fromPath(\"a.txt\") as “create the channel ch_input and send the file a.txt into it”.\n\n\n\n\n\n\nNaming channels\n\n\n\nA channel can be named anything you like, but it is good practice to prepend them with ch_, as that makes it clear which variables are channels and which are just normal variables.\n\n\nHow do we use these channels then? Channels pass data to and from processes through our workflow. By providing channels as arguments to processes, we describe how we want data to flow. This is exactly what we do in the second part: we call our CONVERT_TO_UPPER_CASE process with the ch_input as input argument - this is very similar to functional programming.\nThis is our entire workflow, for now: the creation of a channel followed by using the contents of that channel as input to a single process. Let’s look at how processes themselves are defined!\n\n\n2.2 Process definitions\nprocess CONVERT_TO_UPPER_CASE {\n publishDir \"results/\",\n mode: \"copy\"\n\n input:\n path(file)\n\n output:\n path(\"a.upper.txt\")\n\n script:\n \"\"\"\n tr [a-z] [A-Z] < ${file} > a.upper.txt\n \"\"\"\n}\nLooking at the process in the code above, we can see several parts. The process block starts with its name, in this case CONVERT_TO_UPPER_CASE, followed by several sections, or directives as Nextflow calls them: publishDir, input, output and script.\n\n\n\n\n\n\nNaming processes\n\n\n\nA process can be named using any case, but a commonly used convention is to use UPPERCASE letters for processes to visually distinguish them in the workflow. You do not have to follow this if you don’t want to, but we do so here.\n\n\nLet’s start with the first directive: publishDir. This tells Nextflow where the output of the process should be placed when it is finished. Setting mode to \"copy\" just means that we want to copy the output files to the publishing directory, rather than using a symbolic link (which is the default).\nThe input and output directives describe the data expected to come through this specific process. Each line of input describes the data expected for each process argument, in the order used in the workflow. In this case, CONVERT_TO_UPPER_CASE expects a single channel (one line of input), and expects the data to be filenames ( i.e. of type path). The script directive is where you put the code that the process should execute.\nNotice that there is a difference between how the inputs and outputs are declared? The output is an explicit string (i.e. surrounded by quotes), while the input is a variable named file. This means inputs can be referenced in the process without naming the data explicitly, unlike the output where the name needs to be explicit. We’ll get back to exactly how this works in just a moment. While the name of the input variable here is chosen to be the descriptive file, we could also have chosen something completely different, e.g. banana (we’d also have to change its reference in the script directive).\n\n\n2.3 Executing workflows\nLet’s try running the workflow we just created!\n\nType the following in your terminal:\n\nnextflow run main.nf\nThis will make Nextflow run the workflow specified in your main.nf file. You should see something along these lines:\nN E X T F L O W ~ version 22.10.6\nLaunching `./main.nf` [mad_legentil] - revision: 87f0c253ed\nexecutor > local (1)\n[32/9124a1] process > CONVERT_TO_UPPER_CASE (1) [100%] 1 of 1 ✔\nThe first few lines are information about this particular run, including the Nextflow version used, which workflow definition file was used, a randomly generated run name (an adjective and a scientist), the revision ID as well as where the processes were executed (locally, in this case, as opposed to e.g. SLURM or AWS).\nWhat follows next is a list of all the various processes for this particular workflow. The order does not necessarily reflect the order of execution (depending on each process’ input and output dependencies), but they are in the order they were defined in the workflow file - there’s only the one process here, of course. The first part (e.g. [32/9124a1]) is the process ID, which is also the first part of the subdirectory in which the process is run (the full subdirectory will be something like 32/9124a1dj56n2346236245i2343, so just a longer hash). We then get the process and its name. Lastly, we get how many instances of each process are currently running or have finished. Here we only have the one process, of course, but this will soon change.\n\nLet’s check that everything worked: type ls results/ and see that it contains the output we expected.\nLet’s explore the working directory: change into whatever directory is specified by the process ID (your equivalent to work/32/9124a1[...]).\n\nWhat do you see when you list the contents of this directory? You should see a symbolic link named a.txt pointing to the real location of this file, plus a normal file a.upper.txt, which is the output of the process that was run in this directory. You generally only move into these work directories when debugging errors in your workflow, and Nextflow has some tricks to make this process a lot easier - more on this later.\nSo, in summary: we have three components: a set of inputs stored in a channel, a set of processes and a workflow that defines which processes should be run in what order. We tell Nextflow to push the inputs through the entire workflow, so to speak.\n\nNow it’s your turn! Move back to the workflow root and make it use only the b.txt input file and give you the b.upper.txt instead.\nRun your workflow and make sure it works before you move on; check below if you’re having trouble.\n\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nch_input = Channel.fromPath( \"b.txt\" )\n\n\n\n\n\n2.4 Viewing channel contents\nSomething that’s highly useful during development of Nextflow workflows is to view the contents of channels, which can be done with the view() operator.\n\nAdd the following to your workflow definition (on a new line) and execute the workflow: ch_input.view(). What do you see?\nRemove the view() operator once you’re done.\n\nIt can be quite helpful to view the channel contents whenever you’re unsure of what a channel contains or if you’ve run into some kind of bug or error, or even just when you’re adding something new to your workflow. Remember to view the channel contents whenever you need to during the rest of this tutorial!\n\n\n2.5 Files and sample names\nOne powerful feature of Nextflow is that it can handle complex data structures as input, and not only filenames. One of the more useful things this allows us to do is to couple sample names with their respective data files inside channels.\n\nChange the channel definition to the following:\n\nch_input = Channel\n .fromPath ( \"a.txt\" )\n .map { file -> tuple(file.getBaseName(), file) }\nHere we create a tuple (something containing multiple parts) using the map operator, the base name of the file (a) and the file path (a.txt). The statement .map{ file -> tuple(file.getBaseName(), file) } can thus be read as “replace the channel’s contents with a tuple containing the base name and the file path”. The contents of the channel thus change from [a.txt] to [a, a.txt]. Passing the sample name or ID together with the sample data in this way is extremely useful in a workflow context and can greatly simplify downstream processes.\nBefore this will work, however, we have to change the process itself to make use of this new information contained in the ch_input channel.\n\nChange the process definition to the following:\n\nprocess CONVERT_TO_UPPER_CASE {\n publishDir \"results/\",\n mode: \"copy\"\n\n input:\n tuple val(sample), path(file)\n\n output:\n path(\"${sample}.upper.txt\")\n\n script:\n \"\"\"\n tr [a-z] [A-Z] < ${file} > ${sample}.upper.txt\n \"\"\"\n}\nNotice how the input now is aware that we’re passing a tuple as input, which allows us to use both the file variable (as before) and the new sample variable. All that’s left now is to change the input to our pipeline!\n\nChange the channel definition line from .fromPath ( \"a.txt\" ) to .fromPath ( [\"a.txt\", \"b.txt\"] ) and try running the pipeline. Make sure it works before you move on! Remember to use the view() operator if you want to inspect the channel contents in detail.\n\n\n\n2.6 Input from samplesheets\nSo far we’ve been specifying inputs using strings inside the workflow itself, but hard-coding inputs like this is not ideal. A better solution is to use samplesheets instead, e.g. comma- or tab-separated data files; this is standard for many pipelines, including nf-core. Take, for example, the following CSV file:\na,a.txt\nb,b.txt\nThis specifies the samples and their respective files on each row. Using such a file is much more portable, scalable and overall easier to use than simple hard-coding things in the workflow definition itself. We might also include an arbitrary number of additional metadata columns, useful for downstream processing and analyses. Using contents of files as input can be done using the .splitCsv() and .map{} operators, like so:\nch_input = Channel\n .fromPath ( \"first_samplesheet.csv\" )\n .splitCsv ( )\n .map { row -> tuple(row[0], file(row[1])) }\nThe .SplitCsv() operator lets the channel know the input is a CSV file, while the .map{} operator makes the CSV content into a tuple from the first and second elements of each row.\n\nChange the input channel definition to the code above and create the first_samplesheet.csv file as shown above.\nAdd the .view() operator somewhere to show the contents of ch_input.\nExecute the pipeline. Do you see what you expect? Remove the .view() operator before moving on.\n\n\n\n\n\n\n\nNote\n\n\n\nWhile we are still hard-coding the name of the samplesheet it is still much better to edit a samplesheet than having to edit the pipeline itself - there are also convenient ways to work around this using parameters, which we’ll talk more about later in this tutorial.\n\n\nWe can also specify a header in our samplesheet like so: .splitCsv(header: true). This will allow us to reference the columns using their names instead of their index, e.g. row.col1 instead of row[0].\n\nAdd an appropriate header to your samplesheet, make sure your workflow can read it and execute. Use .view() to see what’s going on, if needed.\n\n\n\n2.7 Adding more processes\nIt’s time to add more processes to our workflow! We have the two files a.upper.txt and b.upper.txt; the next part of the workflow is a step that concatenates the content of all these UPPERCASE files.\nWe already have a channel containing the two files we need: the output of the CONVERT_TO_UPPER_CASE process called CONVERT_TO_UPPER_CASE.out. We can use this output as input to a new process using the syntax: CONVERT_TO_UPPER_CASE.out.collect(). The collect() operator groups all the outputs in the channel into a single data object for the next process. This is a many-to-one type of operation: a stream with several files (many) is merged into a lone list of files (one). If collect() was not used, the next process would try to run a task for each file in the output channel.\nLet’s put this in use by adding a new process to the workflow definition. We’ll call this process CONCATENATE_FILES and it will take the output from CONVERT_TO_UPPER_CASE as input, grouped using the collect() operator.\n\nAdd a line to your workflow definition for this new process with the appropriate input - remember that you can use .view() to check channel contents; click below if you’re having trouble.\n\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nCONCATENATE_FILES( CONVERT_TO_UPPER_CASE.out.collect() )\n\n\n\nNow all we have to do is define the actual CONCATENATE_FILES process in the process definition section.\n\nCopy the following code as a new process into your workflow:\n\nprocess CONCATENATE_FILES {\n publishDir \"results/\",\n mode: \"copy\"\n\n input:\n path(files)\n\n output:\n path(\"*.txt\")\n\n script:\n \"\"\"\n cat ${files} > concat.txt\n \"\"\"\n}\n\nRun your workflow again and check the results/ directory. At this point you should have three files there: a.upper.txt, b.upper.txt and concat.txt.\nInspect the contents of concat.txt - do you see everything as you expected?\n\nNote the use of path(files) as input. Although we pass a list of files as input, the list is considered a single object, and so the files variable references a list. Each file in that list can be individually accessed using an index e.g. ${files[0]}, or as we do here, use the variable without an index to list all the input files.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learnt:\n\nHow to create, execute and extend workflows\nHow to explore the work directory and channel contents\nHow to couple sample names to sample data files\nHow to use samplesheets as input\nHow to collect multiple files as single inputs for processes" + }, + { + "objectID": "pages/nextflow.html#executing-workflows-1", + "href": "pages/nextflow.html#executing-workflows-1", + "title": "Working with Nextflow", + "section": "3 Executing workflows", + "text": "3 Executing workflows\nIt’s time to start working with a more realistic workflow using the MRSA case study of this course! We’ve created a bare-bones version of this pipeline for you, but we’ll work our way through it as we go along and learn more about Nextflow’s features and functionality. The MRSA workflow looks like this:\nworkflow {\n\n // Workflow for generating count data for the MRSA case study\n\n // Get input files from a samplesheet\n ch_input = Channel\n .fromPath ( \"samplesheet.csv\" )\n .splitCsv ( header: true)\n\n // Define the workflow\n DOWNLOAD_FASTQ_FILES (\n ch_input\n )\n RUN_FASTQC (\n DOWNLOAD_FASTQ_FILES.out\n )\n RUN_MULTIQC (\n RUN_FASTQC.out[1].collect()\n )\n GET_GENOME_FASTA ()\n INDEX_GENOME (\n GET_GENOME_FASTA.out.fasta\n )\n ALIGN_TO_GENOME (\n DOWNLOAD_FASTQ_FILES.out,\n INDEX_GENOME.out.index\n )\n SORT_BAM (\n ALIGN_TO_GENOME.out.bam\n )\n GET_GENOME_GFF3 ()\n GENERATE_COUNTS_TABLE (\n SORT_BAM.out.bam.collect(),\n GET_GENOME_GFF3.out.gff\n )\n}\nThe workflow has one input channel named ch_input, which reads input from the samplesheet.csv file. We then define the processes to be executed by this workflow, nine in total. The first process (DOWNLOAD_FASTQ_FILES) takes the ch_input channel as input, while the rest of the processes takes the output of previous processes as input. Before we go into more detail regarding the ins-and-outs of this workflow, let’s start with some specifics of how workflows are executed and what you can get from them.\n\n3.1 Reports and visualisations\nLet’s start with running the workflow plus getting some reports and visualisation while we’re at it!\n\nRun the workflow using the following command: nextflow run main_mrsa.nf -with-report report.html -with-timeline timeline.html -with-dag dag.png.\n\nAfter successful executing, you will find three more files in your current directory: report.html, timeline.html and dag.png. The first file contains a workflow report, which includes various information regarding execution such as runtime, resource usage and details about the different processes. The second file contains a timeline for how long each individual process took to execute, while the last contains a visualisation of the workflow itself.\nTake a few minutes to browse these files for yourself. When running a workflow you can of course choose which of these additional files you want to include by picking which ones are important or interesting to you - or don’t include any!\n\n\n3.2 Logs\nNextflow keeps a log of all the workflows that have been executed. Let’s check it out!\n\nType nextflow log to get a list of all the executions.\n\nHere we get information about when the workflow was executed, how long it ran, its run name, whether it succeeded or not and what command was used to run it. You can also use nextflow log <run name> to show each task’s directory that was executed for that run. You can also supply the -f (or -fields) flag along with additional fields to show.\n\nRun nextflow log <run name> -f hash,name,exit,status\n\nThis shows us not only the beginning of each task’s working directory, but also its name, exit code and status (i.e. if it completed successfully or failed in some manner).\n\n\n\n\n\n\nListing fields\n\n\n\nIf you want to see a complete list of all the fields you might explore using the log, just type nextflow log -l or nextflow log -list-fields. This is highly useful for debugging when there’s some specific information about a run you’re particularly interested in!\n\n\nWe can also get even more detailed information about the latest run by looking into the .nextflow.log file!\n\nLook into the latest log by typing less .nextflow.log.\n\nYou’ll be greeted by a wealth of debugging information, which may even seem a bit overkill at this point! This level of detail is, however, quite useful both as a history of what you’ve attempted and as an additional help when you run into errors! Also, it helps with advanced debugging - which we’ll get into later.\n\n\n3.3 Re-running workflows\nSomething you often want to do in Nextflow (or any WfMS for that matter) is to re-run the workflow when you changed some input files or some of the code for its analyses, but you don’t want to re-run the entire workflow from start to finish. Let’s find out how this works in Nextflow!\n\nRun the same nextflow run main_mrsa.nf command again.\n\nYou’ll notice that Nextflow actually re-ran the entire workflow from scratch, even though we didn’t change anything. This is the default behaviour of Nextflow.\n\nLet’s try that again: nextflow run main_mrsa.nf -resume instead.\n\nNow you can see that Nextflow didn’t actually re-run anything. The -resume flag instructed Nextflow to use the cached results from the previous run!\nNextflow automatically keeps track of not only changes to input files, but also changes to code, process definitions and scripts. You can thus change anything relating to your workflow and just re-run with the -resume flag and be sure that only processes relevant to your changes are executed again!\n\nUse tree work/ to list the contents of the work directory.\n\nBecause Nextflow keeps track of all the runs, we’ve now got two sets of files in the work directory. One set from the first run, and another from the second run. This can take up valuable space, so let’s clean that up.\n\nUse nextflow clean -n -before <run_name> to show which work directories will be cleaned up (use nextflow log to find the run name if you don’t remember it). Then delete those directories by changing -n (dry-run) to -f (force).\n\nNextflow’s clean subcommand can be used to clean up failed tasks and unused processes. Here we used the -before flag, meaning that any runs before the specified run are removed; use nextflow help clean to see other options for cleaning. This is the preferred way to clean up the working directory.\n\nRemove the results directory and re-run the workflow again using the -resume flag.\n\nWe removed all the results we used before, but we still managed to resume the workflow and use its cache - how come? Remember that Nextflow uses the work directory to run all of its tasks, while the results directory is just where we have chosen to publish our outputs. We can thus delete the results directory as often as we like (a necessity when output filenames are changed) and still get everything back without having to re-run anything. If we were to delete the work directory, however…\n\nDelete the work directory and re-run the workflow using the -resume flag.\n\nThere is no longer any cache for Nextflow to use, so it re-runs from the start! This is good to keep in mind: you can always delete the output directories of your workflow, but if you mess with work you’ll lose, well… work!\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learnt:\n\nHow to get automatic reports and visualisations\nHow to check the Nextflow logs\nHow to re-run workflows\nHow to clean the Nextflow cache" + }, + { + "objectID": "pages/nextflow.html#working-with-processes", + "href": "pages/nextflow.html#working-with-processes", + "title": "Working with Nextflow", + "section": "4 Working with processes", + "text": "4 Working with processes\nNow that we’ve gone through the specifics of executing workflows in a bit more detail, let’s go through working with processes. While there are numerous process directives that can be used, we’ll go through some of the more commonly used ones here.\n\n4.1 Tags\nLet’s look at the command line output we got during the workflow’s execution, which should look something like this:\nN E X T F L O W ~ version 22.10.6\nLaunching `./main.nf` [friendly_bhaskara] - revision: b4490b9201\nexecutor > local (17)\n[c9/e5f818] process > DONWLOAD_FASTQ_FILES (SRR935092) [100%] 3 of 3 ✔\n[d5/b5f24e] process > RUN_FASTQC (SRR935092) [100%] 3 of 3 ✔\n[91/2cea54] process > RUN_MULTIQC [100%] 1 of 1 ✔\n[e0/b4fd37] process > GET_GENOME_FASTA [100%] 1 of 1 ✔\n[87/32ce10] process > INDEX_GENOME [100%] 1 of 1 ✔\n[56/e9a460] process > ALIGN_TO_GENOME (SRR935092) [100%] 3 of 3 ✔\n[ed/d8c223] process > SORT_BAM (SRR935092) [100%] 3 of 3 ✔\n[e7/4a6bda] process > GET_GENOME_GFF3 [100%] 1 of 1 ✔\n[e9/84f093] process > GENERATE_COUNTS_TABLE [100%] 1 of 1 ✔\nHave you noticed that there are SRA IDs after some of the processes? Well, if you look at which processes show these SRA IDs you might see that it’s only those processes that are executed three times, i.e. once per SRA ID. This doesn’t happen automatically, however, and comes from something called tags. Let’s look at the DONWLOAD_FASTQ_FILES process:\nprocess DONWLOAD_FASTQ_FILES {\n\n // Download a single-read FASTQ file from the SciLifeLab Figshare remote\n\n tag \"${sra_id}\"\n publishDir \"results/data\",\n mode: \"copy\"\n\n input:\n tuple val(sra_id), val(figshare_link)\n\n output:\n tuple val(sra_id), path(\"*.fastq.gz\")\n\n script:\n \"\"\"\n wget ${figshare_link} -O ${sra_id}.fastq.gz\n \"\"\"\n}\nYou can see the tag directive at the very top of the process definition. Tags can be used to e.g. show information about the sample currently being analysed by the process. This is useful both during run-time (allowing you to see which sample is being processed) but also for debugging or finding problematic samples in case of errors or odd output. There is, naturally, no need to use tags for processes which are only run once.\n\nComment out (prefix with //) the tag directive from the DONWLOAD_FASTQ_FILES process and run the workflow again. What do you see?\n\nWithout the tag directive you should instead see the numbers 1 through 3, representing the input files (of which there are three). Nextflow still tells us that it’s working on one of the input files, but it’s generally much more useful to actually see the sample name or ID, rather than just a number.\n\nUncomment the tag directive before you move on.\n\n\n\n4.2 Named outputs\nLet’s move on to the next process! It looks like this:\nprocess RUN_FASTQC {\n\n // Run FastQC on a FASTQ file.\n\n tag \"${sample}\"\n publishDir \"results/\",\n mode: \"copy\"\n\n input:\n tuple val(sample), path(fastq)\n\n output:\n path(\"*.html\")\n path(\"*.zip\")\n\n script:\n \"\"\"\n fastqc ${fastq} -q\n \"\"\"\n}\nHere is a process with two output channels! One contains all the .html files, while the other contains all the .zip files. How is this handled in the workflow definition of downstream processes that use the outputs? The RUN_MULTIQC process uses this output, and its part in the workflow definition looks like this:\nRUN_MULTIQC (\n RUN_FASTQC.out[1].collect()\n)\nWe already know about .out and .collect(), but we have something new here: the RUN_MULTIQC process is taking the second channel of the output from the RUN_FASTQC process - [1] is the index for the second channel, as Groovy is zero-based (the first channel is indexed by [0]).\nThis comes with some issues, however. What if we accidentally changed the order of the outputs in the rule, or added a new one? Using positions like this is easy to mess up, but there is a better solution: named outputs! This can be achieved by adding the emit option for some or all of the outputs, like so:\noutput:\npath(*.txt), emit: text\nInstead of referring to the output by its position in an array as before we refer to the channel with a label we choose (.out.text) instead. This benefits us in that we can infer more information about channel contents called text rather than [1], and it is also allows us to be less error-prone when rewriting parts of a workflow.\n\nYour turn! Add named outputs to the RUN_FASTQC process and make RUN_MULTIQC use those outputs. You’ll have to change both the output section of the RUN_FASTQC process, and the workflow definition section for RUN_MULTIQC. If you need help, see the hint below.\n\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n// Workflow definition for RUN_MULTIQC\nRUN_MULTIQC (\n RUN_FASTQC.out.zip.collect()\n)\n\n// Output section of RUN_FASTC\noutput:\npath(\"*.html\"), emit: html\npath(\"*.zip\"), emit: zip\n\n\n\nCheck if it works by executing the workflow.\n\n\n4.3 Advanced publishing\nSo far we’ve only used the publishDir directive in a very simple way: specifying a directory and the mode to use when publishing (to copy the files rather than symbolically link them). There are more things you can do, however, especially for processes with more than one output. For example, we can publish outputs in separate directories, like so:\npublishDir \"results/tables\",\n pattern: \"*.tsv\",\n mode: \"copy\"\npublishDir \"results/logs\",\n pattern: \"*.log\",\n mode: \"copy\"\nIn this example, *.tsv files are copied to the folder results/tables/, while *.log files are copied to the folder results/logs. The publishDir directive can be used multiple times in a single process, allowing one to separate output as above, or publish the same output to multiple folders.\n\nEdit the RUN_FASTQC process to place the HTML and compressed files in separate directories. Remove the results directory and re-run the workflow to check that it worked - click below if you’re having trouble.\n\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nprocess RUN_FASTQC {\n\n (...)\n\n publishDir \"results/fastqc/html\",\n pattern: \"*.html\",\n mode: \"copy\"\n publishDir \"results/fastqc/zip\",\n pattern: \"*.zip\",\n mode: \"copy\"\n\n (...)\n}\n\n\n\n\nNote that an output and a published output are different things: something can be an output of a process without being published. In fact, the RUN_FASTQC process is a prime example of this! Think about the compressed output: this output is only used by the downstream process RUN_MULTIQC and is never meant to be viewed by a human or used by a human in some downstream task not part of the pipeline itself. We would thus like to keep the compressed files as an output, but not publish said output. How do we do this? Just remove the corresponding publishDir directive!\nThe MRSA workflow we’ve made here was refactored directly from its original version in the Snakemake tutorial of this course, which means that its output structure is not fully taking advantage of some of Nextflow’s functionality. The compressed output we’ve already talked about above is one example.\n\nSee if you can find any other processes in the current implementation of the MRSA workflow that you could optimise like this!\n\nThink about whether all processes actually need to have published outputs. Make sure you test executing the workflow after you’ve made any changes; click below if you want a hint.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nThe GET_GENOME_FASTA and GET_GENOME_GFF3 both download reference files which are only needed by the workflow itself and does not need to be published, the same goes for the genome index generated by the INDEX_GENOME process.\nOne could argue that neither of the BAM files generated by the ALIGN_TO_GENOME and SORT_BAM processes are needed by the user if only the final counts table is of interest, but BAM files can also be useful for exploring the alignments in e.g. IGV. Both BAMs are, however, definitely not needed: only the sorted one should be published if one is interested in BAM files.\n\n\n\n\n\n4.4 Debugging\nIt is, sadly, inevitable that we all make mistakes while coding - nobody’s perfect! Nextflow helps you quite a bit when this happens, not just with its logs but also with informative error messages. Let’s introduce an error and look at what we get:\n\nChange the final output line in the RUN_MULTIQC process to the following and re-run the workflow: path(\"multiqc_general_stats.csv\") - notice the usage of .csv rather than .txt as before.\n\nWe got an error! We get a number of things, actually, including (in order from the top) the name of the process that gave the error, the likely cause, the command that was executed, along with its exit status, output, error and the work directory that the task was run in. Let’s focus on the Caused by: part at the top, which should look something like this:\nCaused by:\n Missing output file(s) `multiqc_general_stats.csv` expected by process `RUN_MULTIQC`\nWe can also see that the command’s exit status is 0, which means that the command was successful; any exit status other than 0 means there was an error of some kind. We can thus infer that the command (1) worked, (2) failed to give us the output expected by Nextflow. Thankfully, Nextflow graciously prints the work directory for us so that we may check out what happened in more detail.\n\nCopy the working directory path, cd into it and list its contents using ls.\n\nYou might already have spotted the error in the message above; the error we introduced here was that the expected output file has a .csv extension, rather than the correct .txt. Nextflow is expecting the .csv output, but the process script directive is (correctly) giving us the .txt file, which we can see inside the process’ work directory.\n\nGo back to the root directory, revert the error you introduced and re-run the workflow to make sure it works again.\n\nThis might have seemed like a trivial error, but a lot of errors in Nextflow can be solved in the same manner, i.e. by just following the debugging output reported by Nextflow and inspecting the specific subdirectory in question.\n\n\n\n\n\n\nA note about Bash\n\n\n\nIf you are using Bash variables inside the script directive you have to be careful to prepend it with a backslash, e.g. \\${BASH_VARIABLE}. This is because the dollar-sign is used by Nextflow, so you have to tell Nextflow explicitly when you’re using a Bash variable. This is a common source of errors when using Bash variables, so keeping it in mind can save you some debugging time!\n\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learnt:\n\nHow to use the tag directive\nHow to use named output with emit\nHow to publish outputs into different directories\nHow to debug errors and mistakes" + }, + { + "objectID": "pages/nextflow.html#workflow-configuration", + "href": "pages/nextflow.html#workflow-configuration", + "title": "Working with Nextflow", + "section": "5 Workflow configuration", + "text": "5 Workflow configuration\nWe’ve so far been working with a relatively non-generalised workflow: it’s got hard-coded inputs, paths and genome references. This is perfectly fine for a project that is purely aimed at getting reproducible results (which is the full extent of what you want in a lot of cases), but it can be made a lot more generalisable. Let’s go through the MRSA workflow and see what can be improved!\n\n5.1 Parameters\nOne of the things that allow generalisability of Nextflow workflows is parameters, which hold information and values that can be changed directly on the command-line at the time of execution. One use of parameters in our MRSA workflow is to remove the hard-coded results output directory, for example. Parameters can be written in the following form:\nparams {\n parameter_1 = \"some/data/path\" // A string parameter\n parameter_2 = 42 // A value parameter\n parameter_3 = [\"a\", \"b\", \"c\", \"d\"] // A list parameter\n}\nYou would then refer to these parameters using e.g. params.parameter_1 anywhere you need to in the workflow. Although parameters can be defined in main_mrsa.nf, it is preferable to define them in a separate configuration file. The default name of this file is nextflow.config and if such a file is present it will be used automatically by Nextflow (to supply a config file with another name use nextflow -c <path-to-config-file> run main_mrsa.nf)\n\nCreate a configuration file and add a parameter for the results output directory.\nUse your newly created parameter in the publishDir directory of a process Run your workflow to see if it worked; click below if you need help.\n\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n// Configuration file\nparams {\n outdir = \"results\"\n}\n\n// A publishDir directive in a process\npublishDir \"${params.outdir}\",\n mode: \"copy\"\n\n\n\n\n\n5.2 Command line parameters\nWorkflow parameters can be assigned on the command-line by executing workflows like so: nextflow run main_mrsa.nf --parameter_name 'some_value'. The workflow parameter parameter_name is prefixed by a double dash -- to tell Nextflow this is a parameter to the workflow (a single dash is a parameter to Nextflow, e.g. -resume). The value is also quoted (this is important for parameters that take file paths as values).\n\nRun your workflow using the parameter you previously created, but pick something other than the default value!\n\nYou should now have a new directory containing all the results! This is highly useful if you want to keep track of separate runs of a workflow with different software parameters, for example: nextflow run main.nf --important_param 'value1' --resultsdir 'results-value1', or simply want to keep the results of separate versions of the same workflow. You can also change parameters by using the -params-file option or by using another configuration file (and using -c), rather than on the command line!\n\n\n5.3 Configuring inputs\nRemember the input for the MRSA workflow, the ch_input channel? This input (the samplesheet.csv file) is hard-coded inside the main_mrsa.nf file. This could also be made into a parameter!\n\nChange the definition of the ch_input channel to take the value of a new parameter of your choice, defined in the configuration file.\n\nYou should now have a more generalised input to your workflow! Try to run it to make sure it works - look below if you need some help.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n// Channel definition\nch_input = Channel\n .fromPath ( params.input )\n .splitCsv ( header: true )\n\n// Configuration file\ninput = \"samplesheet.csv\"\n\n\n\nBy specifying inputs from sample sheets like this we can change inputs of a workflow execution by creating another sample sheet and specifying e.g., --input samplesheet-2.csv on the command line. This is highly useful when you want to run a single sample e.g., when testing a workflow, or when you want to keep track of all the different inputs you’ve used historically.\n\n\n5.4 Other configuration scopes\nThere are lots of things that you might want to add to your configuration, not just parameters! The workflow manifest, for example, which might look like this:\nmanifest {\n name = \"My Workflow\"\n description = \"My awesome workflow, created by me\"\n author = \"Me\"\n mainScript = \"main.nf\"\n version = \"1.0.0\"\n}\n\nGo ahead and add a workflow manifest to your nextflow.config file!\n\nThe manifest is useful when you’re publishing or sharing the workflow through e.g. GitHub or similar. There are many more such configuration scopes that you might want to use - read more about them in the documentation.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we learnt:\n\nHow to create parameters in a configuration file\nHow to specify parameters on the command line\nHow to add workflow manifest and other configuration scopes" + }, + { + "objectID": "pages/nextflow.html#optimising-the-mrsa-workflow", + "href": "pages/nextflow.html#optimising-the-mrsa-workflow", + "title": "Working with Nextflow", + "section": "6 Optimising the MRSA workflow", + "text": "6 Optimising the MRSA workflow\nWe just added several parameters and configurations to our MRSA workflow, but we didn’t do anything about the reference genomes: those are still hard-coded. The current MRSA workflow is, in fact, not very well-optimised for Nextflow at all, being a refactor from the Snakemake tutorial of this course.\nAll of the processes are basically unchanged, excluding some minor alterations. For example, the run_fastqc rule in Snakemake used the -o flag to specify that the results should be in the current directory, followed by moving the output files to their respective output directory. The first part is not needed in Nextflow (as everything is run in its own subdirectory), and the second part is done by the publishDir directive. These are just minor alterations, though, but we can do much more if we fully utilise Nextflow’s features!\n\n6.1 Remote files\nOne of these features is the ability to automatically download remote files, without needing to explicitly do so! The path input type can handle either file paths (like we’ve done so far) or a URI-supported protocol (such as http://, s3://, ftp://, etc.). This would be highly useful for e.g. the GET_GENOME_FASTA process - in fact, we don’t even need that process at all! All we need to do is to change the input to the INDEX_GENOME and ALIGN_TO_GENOME processes.\n\nCreate a new input channel using the fromPath() channel factory and the absolute path (the FTP address) to the genome FASTA.\nMake the INDEX_GENOME process use that input channel instead of the previously used output of the GET_GENOME_FASTA process.\nRemove the GET_GENOME_FASTA process, as it is not needed anymore.\n\nRe-run the workflow to see if it worked. Check the code below for an example if you’re stuck:\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n// Channel creation\nch_genome_fasta = Channel.fromPath( \"ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz\" )\n\n// Workflow definition\nINDEX_GENOME (\n ch_genome_fasta\n)\n\n\n\nWe could also do this using parameters from our configfile, of course!\n\nNow change the input to the GENERATE_COUNTS_TABLE to use the remote GFF3 file and remove the GET_GENOME_GFF3 in the same manner as above, but using a new parameter instead.\n\nRe-run the workflow again to make sure it worked; check below if you’re stuck.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n// [ nextflow.config ]\nparams {\n genome_gff3 = \"ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz\"\n}\n\n// [ main.nf ]\n// Channel creation\nch_genome_ggf3 = Channel.fromPath ( params.genome_gff3 )\n\n// Workflow definition\nGENERATE_COUNTS_TABLE (\n SORT_BAM.out.bam.collect(),\n ch_genome_ggf3\n)\n\n\n\nIf we want to get detailed we can also change the hard-coded “NCT8325” naming in e.g. the INDEX_GENOME process and put that in another parameter, or grab the baseName() from the channel and make a [prefix, file] tuple using the map{} operator like we did previously; check below if you’re curious of how this could be done.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n// Channel definition\nch_genome_fasta = Channel\n .fromPath( \"ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna/Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz\" )\n .map { file -> tuple(file.getBaseName(), file) }\n\n// INDEX_GENOME process definition\nprocess INDEX_GENOME {\n\n publishDir \"results/bowtie2/\",\n mode: \"copy\"\n\n input:\n tuple val(fasta_name), path(fasta)\n\n output:\n path(\"*.b2t\"), emit: index\n\n script:\n \"\"\"\n # Bowtie2 cannot use .gz, so unzip to a temporary file first\n gunzip -c ${fasta} > tempfile\n bowtie2-build tempfile ${fasta_name}\n \"\"\"\n}\n\n\n\n\n\n6.2 Subworkflows\nThe DSL2 allows highly modular workflow design, where a workflow may contain multiple subworkflows. A subworkflow is just like a normal workflow, but it can be called inside other workflows, similar to a process. There is thus no special difference between a subworkflow and a workflow; the only difference is how you use them in practice. Let’s take a look at a toy example:\nworkflow {\n ch_input = Channel.fromPath ( params.input )\n SUBWORKFLOW (\n ch_input\n )\n}\n\nworkflow SUBWORKFLOW {\n\n take:\n input_file\n\n main:\n ALIGN_READS( input_file )\n\n emit:\n bam = ALIGN_READS.out.bam\n}\nHere we have an unnamed, main workflow like before, plus a named subworkflow. A workflow can have inputs specified by the take directive, which is the equivalent of process input for workflows. The main part is the workflow body, which contains how to run which processes in which order. The last part, emit, also works the same as for processes, in that we name the different outputs of the workflow so that we may use them in other workflows or processes. Nextflow will run the unnamed workflow by default, unless the -entry flag is specified, like so:\nnextflow run main.nf -entry SUBWORKFLOW\nThis will run the workflow named SUBWORKFLOW, but nothing else. You can also store subworkflows in separate files, so that everything doesn’t have to be crammed into a single main.nf file. A subworkflow named SUBWORKFLOW contained in the file subworkflow.nf can be loaded into a main.nf file like so:\ninclude { SUBWORKFLOW } from \"./subworkflow.nf\"\nIf you have a complex workflow with several subworkflows you might thus store them in a separate directory, e.g. subworkflows/. This allows you to have fine-grained control over the general architecture of your Nextflow workflows, organising them in a manner that is easy to code and maintain. A process can also be treated in the same manner, and defined separately in another file.\n\nNow it’s your turn! Separate the RUN_FASTQC and RUN_MULTIQC processes out of the main workflow and into a subworkflow. Check below if you’re having trouble.\n\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n// [ main.nf ]\n// Include subworkflow\ninclude { QUALITY_CONTROLS } from \"./subworkflows/quality_controls.nf\"\n\n// Main workflow\nQUALITY_CONTROLS (\n DOWNLOAD_FASTQ_FILES.out\n)\n\n// [ subworkflows/quality_controls.nf ]\n// Quality controls subworkflow\nworkflow QUALITY_CONTROLS {\n\n take:\n fastq\n\n main:\n RUN_FASTQC (\n fastq\n )\n RUN_MULTIQC (\n RUN_FASTQC.out.zip.collect()\n )\n\n emit:\n html = RUN_MULTIQC.out.html\n general_stats = RUN_MULTIQC.out.general_stats\n}\n\n// [ Include RUN_FASTQC and RUN_MULTIQC processes here ]\n\n\n\nIf you want to challenge yourself, try to do the same with the INDEX_GENOME, ALIGN_TO_GENOME and SORT_BAM processes! Be careful of where you get your inputs and outputs; check below if you want one of the ways in which you can do this:\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n// [ main.nf ]\n// Include subworkflow\ninclude { ALIGNMENT } from \"./subworkflows/alignment.nf\"\n\n// Main workflow\nALIGNMENT {\n ch_genome_fasta,\n DOWNLOAD_FASTQ_FILES.out\n}\n\n// [ subworkflows/alignment.nf ]\n// Alignment subworkflow\nworkflow ALIGNMENT {\n\n take:\n fasta\n fastq\n\n main:\n INDEX_GENOME (\n fasta\n )\n ALIGN_TO_GENOME (\n fastq,\n INDEX_GENOME.out.index\n )\n SORT_BAM (\n ALIGN_TO_GENOME.out.bam\n )\n\n emit:\n bam = SORT_BAM.out.bam\n}\n\n// [ Include INDEX_GENOME, ALIGN_TO_GENOME and SORT_BAM processes here ]\n\n\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we learnt:\n\nHow to automatically download remote files\nHow to create and work with subworkflows" + }, + { + "objectID": "pages/nextflow.html#extra-material", + "href": "pages/nextflow.html#extra-material", + "title": "Working with Nextflow", + "section": "7 Extra material", + "text": "7 Extra material\nThere are many more things you can do with Nextflow than covered here. If you are interested to learn more details about Nextflow, we will briefly show some of its advanced features in this section. But first, here are some links to additional resources on Nextflow:\n\nNextflow patterns that can help with common operations and concepts\nThe Nextflow documentation\nNextflow training at Seqera\nA work-in-progress Nextflow Carpentry course\nCommunity help from Nextflow’s Slack channel\n\n\n7.1 Using containers in Nextflow\nNextflow has built-in support for using both Docker and Apptainer containers (and others too), either with a single container for the workflow as a whole or separate containers for each individual process. The simplest way to do it is to have a single container for your entire workflow, in which case you simply run the workflow and specify the image you want to use, like so:\n# Run with docker\nnextflow run main.nf -with-docker image-name\n\n# Run with Apptainer\nnextflow run main.nf -with-apptainer image.sif\nIf you don’t want to supply this at every execution, you can also add it directly to your configuration file:\n# Docker configuration\nprocess.container = 'image-name'\ndocker.enabled = true\n\n# Apptainer configuration\nprocess.container = 'path/to/image.sif'\napptainer.enabled = true\nIf you instead would like to have each process use a different container you can use the container directive in your processes:\nprocess PROCESS_01 {\n (...)\n container: 'image_01'\n (...)\n}\n\nprocess PROCESS_02 {\n (...)\n container: 'image_02'\n (...)\n}\nRegardless of which solution you go for, Nextflow will execute all the processes inside the specified container. In practice, this means that Nextflow will automatically wrap your processes and run them by executing the Docker or Apptainer command with the image you have provided.\n\n\n7.2 Using Conda in Nextflow\nWhile you can execute Nextflow inside Conda environments just like you would any other type of software, you can also use Conda with Nextflow in the same way as for Docker and Apptainer above. You can either supply an environment.yml file, the path to an existing environment or the packages and their versions directly in the conda directive, like so:\nprocess PROCESS_01 {\n (...)\n conda: 'mrsa-environment.yml'\n (...)\n}\nprocess PROCESS_02 {\n (...)\n conda: 'path/to/mrsa-env'\n (...)\n}\nprocess PROCESS_03 {\n (...)\n conda: 'bioconda::bwa=0.7.17 bioconda::samtools=1.13'\n (...)\n}\nYou can use either of the methods described above with your configuration file as well, here exemplified using an environment.yml file:\nprocess.conda = 'mrsa-environment.yml'\n\n\n7.3 Running Nextflow on Uppmax\nA lot of researchers in Sweden are using the Uppmax computer cluster in Uppsala, which is easily handled by Nextflow. What you need to do is to add the following profile to your nextflow.config file:\nprofiles {\n // Uppmax general profile\n uppmax {\n params {\n account = null\n }\n process {\n executor = 'slurm'\n clusterOptions = \"-A '${params.account}'\"\n memory = { 6.GB * task.attempt }\n cpus = { 1 * task.attempt }\n time = { 10.h * task.attempt }\n scratch = '$SNIC_TMP'\n errorStrategy = 'retry'\n maxRetries = 1\n }\n }\n}\nThis will add a profile to your workflow, which you can access by running the workflow with -profile uppmax. You will also have to supply an extra parameter account which corresponds to your SNIC project account, but the rest you can leave as-is, unless you want to tinker with e.g. compute resource specifications. That’s all you need! Nextflow will take care of communications with SLURM (the system used by Uppmax, specified by the executor line) and will send off jobs to the cluster for you, and everything will look exactly the same way as if you were executing the pipeline locally.\nThe memory, cpus and time lines define the various resources Nextflow will use as well as how much to automatically increase them by if re-trying failed tasks; this, in turn, is specified by the errorStrategy and maxRetries variables. The scratch variable defines where each node’s local storage is situated, which gives Nextflow the most optimal access to the Uppmax file system for temporary files.\n\n\n7.4 Advanced channel creation\nThe input data shown in the MRSA example workflow is not that complex, but Nextflow channels can do much more than that. A common scenario in high-throughput sequencing is that you have pairs of reads for each sample. Nextflow has a special, built-in way to create channels for this data type: the fromFilePairs channel factory:\nch_raw_reads = Channel\n .fromFilePairs ( \"data/*_R{1,2}.fastq.gz\" )\nThis will create a channel containing all the reads in the data/ directory in the format <sample>_R1.fastq.gz and <sample>_R2.fastq.gz and will pair them together into a nested tuple looking like this:\n[sample, [data/sample_R1.fastq.gz, data/sample_R2.fastq.gz]]\nThe first element of the tuple ([0]) thus contains the value sample, while the second element ([1]) contains another tuple with paths to both read files. This nested tuple can be passed into processes for e.g. read alignment, and it makes the entire procedure of going from read pairs (i.e. two separate files, one sample) into a single alignment file (one file, one sample) very simple. For more methods of reading in data see the Nextflow documentation on Channel Factories.\nWe can also do quite advanced things to manipulate data in channels, such as this:\nsamples_and_treatments = Channel\n .fromPath ( params.metadata )\n .splitCsv ( sep: \"\\t\", header: true )\n .map { row -> tuple(\"${row.sample_id}\", \"${row.treatment}\") }\n .filter { id, treatment -> treatment != \"DMSO\" }\n .unique ( )\nThat’s a bit of a handful! But what does it do? The first line specifies that we want to read some data from a file specified by the metadata parameter, and the second line actually reads that data using tab as delimiter, including a header. The map operator takes each entire row and subsets it to only two columns: the sample_id and treatment columns (discarding the other columns). This subset is stored as a tuple. The filter operator is then used to remove any tuples where the second entry (treatment) is not equal to the string \"DMSO\" (i.e. untreated cells, in this example). Finally, we only keep unique tuple values. Let’s say that this is the metadata we’re reading:\nsample dose group treatment\nsample_1 0.1 control DMSO\nsample_1 1.0 control DMSO\nsample_1 2.0 control DMSO\nsample_2 0.1 case vorinostat\nsample_2 1.0 case vorinostat\nsample_2 2.0 case vorinostat\nsample_3 0.1 case fulvestrant\nsample_3 1.0 case fulvestrant\nsample_3 2.0 case fulvestrant\nGiven the channel creation strategy above, we would get the following result:\n[sample_2, vorinostat]\n[sample_3, fulvestrant]\nIn this way, you can perform complex operations on input files or input metadata and send the resulting content to your downstream processes in a simple way. Composing data manipulations in Nextflow like this can be half the fun of writing the workflow. Check out Nextflow’s documentation on Channel operators to see the full list of channel operations at your disposal.\n\n\n7.5 Using Groovy in processes\nYou don’t have to use bash or external scripts inside your processes all the time unless you want to: Nextflow is based on Groovy, which allows you to use both Groovy and Bash in the same process. For example, have a look at this:\nprocess index_fasta {\n tag \"${fasta_name}\"\n\n input:\n tuple val(fasta), path(fasta_file)\n\n output:\n path(\"${fasta_name}.idx\"), emit: fasta\n\n script:\n fasta_name = fasta.substring(0, fasta.lastIndexOf(\".\"))\n \"\"\"\n index --ref ${fasta_file},${fasta_name}\n \"\"\"\n}\nHere we have some command index that, for whatever reason, requires both the path to a FASTA file and the name of that file without the .fasta extension. We can use Groovy in the script directive together with normal Bash, mixing and matching as we like. The first line of the script directive gets the name of the FASTA file without the extension by removing anything after the dot, while the second calls the index command like normal using bash.\n\n\n7.6 The nf-core pipeline collection\nYou may have heard of the nf-core pipeline collection previously, which is a large, collaborative bioinformatics community dedicated to building, developing and maintaining Nextflow workflows. In fact, if you have sequenced data at e.g. the National Genomics Infrastructure (NGI), you can be sure that the data processing has been run using one of the nf-core pipelines! While the community only started in 2018 (with a Nature Biotechnology paper in 2020), it already has over 30 production-ready pipelines with everything from genomics, transcriptomics, proteomics and metagenomics - and more being developed all the time.\nThe nf-core pipelines all work in the same way, in that they have the same exact base for inputs, parameters and arguments, making them all highly similar to run. Since you’ve already learnt the basics of Nextflow in this course, you should now be able to also run the nf-core pipelines! It might be that you have a data type that you can analyse using one of the pipelines in nf-core, meaning you don’t need to do anything other than find out what parameters you should run it with.\nEach pipeline comes with extensive documentation, test datasets that you can use to practice on, can be run on both HPCs like Uppmax, cloud services like AWS or locally on your own computer. All pipelines support both Conda and Docker/Apptainer, and you can additionally run specific versions of the pipelines, allowing for full reproducibility of your analyses. If you want to check nf-core out, simply head over to their list of pipelines and see what’s available! Who knows, you might even write your own nf-core pipeline in the future?" + }, + { + "objectID": "pages/snakemake.html", + "href": "pages/snakemake.html", + "title": "Working with Snakemake", + "section": "", + "text": "A workflow management system (WfMS) is a piece of software that sets up, performs and monitors a defined sequence of computational tasks (i.e. “a workflow”). Snakemake is a WfMS that was developed in the bioinformatics community, and as such it has a number of features that make it particularly well-suited for creating reproducible and scalable data analyses.\nFirst of all the language you use to formulate your workflows is based on Python, which is a language with strong standing in academia. However, users are not required to know how to code in Python to work efficiently with Snakemake. Workflows can easily be scaled from your desktop to server, cluster, grid or cloud environments. This makes it possible to develop a workflow on your laptop, maybe using only a small subset of your data, and then run the real analysis on a cluster. Snakemake also has several features for defining the environment with which each task is carried out. This is important in bioinformatics, where workflows often involve running a large number of small third-party tools.\nSnakemake is primarily intended to work on files (rather than for example streams, reading/writing from databases or passing variables in memory). This fits well with many fields of bioinformatics, notably next-generation sequencing, that often involve computationally expensive operations on large files. It’s also a good fit for a scientific research setting, where the exact specifications of the final workflow aren’t always known at the beginning of a project.\nLastly, a WfMS is a very important tool for making your analyses reproducible. By keeping track of when each file was generated, and by which operation, it is possible to ensure that there is a consistent “paper trail” from raw data to final results. Snakemake also has features that allow you to package and distribute the workflow, and any files it involves, once it’s done.\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/snakemake and activate your snakemake-env Conda environment." + }, + { + "objectID": "pages/snakemake.html#introduction", + "href": "pages/snakemake.html#introduction", + "title": "Working with Snakemake", + "section": "", + "text": "A workflow management system (WfMS) is a piece of software that sets up, performs and monitors a defined sequence of computational tasks (i.e. “a workflow”). Snakemake is a WfMS that was developed in the bioinformatics community, and as such it has a number of features that make it particularly well-suited for creating reproducible and scalable data analyses.\nFirst of all the language you use to formulate your workflows is based on Python, which is a language with strong standing in academia. However, users are not required to know how to code in Python to work efficiently with Snakemake. Workflows can easily be scaled from your desktop to server, cluster, grid or cloud environments. This makes it possible to develop a workflow on your laptop, maybe using only a small subset of your data, and then run the real analysis on a cluster. Snakemake also has several features for defining the environment with which each task is carried out. This is important in bioinformatics, where workflows often involve running a large number of small third-party tools.\nSnakemake is primarily intended to work on files (rather than for example streams, reading/writing from databases or passing variables in memory). This fits well with many fields of bioinformatics, notably next-generation sequencing, that often involve computationally expensive operations on large files. It’s also a good fit for a scientific research setting, where the exact specifications of the final workflow aren’t always known at the beginning of a project.\nLastly, a WfMS is a very important tool for making your analyses reproducible. By keeping track of when each file was generated, and by which operation, it is possible to ensure that there is a consistent “paper trail” from raw data to final results. Snakemake also has features that allow you to package and distribute the workflow, and any files it involves, once it’s done.\nThis tutorial depends on files from the course GitHub repo. Take a look at the setup for instructions on how to set it up if you haven’t done so already, then open up a terminal and go to workshop-reproducible-research/tutorials/snakemake and activate your snakemake-env Conda environment." + }, + { + "objectID": "pages/snakemake.html#the-basics", + "href": "pages/snakemake.html#the-basics", + "title": "Working with Snakemake", + "section": "2 The basics", + "text": "2 The basics\nIn this part of the tutorial we will create a very simple workflow from scratch, in order to show the fundamentals of how Snakemake works. The workflow will take two files as inputs, a.txt and b.txt, and the purpose is to convert the text in the files to upper case and then to concatenate them.\nRun the following shell commands. The first one will make an empty file named Snakefile, which will later contain the workflow. The second and third commands generate two files containing some arbitrary text.\ntouch Snakefile\necho \"This is a.txt\" > a.txt\necho \"This is b.txt\" > b.txt\nThen open Snakefile in your favourite text editor. A Snakemake workflow is based on rules which take some file(s) as input, performs some type of operation on them, and generate some file(s) as outputs. Here is a very simple rule that produces a.upper.txt as an output, using a.txt as input. Copy this rule to your Snakefile and save it.\nrule convert_to_upper_case:\n output:\n \"a.upper.txt\"\n input:\n \"a.txt\"\n shell:\n \"\"\"\n tr [a-z] [A-Z] < {input} > {output}\n \"\"\"\n\n\n\n\n\n\nCaution\n\n\n\nIndentation is important in Snakefiles, so make sure that you have the correct number of spaces before input/output/shell and their respective subsections. The number of spaces per level doesn’t matter as long as you’re consistent. Here we use four, but you could just as well use two for a more compact look. Don’t use tabs (unless your editor automatically converts them to spaces).\n\n\nRules can be given names, here it’s convert_to_upper_case. While rule names are not strictly necessary we encourage you to use them and to make an effort to name your rules in a way that makes it easy to understand the purpose of the rule, as rule names are one of the main ways to interact with the workflow. The shell section (or directive) contains the shell commands that will convert the text in the input file to upper case and send it to the output file. In the shell command string, we can refer to elements of the rule via curly brackets. Here, we refer to the output file by specifying {output} and to the input file by specifying {input}. If you’re not very familiar with Bash, this particular command can be read like “send the contents of a.txt to the program tr, which will convert all characters in the set [a-z] to the corresponding character in the set [A-Z], and then send the output to a.upper.txt”.\nNow let’s run our first Snakemake workflow. When a workflow is executed Snakemake tries to generate a set of target files. Target files can be specified via the command line (or, as you will see later, in several other ways). Here we ask Snakemake to make the file a.upper.txt. We can specify the file containing our rules with -s but since the default behaviour of Snakemake is to look for a file called Snakefile in either the working directory or in a subdirectory called workflow/ we don’t need to specify that here. It’s good practice to first run with the flag -n (or --dry-run), which will show what Snakemake plans to do without actually running anything, and you also need to specify how many cores to be used for the workflow with --cores or -c. For now, you only need 1 so set -c 1. You can also use the flag -p, for showing the shell commands that it will execute, and the flag -r for showing the reason for running a specific rule. snakemake --help will show you all available flags.\n$ snakemake -n -c 1 -r -p a.upper.txt\n\nBuilding DAG of jobs...\nJob stats:\njob count min threads max threads\n--------------------- ------- ------------- -------------\nconvert_to_upper_case 1 1 1\ntotal 1 1 1\n\n\n[Mon Oct 25 16:48:43 2021]\nrule convert_to_upper_case:\n input: a.txt\n output: a.upper.txt\n jobid: 0\n reason: Missing output files: a.upper.txt\n resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T\n\n\n tr [a-z] [A-Z] < a.txt > a.upper.txt\n\nJob stats:\njob count min threads max threads\n--------------------- ------- ------------- -------------\nconvert_to_upper_case 1 1 1\ntotal 1 1 1\n\nThis was a dry-run (flag -n). The order of jobs does not reflect the order of execution.\nYou can see that Snakemake plans to run one job: the rule convert_to_upper_case with a.txt as input and a.upper.txt as output. The reason for doing this is that it’s missing the file a.upper.txt. Now execute the workflow without the -n flag and check that the contents of a.upper.txt is as expected. Then try running the same command again. What do you see? It turns out that Snakemake only reruns jobs if there have been changes to either the input files, or the workflow itself. This is how Snakemake ensures that everything in the workflow is up to date. We will get back to this shortly.\nWhat if we ask Snakemake to generate the file b.upper.txt?\n$ snakemake -n -c 1 -r -p b.upper.txt\n\nBuilding DAG of jobs...\nMissingRuleException:\nNo rule to produce b.upper.txt (if you use input functions make sure that they don't raise unexpected exceptions).\nThat didn’t work well. We could copy the rule to make a similar one for b.txt, but that would be a bit cumbersome. Here is where named wildcards come in; one of the most powerful features of Snakemake. Simply change the input from input: \"a.txt\" to input: \"{some_name}.txt\" and the output to output: \"{some_name}.upper.txt\". Now try asking for b.upper.txt again.\nTada! What happens here is that Snakemake looks at all the rules it has available (actually only one in this case) and tries to assign values to all wildcards so that the targeted files can be generated. In this case it was quite simple, you can see that it says that wildcards: some_name=b, but for large workflows and multiple wildcards it can get much more complex. Named wildcards is what enables a workflow (or single rules) to be efficiently generalized and reused between projects or shared between people.\nIt seems we have the first part of our workflow working, now it’s time to make the second rule for concatenating the outputs from convert_to_upper_case. The rule structure will be similar; the only difference is that here we have two inputs instead of one. This can be expressed in two ways, either with named inputs like this:\ninput:\n firstFile=\"...\",\n secondFile=\"...\"\nshell:\n \"\"\"\n some_function {input.firstFile} {input.secondFile}\n \"\"\"\nOr with indexes like this:\ninput:\n \"...\",\n \"...\"\nshell:\n \"\"\"\n some_function {input[0]} {input[1]}\n \"\"\"\n\n\n\n\n\n\nCaution\n\n\n\nIf you have multiple inputs or outputs they need to be delimited with a comma (as seen above). This is a very common mistake when writing Snakemake workflows. The parser will complain, but sometimes the error message can be difficult to interpret.\n\n\nNow try to construct this rule yourself and name it concatenate_a_and_b. The syntax for concatenating two files in Bash is cat first_file.txt second_file.txt > output_file.txt. Call the output c.txt. Run the workflow in Snakemake and validate that the output looks as expected.\nWouldn’t it be nice if our workflow could be used for any files, not just a.txt and b.txt? We can achieve this by using named wildcards (or in other ways as we will discuss later). As we’ve mentioned, Snakemake looks at all the rules it has available and tries to assign values to all wildcards so that the targeted files can be generated. We therefore have to name the output file in a way so that it also contains information about which input files it should be based on. Try to figure out how to do this yourself. If you’re stuck you can look at the spoiler below, but spend some time on it before you look. Also rename the rule to concatenate_files to reflect its new more general use.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nrule concatenate_files:\n output:\n \"{first}_{second}.txt\"\n input:\n \"{first}.upper.txt\",\n \"{second}.upper.txt\"\n shell:\n \"\"\"\n cat {input[0]} {input[1]} > {output}\n \"\"\"\n\n\n\nWe can now control which input files to use by the name of the file we ask Snakemake to generate. Run the workflow without the flag -n (or --dry-run) to execute both rules, providing one core with -c 1 (or --cores 1):\n$ snakemake a_b.txt -c 1\n\nBuilding DAG of jobs...\nUsing shell: /bin/bash\nProvided cores: 1 (use --cores to define parallelism)\nRules claiming more threads will be scaled down.\nJob stats:\njob count min threads max threads\n--------------------- ------- ------------- -------------\nconcatenate_files 1 1 1\nconvert_to_upper_case 2 1 1\ntotal 3 1 1\n\nSelect jobs to execute...\n\n[Mon Oct 25 16:51:52 2021]\nrule convert_to_upper_case:\n input: b.txt\n output: b.upper.txt\n jobid: 2\n wildcards: some_name=b\n resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T\n\n[Mon Oct 25 16:51:53 2021]\nFinished job 2.\n1 of 3 steps (33%) done\nSelect jobs to execute...\n\n[Mon Oct 25 16:51:53 2021]\nrule convert_to_upper_case:\n input: a.txt\n output: a.upper.txt\n jobid: 1\n wildcards: some_name=a\n resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T\n\n[Mon Oct 25 16:51:53 2021]\nFinished job 1.\n2 of 3 steps (67%) done\nSelect jobs to execute...\n\n[Mon Oct 25 16:51:53 2021]\nrule concatenate_files:\n input: a.upper.txt, b.upper.txt\n output: a_b.txt\n jobid: 0\n wildcards: first=a, second=b\n resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T\n\n[Mon Oct 25 16:51:53 2021]\nFinished job 0.\n3 of 3 steps (100%) done\nNeat!\n\n\n\n\n\n\nTip\n\n\n\nYou can name a file whatever you want in a Snakemake workflow, but you will find that everything falls into place much nicer if the filename reflects the file’s path through the workflow, e.g. sample_a.trimmed.deduplicated.sorted.bam.\n\n\nThe input to Snakemake rules have to be strings or lists of strings, however you don’t have to specify these strings directly in the input: section of rules. Instead, you can specify Python functions that return strings or lists of strings. This allows you to supply input to rules that can vary depending on the wildcards being used. We’ll get to why that’s useful in a sec, but first let’s put it to use for the conatenate_files rule. Because Snakemake is based on Python we can mix rule definitions with standard python code in the same file. Add a function just above the concatenate_files that looks like this:\ndef concat_input(wildcards):\n files = [wildcards.first + \".upper.txt\", wildcards.second + \".upper.txt\"]\n return files\nThis is the syntax to define a function in Python. The def concat_input(wildcards): line shows the name of the function (concat_input) and the variable passed to the function (the wildcards object). In the second line we add two items to a list that we call files and add the ‘.upper.txt’ suffix to each item. Finally, the function returns the list. Because the concatenate_files rule has two wildcards {first} and {second} we can access the actual strings in the wildcards object using wildcards.first and wildcards.second. When we ask for the file a_b.txt then wildcards.first == 'a' and wildcards.second == 'b'. This means that the files list returned by the function will be ['a.upper.txt', 'b.upper.txt']. To see for yourself you can add the following line to the function, just before the return statement: print (wildcards.first, wildcards.second, files). This way the wildcard values and the list will be printed to the terminal when you run Snakemake.\nNow that we’ve defined the function to use as input, we can use it in the concatenate_files rule. Update the rule so that it looks like this:\nrule concatenate_files:\n output:\n \"{first}_{second}.txt\"\n input:\n concat_input\n shell:\n \"\"\"\n cat {input[0]} {input[1]} > {output}\n \"\"\"\nYou see that the name of the function concat_input is added in place of the input strings. When using the wildcards object in input functions like this we have to call the function without any arguments (simply concat_input) and the function has to be defined to accept a single argument (here def concat_input(wildcards):). Let’s run the workflow with the updated rule. Remove the file a_b.txt or add -f to the Snakemake command to force a re-run:\nsnakemake a_b.txt -c 1 -f\nIf you added the print statement to the function you should see the following printed to your terminal:\nBuilding DAG of jobs...\na b ['a.upper.txt', 'b.upper.txt']\nFollowed by the rest of the workflow output.\nThere are a number of possible use-cases for input functions. For example, say that you have an experiment where you’ve sequenced three samples: sample1, sample2 and sample3 with the corresponding FASTQ files under data/ and you want to write a rule that outputs the statistics of all sequences within each sample. However, samples sample1 and sample2 have been sequenced with single-end technology while sample3 have paired-end reads. The single-end samples will have only one FASTQ file whereas the paired-end sample will have two (one for each sequenced end). Thus, depending on the name of the sample the input to the function will either be one file or two. With input functions we can write a generalized rule that can handle both types:\ndef fastq_input(wildcards):\n if wildcards.sample_id in [\"sample1\", \"sample2\"]:\n return \"data/\" + wildcards.sample_id + \".fastq.gz\"\n else:\n return [\"data/\" + wildcards.sample_id + \".R1.fastq.gz\",\n \"data/\" + wildcards.sample_id + \".R2.fastq.gz\"]\n\nrule fastq_stats:\n output:\n \"{sample_id}.stats.txt\"\n input:\n fastq_input\n shell:\n \"\"\"\n seqtk comp {input} > {output}\n \"\"\"\nAs you can see, the fastq_stats rule outputs one file {sample_id}.stats.txt and takes as input the value returned from the fastq_input function. In this function the sample id is evaluated and if it is either sample1 or sample2 (our single-end samples) then the function returns a single string which is the path to the FASTQ file for that sample. Otherwise, the function returns a list containing both the R1 and R2 files for the sample. In the shell: directive of the rule the seqtk comp command is run on the input and the output is sent to the output file.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow a simple Snakemake rule looks.\nHow to define target files when executing a workflow.\nHow to use named wildcards for writing generic and flexible rules.\nHow to use input functions in rules" + }, + { + "objectID": "pages/snakemake.html#visualising-workflows", + "href": "pages/snakemake.html#visualising-workflows", + "title": "Working with Snakemake", + "section": "3 Visualising workflows", + "text": "3 Visualising workflows\nAll that we’ve done so far could quite easily be done in a simple shell script that takes the input files as parameters. Let’s now take a look at some of the features where a WfMS like Snakemake really adds value compared to a more straightforward approach. One such feature is the possibility to visualize your workflow. Snakemake can generate three types of graphs, one that shows how the rules are connected, one that shows how the jobs (i.e. an execution of a rule with some given inputs/outputs/settings) are connected, and finally one that shows rules with their respective input/output files.\nFirst we look at the rule graph. The following command will generate a rule graph in the dot language and pipe it to the program dot, which in turn will save a visualization of the graph as a PNG file (if you’re having troubles displaying PNG files you could use SVG or JPG instead).\n\n\n\n\n\n\nCaution\n\n\n\nIf you added the print(wildcards.first,wildcards.second,files) statement to the concat_input function in the previous section you need to remove that line before running the commands below.\n\n\nsnakemake --rulegraph a_b.txt | dot -Tpng > rulegraph.png\n\nThis looks simple enough, the output from the rule convert_to_upper_case will be used as input to the rule concatenate_files.\nFor a more typical bioinformatics project it can look something like this when you include all the rules from processing of the raw data to generating figures for the paper.\n\nWhile saying that it’s easy to read might be a bit of a stretch, it definitely gives you a better overview of the project than you would have without a WfMS.\nThe second type of graph is based on the jobs, and looks like this for our little workflow (use --dag instead of --rulegraph).\nsnakemake --dag a_b.txt | dot -Tpng > jobgraph.png\n\nThe main difference here is that now each node is a job instead of a rule. You can see that the wildcards used in each job are also displayed. Another difference is the dotted lines around the nodes. A dotted line is Snakemake’s way of indicating that this rule doesn’t need to be rerun in order to generate a_b.txt. Validate this by running snakemake -n -r a_b.txt and it should say that there is nothing to be done.\nWe’ve discussed before that one of the main purposes of using a WfMS is that it automatically makes sure that everything is up to date. This is done by recursively checking that outputs are always newer than inputs for all the rules involved in the generation of your target files. Now try to change the contents of a.txt to some other text and save it. What do you think will happen if you run snakemake -n -r a_b.txt again?\n\n\n\n\n\n\nClick to show\n\n\n\n\n\n$ snakemake -n -r a_b.txt\n\nBuilding DAG of jobs...\nJob stats:\njob count min threads max threads\n--------------------- ------- ------------- -------------\nconcatenate_files 1 1 1\nconvert_to_upper_case 1 1 1\ntotal 2 1 1\n\n\n[Mon Oct 25 17:00:02 2021]\nrule convert_to_upper_case:\n input: a.txt\n output: a.upper.txt\n jobid: 1\n reason: Updated input files: a.txt\n wildcards: some_name=a\n resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T\n\n\n[Mon Oct 25 17:00:02 2021]\nrule concatenate_files:\n input: a.upper.txt, b.upper.txt\n output: a_b.txt\n jobid: 0\n reason: Input files updated by another job: a.upper.txt\n wildcards: first=a, second=b\n resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T\n\nJob stats:\njob count min threads max threads\n--------------------- ------- ------------- -------------\nconcatenate_files 1 1 1\nconvert_to_upper_case 1 1 1\ntotal 2 1 1\n\nThis was a dry-run (flag -n). The order of jobs does not reflect the order of execution.\n\n\n\nWere you correct? Also generate the job graph and compare to the one generated above. What’s the difference? Now rerun without -n and validate that a_b.txt contains the new text (don’t forget to specify -c 1). Note that Snakemake doesn’t look at the contents of files when trying to determine what has changed, only at the timestamp for when they were last modified.\nWe’ve seen that Snakemake keeps track of if files in the workflow have changed, and automatically makes sure that any results depending on such files are regenerated. What about if the rules themselves are changed? It turns out that since version 7.8.0 Snakemake keeps track of this automatically.\nLet’s say that we want to modify the rule concatenate_files to also include which files were concatenated.\nrule concatenate_files:\n output:\n \"{first}_{second}.txt\"\n input:\n \"{first}.upper.txt\",\n \"{second}.upper.txt\"\n shell:\n \"\"\"\n echo 'Concatenating {input}' | cat - {input[0]} {input[1]} > {output}\n \"\"\"\n\n\n\n\n\n\nNote\n\n\n\nIt’s not really important for the tutorial, but the shell command used here first outputs “Concatenating” followed by a space delimited list of the files in input. This string is then sent to the program cat where it’s concatenated with input[0] and input[1] (the parameter - means that it should read from standard input). Lastly, the output from cat is sent to {output}.\n\n\nIf you now run the workflow as before you should see:\nrule concatenate_files:\n input: a.upper.txt, b.upper.txt\n output: a_b.txt\n jobid: 0\n reason: Code has changed since last execution\n wildcards: first=a, second=b\nBecause although no files involved in the workflow have been changed, Snakemake recognizes that the workflow code itself has been modified and this triggers a re-run.\nSnakemake is aware of changes to four categories of such “rerun-triggers”: “input” (changes to rule input files), “params” (changes to the rule params section), “software-env” (changes to Conda environment files specified by the conda: directive) and “code” (changes to code in the shell:, run:, script: and notebook: directives).\nPrior to version 7.8.0, only changes to the modification time of input files would trigger automatic re-runs. To run Snakemake with this previous behaviour you can use the setting --rerun-triggers mtime at the command line. Change the shell: section of the concatenate_files rule back to the previous version, then try running: snakemake -n -r a_b.txt --rerun-triggers mtime and you should again see Nothing to be done (all requested files are present and up to date).\nYou can also export information on how all files were generated (when, by which rule, which version of the rule, and by which commands) to a tab-delimited file like this:\nsnakemake a_b.txt -c 1 -D > summary.tsv\nThe content of summary.tsv is shown in the table below:\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\noutput_file\ndate\nrule\nversion\nlog-file(s)\ninput-file(s)\nshellcmd\nstatus\nplan\n\n\n\n\na_b.txt\nMon Oct 25 17:01:46 2021\nconcatenate_files\n-\n\na.upper.txt,b.upper.txt\ncat a.upper.txt b.upper.txt > a_b.txt\nrule implementation changed\nupdate pending\n\n\na.upper.txt\nMon Oct 25 17:01:46 2021\nconvert_to_upper_case\n-\n\na.txt\ntr [a-z] [A-Z] < a.txt > a.upper.txt\nok\nno update\n\n\nb.upper.txt\nMon Oct 25 17:01:46 2021\nconvert_to_upper_case\n-\n\nb.txt\ntr [a-z] [A-Z] < b.txt > b.upper.txt\nok\nno update\n\n\n\n\nYou can see in the second last column that the rule implementation for a_b.txt has changed. The last column shows if Snakemake plans to regenerate the files when it’s next executed. You can see that for the concatenate_files the plan is update pending because we generated the summary with the default behaviour of using all rerun-triggers.\nYou might wonder where Snakemake keeps track of all these things? It stores all information in a hidden subdirectory called .snakemake. This is convenient since it’s easy to delete if you don’t need it anymore and everything is contained in the project directory. Just be sure to add it to .gitignore so that you don’t end up tracking it with git.\nBy now you should be familiar with the basic functionality of Snakemake, and you can build advanced workflows with only the features we have discussed here. There’s a lot we haven’t covered though, in particular when it comes to making your workflow more reusable. In the following section we will start with a workflow that is fully functional but not very flexible. We will then gradually improve it, and at the same time showcase some Snakemake features we haven’t discussed yet. Note that this can get a little complex at times, so if you felt that this section was a struggle then you could move on to one of the other tutorials instead.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to use --dag and --rulegraph for visualizing the job and rule graphs, respectively.\nHow Snakemake reruns relevant parts of the workflow after there have been changes.\nHow Snakemake tracks changes to files and code in a workflow" + }, + { + "objectID": "pages/snakemake.html#the-mrsa-workflow", + "href": "pages/snakemake.html#the-mrsa-workflow", + "title": "Working with Snakemake", + "section": "4 The MRSA workflow", + "text": "4 The MRSA workflow\nAs you might remember from the intro, we are attempting to understand how lytic bacteriophages can be used as a future therapy for the multi-resistant bacteria MRSA (methicillin-resistant Staphylococcus aureus). In order to do this we have performed RNA-seq of three strains, one test and two controls. We have already set up a draft Snakemake workflow for the RNA-seq analysis and it seems to be running nicely. The rest of the Snakemake tutorial will be spent improving and making this workflow more flexible!\n\n\n\n\n\n\nTip\n\n\n\nThis section will leave a little more up to you compared to the previous one. If you get stuck at some point the final workflow after all the modifications is available in tutorials/git/Snakefile.\n\n\nYou are probably already in your snakemake-env environment, otherwise activate it (use conda info --envs if you are unsure).\n\n\n\n\n\n\nTip\n\n\n\nHere we have one Conda environment for executing the whole Snakemake workflow. Snakemake also supports using explicit Conda environments on a per-rule basis, by specifying something like conda: rule-specific-env.yml in the rule definition and running Snakemake with the --use-conda flag. The given rule will then be run in the Conda environment specified in rule-specific-env.yml that will be created and activated on the fly by Snakemake. Note that by default Snakemake uses conda to generate the rule-specific environments. This behaviour can be changed by running with --conda-frontend conda, which will force Snakemake to use conda instead.\n\n\nLet’s start by generating the rule graph so that we get an overview of the workflow. Here we have to specify the file with the rules using the -s flag to Snakemake since the path to the file differs from the default.\nsnakemake -s snakefile_mrsa.smk --rulegraph | dot -T png > rulegraph_mrsa.png\nThere’s another difference in this command compared to the one we’ve used before, namely that we don’t define a target. In the toy example we used a_b.txt as a target, and the wildcards were resolved based on that. How come that we don’t need to do that here? It turns out that by default Snakemake targets the first rule in a workflow. By convention, we call this rule all and let it serve as a rule for aggregating the main outputs of the workflow.\n\nNow take some time and look through the workflow file and try to understand how the rules fit together. Use the rule graph as aid. The rules represent a quite standard, although somewhat simplified, workflow for RNA-seq analysis. If you are unfamiliar with the purpose of the different operations (index genome, FastQC and so on), then take a look at the intro.\nAlso generate the job graph in the same manner. Here you can see that three samples will be downloaded: SRR935090, SRR935091, and SRR935092. The original sample files contain tens of millions of reads but for the purpose of this course we have sub-sampled them to 100,000 reads per sample, so that they are easy to manage, and made them available at the SciLifeLab Data Repository. These FASTQ files will then be quality controlled with FastQC and aligned to a genome. The QC output will be aggregated with MultiQC and the alignments will be used to generate a count table, i.e. a table that shows how many reads map to each gene for each sample. This count table is then what the downstream analysis will be based on.\n\nNow try to run the whole workflow. Hopefully you see something like this.\nBuilding DAG of jobs...\nUsing shell: /bin/bash\nProvided cores: 1 (use --cores to define parallelism)\nRules claiming more threads will be scaled down.\nJob stats:\njob count min threads max threads\n-------------------- ------- ------------- -------------\nalign_to_genome 3 1 1\nall 1 1 1\nfastqc 3 1 1\ngenerate_count_table 1 1 1\ngenerate_rulegraph 1 1 1\nget_SRA_by_accession 3 1 1\nget_genome_fasta 1 1 1\nget_genome_gff3 1 1 1\nindex_genome 1 1 1\nmultiqc 1 1 1\nsort_bam 3 1 1\ntotal 19 1 1\n\nSelect jobs to execute...\n\n[Mon Oct 25 17:13:47 2021]\nrule get_genome_fasta:\n output: data/ref/NCTC8325.fa.gz\n jobid: 6\n resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T\n\n--2021-10-25 17:13:48-- ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz\n => ‘data/ref/NCTC8325.fa.gz’\nResolving ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)... 193.62.197.75\nConnecting to ftp.ensemblgenomes.org (ftp.ensemblgenomes.org)|193.62.197.75|:21... connected.\nLogging in as anonymous ... Logged in!\n==> SYST ... done. ==> PWD ... done.\n.\n.\n[lots of stuff]\n.\n.\nlocalrule all:\n input: results/tables/counts.tsv, results/multiqc/multiqc.html, results/rulegraph.png\n jobid: 0\n resources: tmpdir=/var/folders/p0/6z00kpv16qbf_bt52y4zz2kc0000gp/T\n\n[Mon Oct 25 17:14:38 2021]\nFinished job 0.\n19 of 19 steps (100%) done\nAfter everything is done, the workflow will have resulted in a bunch of files in the directories data/ and results/. Take some time to look through the structure, in particular the quality control reports in results/multiqc/ and the count table in results/tables/.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow the MRSA workflow looks.\nHow to run the MRSA workflow.\nWhich output files the MRSA workflow produces." + }, + { + "objectID": "pages/snakemake.html#parameters", + "href": "pages/snakemake.html#parameters", + "title": "Working with Snakemake", + "section": "5 Parameters", + "text": "5 Parameters\nIn a typical bioinformatics project, considerable efforts are spent on tweaking parameters for the various programs involved. It would be inconvenient if you had to change in the shell scripts themselves every time you wanted to run with a new setting. Luckily, there is a better option for this: the params keyword.\nrule some_rule:\n output:\n \"...\"\n input:\n \"...\"\n params:\n cutoff=2.5\n shell:\n \"\"\"\n some_program --cutoff {params.cutoff} {input} {output}\n \"\"\"\nMost of the programs are run with default settings in the MRSA workflow and don’t use the params: directive. However, the get_SRA_by_accession rule is an exception. Here the remote address for each of the files to download is passed to the shell directive via:\ndef get_sample_url(wildcards):\n samples = {\n \"SRR935090\": \"https://figshare.scilifelab.se/ndownloader/files/39539767\",\n \"SRR935091\": \"https://figshare.scilifelab.se/ndownloader/files/39539770\",\n \"SRR935092\": \"https://figshare.scilifelab.se/ndownloader/files/39539773\"\n }\n return samples[wildcards.sample_id]\n\nrule get_SRA_by_accession:\n \"\"\"\n Retrieve a single-read FASTQ file\n \"\"\"\n output:\n \"data/{sample_id}.fastq.gz\"\n params:\n url = get_sample_url\n shell:\n \"\"\"\n wget -O - {params.url} | seqtk sample - 25000 | gzip -c > {output[0]}\n \"\"\"\nYou may recognize this from page 2 of this tutorial where we used input functions to generate strings and lists of strings for the input: section of a rule. Using a function to return values based on the wildcards also works for params:. Here sample_id is a wildcard which in this specific workflow can be either SRR935090, SRR935091, or SRR935092. The wildcards object is passed to the function get_sample_url and depending on what output the rule is supposed to generate, wildcards.sample_id will take the value of either of the three sample ids. The samples variable defined in the function is a Python dictionary that has the URLs for each sample_id hard-coded. This dictionary is used to convert the value of the sample_id wildcard to a URL, which is returned by the function. Finally, in the shell: directive we access the url parameter with {params.url}. (We could have written three separate rules to download the samples, but it’s easy to see how that can become impractical.)\nLet’s add another parameter to the get_SRA_by_accession rule. As you can see in the shell command the FASTQ file downloaded by wget gets piped directly (the -O - part means send contents to STDOUT) to the seqtk sample command which reads from STDIN and outputs 25000 randomly sampled reads (out of the 100,000 contained in the example FASTQ file). Change in the rule to use the parameter max_reads instead and set the value to 20000. If you need help, click to show the solution below.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nrule get_SRA_by_accession:\n \"\"\"\n Retrieve a single-read FASTQ file\n \"\"\"\n output:\n \"data/{sample_id}.fastq.gz\"\n params:\n url = get_sample_url,\n max_reads = 20000\n shell:\n \"\"\"\n wget -O - {params.url} | seqtk sample - {params.max_reads} | gzip -c > {output[0]}\n \"\"\"\n\n\n\nNow run through the workflow. Because there’s been changes to the get_SRA_by_accession rule this will trigger a re-run of the rule for all three accessions. In addition all downstream rules that depend on output from get_SRA_by_accession are re-run.\nAs you can see the parameter values we set in the params section don’t have to be static, they can be any Python expression. In particular, Snakemake provides a global dictionary of configuration parameters called config. Let’s modify get_SRA_by_accession to look something like this in order to make use of this dictionary:\nrule get_SRA_by_accession:\n \"\"\"\n Retrieve a single-read FASTQ file\n \"\"\"\n output:\n \"data/{sample_id}.fastq.gz\"\n params:\n url = get_sample_url,\n max_reads = config[\"max_reads\"]\n shell:\n \"\"\"\n wget -L {params.url} | seqtk sample - {params.max_reads} | gzip -c > {output[0]}\n \"\"\"\nNote that Snakemake now expects there to be a key named max_reads in the config dictionary. If we don’t populate the dictionary somehow the dictionary will be empty so if you were to run the workflow now it would trigger a KeyError (try running snakemake -s snakefile_mrsa.smk -n to see for yourself). In order to populate the config dictionary with data for the workflow we could use the snakemake --config KEY=VALUE syntax directly from the command line (e.g. snakemake --config max_reads=20000 -s snakefile_mrsa.smk). However, from a reproducibility perspective, it’s not optimal to set parameters from the command line, since it’s difficult to keep track of which parameter values that were used.\nA much better alternative is to use the --configfile FILE option to supply a configuration file to Snakemake. In this file we can collect all the project-specific settings, sample ids and so on. This also enables us to write the Snakefile in a more general manner so that it can be better reused between projects. Like several other files used in these tutorials, this file should be in YAML format. Create the file below and save it as config.yml.\nmax_reads: 25000\nIf we now run Snakemake with --configfile config.yml, it will parse this file to form the config dictionary. If you want to overwrite a parameter value, e.g. for testing, you can still use the --config KEY=VALUE flag, as in --config max_reads=1000.\n\n\n\n\n\n\nTip\n\n\n\nRather than supplying the config file from the command line you could also add the line configfile: \"config.yml\" to the top of your Snakefile. Keep in mind that with such a setup Snakemake will complain if the file config.yml is not present.\n\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to set parameter values with the params directive.\nHow to run Snakemake with the config variable and with a configuration file." + }, + { + "objectID": "pages/snakemake.html#logs", + "href": "pages/snakemake.html#logs", + "title": "Working with Snakemake", + "section": "6 Logs", + "text": "6 Logs\nAs you probably noticed it was difficult to follow how the workflow progressed since some rules printed a lot of output to the terminal. In some cases this also contained important information, such as statistics on the sequence alignments or genome indexing. This could be valuable for example if you later in the project get weird results and want to debug. It’s also important from a reproducibility perspective that the “paper trail” describing how the outputs were generated is saved. Luckily, Snakemake has a feature that can help with this. Just as we define input and output in a rule we can also define log.\nrule some_rule:\n output:\n \"...\"\n input:\n \"...\"\n log:\n \"...\"\n shell:\n \"\"\"\n echo 'Converting {input} to {output}' > {log}\n \"\"\"\nA log file is not different from any other output file, but it’s dealt with a little differently by Snakemake. For example, it’s shown in the file summary when using -D and unlike other output files it’s not deleted if jobs fail which of course is necessary for debugging purposes. It’s also a good way to clarify the purpose of the file. We probably don’t need to save logs for all the rules, only the ones with interesting output.\n\nget_genome_fasta and get_genome_gff3 would be good to log since they are dependent on downloading files from an external server.\nmultiqc aggregates quality control data for all the samples into one html report, and the log contains information about which samples were aggregated.\nindex_genome outputs some statistics about the genome indexing.\nalign_to_genome outputs important statistics about the alignments. This is probably the most important log to save.\n\nNow add a log file to some or all of the rules above. A good place to save them to would be results/logs/rule_name/. In order to avoid that multiple jobs write to the same files Snakemake requires that all output and log files contain the same wildcards, so be sure to include any wildcards used in the rule in the log name as well, e.g. {some_wildcard}.log.\nYou also have to specify in the shell section of each rule what you want the log to contain. Some of the programs we use send their log information to standard out, some to standard error and some let us specify a log file via a flag.\nFor example, in the align_to_genome rule, it could look like this (Bowtie2 writes log info to standard error):\nrule align_to_genome:\n \"\"\"\n Align a fastq file to a genome index using Bowtie 2.\n \"\"\"\n output:\n \"results/bam/{sample_id,\\w+}.bam\"\n input:\n \"data/{sample_id}.fastq.gz\",\n \"results/bowtie2/NCTC8325.1.bt2\",\n \"results/bowtie2/NCTC8325.2.bt2\",\n \"results/bowtie2/NCTC8325.3.bt2\",\n \"results/bowtie2/NCTC8325.4.bt2\",\n \"results/bowtie2/NCTC8325.rev.1.bt2\",\n \"results/bowtie2/NCTC8325.rev.2.bt2\"\n log:\n \"results/logs/align_to_genome/{sample_id}.log\"\n shell:\n \"\"\"\n bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output} 2>{log}\n \"\"\"\nTo save some time you can use the info below.\n# wget has a -o flag for specifying the log file\nwget remote_file -O output_file -o {log}\n\n# MultiQC and featureCounts write to standard error so we redirect with \"2>\"\nmultiqc -n output_file input_files 2> {log}\nfeatureCounts -t gene -g gene_id -a gff_file -o output_file input_files 2>{log}\n\n# Bowtie2-build redirects to standard out so we use \">\"\nbowtie2-build input_file index_dir > {log}\nNow rerun the whole workflow. Do the logs contain what they should? Note how much easier it is to follow the progression of the workflow when the rules write to logs instead of to the terminal.\n\n\n\n\n\n\nTip\n\n\n\nIf you have a rule with a shell directive in which several commands are run and you want to save stdout and stderr for all commands into the same log file you can add exec &{log} as the first line of the shell directive.\n\n\nIf you run with -D (or -S for a simpler version) you will see that the summary table now also contains the log file for each of the files in the workflow.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to redirect output to log files with the log directive." + }, + { + "objectID": "pages/snakemake.html#temporary-files", + "href": "pages/snakemake.html#temporary-files", + "title": "Working with Snakemake", + "section": "7 Temporary files", + "text": "7 Temporary files\nIt’s not uncommon that workflows contain temporary files that should be kept for some time and then deleted once they are no longer needed. A typical case could be that some operation generates a file, which is then compressed to save space or indexed to make searching faster. There is then no need to save the original output file. Take a look at the job graph for our workflow again. The output from align_to_genome is a BAM file, which contains information about all the reads for a sample and where they map in the genome. For downstream processing we need this file to be sorted by genome coordinates. This is what the rule sort_bam is for. We therefore end up with both results/bam/{sample_id}.bam and results/bam/{sample_id}.sorted.bam.\nIn Snakemake we can mark an output file as temporary like this:\noutput: temp(\"...\")\nThe file will then be deleted as soon as all jobs where it’s an input have finished. Now do this for the output of align_to_genome. We have to rerun the rule for it to trigger, so use -R align_to_genome. It should look something like this:\n.\n.\nrule sort_bam:\n input: results/bam/SRR935090.bam\n output: results/bam/SRR935090.sorted.bam\n jobid: 2\n wildcards: sample_id=SRR935090\n\nRemoving temporary output file results/bam/SRR935090.bam.\nFinished job 2.\n.\n.\n\n\n\n\n\n\nTip\n\n\n\nSometimes you may want to trigger removal of temporary files without actually rerunning the jobs. You can then use the --delete-temp-output flag. In some cases you may instead want to run only parts of a workflow and therefore want to prevent files marked as temporary from being deleted (because the files are needed for other parts of the workflow). In such cases you can use the --notemp flag.\n\n\nSnakemake has a number of options for marking files:\n\ntemp(\"...\"): The output file should be deleted once it’s no longer needed by any rules.\nprotected(\"...\"): The output file should be write-protected. Typically used to protect files that require a huge amount of computational resources from being accidentally deleted.\nancient(\"...\"): The timestamp of the input file is ignored and it’s always assumed to be older than any of the output files.\ntouch(\"...\"): The output file should be “touched”, i.e. created or updated, when the rule has finished. Typically used as “flag files” to enforce some rule execution order without real file dependencies.\ndirectory(\"...\"): The output is a directory rather than a file.\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to mark an output file as temporary for automatic removal." + }, + { + "objectID": "pages/snakemake.html#targets", + "href": "pages/snakemake.html#targets", + "title": "Working with Snakemake", + "section": "8 Targets", + "text": "8 Targets\nWe’ve mentioned that Snakemake rules take either strings or a list of strings as input, and that we can use any Python expression in Snakemake workflows. Here we’ll show how these features help us condense the code of rules.\nConsider the rule align_to_genome below.\nrule align_to_genome:\n \"\"\"\n Align a fastq file to a genome index using Bowtie 2.\n \"\"\"\n output:\n \"results/bam/{sample_id}.bam\"\n input:\n \"data/{sample_id}.fastq.gz\",\n \"results/bowtie2/NCTC8325.1.bt2\",\n \"results/bowtie2/NCTC8325.2.bt2\",\n \"results/bowtie2/NCTC8325.3.bt2\",\n \"results/bowtie2/NCTC8325.4.bt2\",\n \"results/bowtie2/NCTC8325.rev.1.bt2\",\n \"results/bowtie2/NCTC8325.rev.2.bt2\"\n shell:\n \"\"\"\n bowtie2 -x results/bowtie2/NCTC8325 -U {input[0]} > {output}\n \"\"\"\nHere we have seven inputs; the FASTQ file with the reads and six files with similar file names from the Bowtie2 genome indexing. Instead of writing all the filenames we can tidy this up by using a Python expression to generate a list of these files instead. If you’re familiar with Python you could do this with list comprehensions like this:\ninput:\n \"data/{sample_id}.fastq.gz\",\n [f\"results/bowtie2/NCTC8325.{substr}.bt2\" for\n substr in [\"1\", \"2\", \"3\", \"4\", \"rev.1\", \"rev.2\"]]\nThis will take the elements of the list of substrings one by one, and insert that element in the place of {substr}. Since this type of aggregating rules are quite common, Snakemake also has a more compact way of achieving the same thing.\ninput:\n \"data/{sample_id}.fastq.gz\",\n expand(\"results/bowtie2/NCTC8325.{substr}.bt2\",\n substr = [\"1\", \"2\", \"3\", \"4\", \"rev.1\", \"rev.2\"])\n\n\n\n\n\n\nCaution\n\n\n\nWhen using expand() like this, substr is not a wildcard because it is resolved to the values explicitly given inside the expand expression.\n\n\nNow change in the rules index_genome and align_to_genome to use the expand() expression.\nIn the workflow we decide which samples to run by including the SRR ids in the names of the inputs to the rules multiqc and generate_count_table:\nrule generate_count_table:\n output:\n \"results/tables/counts.tsv\"\n input:\n bams = [\"results/bam/SRR935090.sorted.bam\",\n \"results/bam/SRR935091.sorted.bam\",\n \"results/bam/SRR935092.sorted.bam\"],\n...\nrule multiqc:\n output:\n html = \"results/multiqc/multiqc.html\",\n stats = \"results/multiqc/multiqc_general_stats.txt\"\n input:\n \"results/fastqc/SRR935090_fastqc.zip\",\n \"results/fastqc/SRR935091_fastqc.zip\",\n \"results/fastqc/SRR935092_fastqc.zip\"\nThe output files from these two rules, results/multiqc.html and results/tables/counts.tsv, are in turn specified as input to the all rule at the top of the file. Because the first rule is targeted by default when we run Snakemake on the command line (like we mentioned in snakemake-4-the-mrsa-workflow) this is what triggers the rules to run on each of the three samples.\nHowever, this is a potential source of errors since it’s easy to change in one place and forget to change in the other. Because we can use Python code “everywhere” let’s instead define a list of sample ids and put at the very top of the Snakefile, just before the rule all:\nSAMPLES = [\"SRR935090\", \"SRR935091\", \"SRR935092\"]\nNow use expand() in multiqc and generate_count_table to use SAMPLES for the sample ids. For the multiqc rule it could look like this:\ninput:\n expand(\"results/fastqc/{sample_id}_fastqc.zip\", sample_id = SAMPLES)\nSee if you can update the generate_count_table rule in the same manner!\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to use the expand() expression to create a list with file names, inserting all provided wildcard values." + }, + { + "objectID": "pages/snakemake.html#shadow-rules", + "href": "pages/snakemake.html#shadow-rules", + "title": "Working with Snakemake", + "section": "9 Shadow rules", + "text": "9 Shadow rules\nTake a look at the index_genome rule below:\nrule index_genome:\n \"\"\"\n Index a genome using Bowtie 2.\n \"\"\"\n output:\n index = expand(\"results/bowtie2/NCTC8325.{substr}.bt2\",\n substr = [\"1\", \"2\", \"3\", \"4\", \"rev.1\", \"rev.2\"])\n input:\n \"data/NCTC8325.fa.gz\"\n log:\n \"results/logs/index_genome/NCTC8325.log\"\n shell:\n \"\"\"\n # Bowtie2 cannot use .gz, so unzip to a temporary file first\n gunzip -c {input} > tempfile\n bowtie2-build tempfile results/bowtie2/NCTC8325 >{log}\n\n # Remove the temporary file\n rm tempfile\n \"\"\"\nThere is a temporary file here called tempfile which is the uncompressed version of the input, since Bowtie2 cannot use compressed files. There are a number of drawbacks with having files that aren’t explicitly part of the workflow as input/output files to rules:\n\nSnakemake cannot clean up these files if the job fails, as it would do for normal output files.\nIf several jobs are run in parallel there is a risk that they write to tempfile at the same time. This can lead to very scary results.\nSometimes we don’t know the names of all the files that a program can generate. It is, for example, not unusual that programs leave some kind of error log behind if something goes wrong.\n\nAll of these issues can be dealt with by using the shadow option for a rule. The shadow option results in that each execution of the rule is run in an isolated temporary directory (located in .snakemake/shadow/ by default). There are a few options for shadow (for the full list of these options see the Snakemake docs). The most simple is shadow: \"minimal\", which means that the rule is executed in an empty directory that the input files to the rule have been symlinked into. For the rule below, that means that the only file available would be input.txt. The shell commands would generate the files some_other_junk_file and output.txt. Lastly, Snakemake will move the output file (output.txt) to its “real” location and remove the whole shadow directory. We therefore never have to think about manually removing some_other_junk_file.\nrule some_rule:\n output:\n \"output.txt\"\n input:\n \"input.txt\"\n shadow: \"minimal\"\n shell:\n \"\"\"\n touch some_other_junk_file\n cp {input} {output}\n \"\"\"\nTry this out for the rules where we have to “manually” deal with files that aren’t tracked by Snakemake (multiqc, index_genome). Also remove the shell commands that remove temporary files from those rules, as they are no longer needed. Now rerun the workflow and validate that the temporary files don’t show up in your working directory.\n\n\n\n\n\n\nTip\n\n\n\nSome people use the shadow option for almost every rule and some never use it at all. One thing to keep in mind is that it leads to some extra file operations when the outputs are moved to their final location. This is no issue when the shadow directory is on the same disk as the output directory, but if you’re running on a distributed file system and generate very many or very large files it might be worth considering other options (see e.g. the --shadow-prefix flag).\n\n\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to use the shadow option to handle files that are not tracked by Snakemake." + }, + { + "objectID": "pages/snakemake.html#generalising-workflows", + "href": "pages/snakemake.html#generalising-workflows", + "title": "Working with Snakemake", + "section": "10 Generalising workflows", + "text": "10 Generalising workflows\nIt’s a good idea to separate project-specific parameters from the actual implementation of the workflow. This allows anyone using the workflow to modify its behaviour without changing the underlying code, making the workflow more general.\nIn order to generalize our RNA-seq analysis workflow we should move all project-specific information to config.yml. This means that we want the config file to:\n\nSpecify which samples to run.\nSpecify which genome to align to and where to download its sequence and annotation files.\n(Contain any other parameters we might need to make it into a general workflow, e.g. to support both paired-end and single-read sequencing)\n\n\n\n\n\n\n\nNote\n\n\n\nPutting all configuration in config.yml will break the generate_rulegraph rule. You can fix it either by replacing --config max_reads=0 with --configfile=config.yml in the shell command of that rule in the Snakefile, or by adding configfile: \"config.yml\" to the top of the Snakefile (as mentioned in a previous tip).\n\n\nThe first point is straightforward; rather than using SAMPLES = [\"...\"] in the Snakefile we define it as a parameter in config.yml. You can either add it as a list similar to the way it was expressed before by adding:\nSAMPLES: [\"SRR935090\", \"SRR935091\", \"SRR935092\"]\nTo config.yml, or you can use this YAML notation (whether you choose SAMPLES or sample_ids as the name of the entry doesn’t matter, you will just have to reference the same name in the config dictionary inside the workflow):\nsample_ids:\n - SRR935090\n - SRR935091\n - SRR935092\nChange the workflow to reference config[\"sample_ids\"] (if using the latter example) instead of SAMPLES, as in:\nexpand(\"results/fastqc/{sample_id}_fastqc.zip\",\n sample_id = config[\"sample_ids\"])\nRemove the line with SAMPLES = [\"SRR935090\", \"SRR935091\", \"SRR935092\"] that we added to the top of snakefile_mrsa.smk in Snakemake 8: Targets.\nDo a dry-run afterwards to make sure that everything works as expected.\nYou may remember from the snakemake-5-parameters part of this tutorial that we’re using a function to return the URL of the FASTQ files to download for each sample:\ndef get_sample_url(wildcards):\n samples = {\n \"SRR935090\": \"https://figshare.scilifelab.se/ndownloader/files/39539767\",\n \"SRR935091\": \"https://figshare.scilifelab.se/ndownloader/files/39539770\",\n \"SRR935092\": \"https://figshare.scilifelab.se/ndownloader/files/39539773\"\n }\n return samples[wildcards.sample_id]\nHere the URLs of each sample_id is hard-coded in the samples dictionary inside the function. To generalize this function we can move the definition to the config file, placing it for example under an entry that we call sample_urls like this:\nsample_urls:\n SRR935090: \"https://figshare.scilifelab.se/ndownloader/files/39539767\"\n SRR935091: \"https://figshare.scilifelab.se/ndownloader/files/39539770\"\n SRR935092: \"https://figshare.scilifelab.se/ndownloader/files/39539773\"\nThis is what’s called ‘nested’ key/value pairs, meaning that each sample_id -> URL pair becomes nested under the config key sample_urls. So in order to access the URL of e.g. SRR935090 we would use config[\"sample_urls\"][\"SRR935090\"]. This means that you will have to update the get_sample_url function to:\ndef get_sample_url(wildcards):\n return config[\"sample_urls\"][wildcards.sample_id]\nNow the function uses the global config dictionary to return URLs for each sample_id. Again, do a dry-run to see that the new implementation works.\n\n\n\n\n\n\nTip\n\n\n\nIf you were to scale up this workflow with more samples it could become impractical to have to define the URLs by hand in the config file. A tip then is to have a separate file where samples are listed in one column and the URLs (or file paths) in another column. With a few lines of python code you could then read that list at the start of the workflow and add each sample to the config dictionary.\n\n\nNow let’s take a look at the genome reference used in the workflow. In the get_genome_fasta and get_genome_gff3 rules we have hard-coded FTP paths to the FASTA GFF annotation file for the genome NCTC8325. We can generalize this in a similar fashion to what we did with the get_SRA_by_accession rule. Let’s add a nested entry called genomes to the config file that will hold the genome id and FTP paths to the FASTA and GFF file:\ngenomes:\n NCTC8325:\n fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz\n gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.37.gff3.gz\n ST398:\n fasta: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection//staphylococcus_aureus_subsp_aureus_st398/dna/Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.dna.toplevel.fa.gz\n gff3: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/gff3/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_st398//Staphylococcus_aureus_subsp_aureus_st398.ASM958v1.37.gff3.gz\nAs you can see this is very similar to what with did with sample_urls, just that we have one more nested level. Now to access the FTP path to the FASTA file for genome id NCTC8325 we can use config[\"genomes\"][\"NCTC8325\"][\"fasta\"].\nLet’s now look at how to do the mapping from genome id to FASTA path in the rule get_genome_fasta. This is how the rule currently looks (if you have added the log section as previously described).\nrule get_genome_fasta:\n \"\"\"\n Retrieve the sequence in fasta format for a genome.\n \"\"\"\n output:\n \"data/raw_external/NCTC8325.fa.gz\"\n log:\n \"results/logs/get_genome_fasta/NCTC8325.log\"\n shell:\n \"\"\"\n wget -o {log} ftp://ftp.ensemblgenomes.org/pub/bacteria/release-37/fasta/bacteria_18_collection/staphylococcus_aureus_subsp_aureus_nctc_8325/dna//Staphylococcus_aureus_subsp_aureus_nctc_8325.ASM1342v1.dna_rm.toplevel.fa.gz -O {output}\n \"\"\"\nWe don’t want the hard-coded genome id NCTC8325, so replace that with a wildcard, say {genome_id} (remember to add the wildcard to the log: directive as well). We now need to supply the remote paths to the FASTA file for a given genome id. Because we’ve added this information to the config file we just need to pass it to the rule in some way, and just like in the get_SRA_by_accession rule we’ll use a function to do the job:\ndef get_fasta_path(wildcards):\n return config[\"genomes\"][wildcards.genome_id][\"fasta\"]\n\nrule get_genome_fasta:\n \"\"\"\n Retrieve the sequence in fasta format for a genome.\n \"\"\"\n output:\n \"data/ref/{genome_id}.fa.gz\"\n log:\n \"results/logs/get_genome_fasta/{genome_id}.log\"\n params:\n fasta_path = get_fasta_path\n shell:\n \"\"\"\n wget -o {log} {params.fasta_path} -O {output}\n \"\"\"\nNow change the get_genome_gff3 rule in a similar manner. Click to see the solution below if you’re having trouble.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\ndef get_gff_path(wildcards):\n return config[\"genomes\"][wildcards.genome_id][\"gff3\"]\n\nrule get_genome_gff3:\n \"\"\"\n Retrieve annotation in gff3 format for a genome.\n \"\"\"\n output:\n \"data/ref/{genome_id}.gff3.gz\"\n log:\n \"results/logs/get_genome_gff3/{genome_id}.log\"\n params:\n gff3_path = get_gff_path\n shell:\n \"\"\"\n wget -o {log} {params.gff3_path} -O {output}\n \"\"\"\n\n\n\nAlso change in index_genome to use a wildcard rather than a hard-coded genome id. Here you will run into a complication if you have followed the previous instructions and use the expand() expression. We want the list to expand to [\"results/bowtie2/{genome_id}.1.bt2\", \"results/bowtie2/{genome_id}.2.bt2\", ...], i.e. only expanding the wildcard referring to the Bowtie2 index. To keep the genome_id wildcard from being expanded we have to “mask” it with double curly brackets: {genome_id}. In addition, we need to replace the hard-coded results/bowtie2/NCTC8325 in the shell directive of the rule with the genome id wildcard. Inside the shell directive the wildcard object is accessed with this syntax: {wildcards.genome_id}, so the Bowtie2-build command should be:\nbowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}\nNote that this will only work if the {genome_id} wildcard can be resolved to something defined in the config (currently NCTC8325 or ST398). If you try to generate a FASTA file for a genome id not defined in the config Snakemake will complain, even at the dry-run stage.\nFinally, remember that any wildcards need to be present both in the output: and log: directives? This means we have to update the log: directive in index_genome as well. The final rule should look like this:\nrule index_genome:\n \"\"\"\n Index a genome using Bowtie 2.\n \"\"\"\n output:\n expand(\"results/bowtie2/{{genome_id}}.{substr}.bt2\",\n substr = [\"1\", \"2\", \"3\", \"4\", \"rev.1\", \"rev.2\"])\n input:\n \"data/ref/{genome_id}.fa.gz\"\n log:\n \"results/logs/index_genome/{genome_id}.log\"\n shadow: \"minimal\"\n shell:\n \"\"\"\n # Bowtie2 cannot use .gz, so unzip to a temporary file first\n gunzip -c {input} > tempfile\n bowtie2-build tempfile results/bowtie2/{wildcards.genome_id} > {log}\n \"\"\"\nGood job! The rules get_genome_fasta, get_genome_gff3 and index_genome can now download and index any genome as long as we provide valid links in the config file.\nHowever, we need to define somewhere which genome id we actually want to use when running the workflow. This needs to be done both in align_to_genome and generate_count_table. Do this by introducing a parameter in config.yml called \"genome_id\" (you can set it to either NCTC8325 or ST398), e.g.:\ngenome_id: \"NCTC8325\"\nNow we can resolve the genome_id wildcard from the config. See below for an example for align_to_genome. Here the substr wildcard gets expanded from a list while genome_id gets expanded from the config file.\ninput:\n \"data/{sample_id}.fastq.gz\",\n index = expand(\"results/bowtie2/{genome_id}.{substr}.bt2\",\n genome_id = config[\"genome_id\"],\n substr = [\"1\", \"2\", \"3\", \"4\", \"rev.1\", \"rev.2\"])\nAlso change the hard-coded genome id in the generate_count_table input in a similar manner:\nrule generate_count_table:\n \"\"\"\n Generate a count table using featureCounts.\n \"\"\"\n output:\n \"results/tables/counts.tsv\",\n \"results/tables/counts.tsv.summary\"\n input:\n bams=expand(\"results/bam/{sample_id}.sorted.bam\",\n sample_id = config[\"sample_ids\"]),\n annotation=expand(\"data/ref/{genome_id}.gff3.gz\",\n genome_id = config[\"genome_id\"])\n log:\n \"results/logs/generate_count_table.log\"\n shell:\n \"\"\"\n featureCounts -t gene -g gene_id -a {input.annotation} -o {output[0]} {input.bams} 2>{log}\n \"\"\"\nIn general, we want the rules as far downstream as possible in the workflow to be the ones that determine what the wildcards should resolve to. In our case this is align_to_genome and generate_count_table. You can think of it like the rule that really “needs” the file asks for it, and then it’s up to Snakemake to determine how it can use all the available rules to generate it. Here the align_to_genome rule says “I need this genome index to align my sample to” and then it’s up to Snakemake to determine how to download and build the index.\nOne last thing is to change the hard-coded NCTC8325 in the shell: directive of align_to_genome. Bowtie2 expects the index name supplied with the -x flag to be without the “.*.bt2” suffix so we can’t use -x {input.index}. Instead we’ll insert the genome_id directly from the config like this:\nshell:\n \"\"\"\n bowtie2 -x results/bowtie2/{config[genome_id]} -U {input[0]} > {output} 2>{log}\n \"\"\"\n\n\n\n\n\n\nSummary\n\n\n\nWell done! You now have a complete Snakemake workflow with a number of excellent features:\n\nA general RNA-seq pipeline which can easily be reused between projects, thanks to clear separation between code and settings.\nGreat traceability due to logs and summary tables.\nClearly defined the environment for the workflow using Conda.\nThe workflow is neat and free from temporary files due to using temp() and shadow.\nA logical directory structure which makes it easy to separate data and results of different software packages.\nA project set up in a way that makes it very easy to distribute and reproduce either via Git, Snakemake’s --archive option or a Docker image." + }, + { + "objectID": "pages/snakemake.html#reading-samples-from-a-file-instead-of-hard-coding-them", + "href": "pages/snakemake.html#reading-samples-from-a-file-instead-of-hard-coding-them", + "title": "Working with Snakemake", + "section": "11 Reading samples from a file instead of hard-coding them", + "text": "11 Reading samples from a file instead of hard-coding them\nSo far we’ve specified the samples to use in the workflow either as a hard-coded list in the Snakefile, or as a list in the configuration file. This is of course impractical for large real-world examples. Here we’ll just quickly show how you could supply the samples instead via a tab-separated file. For example you could create a file called samples.tsv with the following content:\nSRR935090 https://figshare.scilifelab.se/ndownloader/files/39539767\nSRR935091 https://figshare.scilifelab.se/ndownloader/files/39539770\nSRR935092 https://figshare.scilifelab.se/ndownloader/files/39539773\nThe first column has the sample id and the second column has the url to the fastq file. Now in order to read this into the workflow we need to use a few lines of python code. Since you can mix python code with rule definitions in Snakemake we’ll just add the following lines to the top of the Snakefile:\n# define an empty 'samples' dictionary\nsamples = {}\n# read the sample list file and populate the dictionary\nwith open(\"samples.tsv\", \"r\") as fhin:\n for line in fhin:\n # strip the newline character from the end of the line\n # then split by tab character to get the sample id and url\n sample_id, url = line.strip().split(\"\\t\")\n # store the url in the dictionary with the sample id as key\n samples[sample_id] = url\nNow we can use the samples dictionary in the workflow. For example, to get the url for SRR935090 we can use samples[\"SRR935090\"].\nFor example, the get_sample_url function can now be written as:\ndef get_sample_url(wildcards):\n return samples[wildcards.sample_id]\nWe can also use the samples dictionary in expand(), for example in the multiqc rule:\nrule multiqc:\n \"\"\"\n Aggregate all FastQC reports into a MultiQC report.\n \"\"\"\n output:\n html=\"results/multiqc/multiqc.html\",\n stats=\"results/multiqc/multiqc_general_stats.txt\"\n input:\n expand(\"results/fastqc/{sample_id}_fastqc.zip\", sample_id = samples.keys())\n log:\n \"results/logs/multiqc/multiqc.log\"\n shadow: \"minimal\"\n shell:\n \"\"\"\n # Run multiQC and keep the html report\n multiqc -n multiqc.html {input} 2> {log}\n mv multiqc.html {output.html}\n mv multiqc_data/multiqc_general_stats.txt {output.stats}\n \"\"\"\nNow this depends on there being a samples.tsv file in the working directory. To make this a configurable parameter we can add it to the config file:\nsample_list: \"samples.tsv\"\nand update the code for populating the samples dictionary:\n# define an empty 'samples' dictionary\nsamples = {}\n# read the sample list file and populate the dictionary\nwith open(config[\"sample_list\"], \"r\") as fhin:\n for line in fhin:\n # strip the newline character from the end of the line\n # then split by tab character to get the sample id and url\n sample_id, url = line.strip().split(\"\\t\")\n # store the url in the dictionary with the sample id as key\n samples[sample_id] = url\nThis way, anyone can take our Snakefile and just update the path to their own sample_list using the config file.\n\n\n\n\n\n\nQuick recap\n\n\n\nIn this section we’ve learned:\n\nHow to generalize a Snakemake workflow." + }, + { + "objectID": "pages/snakemake.html#extra-material", + "href": "pages/snakemake.html#extra-material", + "title": "Working with Snakemake", + "section": "12 Extra material", + "text": "12 Extra material\nIf you want to read more about Snakemake in general you can find several resources here:\n\nThe Snakemake documentation is available on ReadTheDocs.\nHere is another (quite in-depth) tutorial.\nIf you have questions, check out stack overflow.\n\n\n12.1 Using containers in Snakemake\nSnakemake also supports defining an Apptainer or Docker container for each rule (you will have time to work on the Containers tutorial later during the course). Analogous to using a rule-specific Conda environment, specify container: \"docker://some-account/rule-specific-image\" in the rule definition. Instead of a link to a container image, it is also possible to provide the path to a *.sif file (= a Singularity image file). When executing Snakemake, add the --software-deployment-method apptainer (or the shorthand --sdm apptainer) flag to the command line. For the given rule, an Apptainer container will then be created from the image or file that is provided in the rule definition on the fly by Snakemake and the rule will be run in this container.\nYou can find pre-made Apptainer or Docker images for many tools on https://biocontainers.pro/ (bioinformatics-specific) or on https://hub.docker.com/.\nHere is an example for a rule and its execution:\nrule align_to_genome:\n output:\n temp(\"results/bam/{sample_id,\\w+}.bam\")\n input:\n fastq = \"data/{sample_id}.fastq.gz\",\n index = expand(\"results/bowtie2/{genome_id}.{substr}.bt2\",\n genome_id=config[\"genome_id\"],\n substr=[\"1\", \"2\", \"3\", \"4\", \"rev.1\", \"rev.2\"])\n log:\n expand(\"results/logs/align_to_genome/{{sample_id}}_{genome_id}.log\",\n genome_id = config[\"genome_id\"])\n container: \"docker://quay.io/biocontainers/bowtie2:2.5.0--py310h8d7afc0_0\"\n shell:\n \"\"\"\n bowtie2 -x results/bowtie2/{config[genome_id]} -U {input.fastq} > {output} 2>{log}\n \"\"\"\nStart your Snakemake workflow with the following command:\nsnakemake --software-deployment-method apptainer\nFeel free to modify the MRSA workflow according to this example. As Apptainer is a container software that was developed for HPC clusters, and for example the Mac version is still a beta version, it might not work to run your updated Snakemake workflow with Apptainer locally on your computer. In the next section we explain how you can run Snakemake workflows on UPPMAX where Apptainer is pre-installed.\n\n\n12.2 Running Snakemake workflows on HPC clusters\nIf you need to run a Snakemake workflow on a high-performance computing (HPC) cluster you have a wide range of options at your disposal. Via the plugin catalog you can find plugins that will add support for various HPC schedulers to Snakemake.\nHere we will focus on how to run Snakemake workflows on clusters with SLURM, a workload manager commonly used on HPC clusters in Sweden such as Rackham, Tetralith and Dardel.\n\n\n\n\n\n\nTip\n\n\n\nWhen running on remote clusters we highly recommend to use a session manager like tmux or screen so that you can run your workflow in a session in the background while doing other things on the cluster or even logging out of the cluster.\n\n\n\n12.2.1 Option 1: Run the entire workflow as a single job\nFor short workflows with only a few rules that need the same compute resources in terms of CPU (cores) and memory, you can submit the entire workflow as a job directly to the SLURM scheduler, or start an interactive job (in your tmux or screen session) and run your Snakemake workflow as you would do that on your local machine. Make sure to give your job enough time to finish running all rules of your Snakemake workflow.\nIf you choose this option, you don’t need to install anything from the plugin catalogue. However, your workflow may not run as efficiently as it could if you were to add SLURM support in Snakemake.\n\n\n12.2.2 Option 2: Use built-in SLURM support\nFor workflows with long run times and/or where each rule requires different compute resources, Snakemake comes with built in functionality for interacting with the SLURM workload manager and send each rule as a job to the SLURM queue and to track the status of each job.\nIn this case, you can start the workflow on the login node and let it run there until all jobs have finished. Given that workflows often consist of many rules, some of which may be highly resource demanding, this is the option we recommend when running most Snakemake workflows on HPC clusters.\nTo add SLURM support to Snakemake you first need to install the SLURM plugin from the plugin catalog. This can be done with conda:\nconda install -c conda-forge snakemake-executor-plugin-slurm\nOnce installed, adding the --executor slurm flag to your Snakemake command line call will enable the plugin. You also need to specify how many jobs Snakemake can submit to the SLURM queue at the same time with the -j flag. For example, to allow up to 100 jobs to be put into the queue at any given time, you would run Snakemake with the following command:\nsnakemake --executor slurm -j 100 <other flags>\n\n\n\n12.3 Specifying resources for SLURM\nDepending on the cluster you are using, you will need to specify some resource requirements for the rules in your workflow, such as the number of CPUs, memory, runtime and account id. This can be done either:\n\ndirectly on the command line with the --default-resources flag which sets default resource settings for all rules\nin the rule definition of your workflow using the resources: directive, or\nin a configuration profile, a folder with a config.yaml file that contains the resource settings.\n\nYou can also use a combination of these methods. For example, the SLURM account id (_e.g. naiss-2023-01-001), which will most likely be the same for all rules, can be set with --default-resources:\nsnakemake --executor slurm -j 100 --default-resources slurm_account=naiss-2023-01-001\nRule-specific resources such as runtime, memory and number of CPUs can be set in the rule definition, for example:\nrule testrule:\n output:\n \"results/output.txt\"\n resources:\n runtime = 60,\n mem_mb = 16000,\n cpus_per_task = 4\n shell:\n \"\"\"\n uname -a > {output}\n \"\"\"\nThis rule uses the standard resource runtime to set the maximum allowed time (in minutes) for the rule, sets the memory requirement with mem_mb and the number of requested CPUs with cpus_per_task. In this example the rule will have a time limit of 60 minutes, will require 16G of RAM and 4 CPUs.\nSome clusters also require you to specify the partition you want to run your job on. The partition name will differ between clusters, for example the Rackham cluster uses core and node partitions, while Dardel uses e.g. shared and main. See the documentation for the cluster you are using for more information.\nThe partition can be set with the slurm_partition resource, for example like so:\nrule testrule:\n output:\n \"results/output.txt\"\n resources:\n runtime = 60,\n mem_mb = 16000,\n cpus_per_task = 4,\n slurm_partition: \"shared\"\n shell:\n \"\"\"\n uname -a > {output}\n \"\"\"\nTo make it easy to adapt your workflow to different compute clusters it is recommended to define resource settings in a configuration profile. A configuration profile is a folder with a config.yaml file that contains values for Snakemake command line arguments, allowing you to modify the behavior of Snakemake without changing the workflow code. For example, you could create a dardel folder (e.g. in the root of your workflow) with a config.yaml file that contains the following:\nexecutor: \"slurm\"\njobs: 100\ndefault-resources:\n slurm_account: \"naiss-2023-01-001\"\n slurm_partition: \"shared\"\n mem_mb: 16000\n cpus_per_task: 4\n runtime: 60\nThis yaml-formatted file contains Snakemake command line arguments that will be used when running the workflow. You can then run Snakemake with the --profile flag pointing to the folder containing the config.yaml file:\nsnakemake --profile dardel\nThis greatly simplifies running the workflow on different clusters, and makes the command line call much more succinct.\nTo set rule-specific resources in the configuration profile, you can add a set_resources: section to the config.yaml file:\nexecutor: \"slurm\"\njobs: 100\ndefault-resources:\n slurm_account: \"naiss-2023-01-001\"\n slurm_partition: \"shared\"\n mem_mb: 16000\n cpus_per_task: 4\n runtime: 60\nset_resources:\n index_genome:\n runtime: 240\n mem_mb: 32000\n cpus_per_task: 8\n align_to_genome:\n runtime: 120\n mem_mb: 24000\n cpus_per_task: 6\nIn this example, the index_genome rule will have a runtime of 240 minutes, will require 32G of RAM and 8 CPUs, while the align_to_genome rule will have a runtime of 120 minutes, will require 24G of RAM and 6 CPUs. Both rules will use the slurm_account and slurm_partition settings from the default_resources section, unless overridden in the rule-specific settings.\nYou can still define resources in the rule definition, but the values in the configuration profile will take precedence.\nNow, when you run your Snakemake workflow with:\nsnakemake --profile dardel\nSnakemake will submit each job to the SLURM queue and inform you about both the local jobid and the SLURM jobid by writing something similar to this to your terminal:\nJob 0 has been submitted with SLURM jobid 37099380 (log: .snakemake/slurm_logs/rule_name/37099380.log).\nIn this example the log output from the job will be in .snakemake/slurm_logs/rule_name/37099380.log.\nYou can read more details about running Snakemake on compute clusters in the Snakemake docs." } ] \ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 3875e2ea..415d77cc 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -4,10 +4,22 @@ https://nbisweden.github.io/workshop-reproducible-research/pages/take-down.html 2024-10-04T16:44:47.576Z + + https://nbisweden.github.io/workshop-reproducible-research/pages/quarto.html + 2024-10-15T14:34:31.985Z + + + https://nbisweden.github.io/workshop-reproducible-research/pages/markdown.html + 2024-10-15T06:49:44.656Z + https://nbisweden.github.io/workshop-reproducible-research/pages/introduction.html 2024-10-04T16:44:47.568Z + + https://nbisweden.github.io/workshop-reproducible-research/pages/containers.html + 2024-10-15T06:55:22.700Z + https://nbisweden.github.io/workshop-reproducible-research/lectures/snakemake/snakemake.html 2024-10-04T16:44:47.455Z @@ -68,12 +80,24 @@ https://nbisweden.github.io/workshop-reproducible-research/lectures/quarto/quarto.html 2024-10-04T16:44:47.451Z + + https://nbisweden.github.io/workshop-reproducible-research/pages/conda.html + 2024-10-15T05:51:51.919Z + https://nbisweden.github.io/workshop-reproducible-research/pages/git.html 2024-10-04T16:44:47.528Z - https://nbisweden.github.io/workshop-reproducible-research/pages/markdown.html - 2024-10-04T16:44:47.570Z + https://nbisweden.github.io/workshop-reproducible-research/pages/jupyter.html + 2024-10-15T14:15:07.462Z + + + https://nbisweden.github.io/workshop-reproducible-research/pages/nextflow.html + 2024-10-15T08:14:28.350Z + + + https://nbisweden.github.io/workshop-reproducible-research/pages/snakemake.html + 2024-10-15T07:42:52.382Z