Skip to content

Basic usage and practical information

Adelme Bazin edited this page Oct 30, 2019 · 8 revisions

The 'workflow' subcommand

We tried to make PPanGGOLiN relatively easy to use by making this 'workflow' subcommand. It runs a pangenome analysis whose exact steps will depend on the input files you provide it with. In the end, you will end up with some files and figures that describe the pangenome of your taxonomic group of interest in different ways.

The minimal subcommand is as follow :

ppanggolin workflow --fasta ORGANISMS_FASTA_LIST

It uses parameters that we found to be generally the best when working with species pangenomes.

The file ORGANISMS_FASTA_LIST is a tsv-separated file with the following organisation :

  1. The first column contains a unique organism name
  2. The second column the path to the associated FASTA file
  3. Circular contig identifiers are indicated in the following columns
  4. Each line represents an organism

An example with 50 Chlamydia trachomatis genomes can be found in the testingDataset/ directory.

You can also give PPanGGOLiN your own annotations using .gff or .gbff/.gbk files instead of .fasta files, such as the ones provided by prokka using the following command :

ppanggolin workflow --anno ORGANISMS_ANNOTATION_LIST

Another example of such a file can be found in the testingDataset/ directory.

Required computing resources

Most of PPanGGOLiN's commands should be run with as many CPUs as you can give them by using the --cpu option as PPanGGOLiN's speed increases relatively well with the number of CPUs. While the 'smallest' pangenomes (up to a few hundred genomes) can be easily analysed on a normal desktop computer, the biggest ones will require a good amount of RAM ( as of writing those lines, 20 656 genomes was the biggest we did and it required over 120 Go of RAM).

Usage and basic options

As most programs in bioinformatics, you can always specify some utility options.

You can specify the number of CPUs to use (which is recommended ! The default is to use just one) using the option --cpu.

You can specify the output directory (if not provided, one will be generated) using the option --output.

If you work in a strange environment that has no, or little available disk space in the '/tmp' (or equivalent) directory, you can specify a new temporary directory using --tmp

And if you want to redo an analysis from scratch and store it in a directory that already exists, you will have to use the --force option. Be wary, however, that the data in that directory if named identically as any output file written by ppanggolin will be overwritten.