restructure manual, split input into more sections

ablab · Apr 2, 2024 · 8664e13 · 8664e13
1 parent f3b297d
commit 8664e13
Show file tree

Hide file tree

Showing 10 changed files with 378 additions and 306 deletions.
diff --git a/docs/datatypes.md b/docs/datatypes.md
@@ -0,0 +1,76 @@
+# Tips on SPAdes parameters
+
+## Assembling IonTorrent reads
+
+Only FASTQ or BAM files are supported as input.
+
+The selection of k-mer length is non-trivial for IonTorrent. If the dataset is more or less conventional (good coverage, moderate or low GC, etc), then use our [recommendation for long reads](datatypes.md#assembling-long-illumina-paired-reads) (e.g. assemble using k-mer lengths 21,33,55,77,99,127). However, due to increased error rate some changes of k-mer lengths (e.g. selection of shorter ones) may be required. For example, if you ran SPAdes with k-mer lengths 21,33,55,77 and then decided to assemble the same data set using more iterations and larger values of K, you can run SPAdes once again specifying the same output folder and the following options: `--restart-from k77 -k 21,33,55,77,99,127 --mismatch-correction -o <previous_output_dir>`. Do not forget to copy contigs and scaffolds from the previous run. We are planning to tackle issue of selecting k-mer lengths for IonTorrent reads in next versions.
+
+You may need no error correction for Hi-Q enzyme at all. However, we suggest trying to assemble your data with and without error correction and select the best variant.
+
+For non-trivial datasets (e.g. with high GC, low or uneven coverage) we suggest to enable single-cell mode (setting `--sc` option) and use k-mer lengths of 21,33,55.
+
+## Assembling long Illumina paired reads
+
+Recent advances in DNA sequencing technology have led to a rapid increase in read length. Nowadays, it is a common situation to have a data set consisting of 2x150 or 2x250 paired-end reads produced by Illumina MiSeq or HiSeq2500. However, the use of longer reads alone will not automatically improve assembly quality. An assembler that can properly take advantage of them is needed.
+
+SPAdes use of iterative k-mer lengths allows benefiting from the full potential of the long paired-end reads. Currently one has to set the assembler options up manually, but we plan to incorporate automatic calculation of necessary options soon.
+
+Please note that in addition to the read length, the insert length also matters a lot. It is not recommended to sequence a 300bp fragment with a pair of 250bp reads. We suggest using 350-500 bp fragments with 2x150 reads and 550-700 bp fragments with 2x250 reads.
+
+### Multi-cell data set with read length 2x150 bp
+
+Do not turn off SPAdes error correction (BayesHammer module), which is included in SPAdes default pipeline.
+
+If you have enough coverage (50x+), then you may want to try to set k-mer lengths of 21, 33, 55, 77 (selected by default for reads with length 150bp).
+
+Make sure you run assembler with the `--careful` option to minimize number of mismatches in the final contigs.
+
+We recommend that you check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs.
+
+For reads corrected prior to running the assembler:
+
+``` bash
+
+    spades.py -k 21,33,55,77 --careful --only-assembler <your reads> -o spades_output
+```
+
+To correct and assemble the reads:
+
+``` bash
+
+    spades.py -k 21,33,55,77 --careful <your reads> -o spades_output
+```
+
+### Multi-cell data set with read lengths 2x250 bp
+
+Do not turn off SPAdes error correction (BayesHammer module), which is included in SPAdes default pipeline.
+
+By default we suggest to increase k-mer lengths in increments of 22 until the k-mer length reaches 127. The exact length of the k-mer depends on the coverage: k-mer length of 127 corresponds to 50x k-mer coverage and higher. For read length 250bp SPAdes automatically chooses K values equal to 21, 33, 55, 77, 99, 127.
+
+Make sure you run assembler with `--careful` option to minimize number of mismatches in the final contigs.
+
+We recommend you to check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs.
+
+For reads corrected prior to running the assembler:
+
+``` bash
+
+    spades.py -k 21,33,55,77,99,127 --careful --only-assembler <your reads> -o spades_output
+```
+
+To correct and assemble the reads:
+
+``` bash
+
+    spades.py -k 21,33,55,77,99,127 --careful <your reads> -o spades_output
+```
+
+### Single-cell data set with read lengths 2 x 150 or 2 x 250
+
+The default k-mer lengths are recommended. For single-cell data sets SPAdes selects k-mer sizes 21, 33 and 55.
+
+However, it might be tricky to fully utilize the advantages of long reads you have. Consider contacting us for more information and to discuss assembly strategy.
+
+
+
diff --git a/docs/feedback.md b/docs/feedback.md
@@ -1,5 +1,5 @@
 # Feedback and bug reports
 
-Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve SPAdes. If you have any troubles running SPAdes, please send us `params.txt` and `spades.log` from the directory `<output_dir>`.
+Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve SPAdes. If you have any troubles running SPAdes, please send us `params.txt` and `spades.log` from the output folder.
 
 You can leave your comments and bug reports at [our GitHub repository tracker](https://github.com/ablab/spades/issues).
diff --git a/docs/hmm.md b/docs/hmm.md
@@ -0,0 +1,21 @@
+# HMM-guided mode
+The majority of SPAdes assembly modes (normal multicell, single-cell, rnaviral, meta and of course biosynthetic) also supports HMM-guided mode as implemented in biosyntheticSPAdes. The detailed description could be found in [biosyntheticSPAdes paper](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118), but in short: amino acid profile HMMs are aligned to the edges of assembly graph. After this the subgraphs containing the set of matches ("domains") are extracted and all possible paths through the domains that are supported both by paired-end data (via scaffolds) and graph topology are obtained (putative biosynthetic gene clusters).
+
+HMM-guided mode could be enabled via providing a set of HMMs via `--custom-hmms` option. In HMM guided mode the set of contigs and scaffolds (see [SPAdes output](output.md#spades-output) section for more information ) is kept intact, however additional [biosyntheticSPAdes output](output.md#biosyntheticspades-output) represents the output of HMM-guided assembly.
+
+Note that normal biosyntheticSPAdes mode (via `--bio` option) is a bit different from HMM-guided mode: besides using the special set of profile HMMS representing a family of NRSP/PKS domains also includes a set of assembly graph simplification and processing settings aimed for fuller recovery of biosynthetic gene clusters.
+
+## coronaSPAdes mode
+
+Given an increased interest in coronavirus research we developed a coronavirus assembly mode for SPAdes assembler (also known as coronaSPAdes). It allows to assemble full-length coronaviridae genomes from the transcriptomic and metatranscriptomic data. Algorithmically, coronaSPAdes is an rnaviralSPAdes that uses the set of HMMs from [Pfam SARS-CoV-2 2.0](ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/) set as well as additional HMMs as outlined by [(Phan et al, 2019)](https://doi.org/10.1093/ve/vey035). coronaSPAdes could be run via a dedicated `coronaspades.py` script. See [coronaSPAdes preprint](https://www.biorxiv.org/content/10.1101/2020.07.28.224584v1) for more information about rnaviralSPAdes,  coronaSPAdes and HMM-guided mode. Output for any HMM-related mode (`--bio`, `--corona`, or `--custom-hmms` flags) is the same with biosyntheticSPAdes' output.
+
+
+## wastewaterSPAdes mode
+
+SARS-CoV-2 wastewater samples are extensively collected and studied because it allows quantitative assessment of viral load in surrounding populations. We developed wastewaterSPAdes that solves SARS-CoV-2 deconvolution problem using assembly graph structure.
+To use wastewaterSPAdes, you'll need to:
+
+- Set `--sewage` flag to the `coronaspades.py`.
+- Provide the SARS-CoV-2 reference genome as trusted contigs.
+
+Results of wastewaterSPAdes are stored in `lineages.csv` file. First column contains strain name, and second column contains estimated abundance of this strain in the sample.
diff --git a/docs/hybrid.md b/docs/hybrid.md
@@ -0,0 +1,18 @@
+# Hybrid assembly
+
+## PacBio and Oxford Nanopore reads
+
+SPAdes can take as an input an unlimited number of PacBio and Oxford Nanopore libraries.
+
+PacBio CLR and Oxford Nanopore reads are used for hybrid assemblies (e.g. with Illumina or IonTorrent). There is no need to pre-correct this kind of data. SPAdes will use PacBio CLR and Oxford Nanopore reads for gap closure and repeat resolution.
+
+For PacBio you just need to have filtered subreads in FASTQ/FASTA format. Provide these filtered subreads using `--pacbio` option. Oxford Nanopore reads are provided with `--nanopore` option.
+
+PacBio CCS/Reads of Insert reads or pre-corrected (using third-party software) PacBio CLR / Oxford Nanopore reads can be simply provided as single reads to SPAdes.
+
+## Additional contigs
+
+In case you have contigs of the same genome generated by other assembler(s) and you wish to merge them into SPAdes assembly, you can specify additional contigs using `--trusted-contigs` or `--untrusted-contigs`. First option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. Second option is used for less reliable contigs that may have more errors or contigs of unknown quality. These contigs will be used only for gap closure and repeat resolution. The number of additional contigs is unlimited.
+
+Note, that SPAdes does not perform assembly using genomes of closely-related species. Only contigs of the same genome should be specified.
+
diff --git a/docs/input.md b/docs/input.md
@@ -0,0 +1,52 @@
+# SPAdes basic input
+
+SPAdes takes as input paired-end reads, mate-pairs and single (unpaired) reads in FASTA and FASTQ. For IonTorrent data SPAdes also supports unpaired reads in unmapped BAM format (like the one produced by Torrent Server). However, in order to run read error correction, reads should be in FASTQ or BAM format. Sanger, Oxford Nanopore and PacBio CLR reads can be provided in both formats since SPAdes does not run error correction for these types of data.
+
+To run SPAdes you need at least one library of the following types:
+
+-   Illumina paired-end/high-quality mate-pairs/unpaired reads
+-   IonTorrent paired-end/high-quality mate-pairs/unpaired reads
+-   PacBio CCS reads
+
+Illumina and IonTorrent libraries should not be assembled together. All other types of input data are compatible. SPAdes should not be used if only PacBio CLR, Oxford Nanopore, Sanger reads or additional contigs are available.
+
+SPAdes supports mate-pair only assembly. However, we recommend to use only high-quality mate-pair libraries in this case (e.g. that do not have a paired-end part). We tested mate-pair only pipeline using Illumina Nextera mate-pairs. See more [here](running.md#specifying-multiple-libraries).
+
+Notes:
+
+-   It is strongly suggested to provide multiple paired-end and mate-pair libraries according to their insert size (from smallest to longest).
+-   It is not recommended to run SPAdes on PacBio reads with low coverage (less than 5).
+-   We suggest not to run SPAdes on PacBio reads for large genomes.
+-   SPAdes accepts gzip-compressed files.
+
+## Paired read libraries
+
+By using command line interface, you can specify up to nine different paired-end libraries, up to nine mate-pair libraries and also up to nine high-quality mate-pair ones. If you wish to use more, you can use [YAML data set file](running.md#specifying-multiple-libraries-with-yaml-data-set-file). We further refer to paired-end and mate-pair libraries simply as to read-pair libraries.
+
+By default, SPAdes assumes that paired-end and high-quality mate-pair reads have forward-reverse (fr) orientation and usual mate-pairs have reverse-forward (rf) orientation. However, different orientations can be set for any library by using SPAdes options.
+
+To distinguish reads in pairs we refer to them as left and right reads. For forward-reverse orientation, the forward reads correspond to the left reads and the reverse reads, to the right. Similarly, in reverse-forward orientation left and right reads correspond to reverse and forward reads, respectively, etc.
+
+Each read-pair library can be stored in several files or several pairs of files. Paired reads can be organized in two different ways:
+
+-   In file pairs. In this case left and right reads are placed in different files and go in the same order in respective files.
+-   In interleaved files. In this case, the reads are interlaced, so that each right read goes after the corresponding paired left read.
+
+For example, Illumina produces paired-end reads in two files: `R1.fastq` and `R2.fastq`. If you choose to store reads in file pairs make sure that for every read from `R1.fastq` the corresponding paired read from `R2.fastq` is placed in the respective paired file on the same line number. If you choose to use interleaved files, every read from `R1.fastq` should be followed by the corresponding paired read from `R2.fastq`.
+
+If adapter and/or quality trimming software has been used prior to assembly, files with the orphan reads can be provided as "single read files" for the corresponding read-pair library.
+
+If you have merged some of the reads from your paired-end (not mate-pair or high-quality mate-pair) library (using tools s.a. [BBMerge](https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmerge-guide/) or [STORM](https://bitbucket.org/yaoornl/align_test/overview)), you should provide the file with resulting reads as a "merged read file" for the corresponding library.
+Note that non-empty files with the remaining unmerged left/right reads (separate or interlaced) **must** be provided for the same library (for SPAdes to correctly detect the original read length).
+
+In an unlikely case some of the reads from your mate-pair (or high-quality mate-pair) library are "merged", you should provide the resulting reads as a SEPARATE single-read library.
+
+## Unpaired (single-read) libraries
+
+By using command line interface, you can specify up to nine different single-read libraries. To input more libraries, you can use [YAML data set file](running.md#specifying-multiple-libraries-with-yaml-data-set-file).
+
+Single librairies are assumed to have high quality and a reasonable coverage. For example, you can provide PacBio CCS reads as a single-read library.
+
+Note, that you should not specify PacBio CLR, Sanger reads or additional contigs as single-read libraries, each of them has a separate [option](running.md#input-data).
+
+
diff --git a/docs/installation.md b/docs/installation.md
@@ -12,7 +12,7 @@ In case of successful installation the following files will be placed in the `bi
 -   `metaviralspades.py` (main executable script for [metaviralSPAdes](running.md#basic-options-and-modes))
 -   `rnaspades.py` (main executable script for [rnaSPAdes](rna.md))
 -   `rnaviralspades.py` (main executable script for rnaviralSPAdes)
--   `coronaspades.py` (wrapper script for coronaSPAdes mode)
+-   `coronaspades.py` (wrapper script for [coronaSPAdes mode](hmm.md#hmm-guided-mode))
 -   `spades-core`  (assembly module)
 -   `spades-gbuilder`  (standalone graph builder application)
 -   `spades-gmapper`  (standalone long read to graph aligner)
@@ -30,8 +30,8 @@ To download [SPAdes Linux binaries](https://github.com/ablab/spades/releases/dow
 ``` bash
 
     wget https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5-Linux.tar.gz
-    tar -xzf SPAdes-3.15.4-Linux.tar.gz
-    cd SPAdes-3.15.4-Linux/bin/
+    tar -xzf SPAdes-3.15.5-Linux.tar.gz
+    cd SPAdes-3.15.5-Linux/bin/
 ```
 
 In this case you do not need to run any installation scripts - SPAdes is ready to use. We also suggest adding SPAdes installation directory to the `PATH` variable.
@@ -46,8 +46,8 @@ To obtain [SPAdes binaries for Mac](https://github.com/ablab/spades/releases/dow
 ``` bash
 
     curl https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5-Darwin.tar.gz
-    tar -zxf SPAdes-3.15.4-Darwin.tar.gz
-    cd SPAdes-3.15.4-Darwin/bin/
+    tar -zxf SPAdes-3.15.5-Darwin.tar.gz
+    cd SPAdes-3.15.5-Darwin/bin/
 ```
 
 Just as in Linux, SPAdes is ready to use and no further installation steps are required. We also suggest adding SPAdes installation directory to the `PATH` variable.
@@ -67,8 +67,8 @@ If you meet these requirements, you can download the [SPAdes source code](https:
 ``` bash
 
     wget https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5.tar.gz
-    tar -xzf SPAdes-3.15.4.tar.gz
-    cd SPAdes-3.15.4
+    tar -xzf SPAdes-3.15.5.tar.gz
+    cd SPAdes-3.15.5
 ```
 
 and build it with the following script: