diff --git a/docs/binspreader.md b/docs/binspreader.md index 6b23a4be77..68e09f90ff 100644 --- a/docs/binspreader.md +++ b/docs/binspreader.md @@ -16,6 +16,16 @@ source of information for refining. Optionally, BinSPreader can be provided with multiple Hi-C and/or paired-end libraries. The [BinSPreader protocol](https://star-protocols.cell.com/protocols/2802) contains more detailed instructions on installing and running BinSPreader. +## Compilation + +To compile BinSPreader, run + +``` +./spades_compile -SPADES_ENABLE_PROJECTS=binspreader +``` + +After the compilation is complete, `binspreader` executable will be located in the `bin` folder. + ## Command line options Required positional arguments: @@ -69,7 +79,7 @@ binspreader [OPTION...] Labels correction regularization parameter for labeled data (default: 0.6) -### Output +## Output BinSPreader stores all output files in the output directory ` ` set by the user. - `/binning.tsv` contains refined binning in `.tsv` format @@ -83,3 +93,11 @@ In addition - `/bin_label_1.fastq, /bin_label_2.fastq` read set for bin labeled by `bin_label` (if `--reads` was used) - `/pe_links.tsv` list of paired-end links between assembly graph edges with weights (if `--debug` was used) - `/graph_links.tsv` list of graph links between assembly graph edges with weights (if `--debug` was used) + + +## References + +If you are using **BinSPreader** in your research, please cite: + +[Tolstoganov et al., 2022](https://www.cell.com/iscience/pdf/S2589-0042(22)01042-2.pdf) and +[Ochkalova et al., 2023](https://www.sciencedirect.com/science/article/pii/S2666166723003842). \ No newline at end of file diff --git a/docs/getting-started.md b/docs/getting-started.md index 4ada6e47bd..cbad1dce9e 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -89,20 +89,20 @@ bin/spades.py --rnaviral -1 left.fastq.gz -2 right.fastq.gz -o output_folder ## Standalone SPAdes tools -- `spades-kmercount` - k-mer counting; +- [`spades-kmercount`](standalone.md#k-mer-counter) - k-mer counting; -- `spades-read-filter` - read filtering using k-mer coverage; +- [`spades-read-filter`](standalone.md#k-mer-coverage-read-filter) - read filtering using k-mer coverage; -- `spades-kmer-estimating` - estimating number of unique k-mers; +- [`spades-kmer-estimating`](standalone.md#k-mer-cardinality-estimating) - estimating number of unique k-mers; -- `spades-gbuilder` - assembly graph construction; +- [`spades-gbuilder`](standalone.md#graph-construction) - assembly graph construction; -- `spades-gsimplifier` - assembly graph simplification; +- [`spades-gsimplifier`](standalone.md#graph-simplification) - assembly graph simplification; -- `spalgner` - alignment of long reads to assembly graph; +- [`spalgner`](spaligner.md) - alignment of long reads to assembly graph; -- `spades-gmapper` - specific alignment of long reads to assembly graph used in hybrid assembly pipeline; - -- `binspreader` - refinement of metagenome-assembled genomes. +- [`spades-gmapper`](standalone.md#long-read-to-graph-alignment) - specific alignment of long reads to assembly graph used in hybrid assembly pipeline; +- [`binspreader`](binspreader.md) - refinement of metagenome-assembled genomes; +- [`pathracer`](pathracer.md) - alignment of profile HMMs to assembly graph. diff --git a/docs/installation.md b/docs/installation.md index 88f4e98332..fa9120150b 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -88,7 +88,7 @@ for example: which will install SPAdes into `/usr/local/bin`. -After installation you will get the same files (listed above) in `./bin` directory (or `/bin` if you specified PREFIX). We also suggest adding `bin` directory to the `PATH` variable. +After installation, you will get the same files (listed above) in `bin` directory (or `/bin` if you specified PREFIX). We also suggest adding `bin` directory to the `PATH` environment variable. ## Building additional tools SPAdes toolkit includes a number of standalone tools that are built using core @@ -103,10 +103,10 @@ can pass `-SPADES_ENABLE_PROJECTS="semicolon-separated list of projects"` to ena subset of SPAdes components. The components are: - `spades` - - `spades_tools` [standalone SPAdes tools](standalone.md) - - `binspareader` [BinSPreader](binspreader.md) - - `pathracer` - - `spaligner` + - [`spades_tools`](standalone.md) + - [`binspareader`](binspreader.md) + - [`pathracer`](pathracer.md) + - [`spaligner`](spaligner.md) By default, only SPAdes and SPAdes tools are enabled (so `-DSPADES_ENABLE_PROJECTS="spades;spades_tools"` is the default). Alternatively, diff --git a/src/projects/pathracer/extra/pHMM_with_frameshifts.svg b/docs/pHMM_with_frameshifts.svg similarity index 100% rename from src/projects/pathracer/extra/pHMM_with_frameshifts.svg rename to docs/pHMM_with_frameshifts.svg diff --git a/src/projects/pathracer/README.md b/docs/pathracer.md similarity index 86% rename from src/projects/pathracer/README.md rename to docs/pathracer.md index 44c9de7cee..fb3d722500 100644 --- a/src/projects/pathracer/README.md +++ b/docs/pathracer.md @@ -1,14 +1,11 @@ -PathRacer: racing profile HMM paths on assembly graph -===================================================== -MANUAL ------- - -### Overview - - - - - +# PathRacer: racing profile HMM paths on assembly graph + +## Overview +PathRacer is assembly graph against profile HMM aligning tool supporting +both _local-local_ and _global-local_ (aka _glocal_) alignment and both nucleotide and amino acid profile HMMs. +The tool finds all proper alignments rather than only the best one. +That allows extracting all genes satisfying HMM gene model from the assembly. + **PathRacer** is a tool for alignment of assembly graph against pHMM. It provides the set of _k_ most probable paths traversed by a HMM through the whole assembly @@ -22,21 +19,33 @@ translation on-fly walking through frameshifts. Both tool use extended pHMM model allowing frame shifts: -![Scheme of extended pHMM](./extra/pHMM_with_frameshifts.svg) +![Scheme of extended pHMM](./pHMM_with_frameshifts.svg) but for `pathracer-seq-fs` this extension is crucial: for aligning amino-acid pHMMs without allowing indels in the nucleotide space six frame translation + `hmmsearch` from **HMMer** package is more than enough. -### Input -Currently the tool supports only _de Bruijn_ graphs in GFA format as produced by **SPAdes** or compatible assembler in this matter (e.g., **MEGAHIT**). +## Compilation + +To compile PathRacer, run + +``` +./spades_compile -SPADES_ENABLE_PROJECTS=pathracer +``` + +After the compilation is complete, `pathracer` executable will be located in the `bin` folder. + +## Input +Currently, the tool supports only _de Bruijn_ graphs in GFA format as produced by **SPAdes** or compatible assembler in this matter (e.g., **MEGAHIT**). Contact us if you need some other format support. Input sequences are supposed to be in FASTA/FASTQ format. Profile HMM should be in **HMMer3** format, but one can pass nucleotide or amino acid sequences as well. These sequences will be converted to proxy pHMMs. Aligning of these pHMMs would be equivalent to performing alignment using Levenshtein distance for each input sequence. +## pathracer tool + +### pathracer command line options -### `pathracer` command line options Required positional arguments: 1. Query file (.hmm file or .fasta) @@ -72,7 +81,7 @@ Debug output control: _In addition:_ Some other developer options that are not supposed to be tuned by the end-user. Could be removed in further releases. -### `pathracer` output +### pathracer output For each input pHMM (gene model) `pathracer` reports: - **<gene\_name>.seqs.fa**: sequences correspondent to _N_ best scored paths ordered by score along with their alignment in CIGAR format @@ -89,7 +98,11 @@ In addition: - **pathracer.log**: log file - **graph\_with\_hmm\_paths.gfa**: _(optional)_ input graph with top scored paths added -### `pathracer-seq-fs` command line options + +## pathracer-seq-fs tool + +### pathracer-seq-fs command line options + Required positional arguments: 1. Query .hmm file (.fasta is not supported yet) @@ -106,11 +119,11 @@ Main options: Heuristics options: _The same as in main `pathracer`_ -### `pathracer-seq-fs` output +### pathracer-seq-fs output For each input pHMM (gene model): **<gene\_name>.seqs.fa** and **<gene\_name>.nucls.fa**, the same as in main `pathracer` -### Output files format +## Output files format **<gene\_name>.seqs.fa** and **<gene\_name>.nucls.fa** files contain metainformation in FASTA headers. For main `pathracer` the header format is: > @@ -143,7 +156,7 @@ For alignment with frameshifts the extemded CIGAR/FASTA is used: P/"-" — one nucleotide insertion, G/"=" — two nucleotides insertion -### Examples +## Examples One can download example datasets from - **urban_strain.gfa**: strain assembly graph of Singapore clinical isolation ward wastewater metagenome (SRA accession SRR5997548, dataset H1) @@ -188,13 +201,10 @@ export OMP_STACKSIZE=1G pathracer bac.hmm synth_strain_gbuilder.gfa --queries 16S_rRNA -m 250 --top 1000000 --output pathracer_synth_strain_gbuilder_16s --no-top-score-filter ``` -### References -If you are using **PathRacer** in your research, please cite: -A. Shlemov and A. Korobeynikov. PathRacer: racing profile HMM paths on assembly -graph. In _Proceedings of International Conference on Algorithms for Computational Biology, -AlCoB 2019. Berkeley, California, USA, May 28–30, 2019,_ volume 11488 LNCS, pages -80–94, 2019. - +## References + +If you are using **PathRacer** in your research, please cite: + +[Shlemov and Korobeynikov, 2019](https://link.springer.com/chapter/10.1007/978-3-030-18174-1_6) -In case of any problems running **PathRacer** please contact [SPAdes support](https://github.com/ablab/spades/issues)> attaching the log file. -Your suggestions are also very welcome! +In case of any problems running **PathRacer** please contact [SPAdes support](https://github.com/ablab/spades/issues) attaching the log file. diff --git a/src/projects/spaligner/pipeline.jpg b/docs/spaligner.jpg similarity index 100% rename from src/projects/spaligner/pipeline.jpg rename to docs/spaligner.jpg diff --git a/src/projects/spaligner/README.md b/docs/spaligner.md similarity index 82% rename from src/projects/spaligner/README.md rename to docs/spaligner.md index 390b5a7fab..1d6c49feff 100644 --- a/src/projects/spaligner/README.md +++ b/docs/spaligner.md @@ -1,31 +1,57 @@ -# SPAligner +# SPAligner: long read to graph aligner + +SPAligner is a tool for fast and accurate alignment of nucleotide sequences to assembly graphs. +It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read +to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")). + + +## Compilation + +To compile SPAligner, run + +``` +./spades_compile -SPADES_ENABLE_PROJECTS=spaligner +``` + +After the compilation is complete, `spaligner` executable will be located in the `bin` folder. -Tool for fast and accurate alignment of nucleotide sequences (s.a. long reads, coding sequences, etc.) to assembly graphs. ## Running SPAligner - spaligner spaligner_config.yaml \ # config file +Synopsis: + + spaligner spaligner_config.yaml \ # config file -d pacbio \ # data type: pacbio, nanopore - -g assembly_graph.gfa \ # gfa-file with assembly graph - -k 77 \ # graph K-mer size - -s pacbio_reads.fastq.gz \ # sequences to align in fasta/fastq formats - -t 8 # number of threads, 8 by default + -g assembly_graph.gfa \ # assembly graph + -k 77 \ # graph k-mer size + -s pacbio_reads.fastq.gz \ # input sequences / reads + -t 8 # number of threads -By default, spaligner_config.yaml will be installed into /usr/share/spaligner/ or can be found in assembler/projects/spaligner/. +By default, `spaligner_config.yaml` can be found in `src/projects/spaligner/`. -Alignments will be saved to spaligner_result/alignment.tsv by default. +Alignments will be saved to `spaligner_result/alignment.tsv` by default. -## Compilation +### Command line options + +`-d ` + long reads type: `nanopore` or `pacbio` + +`-s ` + file with sequences in FASTA or FASTQ formats (can be gzipped) - git clone https://github.com/ablab/spades.git - cd spades/assembler/ - mkdir build && cd build && cmake ../src - make spaligner +`-g ` + file with an assembly graph in GFA format -Now to run SPAligner move to folder `assembler/` and execute +`-k ` + k-mer length that was used for graph construction + +`-t ` + number of threads (default: 8) + +`-o, --outdir ` + output directory to use (default: `spaligner_result/`) - build/bin/spaligner ## Output @@ -102,7 +128,7 @@ If a sequence was not fully aligned, SPAligner tries to prolong the longest alig Overview of the alignment of the nucleotide query sequence *S* (orange bar) to assembly graph *G*. Assembly graph edges are considered directed left-to-right (explicit edge orientation was omitted to improve the clarity). -![pipeline](pipeline.jpg) +![pipeline](spaligner.jpg) 1. **Anchor search.** Anchors (regions of high similarity) between the query and the edge labels are identified with [BWA-MEM](http://bio-bwa.sourceforge.net/). 2. **Anchor filtering.** Anchors shorter than *K*, assembly graph *K*-mer size,(anchors 2, 6, 11), anchors “in the middle” of long edge (anchor 7) or ambiguous anchors (anchor 10 mostly covered by anchor 9, both anchors 4 and 5) are discarded. @@ -146,6 +172,10 @@ Increase of `max_gs_states`, `max_restorable_length`, `queue_limit`, `iteration_ Turning off restore_ends or run_dijkstra in nucleotide sequence alignment mode leads to shorter alignments, but considerable speed-up. -## Contacts +## References + +If you are using **SPAligner** in your research, please cite: + +[Dvorkina et al., 2020](https://link.springer.com/article/10.1186/s12859-020-03590-7) For any questions or suggestions please do not hesitate to contact Tatiana Dvorkina . diff --git a/docs/standalone.md b/docs/standalone.md index 55cd3e5495..3e83181503 100644 --- a/docs/standalone.md +++ b/docs/standalone.md @@ -169,12 +169,17 @@ Additional options are: original graph -## Long read to graph alignment +## hybridSPAdes aligner + +_Not to be confused with [SPAligner](spaligner.md)._ -### hybridSPAdes aligner A tool `spades-gmapper ` gives the opportunity to extract long read alignments generated with hybridSPAdes pipeline options. It has three mandatory options: dataset description file in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file), graph file in GFA format and an output file name. +While `spades-gmapper` is a solution for those who work on hybridSPAdes assembly and +want to get exactly its intermediate results, [SPAligner](spaligner.md) is an end-product application for sequence-to-graph alignment with tunable parameters and output types. + + Synopsis: `spades-gmapper [-k ] [-t ] [-tmpdir ]` Additional options are: @@ -188,13 +193,11 @@ Additional options are: `-tmpdir ` scratch directory to use -While `spades-gmapper` is a solution for those who work on hybridSPAdes assembly and want to get exactly its intermediate results, [SPAligner](standalone.md#spaligner) is an end-product application for sequence-to-graph alignment with tunable parameters and output types. ### SPAligner A tool for fast and accurate alignment of nucleotide sequences to assembly graphs. It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")). -Synopsis: `spaligner src/projects/spaligner_config.yaml -d -s -g -k [-t ] [-o ]` Parameters are: @@ -216,8 +219,6 @@ Parameters are: `-o, --outdir ` output directory to use (default: spaligner_result/) -For more information on parameters and options please refer to the main SPAligner manual (assembler/src/projects/spaligner/README.md). - Also if you want to align protein sequences please refer to our [pre-release version](https://github.com/ablab/spades/releases/tag/spaligner-paper). Note that in order you use SPAligner one needs either to use pre-built binaries or compile SPAdes from sources using the additional `-DSPADES_ENABLE_PROJECTS=spaligner` option. diff --git a/mkdocs.yml b/mkdocs.yml index f6d77f82fb..39c7fef5b0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -13,6 +13,8 @@ nav: - HMM-guided mode: hmm.md - Transcriptome assembly: rna.md - Binning refining: binspreader.md + - HMM mapping on assembly graph: pathracer.md + - Sequence to graph alignment: spaligner.md - SPAdes tools: standalone.md - Citation: citation.md - Feedback: feedback.md