Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tools docs #1309

Merged
merged 3 commits into from
May 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion docs/binspreader.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,16 @@ source of information for refining. Optionally, BinSPreader can be provided with
multiple Hi-C and/or paired-end libraries. The [BinSPreader protocol](https://star-protocols.cell.com/protocols/2802) contains more detailed
instructions on installing and running BinSPreader.

## Compilation

To compile BinSPreader, run

```
./spades_compile -SPADES_ENABLE_PROJECTS=binspreader
```

After the compilation is complete, `binspreader` executable will be located in the `bin` folder.

## Command line options

Required positional arguments:
Expand Down Expand Up @@ -69,7 +79,7 @@ binspreader <graph (in GFA)> <binning (in .tsv)> <output directory> [OPTION...]
Labels correction regularization parameter for labeled data (default: 0.6)


### Output
## Output
BinSPreader stores all output files in the output directory `<output_dir> ` set by the user.

- `<output_dir>/binning.tsv` contains refined binning in `.tsv` format
Expand All @@ -83,3 +93,11 @@ In addition
- `<output_dir>/bin_label_1.fastq, <output_dir>/bin_label_2.fastq` read set for bin labeled by `bin_label` (if `--reads` was used)
- `<output_dir>/pe_links.tsv` list of paired-end links between assembly graph edges with weights (if `--debug` was used)
- `<output_dir>/graph_links.tsv` list of graph links between assembly graph edges with weights (if `--debug` was used)


## References

If you are using **BinSPreader** in your research, please cite:

[Tolstoganov et al., 2022](https://www.cell.com/iscience/pdf/S2589-0042(22)01042-2.pdf) and
[Ochkalova et al., 2023](https://www.sciencedirect.com/science/article/pii/S2666166723003842).
18 changes: 9 additions & 9 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,20 +89,20 @@ bin/spades.py --rnaviral -1 left.fastq.gz -2 right.fastq.gz -o output_folder

## Standalone SPAdes tools

- `spades-kmercount` - k-mer counting;
- [`spades-kmercount`](standalone.md#k-mer-counter) - k-mer counting;

- `spades-read-filter` - read filtering using k-mer coverage;
- [`spades-read-filter`](standalone.md#k-mer-coverage-read-filter) - read filtering using k-mer coverage;

- `spades-kmer-estimating` - estimating number of unique k-mers;
- [`spades-kmer-estimating`](standalone.md#k-mer-cardinality-estimating) - estimating number of unique k-mers;

- `spades-gbuilder` - assembly graph construction;
- [`spades-gbuilder`](standalone.md#graph-construction) - assembly graph construction;

- `spades-gsimplifier` - assembly graph simplification;
- [`spades-gsimplifier`](standalone.md#graph-simplification) - assembly graph simplification;

- `spalgner` - alignment of long reads to assembly graph;
- [`spalgner`](spaligner.md) - alignment of long reads to assembly graph;

- `spades-gmapper` - specific alignment of long reads to assembly graph used in hybrid assembly pipeline;

- `binspreader` - refinement of metagenome-assembled genomes.
- [`spades-gmapper`](standalone.md#long-read-to-graph-alignment) - specific alignment of long reads to assembly graph used in hybrid assembly pipeline;

- [`binspreader`](binspreader.md) - refinement of metagenome-assembled genomes;

- [`pathracer`](pathracer.md) - alignment of profile HMMs to assembly graph.
10 changes: 5 additions & 5 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ for example:

which will install SPAdes into `/usr/local/bin`.

After installation you will get the same files (listed above) in `./bin` directory (or `<destination_dir>/bin` if you specified PREFIX). We also suggest adding `bin` directory to the `PATH` variable.
After installation, you will get the same files (listed above) in `bin` directory (or `<destination_dir>/bin` if you specified PREFIX). We also suggest adding `bin` directory to the `PATH` environment variable.

## Building additional tools
SPAdes toolkit includes a number of standalone tools that are built using core
Expand All @@ -103,10 +103,10 @@ can pass `-SPADES_ENABLE_PROJECTS="semicolon-separated list of projects"` to ena
subset of SPAdes components. The components are:

- `spades`
- `spades_tools` [standalone SPAdes tools](standalone.md)
- `binspareader` [BinSPreader](binspreader.md)
- `pathracer`
- `spaligner`
- [`spades_tools`](standalone.md)
- [`binspareader`](binspreader.md)
- [`pathracer`](pathracer.md)
- [`spaligner`](spaligner.md)

By default, only SPAdes and SPAdes tools are enabled (so
`-DSPADES_ENABLE_PROJECTS="spades;spades_tools"` is the default). Alternatively,
Expand Down
68 changes: 39 additions & 29 deletions src/projects/pathracer/README.md → docs/pathracer.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,11 @@
PathRacer: racing profile HMM paths on assembly graph
=====================================================
MANUAL
------

### Overview
<!-- PathRacer is assembly graph against profile HMM aligning tool supporting -->
<!-- both _local-local_ and _global-local_ (aka _glocal_) alignment and both nucleotide and amino acid profile HMMs. -->
<!-- The tool finds all proper alignments rather than only the best one. -->
<!-- That allows extracting all genes satisfying HMM gene model from the assembly. -->
<!-- -->
# PathRacer: racing profile HMM paths on assembly graph

## Overview
PathRacer is assembly graph against profile HMM aligning tool supporting
both _local-local_ and _global-local_ (aka _glocal_) alignment and both nucleotide and amino acid profile HMMs.
The tool finds all proper alignments rather than only the best one.
That allows extracting all genes satisfying HMM gene model from the assembly.


**PathRacer** is a tool for alignment of assembly graph against pHMM. It provides
the set of _k_ most probable paths traversed by a HMM through the whole assembly
Expand All @@ -22,21 +19,33 @@ translation on-fly walking through frameshifts.

Both tool use extended pHMM model allowing frame shifts:

![Scheme of extended pHMM](./extra/pHMM_with_frameshifts.svg)
![Scheme of extended pHMM](./pHMM_with_frameshifts.svg)

but for `pathracer-seq-fs` this extension is crucial: for aligning amino-acid pHMMs without allowing indels in the nucleotide space
six frame translation + `hmmsearch` from **HMMer** package is more than enough.

### Input
Currently the tool supports only _de Bruijn_ graphs in GFA format as produced by **SPAdes** or compatible assembler in this matter (e.g., **MEGAHIT**).
## Compilation

To compile PathRacer, run

```
./spades_compile -SPADES_ENABLE_PROJECTS=pathracer
```

After the compilation is complete, `pathracer` executable will be located in the `bin` folder.

## Input
Currently, the tool supports only _de Bruijn_ graphs in GFA format as produced by **SPAdes** or compatible assembler in this matter (e.g., **MEGAHIT**).
Contact us if you need some other format support. Input sequences are supposed to be in FASTA/FASTQ format.

Profile HMM should be in **HMMer3** format, but one can pass nucleotide or amino acid sequences as well.
These sequences will be converted to proxy pHMMs.
Aligning of these pHMMs would be equivalent to performing alignment using Levenshtein distance for each input sequence.

## pathracer tool

### pathracer command line options

### `pathracer` command line options
Required positional arguments:

1. Query file (.hmm file or .fasta)
Expand Down Expand Up @@ -72,7 +81,7 @@ Debug output control:

_In addition:_ Some other developer options that are not supposed to be tuned by the end-user. Could be removed in further releases.

### `pathracer` output
### pathracer output
For each input pHMM (gene model) `pathracer` reports:

- **&lt;gene\_name&gt;.seqs.fa**: sequences correspondent to _N_ best scored paths ordered by score along with their alignment in CIGAR format
Expand All @@ -89,7 +98,11 @@ In addition:
- **pathracer.log**: log file
- **graph\_with\_hmm\_paths.gfa**: _(optional)_ input graph with top scored paths added

### `pathracer-seq-fs` command line options

## pathracer-seq-fs tool

### pathracer-seq-fs command line options

Required positional arguments:

1. Query .hmm file (.fasta is not supported yet)
Expand All @@ -106,11 +119,11 @@ Main options:
Heuristics options:
_The same as in main `pathracer`_

### `pathracer-seq-fs` output
### pathracer-seq-fs output
For each input pHMM (gene model): **&lt;gene\_name&gt;.seqs.fa** and **&lt;gene\_name&gt;.nucls.fa**, the same as in main `pathracer`


### Output files format
## Output files format
**&lt;gene\_name&gt;.seqs.fa** and **&lt;gene\_name&gt;.nucls.fa** files contain metainformation in FASTA headers.
For main `pathracer` the header format is:
><code>
Expand Down Expand Up @@ -143,7 +156,7 @@ For alignment with frameshifts the extemded CIGAR/FASTA is used:
P/"-" &mdash; one nucleotide insertion, G/"=" &mdash; two nucleotides insertion


### Examples
## Examples
One can download example datasets from <http://cab.spbu.ru/software/pathracer/>

- **urban_strain.gfa**: strain assembly graph of Singapore clinical isolation ward wastewater metagenome (SRA accession SRR5997548, dataset H1)
Expand Down Expand Up @@ -188,13 +201,10 @@ export OMP_STACKSIZE=1G
pathracer bac.hmm synth_strain_gbuilder.gfa --queries 16S_rRNA -m 250 --top 1000000 --output pathracer_synth_strain_gbuilder_16s --no-top-score-filter
```

### References
If you are using **PathRacer** in your research, please cite:
A. Shlemov and A. Korobeynikov. PathRacer: racing profile HMM paths on assembly
graph. In _Proceedings of International Conference on Algorithms for Computational Biology,
AlCoB 2019. Berkeley, California, USA, May 28&ndash;30, 2019,_ volume 11488 LNCS, pages
80&ndash;94, 2019.
<https://link.springer.com/chapter/10.1007/978-3-030-18174-1_6>
## References

If you are using **PathRacer** in your research, please cite:

[Shlemov and Korobeynikov, 2019](https://link.springer.com/chapter/10.1007/978-3-030-18174-1_6)

In case of any problems running **PathRacer** please contact [SPAdes support](https://github.com/ablab/spades/issues)> attaching the log file.
Your suggestions are also very welcome!
In case of any problems running **PathRacer** please contact [SPAdes support](https://github.com/ablab/spades/issues) attaching the log file.
File renamed without changes
66 changes: 48 additions & 18 deletions src/projects/spaligner/README.md → docs/spaligner.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,57 @@
# SPAligner
# SPAligner: long read to graph aligner

SPAligner is a tool for fast and accurate alignment of nucleotide sequences to assembly graphs.
It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read
to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")).


## Compilation

To compile SPAligner, run

```
./spades_compile -SPADES_ENABLE_PROJECTS=spaligner
```

After the compilation is complete, `spaligner` executable will be located in the `bin` folder.

Tool for fast and accurate alignment of nucleotide sequences (s.a. long reads, coding sequences, etc.) to assembly graphs.

## Running SPAligner

spaligner spaligner_config.yaml \ # config file
Synopsis:

spaligner spaligner_config.yaml \ # config file
-d pacbio \ # data type: pacbio, nanopore
-g assembly_graph.gfa \ # gfa-file with assembly graph
-k 77 \ # graph K-mer size
-s pacbio_reads.fastq.gz \ # sequences to align in fasta/fastq formats
-t 8 # number of threads, 8 by default
-g assembly_graph.gfa \ # assembly graph
-k 77 \ # graph k-mer size
-s pacbio_reads.fastq.gz \ # input sequences / reads
-t 8 # number of threads

By default, spaligner_config.yaml will be installed into /usr/share/spaligner/ or can be found in assembler/projects/spaligner/.
By default, `spaligner_config.yaml` can be found in `src/projects/spaligner/`.

Alignments will be saved to spaligner_result/alignment.tsv by default.
Alignments will be saved to `spaligner_result/alignment.tsv` by default.


## Compilation
### Command line options

`-d <type> `
long reads type: `nanopore` or `pacbio`

`-s <filename> `
file with sequences in FASTA or FASTQ formats (can be gzipped)

git clone https://github.com/ablab/spades.git
cd spades/assembler/
mkdir build && cd build && cmake ../src
make spaligner
`-g <filename> `
file with an assembly graph in GFA format

Now to run SPAligner move to folder `assembler/` and execute
`-k <int> `
k-mer length that was used for graph construction

`-t <int> `
number of threads (default: 8)

`-o, --outdir <dir> `
output directory to use (default: `spaligner_result/`)

build/bin/spaligner

## Output

Expand Down Expand Up @@ -102,7 +128,7 @@ If a sequence was not fully aligned, SPAligner tries to prolong the longest alig

Overview of the alignment of the nucleotide query sequence *S* (orange bar) to assembly graph *G*. Assembly graph edges are considered directed left-to-right (explicit edge orientation was omitted to improve the clarity).

![pipeline](pipeline.jpg)
![pipeline](spaligner.jpg)

1. **Anchor search.** Anchors (regions of high similarity) between the query and the edge labels are identified with [BWA-MEM](http://bio-bwa.sourceforge.net/).
2. **Anchor filtering.** Anchors shorter than *K*, assembly graph *K*-mer size,(anchors 2, 6, 11), anchors “in the middle” of long edge (anchor 7) or ambiguous anchors (anchor 10 mostly covered by anchor 9, both anchors 4 and 5) are discarded.
Expand Down Expand Up @@ -146,6 +172,10 @@ Increase of `max_gs_states`, `max_restorable_length`, `queue_limit`, `iteration_
Turning off restore_ends or run_dijkstra in nucleotide sequence alignment mode leads to shorter alignments, but considerable speed-up.


## Contacts
## References

If you are using **SPAligner** in your research, please cite:

[Dvorkina et al., 2020](https://link.springer.com/article/10.1186/s12859-020-03590-7)

For any questions or suggestions please do not hesitate to contact Tatiana Dvorkina <[email protected]>.
13 changes: 7 additions & 6 deletions docs/standalone.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,12 +169,17 @@ Additional options are:
original graph


## Long read to graph alignment

## hybridSPAdes aligner

_Not to be confused with [SPAligner](spaligner.md)._

### hybridSPAdes aligner
A tool `spades-gmapper ` gives the opportunity to extract long read alignments generated with hybridSPAdes pipeline options. It has three mandatory options: dataset description file in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file), graph file in GFA format and an output file name.

While `spades-gmapper` is a solution for those who work on hybridSPAdes assembly and
want to get exactly its intermediate results, [SPAligner](spaligner.md) is an end-product application for sequence-to-graph alignment with tunable parameters and output types.


Synopsis: `spades-gmapper <dataset description (in YAML)> <graph (in GFA)> <output filename> [-k <value>] [-t <value>] [-tmpdir <dir>]`

Additional options are:
Expand All @@ -188,13 +193,11 @@ Additional options are:
`-tmpdir <dir_name> `
scratch directory to use

While `spades-gmapper` is a solution for those who work on hybridSPAdes assembly and want to get exactly its intermediate results, [SPAligner](standalone.md#spaligner) is an end-product application for sequence-to-graph alignment with tunable parameters and output types.


### SPAligner
A tool for fast and accurate alignment of nucleotide sequences to assembly graphs. It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")).

Synopsis: `spaligner src/projects/spaligner_config.yaml -d <value> -s <value> -g <value> -k <value> [-t <value>] [-o <value>]`

Parameters are:

Expand All @@ -216,8 +219,6 @@ Parameters are:
`-o, --outdir <dir> `
output directory to use (default: spaligner_result/)

For more information on parameters and options please refer to the main SPAligner manual (assembler/src/projects/spaligner/README.md).

Also if you want to align protein sequences please refer to our [pre-release version](https://github.com/ablab/spades/releases/tag/spaligner-paper).

Note that in order you use SPAligner one needs either to use pre-built binaries or compile SPAdes from sources using the additional `-DSPADES_ENABLE_PROJECTS=spaligner` option.
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ nav:
- HMM-guided mode: hmm.md
- Transcriptome assembly: rna.md
- Binning refining: binspreader.md
- HMM mapping on assembly graph: pathracer.md
- Sequence to graph alignment: spaligner.md
- SPAdes tools: standalone.md
- Citation: citation.md
- Feedback: feedback.md
Expand Down
Loading