manual section about supported aligners

histopathology · Jan 19, 2023 · a5c13f5 · a5c13f5
1 parent faca4b8
commit a5c13f5
Show file tree

Hide file tree

Showing 2 changed files with 11 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -92,6 +92,8 @@ Please refer to the [user manual](http://arriba.readthedocs.io/en/latest/) for i
    - [Viral detection](https://arriba.readthedocs.io/en/latest/current-limitations/#viral-detection)
    - [Targeted sequencing](https://arriba.readthedocs.io/en/latest/current-limitations/#targeted-sequencing)
    - [Supporting read count vs. coverage](https://arriba.readthedocs.io/en/latest/current-limitations/#supporting-read-count-vs-coverage)
+   - [Supported organisms](https://arriba.readthedocs.io/en/latest/current-limitations/#supported-organisms)
+   - [Supported aligners](https://arriba.readthedocs.io/en/latest/current-limitations/#supported-aligners)
 
 10. [Internal algorithm](https://arriba.readthedocs.io/en/latest/internal-algorithm/)
 

diff --git a/documentation/current-limitations.md b/documentation/current-limitations.md
@@ -43,8 +43,8 @@ Supporting read count vs. coverage
 
 The number of reads (or fragments) supporting a fusion are given in the columns `split_reads1/2` and `discordant_mates`. These columns only report reads which passed all filters and can be thought of as high-quality supporting reads. Reads which failed one or more filters are reported in the column `filters`. In contrast, the columns `coverage1/2` report all reads covering the fusion breakpoints. No filters are applied to coverage calculation, such that these numbers are not afflicted with the negative bias of the supporting reads columns. Most notably, the coverage calculation includes duplicates, whereas the supporting reads lack duplicates. Moreover, Arriba by default ignores supporting reads in excess of 300 for performance reasons (see also parameter `-U`). Therefore, the coverage values and supporting read counts are only roughly comparable - especially when a high number of duplicates is expected, for example with targeted sequencing libraries or highly expressed genes. Nevertheless, the implications on fusion calling are negligible, because few filters make use of coverage information. But users who desire consistent counting of supporting reads and coverage should remove (not just mark) all duplicates from the BAM file prior to running Arriba. This is currently the only way to obtain comparable counts.
 
-Unsupported organisms
----------------------
+Supported organisms
+-------------------
 
 Arriba officially supports only human (hg19/GRCh37/hs37d5 or hg38/GRCh38) and mouse (mm10/GRCm38 or mm39/GRCm39). Other organisms or genome assemblies can be used in principle, but the results will be less accurate and the annotation incomplete. This is because important reference files are not available, including:
 
@@ -60,3 +60,10 @@ In order to improve the fusion calls from unsupported organisms, users can build
 
 A blacklist can be built by simply running Arriba on the set of training samples. The breakpoint pairs to be blacklisted can then be extracted from the columns `breakpoint1/2` from both the main output file (as specified by the parameter `-o`) and the discarded fusions file (as specified by the parameter `-O`). The extracted breakpoint pairs just be stored in a tab-separated file with two columns - one for each breakpoint. Depending on the type of training samples used (normal vs. malignant), the recurrence threshold should be adjusted accordingly. If normal samples were used for training, any breakpoint pair which is found in more than one sample can be blacklisted as a recurrent artifact; for malignant training samples, the threshold should be much higher - at least as high as the most prevalent oncogenic driver fusion in the given disease. After the recurrent breakpoint pairs have been added to the blacklist, the list can optionally be fine-tuned further by adding special keywords. For example, when a certain gene is involved in a lot of artifacts even after the newly built blacklist has been applied, the gene may be blacklisted completely by putting the gene name in the first column and the keyword `any` in the second column. All valid keywords are described in the [section about the blacklist](input-files#blacklist).
 
+Supported aligners
+------------------
+
+In principle, Arriba is compatible with any RNA-Seq aligner which reports split reads and discordant mates in a format that is compliant with the [SAM format specification](https://samtools.github.io/hts-specs/SAMv1.pdf). That is, paired-end discordant mates must be marked as such by means of having the `BAM_FPROPER_PAIR (0x2)` flag unset, and split reads must be represented as supplementary alignments with the `BAM_FSUPPLEMENTARY (0x800)` flag set for the supplementary and a `SA` tag for the anchor read. However, Arriba currently has the limitation that it can only utilize supplementary alignments if there is exactly one supplementary alignment per read. Reads which have multiple supplementary alignments are ignored. Multi-mapping chimeric reads are recognized by Arriba provided that all SAM records pertaining to the same alignment have a `HI` tag and the tag has the same value. In other words, when a read maps to multiple loci, all SAM records pertaining to the first alignment must have the tag `HI:i:1`, and all SAM records pertaining to the second alignment must have the tag `HI:i:2`, and so on.
+
+Alignment tools that have been tested successfully with Arriba are the [STAR aligner](https://github.com/alexdobin/STAR) and Illumina's Dragen aligner. (Note: The open-source implementation of Dragen, DRAGMAP, is not compatible with Arriba, since it is not suitable for RNA-Seq data.) STAR is preferred over Dragen, because it is better at aligning split reads and because multi-mapping reads are stored in a way that Arriba can handle. Users who want to use an incompatible aligner in conjunction with Arriba can run the script [`run_arriba_on_prealigned_bam.sh`](../utility-scripts/#run-arriba-on-prealigned-bam-file), which takes a BAM file (aligned with any aligner) as input and uses STAR to realign only those reads which are relevant to fusion detection, namely, clipped and unmapped reads. The fusion calls from this workflow should be close to the recommended workflow based entirely on STAR, but avoids having to realign the entire BAM file just for the sake of fusion detection, thus saving CPU time.
+