Read Structures

NEW: Validate your read structures using this online tool.

Overview
Operators
General Rules
Examples
Formal Grammar

Overview

In fgbio, fqtk, sgdemux, and also in Picard, a Read Structure refers to a String that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the --use-bases-mask in Illumina's bcltofastq software, but provides some additional capabilities.

A Read Structure is a sequence of <number><operator> pairs or segments where, optionally, the last segment in the string is allowed to use + instead of a number for its length. The + means translates to whatever bases are left after the other segments are processed and can be thought of as meaning [0..infinity].

Read Structures are most commonly used in tools that convert from sequencer output formats (e.g. fastq files, BCLs) to downstream formats like SAM/BAM/CRAM, and in tools that process SAM/BAM/CRAM to extract non-template bases from the reads. Examples include:

DemuxFastqs in fgbio to demultiplex a set of multi-sample fastq files and optionally extract UMIs
FastqToBam in fgbio to convert from fastq to BAM while preserving sample barcode and UMI information
ExtractUmisFromBam in fgbio which re-writes a BAM file with UMI sequences extracted from the reads and placed into tags
IlluminaBasecallsToSam and IlluminaBasecallsToFastqin Picard both of which process BCLs and related files in an Illumina run folder and create BAMs or FASTQs respectively

Operators

Four kinds of operator are supported:

T or Template: the bases in the segment are reads of template (e.g. genomic dna, rna, etc.)
B or Sample Barcode: the bases in the segment are an index sequence used to identify the sample being sequenced
M or Molecular Barcode: the bases in the segment are an index sequence used to identify the unique source molecule being sequence (i.e. a UMI)
S or Skip: the bases in the segment should be skipped or ignored, for example if they are monotemplate sequence generated by the library preparation

General Rules

Any number of segments >= 1 is valid
The length of each segment must be a positive integer >= 1 (or +)
Only the last segment in a read structure may use + for it's length
Adjacent segments may use the same operator. E.g. if two sample indices are ligated onto a molecule separately such that they are adjacent, a structure of 6B6B+T is perfectly acceptable.

Examples

The following handful of example attempt to describe the recommended way to describe a sequencing run in two different ways. Firstly as a single Read Structure for the entire run as you might use with IlluminaBasecallsToSam, and secondly as a set of read structures that would map one-to-one with the physical reads after fastq-conversion and optionally adapter trimming (which will create variable length reads):

A simple 2x150bp paired end run with no sample or molecular indices:
- 150T150T
- [+T, +T]
A 2x75bp paired end run with an 8bp I1 index read:
- 75T8B75T
- [+T, 8B, +T]
A 2x150bp paired end run with an 8bp I1 index read and an inline 6bp UMI in read 1:
- 8M142T8B150T
- [8M+T, 8B, +T]
A 2x150bp duplex sequencing run with dual sample-barcoding (I1 and I2) and both a 10bp UMI and 5bp monotemplate at the start of both R1 and R2:
- 10M5S135T8B8B10M5S135T
- [10M5S+T, 8B, 8B, 10M5S+T]

Formal Grammar

The formal grammar for Read Structures supported by fgbio is as follows:

<read-structure>     ::= <fixed-structure> <variable-segment>
<fixed-structure>    ::= "" | <fixed-length> <operator> <fixed-structure>
<variable-segment>   ::= "" | <variable-length> <operator>
<segment>            ::= <any-length><operator>
<operator>           ::= "T" | "B" | "M" | "S"
<fixed-length>       ::= <non-zero-digit>{<digit>}
<variable-length>    ::= "+"
<any-length>         ::= <fixed-length> / <variable-length>
<non-zero-digit>     ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<digit>              ::= "0" | <non-zero-digit>

Provide feedback

Saved searches