diff --git a/.gitignore b/.gitignore index d8d2a4a02..c365740e8 100644 --- a/.gitignore +++ b/.gitignore @@ -6,5 +6,6 @@ *.ver *.dvi -*.pdf *.ps + +/_site diff --git a/BCFv1_qref.pdf b/BCFv1_qref.pdf new file mode 100644 index 000000000..8cbcd3563 Binary files /dev/null and b/BCFv1_qref.pdf differ diff --git a/BCFv2_qref.pdf b/BCFv2_qref.pdf new file mode 100644 index 000000000..45c4b0187 Binary files /dev/null and b/BCFv2_qref.pdf differ diff --git a/CRAMv2.1.pdf b/CRAMv2.1.pdf new file mode 100644 index 000000000..ac50bd025 Binary files /dev/null and b/CRAMv2.1.pdf differ diff --git a/CRAMv2.1.tex b/CRAMv2.1.tex index 920d0b0b3..ae0630af0 100644 --- a/CRAMv2.1.tex +++ b/CRAMv2.1.tex @@ -392,12 +392,30 @@ \section{\textbf{File definition}} \hline unsigned byte & major format number & 2 (0x2)\tabularnewline \hline -unsigned byte & minor format number & 0 (0x0)\tabularnewline +unsigned byte & minor format number & 1 (0x1)\tabularnewline \hline byte[20] & file id & CRAM file identifier (e.g. file name or SHA1 checksum)\tabularnewline \hline \end{tabular} +Valid CRAM \textit{major}.\textit{minor} version numbers are as follows: + +\begin{itemize} +\item[\textit{1.0}] +The original public CRAM release. + +\item[\textit{2.0}] +The first CRAM release implemented in both Java and C; tidied up +implementation vs specification differences in \textit{1.0}. + +\item[\textit{2.1}] +Gained end of file markers; compatible with \textit{2.0}. + +\item[\textit{3.0}] +Additional compression methods; header and data checksums; +improvements for unsorted data. +\end {itemize} + \section{\textbf{Container structure}} The file definition is followed by one or more containers with the following header diff --git a/CRAMv3.pdf b/CRAMv3.pdf new file mode 100644 index 000000000..eb93e7c61 Binary files /dev/null and b/CRAMv3.pdf differ diff --git a/CRAMv3.tex b/CRAMv3.tex index 60f3e6d72..4b9bb5fff 100644 --- a/CRAMv3.tex +++ b/CRAMv3.tex @@ -83,7 +83,7 @@ \section{\textbf{Data types}} types are written as words (e.g. int) while physical data types are written using single letters (e.g. i). The difference between the two is that storage data types define how logical data types are stored in CRAM. Data in CRAM is stored either -as as bits or as bytes. Writing values as bits and bytes is described in detail +as bits or bytes. Writing values as bits and bytes is described in detail below. \subsection{\textbf{Logical data types}} @@ -195,7 +195,7 @@ \subsection{\textbf{Writing bytes to a byte stream}} the number of bytes to follow. To accommodate 32 bits such representation requires 5 bytes with only 4 lower bits used in the last byte 5. -\item[LTF-8 long or (ltf8)]\ \newline +\item[LTF-8 long (ltf8)]\ \newline See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the number of bytes used to encode a single value. To do so 64 bits are required and this can be done with 9 byte at most with the first byte consisting of just 1s @@ -204,11 +204,8 @@ \subsection{\textbf{Writing bytes to a byte stream}} \item[{Array ([ ])}]\ \newline Array length is written first as integer (itf8), followed by the elements of the array. -\end{description} - - -\subsubsection*{Encoding} +\item[{Encoding}]\ \newline Encoding is a data type that specifies how data series have been compressed. Encodings are defined as encoding\texttt{<}type\texttt{>} where the type is a logical data type as opposed to a storage data type. @@ -244,8 +241,7 @@ \subsubsection*{Encoding} K = 0x1 = 1 -\subsubsection*{Map} - +\item[{Map}]\ \newline A map is a collection of keys and associated values. A map with N keys is written as follows: @@ -258,11 +254,13 @@ \subsubsection*{Map} Both the size in bytes and the number of keys are written as integer (itf8). Keys and values are written according to their data types and are specific to each map. -\subsection{\textbf{Strings}} - -Strings are represented as byte arrays using UTF-8 format. Read names, reference +\item[String]\ \newline +A string is represented as byte arrays using UTF-8 format. Read names, reference sequence names and tag values with type `Z' are stored as UTF-8. +\end{description} + + \section{\textbf{Encodings }} Encoding is a data structure that captures information about compression details @@ -397,7 +395,7 @@ \section{\textbf{File definition}} \hline byte[4] & format magic number & CRAM (0x43 0x52 0x41 0x4d)\tabularnewline \hline -unsigned byte & major format number & 2 (0x2)\tabularnewline +unsigned byte & major format number & 3 (0x3)\tabularnewline \hline unsigned byte & minor format number & 0 (0x0)\tabularnewline \hline @@ -405,6 +403,24 @@ \section{\textbf{File definition}} \hline \end{tabular} +Valid CRAM \textit{major}.\textit{minor} version numbers are as follows: + +\begin{itemize} +\item[\textit{1.0}] +The original public CRAM release. + +\item[\textit{2.0}] +The first CRAM release implemented in both Java and C; tidied up +implementation vs specification differences in \textit{1.0}. + +\item[\textit{2.1}] +Gained end of file markers; compatible with \textit{2.0}. + +\item[\textit{3.0}] +Additional compression methods; header and data checksums; +improvements for unsorted data. +\end {itemize} + \section{\textbf{Container structure}} The file definition is followed by one or more containers with the following header @@ -1009,6 +1025,8 @@ \subsection{\textbf{CRAM record bit flags (BF data series)}} \hline 0x400 & & PCR or optical duplicate\tabularnewline \hline +0x800 & & Supplementary alignment\tabularnewline +\hline \end{tabular} * For segments within the same slice. diff --git a/CSIv1.pdf b/CSIv1.pdf new file mode 100644 index 000000000..38e52754f Binary files /dev/null and b/CSIv1.pdf differ diff --git a/CSIv2.pdf b/CSIv2.pdf new file mode 100644 index 000000000..4ce7f92e4 Binary files /dev/null and b/CSIv2.pdf differ diff --git a/MAINTAINERS.md b/MAINTAINERS.md new file mode 100644 index 000000000..564824c0d --- /dev/null +++ b/MAINTAINERS.md @@ -0,0 +1,68 @@ +## Specification maintainers + +The SAM, BAM, and VCF formats originated in the 1000 Genomes Project. +In February 2014, ongoing format maintenance was brought under the aegis of the [Global Alliance for Genomics & Health][ga4gh-ff]. +At this time, lead maintainers for each of the formats were nominated. +The current maintainers are listed below. + +### SAM/BAM + +* James Bonfield (@jkbonfield) +* John Marshall (@jmarshall) +* Yossi Farjoun (@yfarjoun) + +Past SAM/BAM maintainers include Jay Carey, Tim Fennell, and Nils Homer. + +### CRAM + +* James Bonfield (@jkbonfield) +* Vadim Zalunin (@vadimzalunin) + +### VCF/BCF + +* Cristina Yenyxe Gonzalez Garcia (@cyenyxe) +* David Roazen (@droazen) +* Petr Danecek (@pd3) + +Past VCF/BCF maintainers include Ryan Poplin. + +### Htsget + +* Mike Lin (@mlin) + +[ga4gh-ff]: https://genomicsandhealth.org/working-groups/our-work/file-formats + + +## Generating PDF specification documents + +Use the _Makefile_ to generate PDFs from the TeX source documents. +Both TeX source and generated PDFs are checked into the **master** branch, so the make rules are set up to stage PDFs into a _new/_ subdirectory, from where they can be copied when you are ready to check them in. + +Most of the specifications use a _.ver_ file and associated rules to display a commit hash and datestamp on their title page. +(See _SAMv1.tex_ and _new/SAMv1.pdf_'s _Makefile_ dependencies for how to add this to other specifications.) +So the usual workflow when editing these documents is (for example, when working on the SAM specification): + +1. Edit _SAMv1.tex_, and type `make new/SAMv1.pdf` to generate a working PDF to preview your work. + +2. When you are ready, commit your _.tex_ source changes (but don't commit any changed PDF files yet). + +3. Type `make clean SAMv1.pdf` to regenerate the PDF and copy it to the main directory. +(Optionally, verify that it contains the correct commit hash for your source changes.) +Now commit your _.pdf_ changes, separately from any source changes. + +### Rationale + +It is a little inconvenient having the working PDFs down in a subdirectory, but this is outweighed by the convenience of being able to switch between Git branches etc without trouble — as there would be if updated working PDFs were in the main directory, overwriting the checked-in PDFs. + +The intention is that the commit hash embedded in a PDF encompasses all the source changes and commits that contribute to that PDF. +The hash of the particular commit that updates the PDF is of course not yet known when the PDF is being generated, so the best that can be done is the hash of a slightly-previous commit. +Therefore: +* The PDF needs to be committed separately from the corresponding TeX source changes. +* The PDF should not be updated in a merge commit (as commits from one or the other of the merge's parents will not be recorded), and there's not much point updating it in a pull request. +* So pull requests need to be merged, and then their PDFs updated separately as a non-merge commit on **master**. +* If a series of changes are being made or several pull requests are being merged at once, the PDF updates can be batched up and just made once at the end. +* Conversely, if there are changes pending to several (even unrelated) PDFs, there is no reason not to commit them all at once. + +If you are working on several PDFs at once, be careful in step 3 and perhaps use `make clean new/VCFv4.2.pdf new/VCFv4.3.pdf; make VCFv4.2.pdf VCFv4.3.pdf` to ensure that spurious “-dirty” commit hashes don't make their way into your PDFs. + + diff --git a/Makefile b/Makefile index f42930521..a5a06bdae 100644 --- a/Makefile +++ b/Makefile @@ -6,35 +6,40 @@ PDFS = BCFv1_qref.pdf \ CRAMv3.pdf \ CSIv1.pdf \ SAMv1.pdf \ + SAMtags.pdf \ tabix.pdf \ VCFv4.1.pdf \ VCFv4.2.pdf \ VCFv4.3.pdf -pdf: $(PDFS) +pdf: $(PDFS:%=new/%) -CRAMv2.1.pdf: CRAMv2.1.tex CRAMv2.1.ver -CRAMv3.pdf: CRAMv3.tex CRAMv3.ver -SAMv1.pdf: SAMv1.tex SAMv1.ver -VCFv4.1.pdf: VCFv4.1.tex VCFv4.1.ver -VCFv4.2.pdf: VCFv4.2.tex VCFv4.2.ver -VCFv4.3.pdf: VCFv4.3.tex VCFv4.3.ver +%.pdf: new/%.pdf + cp $^ $@ +new/CRAMv2.1.pdf: CRAMv2.1.tex new/CRAMv2.1.ver +new/CRAMv3.pdf: CRAMv3.tex new/CRAMv3.ver +new/SAMv1.pdf: SAMv1.tex new/SAMv1.ver +new/SAMtags.pdf: SAMtags.tex new/SAMtags.ver +new/VCFv4.1.pdf: VCFv4.1.tex new/VCFv4.1.ver +new/VCFv4.2.pdf: VCFv4.2.tex new/VCFv4.2.ver +new/VCFv4.3.pdf: VCFv4.3.tex new/VCFv4.3.ver -.SUFFIXES: .tex .pdf .ver -.tex.pdf: - pdflatex $< - while grep -q 'Rerun to get [a-z-]* right' $*.log; do pdflatex $< || exit; done -.tex.ver: +new/%.pdf: %.tex + pdflatex --output-directory new $< + while grep -q 'Rerun to get [a-z-]* right' new/$*.log; do pdflatex --output-directory new $< || exit; done + +new/%.ver: %.tex echo "@newcommand*@commitdesc{`git describe --always --dirty`}@newcommand*@headdate{`git rev-list -n1 --format=%aD HEAD $< | sed '1d;s/.*, *//;s/ *[0-9]*:.*//'`}" | tr @ \\ > $@ mostlyclean: - -rm -f *.aux *.idx *.log *.out *.toc *.ver + -cd new && rm -f *.aux *.idx *.log *.out *.toc *.ver clean: mostlyclean - -rm -f $(PDFS) + -cd new && rm -f $(PDFS) + -rm -rf _site .PHONY: all pdf mostlyclean clean diff --git a/README.md b/README.md index 80d433f83..6bff26cfd 100644 --- a/README.md +++ b/README.md @@ -1,47 +1,56 @@ SAM/BAM and related specifications ================================== -Quick links ------------ - - -[HTS-spec GitHub page](http://samtools.github.io/hts-specs/) -[SAMv1.pdf](http://samtools.github.io/hts-specs/SAMv1.pdf) -[CRAMv2.1.pdf](http://samtools.github.io/hts-specs/CRAMv2.1.pdf) -[CRAMv3.pdf](http://samtools.github.io/hts-specs/CRAMv3.pdf) -[BCFv1.pdf](http://samtools.github.io/hts-specs/BCFv1_qref.pdf) -[BCFv2.1.pdf](http://samtools.github.io/hts-specs/BCFv2_qref.pdf) -[CSIv1.pdf](http://samtools.github.io/hts-specs/CSIv1.pdf) -[tabix.pdf](http://samtools.github.io/hts-specs/tabix.pdf) -[VCFv4.1.pdf](http://samtools.github.io/hts-specs/VCFv4.1.pdf) -[VCFv4.2.pdf](http://samtools.github.io/hts-specs/VCFv4.2.pdf) +Links **in bold** point to the corresponding PDFs on this repository's [GitHub Pages website][hts-specs]. + +Please request improvements or report errors using this repository, but see also [the list of maintainers](MAINTAINERS.md) if you need to contact them directly. + Alignment data files -------------------- -**SAMv1.tex** is the canonical specification for the SAM (Sequence Alignment/Map) format, BAM (its binary equivalent), and the BAI format for indexing BAM files. +**[SAMv1.tex]** is the canonical specification for the SAM (Sequence Alignment/Map) format, BAM (its binary equivalent), and the BAI format for indexing BAM files. +**[SAMtags.tex]** is a companion specification describing the predefined standard optional fields and tags found in SAM, BAM, and CRAM files. These formats are discussed on the [samtools-devel mailing list][samdev-ml]. -**CRAMv3.tex** is the canonical specification for the CRAM format, while **CRAMv2.1.tex** describes its now-obsolete predecessor. +**[CRAMv3.tex]** is the canonical specification for the CRAM format, while **[CRAMv2.1.tex]** describes its now-obsolete predecessor. Further details can be found at [ENA's CRAM toolkit page][ena-cram]. CRAM discussions can also be found on the [samtools-devel mailing list][samdev-ml]. -The **tabix.tex** and **CSIv1.tex** quick references summarize more recent index formats: the [tabix] tool indexes generic textual genome position-sorted files, while CSI is [htslib]'s successor to the BAI index format. +The **[tabix.tex]** and **[CSIv1.tex]** quick references summarize more recent index formats: the tabix tool indexes generic textual genome position-sorted files, while CSI is [htslib]'s successor to the BAI index format. Variant calling data files -------------------------- -**VCFv4.1.tex** and **VCFv4.2.tex** are the canonical specifications for the Variant Call Format and its textual (VCF) and binary encodings (BCF 2.x). +**[VCFv4.3.tex]** is the canonical specification for the Variant Call Format and its textual (VCF) and binary (BCF) encodings, while **[VCFv4.1.tex]** and **[VCFv4.2.tex]** describe their predecessors. These formats are discussed on the [vcftools-spec mailing list][vcfspec-ml]. -**BCFv1_qref.tex** summarizes the obsolete BCF1 format historically produced by [samtools]. This format is no longer recommended for use, as it has been superseded by the more widely-implemented BCF2. +**[BCFv1_qref.tex]** summarizes the obsolete BCF1 format historically produced by [samtools]. This format is no longer recommended for use, as it has been superseded by the more widely-implemented BCF2. + +**[BCFv2_qref.tex]** is a quick reference describing just the layout of data within BCF2 files. + +Transfer protocols +------------------ + +**[Htsget.md]** describes the _hts-get_ retrieval protocol, which enables parallel streaming access to data sharded across multiple URLs or files. -**BCFv2_qref.tex** is a quick reference describing just the layout of data within BCF2 files. +[SAMv1.tex]: http://samtools.github.io/hts-specs/SAMv1.pdf +[SAMtags.tex]: http://samtools.github.io/hts-specs/SAMtags.pdf +[CRAMv2.1.tex]: http://samtools.github.io/hts-specs/CRAMv2.1.pdf +[CRAMv3.tex]: http://samtools.github.io/hts-specs/CRAMv3.pdf +[CSIv1.tex]: http://samtools.github.io/hts-specs/CSIv1.pdf +[tabix.tex]: http://samtools.github.io/hts-specs/tabix.pdf +[VCFv4.1.tex]: http://samtools.github.io/hts-specs/VCFv4.1.pdf +[VCFv4.2.tex]: http://samtools.github.io/hts-specs/VCFv4.2.pdf +[VCFv4.3.tex]: http://samtools.github.io/hts-specs/VCFv4.3.pdf +[BCFv1_qref.tex]: http://samtools.github.io/hts-specs/BCFv1_qref.pdf +[BCFv2_qref.tex]: http://samtools.github.io/hts-specs/BCFv2_qref.pdf +[Htsget.md]: http://samtools.github.io/hts-specs/htsget.html [ena-cram]: http://www.ebi.ac.uk/ena/about/cram_toolkit [htslib]: https://github.com/samtools/htslib [samtools]: https://github.com/samtools/samtools -[tabix]: https://github.com/samtools/tabix +[hts-specs]: http://samtools.github.io/hts-specs/ [samdev-ml]: https://lists.sourceforge.net/lists/listinfo/samtools-devel [vcfspec-ml]: https://lists.sourceforge.net/lists/listinfo/vcftools-spec diff --git a/SAMtags.pdf b/SAMtags.pdf new file mode 100644 index 000000000..785715148 Binary files /dev/null and b/SAMtags.pdf differ diff --git a/SAMtags.tex b/SAMtags.tex new file mode 100644 index 000000000..c155bf83f --- /dev/null +++ b/SAMtags.tex @@ -0,0 +1,395 @@ +\documentclass[10pt]{article} +\usepackage[margin=1in]{geometry} +\usepackage{longtable} +\usepackage[pdfborder={0 0 0},hyperfootnotes=false]{hyperref} +\usepackage[title]{appendix} + +\newcommand{\mailtourl}[1]{\href{mailto:#1}{\tt #1}} +\newcommand{\tagvalue}[1]{\tt #1} +\newcommand{\tagregex}[1]{\tt #1} + +\begin{document} + +\input{SAMtags.ver} +\title{Sequence Alignment/Map Optional Fields Specification} +\author{The SAM/BAM Format Specification Working Group} +\date{\headdate} +\maketitle +\begin{quote}\small +The master version of this document can be found at +\url{https://github.com/samtools/hts-specs}.\\ +This printing is version~\commitdesc\ from that repository, +last modified on the date shown above. +\end{quote} +\vspace*{1em} + +\noindent +This document is a companion to the {\sl Sequence Alignment/Map Format +Specification} that defines the SAM and~BAM formats, and to the {\sl CRAM +Format Specification} that defines the CRAM format.\footnote{See +\href{http://samtools.github.io/hts-specs/SAMv1.pdf}{\tt SAMv1.pdf} and +\href{http://samtools.github.io/hts-specs/CRAMv3.pdf}{\tt CRAMv3.pdf} +at \url{https://github.com/samtools/hts-specs}.} +Alignment records in each of these formats may contain a number of optional +fields, each labelled with a {\it tag\/} identifying that field's data. +This document describes each of the predefined standard tags, and discusses +conventions around creating new tags. + +\section{Standard tags} + +Predefined standard tags are listed in the following table and described +in greater detail in later subsections. +Optional fields are usually displayed as {\tt TAG:TYPE:VALUE}; the {\it type\/} +may be one of +{\tt A} (character), +{\tt B} (general array), +{\tt f} (real number), +{\tt H} (hexadecimal array), +{\tt i} (integer), +or +{\tt Z} (string). + +\begin{center}\small +\begin{longtable}{ccp{12.5cm}} + \hline + {\bf Tag} & {\bf Type} & {\bf Description} \\ + \hline + {\tt AM} & i & The smallest template-independent mapping quality of segments in the rest \\ + {\tt AS} & i & Alignment score generated by aligner \\ + {\tt BC} & Z & Barcode sequence identifying the sample \\ + {\tt BQ} & Z & Offset to base alignment quality (BAQ) \\ + {\tt BZ} & Z & Phred quality of the unique molecular barcode bases in the {\tt OX} tag \\ + {\tt CC} & Z & Reference name of the next hit \\ + {\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM}) \\ + {\tt CO} & Z & Free-text comments \\ + {\tt CP} & i & Leftmost coordinate of the next hit \\ + {\tt CQ} & Z & Color read base qualities \\ + {\tt CS} & Z & Color read sequence \\ + {\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features \\ + {\tt E2} & Z & The 2nd most likely base calls \\ + {\tt FI} & i & The index of segment in the template \\ + {\tt FS} & Z & Segment suffix \\ + {\tt FZ} & B,S & Flow signal intensities \\ + {\tt GC} & ? & Reserved for backwards compatibility reasons \\ + {\tt GQ} & ? & Reserved for backwards compatibility reasons \\ + {\tt GS} & ? & Reserved for backwards compatibility reasons \\ + {\tt H0} & i & Number of perfect hits \\ + {\tt H1} & i & Number of 1-difference hits (see also {\tt NM}) \\ + {\tt H2} & i & Number of 2-difference hits \\ + {\tt HI} & i & Query hit index \\ + {\tt IH} & i & Number of stored alignments in SAM that contains the query in the current record \\ + {\tt LB} & Z & Library \\ + {\tt MC} & Z & CIGAR string for mate/next segment \\ + {\tt MD} & Z & String for mismatching positions \\ + {\tt MF} & ? & Reserved for backwards compatibility reasons \\ + {\tt MI} & Z & Molecular identifier; a string that uniquely identifies the molecule from which the record was derived \\ + {\tt MQ} & i & Mapping quality of the mate/next segment \\ + {\tt NH} & i & Number of reported alignments that contains the query in the current record \\ + {\tt NM} & i & Edit distance to the reference \\ + {\tt OC} & Z & Original CIGAR \\ + {\tt OP} & i & Original mapping position \\ + {\tt OQ} & Z & Original base quality \\ + {\tt OX} & Z & Original unique molecular barcode bases \\ + {\tt PG} & Z & Program \\ + {\tt PQ} & i & Phred likelihood of the template \\ + {\tt PT} & Z & Read annotations for parts of the padded read sequence \\ + {\tt PU} & Z & Platform unit \\ + {\tt Q2} & Z & Phred quality of the mate/next segment sequence in the {\tt R2} tag \\ + {\tt QT} & Z & Phred quality of the sample-barcode sequence in the {\tt BC} (or {\tt RT}) tag \\ + {\tt QX} & Z & Quality score of the unique molecular identifier in the {\tt RX} tag \\ + {\tt R2} & Z & Sequence of the mate/next segment in the template \\ + {\tt RG} & Z & Read group \\ + {\tt RT} & Z & Barcode sequence (deprecated; use {\tt BC} instead) \\ + {\tt RX} & Z & Sequence bases of the (possibly corrected) unique molecular identifier \\ + {\tt SA} & Z & Other canonical alignments in a chimeric alignment \\ + {\tt SM} & i & Template-independent mapping quality \\ + {\tt SQ} & ? & Reserved for backwards compatibility reasons \\ + {\tt S2} & ? & Reserved for backwards compatibility reasons \\ + {\tt TC} & i & The number of segments in the template \\ + {\tt U2} & Z & Phred probability of the 2nd call being wrong conditional on the best being wrong \\ + {\tt UQ} & i & Phred likelihood of the segment, conditional on the mapping being correct \\ + {\tt X?} & ? & Reserved for end users \\ + {\tt Y?} & ? & Reserved for end users \\ + {\tt Z?} & ? & Reserved for end users \\ + \hline +\end{longtable} +\end{center} + +\subsection{Additional Template and Mapping data} + +\begin{description} +\item[AM:i:\tagvalue{int}] +The smallest template-independent mapping quality of segments in the rest. + +\item[AS:i:\tagvalue{score}] +Alignment score generated by aligner. + +\item[BQ:Z:\tagvalue{qualities}] +Offset to base alignment quality (BAQ), of the same length as the read sequence. +At the $i$-th read base, ${\rm BAQ}_i=Q_i-({\rm BQ}_i-64)$ where $Q_i$ is the $i$-th base quality. + +\item[CC:Z:\tagvalue{rname}] +Reference name of the next hit; `{\tt =}' for the same chromosome. + +\item[CP:i:\tagvalue{pos}] +Leftmost coordinate of the next hit. + +\item[E2:Z:\tagvalue{bases}] +The 2nd most likely base calls. Same encoding and same length as {\sf SEQ}. +See also {\tt U2} for associated quality values. + +\item[FI:i:\tagvalue{int}] +The index of segment in the template. + +\item[FS:Z:\tagvalue{str}] +Segment suffix. + +\item[H0:i:\tagvalue{count}] +Number of perfect hits. + +\item[H1:i:\tagvalue{count}] +Number of 1-difference hits (see also {\tt NM}). + +\item[H2:i:\tagvalue{count}] +Number of 2-difference hits. + +\item[HI:i:\emph{i}] +Query hit index, indicating the alignment record is the $i$-th one stored +in SAM. + +\item[IH:i:\tagvalue{count}] +Number of stored alignments in SAM that contains the query in the current +record. + +\item[MC:Z:\tagvalue{cigar}] +CIGAR string for mate/next segment. + +\item[MD:Z:\tagregex{[0-9]+(([A-Z]|\char92\char94[A-Z]+)[0-9]+)*}] +String for mismatching positions. + +The {\tt MD} field aims to achieve SNP/indel calling without +looking at the reference. For example, a string `{\tt 10A5\char94AC6}' means +from the leftmost reference base in the alignment, there are 10 matches +followed by an A on the reference which is different from the aligned read +base; the next 5 reference bases are matches followed by a 2bp deletion from +the reference; the deleted sequence is AC; the last 6~bases are matches. +The {\tt MD} field ought to match the {\sf CIGAR} string. + +\item[MQ:i:\tagvalue{}] +Mapping quality of the mate/next segment. + +\item[NH:i:\tagvalue{}] +Number of reported alignments that contains the query in the current record. + +\item[NM:i:\tagvalue{}] +Edit distance to the reference, including ambiguous bases but excluding clipping. + +\item[PQ:i:\tagvalue{}] +Phred likelihood of the template, conditional on both the mapping being correct. + +\item[Q2:Z:\tagvalue{qualities}] +Phred quality of the mate/next segment sequence in the {\tt R2} tag. +Same encoding as {\sf QUAL}. + +\item[R2:Z:\tagvalue{bases}] +Sequence of the mate/next segment in the template. See also {\tt Q2} +for any associated quality values. + +\item[SA:Z:\tagregex{{\tt (}\emph{rname}{\tt ,}\emph{pos}{\tt ,}\emph{strand}{\tt ,}\emph{CIGAR}{\tt ,}\emph{mapQ}{\tt ,}\emph{NM}{\tt ;)}+}] +Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list. +Each element in the list represents a part of the chimeric alignment. Conventionally, at a supplementary line, the first element points to the primary line. + +\item[SM:i:\tagvalue{}] +Template-independent mapping quality. + +\item[TC:i:\tagvalue{}] +The number of segments in the template. + +\item[U2:Z:\tagvalue{}] +Phred probility of the 2nd call being wrong conditional on the best being wrong. +The same encoding and length as {\sf QUAL}. See also {\tt E2} for associated base calls. + +\item[UQ:i:\tagvalue{}] +Phred likelihood of the segment, conditional on the mapping being correct. +\end{description} + +\subsection{Metadata} + +\begin{description} +\item[RG:Z:\tagvalue{readgroup}] +The read group to which the read belongs. +If {\tt @RG} headers are present, then \emph{readgroup} must match the +{\tt RG-ID} field of one of the headers. + +\item[LB:Z:\tagvalue{library}] +The library from which the read has been sequenced. +If {\tt @RG} headers are present, then \emph{library} must match the +{\tt RG-LB} field of one of the headers. + +\item[PG:Z:\tagvalue{}] +Program. Value matches the header {\tt PG-ID} tag if {\tt @PG} is present. + +\item[PU:Z:\tagvalue{platformunit}] +The platform unit in which the read was sequenced. +If {\tt @RG} headers are present, then \emph{platformunit} must match the +{\tt RG-PU} field of one of the headers. + +\item[CO:Z:\tagvalue{text}] +Free-text comments. +\end{description} + +\subsection{Barcodes} + +\begin{description} +\item[BC:Z:\tagvalue{sequence}] +Barcode sequence (Identifying the sample/library), with any quality scores (optionally) stored in the {\tt QT} tag. +The {\tt BC} tag should match the {\tt QT} tag in length. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes and places a hyphen (`{\tt -}') between the barcodes from the same template. + +\item[QT:Z:\tagvalue{qualities}] +Phred quality of the sample-barcode sequence in the {\tt BC} (or {\tt RT}) tag. +Same encoding as {\sf QUAL}, i.e., Phred score + 33. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with spaces (`{\tt \textvisiblespace}') between the different strings from the same template. + +\item[RX:Z:\tagvalue{sequence+}] +Sequence bases from the unique molecular identifier. +These could be either corrected or uncorrected. Unlike {\tt MI}, the value may be non-unique in the file. +Should be comprised of a sequence of bases. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes. + +If the bases represent corrected bases, the original sequence can be stored in {\tt OX} (similar to {\tt OQ} storing the original qualities of bases.) + +\item[QX:Z:\tagvalue{qualities+}] +Phred quality of the unique molecular identifier sequence in the {\tt RX} tag. +Same encoding as {\sf QUAL}, i.e., Phred score + 33. +The qualities here may have been corrected (Raw bases and qualities can be stored in {\tt OX} and {\tt BZ} respectively.) +The lengths of the {\tt QX} and the {\tt RX} tags must match. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with a space (`{\tt \textvisiblespace}') between the different strings. + +\item[MI:Z:\tagvalue{str}] +Molecular Identifier. +A unique ID within the SAM file for the source molecule from which this read is derived. +All reads with the same {\tt MI} tag represent the group of reads derived from the same source molecule. + +\item[OX:Z:\tagvalue{sequence+}] +Raw (uncorrected) unique molecular identifier bases, with any quality scores (optionally) stored in the {\tt BZ} tag. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the barcodes with a hyphen (`{\tt -}') between the different barcodes. + +\item[BZ:Z:\tagvalue{qualities+}] +Phred quality of the (uncorrected) unique molecular identifier sequence in the {\tt OX} tag. +Same encoding as {\sf QUAL}, i.e., Phred score + 33. +The {\tt OX} tags should match the {\tt BZ} tag in length. +In the case of multiple unique molecular identifiers (e.g., one on each end of the template) the recommended implementation concatenates all the quality strings with a space (`{\tt \textvisiblespace}') between the different strings. + +\item[RT:Z:\tagvalue{sequence}] +Deprecated alternative to {\tt BC} tag originally used at Sanger. +\end{description} + +\subsection{Original data} + +\begin{description} +\item[OC:Z:\tagvalue{cigar}] +Original CIGAR, usually before realignment. + +\item[OP:i:\tagvalue{pos}] +Original mapping position, usually before realignment. + +\item[OQ:Z:\tagvalue{qualities}] +Original base quality, usually before recalibration. +Same encoding as {\sf QUAL}. +\end{description} + +\subsection{Annotation and Padding} + +\begin{description} +\item[CT:Z:\tagregex{\emph{strand};\emph{type}(;\emph{key}(=\emph{value}))*}] +Complete read annotation tag, used for consensus annotation dummy features. + +The {\tt CT} tag is intended primarily for annotation +dummy reads, and consists of a \emph{strand}, \emph{type} and zero or +more \emph{key}=\emph{value} pairs, each separated with semicolons. +The \emph{strand} field has four values as in GFF3, and supplements FLAG +bit 0x10 to allow unstranded (`{\tt .}'), and stranded but unknown strand +(`{\tt ?}') annotation. For these and annotation on the forward strand +(\emph{strand} set to `{\tt +}'), do not set FLAG bit 0x10. For +annotation on the reverse strand, set the \emph{strand} to `{\tt -}' +and set FLAG bit 0x10. + +The \emph{type} and any \emph{keys} and their +optional \emph{values} are all percent encoded according to +RFC3986 to escape meta-characters `{\tt =}', `{\tt \%}', `{\tt ;}', +`{\tt |}' or non-printable characters not matched by the isprint() +macro (with the C locale). For example a percent sign becomes +`{\tt \%2C}'. +%NOTE - This leaves open the possibility of allowing multiple such +%entries for a single CT tag to be combined with | as in the PT tag. + +\item[PT:Z:\tagregex{\tt \emph{start};\emph{end};\emph{strand};\emph{type}(;\emph{key}(=\emph{value}))*(\char92|\emph{start};\emph{end};\emph{strand};\emph{type}(;\emph{key}(=\emph{value}))*)*}] +Read annotations for parts of the padded read sequence. + +The {\tt PT} tag value has the format of a series of +tags separated by `{\tt |}', each annotating a sub-region of the read. +Each tag consists of \emph{start}, \emph{end}, \emph{strand}, +\emph{type} and zero or more \emph{key}{\tt =}\emph{value} pairs, each +separated with semicolons. \emph{Start} and \emph{end} are 1-based +positions between one and the sum of the {\tt M/I/D/P/S/=/X} +{\sf CIGAR} operators, i.e. {\sf SEQ} length plus any pads. Note +any editing of the CIGAR string may require updating the `{\tt PT}' +tag coordinates, or even invalidate them. +As in GFF3, \emph{strand} is one of `{\tt +}' for forward strand tags, +`{\tt -}' for reverse strand, `{\tt .}' for unstranded or `{\tt ?}' +for stranded but unknown strand. +The \emph{type} and any \emph{keys} and their optional \emph{values} +are all percent encoded as in the {\tt CT} tag. +\end{description} + +\subsection{Technology-specific data} + +\begin{description} +\item[FZ:B,S:\tagvalue{intensities}] +Flow signal intensities on the original strand of the read, stored as {\tt (uint16\_t) round(value * 100.0)}. +\end{description} + +\subsubsection{Color space} + +% TODO Describe color space and the encoding here. + +\begin{description} +\item[CM:i:\tagvalue{distance}] +Edit distance between the color sequence and the color reference (see also {\tt NM}). + +\item[CS:Z:\tagvalue{sequence}] +Color read sequence on the original strand of the read. The primer base must be included. + +\item[CQ:Z:\tagvalue{qualities}] +Color read quality on the original strand of the read. Same encoding as {\sf QUAL}; same length as {\tt CS}. +\end{description} + +\section{Locally-defined tags} + +You can freely add new tags. +Note that tags starting with `{\tt X}', `{\tt Y}', or `{\tt Z}' and tags +containing lowercase letters in either position are reserved for local use +and will not be formally defined in any future version of this specification. + +If a new tag may be of general interest, it may be useful to have it added +to this specification. Additions can be proposed by opening a new issue at +\url{https://github.com/samtools/hts-specs/issues} and/or by sending email +to \mailtourl{samtools-devel@lists.sourceforge.net}. + +\begin{appendices} +\appendix +\section{SAM Tags History}\label{sec:history} + +This lists the date of each tagged SAM version along with changes that +have been made while that version was current. + +\subsection*{1.5: 23 May 2013 to current} +\begin{itemize} +\item Add UMI-related tags (RX, QX, OX, BZ, MI) and clarified usage of sample barcode tag BC. (August 2017) +\item SAMtags.txt (this file) created with tags from SAMv1 +\end{itemize} + +\end{appendices} + +\end{document} diff --git a/SAMv1.pdf b/SAMv1.pdf new file mode 100644 index 000000000..8a15ebf9d Binary files /dev/null and b/SAMv1.pdf differ diff --git a/SAMv1.tex b/SAMv1.tex index 129b9e691..27646a8ac 100644 --- a/SAMv1.tex +++ b/SAMv1.tex @@ -5,7 +5,9 @@ \usepackage{framed} \usepackage{enumitem} \usepackage{longtable} +\usepackage{makecell} \usepackage[pdfborder={0 0 0},hyperfootnotes=false]{hyperref} +\usepackage[title]{appendix} \makeindex @@ -34,6 +36,10 @@ \section{The SAM Format Specification} information such as mapping position, and variable number of optional fields for flexible or aligner specific information. +This specification is for version 1.5 of the SAM and BAM formats. Each SAM and +BAM file may optionally specify the version being used via the +{\tt @HD VN} tag. For full version history see Appendix~\ref{sec:history}. + \subsection{An example}\label{sec:example} Suppose we have the following alignment with bases in lower cases clipped from the alignment. Read {\tt r001/1} and {\tt r001/2} @@ -193,10 +199,28 @@ \subsection{The header section} grouped by {\sf QNAME}), and {\tt reference} (alignments are grouped by {\sf RNAME}/{\sf POS}).\\\cline{1-3} \multicolumn{2}{|l}{\tt @SQ} & Reference sequence dictionary. The order of {\tt @SQ} lines defines the alignment sorting order.\\\cline{2-3} - & {\tt SN}* & Reference sequence name. Each {\tt @SQ} line must have a unique {\tt SN} tag. The value of this - field is used in the + & {\tt SN}* & Reference sequence name. +The {\tt SN} tags and all individual {\tt AN} names in all {\tt @SQ} lines +must be distinct. + The value of this field is used in the alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3} & {\tt LN}* & Reference sequence length. \emph{Range}: {\tt [1,2$^{31}$-1]}\\\cline{2-3} + & {\tt AH} & Indicates that this sequence is an alternate locus.% +\footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.} + The value is the locus in the primary assembly for which this sequence is an alternative, in the format `\emph{chr}{\tt :}\emph{start}{\tt -}\emph{end}', `\emph{chr}' (if known), or `{\tt *}' (if unknown), where `\emph{chr}' is a sequence in the primary assembly. + Must not be present on sequences in the primary assembly.\\\cline{2-3} + & {\tt AN} & Alternative reference sequence names. +A comma-separated list of alternative names that tools may use when referring +to this reference sequence.% +\footnote{For example, given `{\tt @SQ SN:MT AN:chrMT,M,chrM LN:16569}', +tools can ensure that a user's request for any of `MT', `chrMT', `M', +or~`chrM' succeeds and refers to the same sequence. +Note the restricted set of characters allowed in an alternative name.} +These alternative names are not used elsewhere within the SAM file; +in particular, they must not appear in alignment records' {\sf RNAME} +or~{\sf RNEXT} fields. +\emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*} +where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3} & {\tt AS} & Genome assembly identifier. \\\cline{2-3} & {\tt M5} & MD5 checksum of the sequence in the uppercase, excluding spaces but including pads (as `*'s).\\\cline{2-3} & {\tt SP} & Species.\\\cline{2-3} @@ -321,11 +345,18 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord} segment has been mapped. When 0x4 is set, this indicates whether the unmapped read is stored in its original orientation as it came off the sequencing machine. - \item If 0x40 and 0x80 are both set, the read is part of a linear + \item Bits 0x40 and 0x80 reflect the read ordering within each template + inherent in the sequencing technology used.\footnote{For example, + in Illumina paired-end sequencing, {\sf first}~(0x40) corresponds to + the R1~`forward' read and {\sf last}~(0x80) to the R2~`reverse' read. + (Despite the terminology, this is unrelated to the segments' orientations + when they are mapped: either, neither, or both may have their + {\sf reverse} flag bits~(0x10) set after mapping.)} + If 0x40 and 0x80 are both set, the read is part of a linear template, but it is neither the first nor the last read. If both 0x40 and 0x80 are unset, the index of the read in the template - is unknown. This may happen for a non-linear template or the index - is lost in data processing. + is unknown. This may happen for a non-linear template or when this + information is lost during data processing. \item If 0x1 is unset, no assumptions can be made about 0x2, 0x8, 0x20, 0x40 and 0x80. \item Bits that are not listed in the table are reserved for future use. @@ -339,8 +370,9 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord} also have an ordinary coordinate such that it can be placed at a desired position after sorting. If {\sf RNAME} is `*', no assumptions can be made about {\sf POS} and {\sf CIGAR}. -\item {\sf POS}: 1-based leftmost mapping POSition of the first matching - base. The first base in a reference sequence has coordinate 1. {\sf +\item {\sf POS}: 1-based leftmost mapping POSition of the first {\sf + CIGAR} operation that ``consumes'' a reference base (see table below). + The first base in a reference sequence has coordinate 1. {\sf POS} is set as 0 for an unmapped read without coordinate. If {\sf POS} is 0, no assumptions can be made about {\sf RNAME} and {\sf CIGAR}. @@ -351,23 +383,26 @@ \subsection{The alignment section: mandatory fields}\label{sec:alnrecord} \item {\sf CIGAR}: CIGAR string. The CIGAR operations are given in the following table (set `*' if unavailable): \begin{center}\small - \begin{tabular}{ccl} + \begin{tabular}{cclcc} \hline - Op & BAM & Description\\ + Op & BAM & Description & \makecell{Consumes \\ query} & \makecell{Consumes \\ reference}\\ \hline - {\tt M} & 0 & alignment match (can be a sequence match or mismatch)\\ - {\tt I} & 1 & insertion to the reference \\ - {\tt D} & 2 & deletion from the reference \\ - {\tt N} & 3 & skipped region from the reference \\ - {\tt S} & 4 & soft clipping (clipped sequences present in {\sf SEQ})\\ - {\tt H} & 5 & hard clipping (clipped sequences NOT present in {\sf SEQ})\\ - {\tt P} & 6 & padding (silent deletion from padded reference)\\ - {\tt =} & 7 & sequence match \\ - {\tt X} & 8 & sequence mismatch \\ + {\tt M} & 0 & alignment match (can be a sequence match or mismatch) & yes & yes \\ + {\tt I} & 1 & insertion to the reference & yes & no \\ + {\tt D} & 2 & deletion from the reference & no & yes \\ + {\tt N} & 3 & skipped region from the reference & no & yes \\ + {\tt S} & 4 & soft clipping (clipped sequences present in {\sf SEQ}) & yes & no \\ + {\tt H} & 5 & hard clipping (clipped sequences NOT present in {\sf SEQ}) & no & no \\ + {\tt P} & 6 & padding (silent deletion from padded reference) & no & no \\ + {\tt =} & 7 & sequence match & yes & yes \\ + {\tt X} & 8 & sequence mismatch & yes & yes \\ \hline \end{tabular} \end{center} \begin{itemize} + \item ``Consumes query'' and ``consumes reference'' indicate + whether the CIGAR operation causes the alignment to step along the + query sequence and the reference sequence respectively. \item {\tt H} can only be present as the first and/or last operation. \item {\tt S} may only have {\tt H} operations between them and the ends of the {\sf CIGAR} string. @@ -427,8 +462,8 @@ \subsection{The alignment section: optional fields}\label{sec:alnaux} A & {\tt [!-\char126]} & Printable character \\ i & {\tt [-+]?[0-9]+} & Signed integer\footnotemark\\ f & {\tt [-+]?[0-9]*\char92.?[0-9]+([eE][-+]?[0-9]+)?} & Single-precision floating number \\ -Z & {\tt [\,\,\,!-\char126]+} & Printable string, including space\\ -H & {\tt [0-9A-F]+} & Byte array in the Hex format\footnotemark\\ +Z & {\tt [\,\,\,!-\char126]*} & Printable string, including space\\ +H & {\tt ([0-9A-F][0-9A-F])*} & Byte array in the Hex format\footnotemark\\ B & {\tt [cCsSiIf](,[-+]?[0-9]*\char92.?[0-9]+([eE][-+]?[0-9]+)?)+} & Integer or numeric array\\ \hline \end{tabular} @@ -447,119 +482,15 @@ \subsection{The alignment section: optional fields}\label{sec:alnaux} may be changed if the new type is also compatible with the array. \footnotetext{Explicit typing eases format parsing and helps to reduce the file size when SAM is converted to BAM.} -{Predefined tags are shown in the following table. You can - freely add new tags, and if a new tag may be of general interest, you - can email {\tt samtools-devel@lists.sourceforge.net} to add the new tag - to the specification. Note that tags starting with `{\tt X}', `{\tt Y}' - and `{\tt Z}' or tags containing lowercase letters in either position - are reserved for local use and will not be formally - defined in any future version of this specification.} -\begin{center}\small -\begin{longtable}{ccp{12.5cm}} - \hline - {\bf Tag\footnotemark} & {\bf Type} & {\bf Description} \\ - \hline - {\tt X?} & ? & Reserved fields for end users (together with {\tt Y?} and {\tt Z?}) \\ - {\tt AM} & i & The smallest template-independent mapping quality of segments in the rest \\ - {\tt AS} & i & Alignment score generated by aligner \\ - {\tt BC} & Z & Barcode sequence, with any quality scores stored in the {\tt QT} tag. \\ - {\tt BQ} & Z & Offset to base alignment quality (BAQ), of the same length as the read sequence. - At the $i$-th read base, ${\rm BAQ}_i=Q_i-({\rm BQ}_i-64)$ where $Q_i$ is the $i$-th base quality. \\ - {\tt CC} & Z & Reference name of the next hit; `{\tt =}' for the same chromosome \\ - {\tt CM} & i & Edit distance between the color sequence and the color reference (see also {\tt NM})\\ - {\tt CO} & Z & Free-text comments \\ - {\tt CP} & i & Leftmost coordinate of the next hit \\ - {\tt CQ} & Z & Color read quality on the original strand of the read. Same encoding as {\sf QUAL}; same length as {\tt CS}.\\ - {\tt CS} & Z & Color read sequence on the original strand of the read. The primer base must be included.\\ - {\tt CT} & Z & Complete read annotation tag, used for consensus annotation dummy features.\footnotemark\\ - {\tt E2} & Z & The 2nd most likely base calls. Same encoding and same length as {\sf QUAL}.\\ - {\tt FI} & i & The index of segment in the template.\\ - {\tt FS} & Z & Segment suffix.\\ - {\tt FZ} & B,S & Flow signal intensities on the original strand of the read, stored as {\tt (uint16\_t) round(value * 100.0)}. \\ - {\tt LB} & Z & Library. Value to be consistent with the header {\tt RG-LB} tag if {\tt @RG} is present.\\ - {\tt H0} & i & Number of perfect hits\\ - {\tt H1} & i & Number of 1-difference hits (see also {\tt NM})\\ - {\tt H2} & i & Number of 2-difference hits \\ - {\tt HI} & i & Query hit index, indicating the alignment record is the $i$-th one stored in SAM\\ - {\tt IH} & i & Number of stored alignments in SAM that contains the query in the current record\\ - {\tt MC} & Z & CIGAR string for mate/next segment\\ - {\tt MD} & Z & String for mismatching positions. \emph{Regex}: {\tt [0-9]+(([A-Z]|\char92\char94[A-Z]+)[0-9]+)*}\footnotemark\\ - {\tt MQ} & i & Mapping quality of the mate/next segment \\ - {\tt NH} & i & Number of reported alignments that contains the query in the current record\\ - {\tt NM} & i & Edit distance to the reference, including ambiguous bases but excluding clipping\\ - {\tt OQ} & Z & Original base quality (usually before recalibration). Same encoding as {\sf QUAL}.\\ - {\tt OP} & i & Original mapping position (usually before realignment) \\ - {\tt OC} & Z & Original CIGAR (usually before realignment) \\ - {\tt PG} & Z & Program. Value matches the header {\tt PG-ID} tag if {\tt @PG} is present. \\ - {\tt PQ} & i & Phred likelihood of the template, conditional on both the mapping being correct \\ - {\tt PT} & Z & Read annotations for parts of the padded read sequence\footnotemark\\ - {\tt PU} & Z & Platform unit. Value to be consistent with the header {\tt RG-PU} tag if {\tt @RG} is present.\\ - {\tt QT} & Z & Phred quality of the barcode sequence in the {\tt BC} (or {\tt RT}) tag. Same encoding as {\sf QUAL}. \\ - {\tt Q2} & Z & Phred quality of the mate/next segment sequence in the {\tt R2} tag. Same encoding as {\sf QUAL}.\\ - {\tt R2} & Z & Sequence of the mate/next segment in the template. \\ - {\tt RG} & Z & Read group. Value matches the header {\tt RG-ID} tag if {\tt @RG} is present in the header. \\ - {\tt RT} & Z & Deprecated alternative to {\tt BC} tag originally used at Sanger. \\ - {\tt SA} & Z & Other canonical alignments in a chimeric alignment, formatted as a semicolon-delimited list: - {\tt (}\emph{rname}{\tt ,}\emph{pos}{\tt ,}\emph{strand}{\tt ,}\emph{CIGAR}{\tt ,}\emph{mapQ}{\tt ,}\emph{NM}{\tt ;)}+. - Each element in the list represents a part of the chimeric alignment. Conventionally, at a supplementary line, - the first element points to the primary line.\\ - {\tt SM} & i & Template-independent mapping quality \\ - {\tt TC} & i & The number of segments in the template.\\ - {\tt U2} & Z & Phred probility of the 2nd call being wrong conditional on the best being wrong. The same encoding as {\sf QUAL}. \\ - {\tt UQ} & i & Phred likelihood of the segment, conditional on the mapping being correct \\ - \hline -\end{longtable} -\end{center} -\addtocounter{footnote}{-3} -\footnotetext{The {\tt GS}, {\tt GC}, {\tt GQ}, {\tt MF}, {\tt S2} - and {\tt SQ} are reserved for backward compatibility.} -\stepcounter{footnote} -\footnotetext{The {\tt CT} tag is intended primarily for annotation -dummy reads, and consists of a \emph{strand}, \emph{type} and zero or -more \emph{key}=\emph{value} pairs, each separated with semicolons. -The \emph{strand} field has four values as in GFF3, and supplements FLAG -bit 0x10 to allow unstranded (`{\tt .}'), and stranded but unknown strand -(`{\tt ?}') annotation. For these and annotation on the forward strand -(\emph{strand} set to `{\tt +}'), do not set FLAG bit 0x10. For -annotation on the reverse strand, set the \emph{strand} to `{\tt -}' -and set FLAG bit 0x10. The \emph{type} and any \emph{keys} and their -optional \emph{values} are all percent encoded according to -RFC3986 to escape meta-characters `{\tt =}', `{\tt \%}', `{\tt ;}', -`{\tt |}' or non-printable characters not matched by the isprint() -macro (with the C locale). For example a percent sign becomes -`{\tt \%2C}'. The CT record matches: -``{\tt \emph{strand};\emph{type}(;\emph{key}(=\emph{value}))*}''. -%NOTE - This leaves open the possibility of allowing multiple such -%entries for a single CT tag to be combined with | as in the PT tag. -}%End of CT tag footnote -\stepcounter{footnote} -\footnotetext{The {\tt MD} field aims to achieve SNP/indel calling without -looking at the reference. For example, a string `{\tt 10A5\char94AC6}' means -from the leftmost reference base in the alignment, there are 10 matches -followed by an A on the reference which is different from the aligned read -base; the next 5 reference bases are matches followed by a 2bp deletion from -the reference; the deleted sequence is AC; the last 6~bases are matches. -The {\tt MD} field ought to match the {\sf CIGAR} string.} -\stepcounter{footnote} -\footnotetext{The {\tt PT} tag value has the format of a series of -tags separated by {\tt |}, each annotating a sub-region of the read. -Each tag consists of \emph{start}, \emph{end}, \emph{strand}, -\emph{type} and zero or more \emph{key}=\emph{value} pairs, each -separated with semicolons. \emph{Start} and \emph{end} are 1-based -positions between one and the sum of the {\tt M/I/D/P/S/=/X} -{\sf CIGAR} operators, i.e. {\sf SEQ} length plus any pads. Note -any editing of the CIGAR string may require updating the `{\tt PT}' -tag coordinates, or even invalidate them. -As in GFF3, \emph{strand} is one of `{\tt +}' for forward strand tags, -`{\tt -}' for reverse strand, `{\tt .}' for unstranded or `{\tt ?}' -for stranded but unknown strand. -The \emph{type} and any \emph{keys} and their optional \emph{values} -are all percent encoded as in the {\tt CT} tag. -Formally the entire PT record matches: - ``{\tt \emph{start};\emph{end};\emph{strand};\emph{type}(;\emph{key}(=\emph{value}))*(\char92|\emph{start};\emph{end};\emph{strand};\emph{type}(;\emph{key}(=\emph{value}))*)*}''. - }%End of PT tag footnote - - +Predefined tags are described in the separate {\sl Sequence Alignment/Map +Optional Fields Specification}.\footnote{See +\href{http://samtools.github.io/hts-specs/SAMtags.pdf}{\tt SAMtags.pdf} +at \url{https://github.com/samtools/hts-specs}.} +See that document for details of existing standard tag fields and conventions +around creating new tags that may be of general interest. +Tags starting with `{\tt X}', `{\tt Y}' or `{\tt Z}' and tags containing +lowercase letters in either position are reserved for local use and will +not be formally defined in any future version of these specifications. \pagebreak @@ -1133,4 +1064,75 @@ \subsection{C source code for computing bin number and overlapping bins}\label{s \end{verbatim} } +\pagebreak + +\begin{appendices} +\appendix +\section{SAM Version History}\label{sec:history} + +This lists the date of each tagged SAM version along with changes that +have been made while that version was current. The key changes +that caused the version number to change are shown in bold. + +Note the auxiliary tags have now moved to their own +specification with its own version numbering.\footnote{ +\href{http://samtools.github.io/hts-specs/SAMtags.pdf}{http://samtools.github.io/hts-specs/SAMtags.pdf}} + +\subsection*{1.5: 23 May 2013 to current} + +\begin{itemize} +\item Add {\tt @SQ AH} header tag. (Mar 2017) +\item Auxiliary tags migrated to SAMtags document. (Sep 2016) +\item Z and H auxiliary tags are permitted to be zero length. (Jun 2016) +\item QNAME limited to 254 bytes (was 255). (Aug 2015) +\item Generalise 0x200 flag bit as filtered-out bit. (Aug 2015) +\item Add {\tt @HD GO} for group order. (Mar 2015) +\item Add {\tt ONT} to the {\tt @RG PL} and {\tt @RG PM} header tags. (Mar 2015) +\item Add meaning to reverse FLAG on unmapped reads. (Mar 2015) +\item Document the {\tt idxstats} .bai elements. (Nov 2014) +\item Addition of CSI index. (Sep 2014) +\item Add {\tt MC} auxiliary tag. (Dec 2013) +\item Add {\tt @PG DS} header field. (Dec 2013) +\item Document the BAM EOF byte values. (Dec 2013) +\item Glossary of alignment types. (May 2013) +\item Add {\tt SA:Z} tag; PNEXT/RNEXT points to next read, not + segment. (May 2013) +\item \textbf{Add SUPPLEMENTARY flag bit}. (May 2013) +\end{itemize} + +\subsection*{1.4: 21 April 2011 to May 2013} + +\begin{itemize} +\item Add guide to using sequence annotations ({\tt CT/PT tags}). (Mar 2012) +\item Increase max reference length from $2^{29}$ to $2^{31}$. (Sep + 2011) +\item Add {\tt CO} and {\tt RT} auxiliary tags. (Sep 2011) +\item Clarify {\tt @SQ M5} header tag generation. (Sep 2011) +\item Describe padded alignments and add {\tt CT/PT tags}. (Sep 2011) +\item Add {\tt BC} barcode auxiliary tag. (Sep 2011) +\item Change {\tt FZ} tag from type {\tt H} to type {\tt B,S}. (Aug 2011) +\item Add {\tt @RG FO}, {\tt KS} header fields. (Apr 2011) +\item Add {\tt FZ} auxiliary tag. (Apr 2011) +\item Clarify chaining of PG records. (Apr 2011) +\item \textbf{Add {\tt B} array auxiliary tag type.} (Apr 2011)\ +\item \textbf{Permit IUPAC in SEQ and {\tt MD} auxiliary tag.} (Apr 2011) +\item \textbf{Permit QNAME ``{\tt *}''.} (Apr 2011) +\end{itemize} + +\subsection*{1.3: July 2010 to April 2011} + +\begin{itemize} +\item Re-add {\tt CC} and {\tt CP} auxiliary tags. (Mar 2011) +\item Add CIGAR N intron/skip operator. (Dec 2010) +\item Add {\tt BQ} BAQ tag. (Nov 2010) +\item Add {\tt RG PG} header field. (Nov 2010) +\item Add BAM description and index sections. (Nov 2010) +\item \textbf{Removal of FLAG letters.} (July 2010) +\end{itemize} + +\subsection*{1.0: 2009 to July 2010} + +Initial edition. + +\end{appendices} \end{document} diff --git a/VCFv4.1.pdf b/VCFv4.1.pdf new file mode 100644 index 000000000..dfc8f74c0 Binary files /dev/null and b/VCFv4.1.pdf differ diff --git a/VCFv4.1.tex b/VCFv4.1.tex index 5623e5b3d..ce486d6e1 100644 --- a/VCFv4.1.tex +++ b/VCFv4.1.tex @@ -161,7 +161,7 @@ \subsubsection{Fixed fields} \item POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required) \item ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted) \item REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g. complex substitutions or other events where all alleles have at least one base represented in their Strings. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required). - \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself) + \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself) \item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. $-10log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant). High QUAL scores indicate high confidence calls. Although traditionally people use integer phred scores, this field is permitted to be a floating point to enable higher resolution for low confidence calls if desired. If unknown, the missing value should be specified. (Numeric) \item FILTER - filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples. `0' is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted) \item INFO - additional information: (String, no white-space, semi-colons, or equals-signs permitted; commas are permitted only as delimiters for lists of values) INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: $<$key$>$=$<$data$>$[,data]. Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional): @@ -1155,7 +1155,7 @@ \subsubsection{Type encoding} \vspace{0.3cm} -\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the value with all bits set (0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value. +\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the values 0x80, 0x8000, 0x80000000 are interpreted as missing values. \vspace{0.3cm} \textbf{Floats} are encoded as single-precision (32 bit) in the basic format defined by the IEEE-754-1985 standard. This is the standard representation for floating point numbers on modern computers, with direct support in programming languages like C and Java (see Java's Double class for example). BCF2 supports the full range of values from -Infinity to +Infinity, including NaN. BCF2 needs to represent missing values for single precision floating point numbers. This is accomplished by writing the NaN value as the quiet NaN (qNaN), while the MISSING value is encoded as a signaling NaN. From the NaN wikipedia entry, we have: @@ -1332,17 +1332,17 @@ \subsubsection{Encoding ID} \subsubsection{Encoding REF/ALT fields} We encode each of REF and ALT as typed strings, first REF followed immediately -by ALT. Each is a 1 element string (0x19), which would then be followed by the +by ALT. Each is a 1 element string (0x17), which would then be followed by the single bytes for the bases of 0x43 and 0x41: \vspace{0.3cm} \begin{tabular}{|l| l|} \hline -0x19 0x41 & REF A \\ \hline -0x19 0x43 & ALT C \\ \hline +0x17 0x41 & REF A \\ \hline +0x17 0x43 & ALT C \\ \hline \end{tabular} \vspace{0.3cm} -Just for discussion, suppose instead that ALT was ALT=C,T. The only thing that could change is that there would be another typed string following immediately after C encoding 0x19 (1 element string) with the value of 0x54. +Just for discussion, suppose instead that ALT was ALT=C,T. The only thing that could change is that there would be another typed string following immediately after C encoding 0x17 (1 element string) with the value of 0x54. \subsubsection{Encoding FILTER} @@ -1377,12 +1377,12 @@ \subsubsection{Encoding the INFO fields} \end{tabular} \vspace{0.3cm} -The ancestral allele (AA) tell us that among other primates the original allele is C, a Character here. Because we represent Characters as single element strings in BCF2 (0x19) with value 0x43 (C). So the entire key/value pair is: +The ancestral allele (AA) tell us that among other primates the original allele is C, a Character here. Because we represent Characters as single element strings in BCF2 (0x17) with value 0x43 (C). So the entire key/value pair is: \vspace{0.3cm} \begin{tabular}{|l |l|} \hline 0x11 0x51 & AA key \\ \hline -0x19 0x43 & with value of C \\ \hline +0x17 0x43 & with value of C \\ \hline \end{tabular} \subsubsection{Encoding Genotypes} @@ -1445,8 +1445,8 @@ \subsubsection{Encoding Genotypes} 0x00020004 & n\_allele\_info \\ \hline 0x05000003 & n\_fmt\_samples \\ \hline 0x59 0x72 0x73 0x31 0x32 0x33 & ID \\ \hline -0x19 0x41 & REF A \\ \hline -0x19 0x43 & ALT C \\ \hline +0x17 0x41 & REF A \\ \hline +0x17 0x43 & ALT C \\ \hline 0x11 0x00 & FILTER field PASS \\ \hline 0x11 0x50 0x11 0x01 & HM3 flag is present \\ \hline 0x11 0x51 & AC key \\ \hline @@ -1454,7 +1454,7 @@ \subsubsection{Encoding Genotypes} 0x11 0x52 & AN key \\ \hline 0x11 0x06 & with value of 6 \\ \hline 0x11 0x51 & AA key \\ \hline -0x19 0x43 & with value of C \\ \hline +0x17 0x43 & with value of C \\ \hline 0x1101 0x21 0x020202040404 & GT \\ \hline 0x1102 0x11 0x0A0A0A & GQ \\ \hline 0x1103 0x11 0x203040 & DP \\ \hline diff --git a/VCFv4.2.pdf b/VCFv4.2.pdf new file mode 100644 index 000000000..e85463da4 Binary files /dev/null and b/VCFv4.2.pdf differ diff --git a/VCFv4.2.tex b/VCFv4.2.tex index 49b7245e7..6360bd092 100644 --- a/VCFv4.2.tex +++ b/VCFv4.2.tex @@ -178,7 +178,7 @@ \subsubsection{Fixed fields} \item POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required) \item ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted) \item REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g. complex substitutions or other events where all alleles have at least one base represented in their Strings. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required). - \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. The `*' allele is reserved to indicate that the allele is missing due to a upstream deletion. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself) + \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. The `*' allele is reserved to indicate that the allele is missing due to a upstream deletion. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself) \item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. $-10log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant). If unknown, the missing value should be specified. (Numeric) \item FILTER - filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples. `0' is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted) \item INFO - additional information: (String, no white-space, semi-colons, or equals-signs permitted; commas are permitted only as delimiters for lists of values) INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: $<$key$>$=$<$data$>$[,data]. Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional): @@ -885,7 +885,7 @@ \subsubsection{Clonal derivation relationships} PEDIGREE= \end{verbatim} -This line asserts that the DNA in genome is asexually or clonally derived with mutations from the DNA in genome . This is the asexual analog of the VCF format that has been proposed for family relationships between genomes, i.e. there is one entry per of the form: +This line asserts that the DNA in genome ID2 is asexually or clonally derived with mutations from the DNA in genome ID1. This is the asexual analog of the VCF format that has been proposed for family relationships between genomes, i.e. there is one entry per of the form: \begin{verbatim} PEDIGREE= @@ -1172,7 +1172,7 @@ \subsubsection{Type encoding} \vspace{0.3cm} -\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the value with all bits set (0x80, 0x8000, 0x80000000) for 8, 16, and 32 bit values, respectively) indicates that the field is a missing value. +\textbf{Integers} may be encoded as 8, 16, or 32 bit values, in little-endian order. It is up to the encoder to determine the appropriate ranged value to use when writing the BCF2 file. For each integer size, the values 0x80, 0x8000, 0x80000000 are interpreted as missing values. \vspace{0.3cm} \textbf{Floats} are encoded as single-precision (32 bit) in the basic format defined by the IEEE-754-1985 standard. This is the standard representation for floating point numbers on modern computers, with direct support in programming languages like C and Java (see Java's Double class for example). BCF2 supports the full range of values from -Infinity to +Infinity, including NaN. BCF2 needs to represent missing values for single precision floating point numbers. This is accomplished by writing the NaN value as the quiet NaN (qNaN), while the MISSING value is encoded as a signaling NaN. From the NaN wikipedia entry, we have: @@ -1349,17 +1349,17 @@ \subsubsection{Encoding ID} \subsubsection{Encoding REF/ALT fields} We encode each of REF and ALT as typed strings, first REF followed immediately -by ALT. Each is a 1 element string (0x19), which would then be followed by the +by ALT. Each is a 1 element string (0x17), which would then be followed by the single bytes for the bases of 0x43 and 0x41: \vspace{0.3cm} \begin{tabular}{|l| l|} \hline -0x19 0x41 & REF A \\ \hline -0x19 0x43 & ALT C \\ \hline +0x17 0x41 & REF A \\ \hline +0x17 0x43 & ALT C \\ \hline \end{tabular} \vspace{0.3cm} -Just for discussion, suppose instead that ALT was ALT=C,T. The only thing that could change is that there would be another typed string following immediately after C encoding 0x19 (1 element string) with the value of 0x54. +Just for discussion, suppose instead that ALT was ALT=C,T. The only thing that could change is that there would be another typed string following immediately after C encoding 0x17 (1 element string) with the value of 0x54. \subsubsection{Encoding FILTER} @@ -1394,12 +1394,12 @@ \subsubsection{Encoding the INFO fields} \end{tabular} \vspace{0.3cm} -The ancestral allele (AA) tell us that among other primates the original allele is C, a Character here. Because we represent Characters as single element strings in BCF2 (0x19) with value 0x43 (C). So the entire key/value pair is: +The ancestral allele (AA) tell us that among other primates the original allele is C, a Character here. Because we represent Characters as single element strings in BCF2 (0x17) with value 0x43 (C). So the entire key/value pair is: \vspace{0.3cm} \begin{tabular}{|l |l|} \hline 0x11 0x51 & AA key \\ \hline -0x19 0x43 & with value of C \\ \hline +0x17 0x43 & with value of C \\ \hline \end{tabular} \subsubsection{Encoding Genotypes} @@ -1462,8 +1462,8 @@ \subsubsection{Encoding Genotypes} 0x00020004 & n\_allele\_info \\ \hline 0x05000003 & n\_fmt\_samples \\ \hline 0x59 0x72 0x73 0x31 0x32 0x33 & ID \\ \hline -0x19 0x41 & REF A \\ \hline -0x19 0x43 & ALT C \\ \hline +0x17 0x41 & REF A \\ \hline +0x17 0x43 & ALT C \\ \hline 0x11 0x00 & FILTER field PASS \\ \hline 0x11 0x50 0x11 0x01 & HM3 flag is present \\ \hline 0x11 0x51 & AC key \\ \hline @@ -1471,7 +1471,7 @@ \subsubsection{Encoding Genotypes} 0x11 0x52 & AN key \\ \hline 0x11 0x06 & with value of 6 \\ \hline 0x11 0x51 & AA key \\ \hline -0x19 0x43 & with value of C \\ \hline +0x17 0x43 & with value of C \\ \hline 0x1101 0x21 0x020202040404 & GT \\ \hline 0x1102 0x11 0x0A0A0A & GQ \\ \hline 0x1103 0x11 0x203040 & DP \\ \hline diff --git a/VCFv4.3.pdf b/VCFv4.3.pdf new file mode 100644 index 000000000..878ee7559 Binary files /dev/null and b/VCFv4.3.pdf differ diff --git a/VCFv4.3.tex b/VCFv4.3.tex index dbbcb2cce..0f73adf70 100644 --- a/VCFv4.3.tex +++ b/VCFv4.3.tex @@ -1,6 +1,7 @@ \documentclass[8pt]{article} \usepackage{enumerate} \usepackage{graphicx} +\usepackage{tabularx} \usepackage{lscape} \usepackage[margin=0.75in]{geometry} \usepackage[pdfborder={0 0 0}]{hyperref} @@ -31,21 +32,19 @@ \tableofcontents \newpage -\section{The VCF specification} - -VCF is a text file format (most likely stored in a compressed manner). +\section{The VCF specification} +VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines (prefixed with "\#\#"), a header -line (prefixed with "\#"), and data lines each containing information -about a position in the genome and genotype information on samples for -each position (text fields separated by tabs). The VCF format can also +line (prefixed with "\#"), and data lines +each containing information about a position in the genome and genotype +information on samples for each position +(text fields separated by tabs). The VCF format can also store information on DNA methylation from bisulfite sequencing experiments and other sources alongside information about genome -sequence variation. Zero length fields are not allowed, a dot (".") -should be used instead. In order to ensure interoperability across -platforms, VCF compliant implementations must support both LF -(\texttt{\textbackslash n}) and CR+LF (\texttt{\textbackslash -r\textbackslash n}) newline conventions. - +sequence variation. Zero length fields are not allowed, a dot (".") must +be used instead. +In order to ensure interoperability across platforms, VCF compliant implementations must support +both LF (\texttt{\textbackslash n}) and CR+LF (\texttt{\textbackslash r\textbackslash n}) newline conventions. \subsection{An example} \scriptsize @@ -113,11 +112,12 @@ \subsection{Data types} \subsection{Meta-information lines} + File meta-information is included after the \#\# string and must be key=value pairs. Meta-information lines are optional, but if they are present then they must be completely well-formed. Note that BCF, the binary -counterpart of VCF, requires that all entries are present. It is strongly -encouraged to include meta-information lines describing the entries used in the +counterpart of VCF, requires that all entries are present. It is recommended +to include meta-information lines describing the entries used in the body of the VCF file. All structured lines that have their value enclosed within "$<>$" require an ID @@ -127,10 +127,10 @@ \subsection{Meta-information lines} ##INFO= \end{verbatim} In the above example, the extra fields of ``Source'' and ``Version'' are -provided. Optional fields should be stored as strings even for numeric values. +provided. Optional fields must be stored as strings even for numeric values. -It is highly recommended (but not required) that the header -include tags describing the reference and contigs backing the data contained in +It is recommended in VCF and required in BCF that the header +includes tags describing the reference and contigs backing the data contained in the file. These tags are based on the SQ field from the SAM spec; all tags are optional (see the VCF example above). @@ -139,7 +139,7 @@ \subsection{Meta-information lines} \subsubsection{File format} -A single `fileformat' line is always required, must be the first line in the file, and details the VCF format version number. For VCF version 4.3, this line should read: +A single `fileformat' line is always required, must be the first line in the file, and details the VCF format version number. For VCF version 4.3, this line is: \begin{verbatim} ##fileformat=VCFv4.3 @@ -148,7 +148,7 @@ \subsubsection{File format} \subsubsection{Information field format} -INFO fields should be described as follows (first four keys are required, source and version are recommended): +INFO fields are described as follows (first four keys are required, source and version are recommended): \begin{verbatim} ##INFO= @@ -158,28 +158,28 @@ \subsubsection{Information field format} String. The Number entry is an Integer that describes the number of values that can be included with the INFO field. For example, if the INFO field contains a -single number, then this value should be $1$; if the INFO field describes a -pair of numbers, then this value should be $2$ and so on. There are also +single number, then this value must be $1$; if the INFO field describes a +pair of numbers, then this value must be $2$ and so on. There are also certain special characters used to define special cases: \begin{itemize} - \item If the field has one value per alternate allele then this value should be `A'. - \item If the field has one value for each possible allele (including the reference), then this value should be `R'. - \item If the field has one value for each possible genotype (more relevant to the FORMAT tags) then this value should be `G'. - \item If the number of possible values varies, is unknown, or is unbounded, then this value should be `.'. + \item A: The field has one value per alternate allele. The values must be in the same order as listed in the ALT column (described in section \ref{data-lines}). + \item R: The field has one value for each possible allele, including the reference. The order of the values must be the reference allele first, then the alternate alleles as listed in the ALT column. + \item G: The field has one value for each possible genotype. The values must be in the same order as prescribed in section \ref{genotype-fields:genotype-ordering} (see \textsc{Genotype Ordering}). + \item . (dot): The number of possible values varies, is unknown or unbounded. \end{itemize} -The `Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number should be $0$ in this case. The Description value must be surrounded by double-quotes. Double-quote character can be escaped with backslash $\backslash$ and backslash as $\backslash\backslash$. Source and Version values likewise should be surrounded by double-quotes and specify the annotation source (case-insensitive, e.g. ``dbsnp'') and exact version (e.g. ``138''), respectively for computational use. +The `Flag' type indicates that the INFO field does not contain a Value entry, and hence the Number must be $0$ in this case. The Description value must be surrounded by double-quotes. Double-quote character must be escaped with backslash $\backslash$ and backslash as $\backslash\backslash$. Source and Version values likewise must be surrounded by double-quotes and specify the annotation source (case-insensitive, e.g. ``dbsnp'') and exact version (e.g. ``138''), respectively for computational use. \subsubsection{Filter field format} -FILTERs that have been applied to the data should be described as follows: +FILTERs that have been applied to the data are described as follows: \begin{verbatim} ##FILTER= \end{verbatim} \subsubsection{Individual format field format} -Likewise, Genotype fields specified in the FORMAT field should be described as follows: +Genotype fields specified in the FORMAT field are described as follows: \begin{verbatim} ##FORMAT= @@ -188,7 +188,7 @@ \subsubsection{Individual format field format} Possible Types for FORMAT fields are: Integer, Float, Character, and String (this field is otherwise defined precisely as the INFO field). \subsubsection{Alternative allele field format} -Symbolic alternate alleles should be described as follows: +Symbolic alternate alleles are described as follows: \begin{verbatim} ##ALT= \end{verbatim} @@ -197,7 +197,7 @@ \subsubsection{Alternative allele field format} In symbolic alternate alleles for imprecise structural variants, the ID field indicates the type of structural variant, and can be a colon-separated list of types and subtypes. ID values are case sensitive -strings and may not contain whitespace or angle brackets. The first level type +strings and must not contain whitespace or angle brackets. The first level type must be one of the following: \begin{itemize} \item DEL Deletion relative to the reference @@ -234,8 +234,8 @@ \subsubsection{Assembly field format} \subsubsection{Contig field format} \label{sec-contig-field} -It is highly recommended (and required for BCF) that the header includes tags -describing the contigs referred to in the VCF file. The structured \texttt{contig} +It is recommended for VCF, and required for BCF, that the header includes tags +describing the contigs referred to in the file. The structured \texttt{contig} field must include the ID attribute and typically includes also sequence length, MD5 checksum, URL tag to indicate where the sequence can be found, etc. For example: @@ -295,17 +295,18 @@ \subsection{Header line syntax} and there must be no tab characters at the end of the line. \subsection{Data lines} +\label{data-lines} All data lines are tab-delimited -with no tab character at the end of the line. The last data line should end with a line separator. In all cases, +with no tab character at the end of the line. The last data line must end with a line separator. In all cases, missing values are specified with a dot (`.'). \subsubsection{Fixed fields} There are 8 fixed fields per record. Fixed fields are: \begin{enumerate} - \item CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf. the \#\#assembly line in the header). All entries for a specific CHROM should form a contiguous block within the VCF file. The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required). + \item CHROM - chromosome: An identifier from the reference genome or an angle-bracketed ID String (``$<$ID$>$'') pointing to a contig in the assembly file (cf. the \#\#assembly line in the header). All entries for a specific CHROM must form a contiguous block within the VCF file. The colon symbol (:) must be absent from all chromosome names to avoid parsing errors when dealing with breakends. (String, no white-space permitted, Required). \item POS - position: The reference position, with the 1st base having position 1. Positions are sorted numerically, in increasing order, within each reference sequence CHROM. It is permitted to have multiple records with the same POS. Telomeres are indicated by using positions 0 or N+1, where N is the length of the corresponding chromosome or contig. (Integer, Required) - \item ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted, duplicate values not allowed.) + \item ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant the rs number(s) should be used. No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted, duplicate values not allowed.) \item REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and @@ -327,72 +328,97 @@ \subsubsection{Fixed fields} (thus R as a reference base is converted to A in VCF.) - \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles called on at least one of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. The `*' allele is reserved to indicate that the allele is missing due to a an overlapping deletion. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself) - \item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. $-10log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant). If unknown, the missing value should be specified. (Float) - \item FILTER - filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples. `0' is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted, duplicate values not allowed.) - \item INFO - additional information: (String, no semi-colons or - equals-signs permitted; commas are permitted only as delimiters for lists of + \item ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or a missing value `.' (no variant) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. The `*' allele is reserved to indicate that the allele is missing due to a an overlapping deletion. If there are no alternative alleles, then the missing value must be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself) + \item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e. $-10log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant). If unknown, the missing value must be specified. (Float) + \item FILTER - filter status: PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples. `0' is reserved and must not be used as a filter String. If filters have not been applied, then this field must be set to the missing value. (String, no white-space or semi-colons permitted, duplicate values not allowed.) + \item INFO - additional information: (String, no semi-colons or equals-signs permitted; commas are permitted only as delimiters for lists of values; characters with special meaning can be encoded using the percent encoding, see Section~\ref{character-encoding}; space characters are allowed) - INFO fields are encoded as a semicolon-separated series of short keys - with optional values in the format: $<$key$>$=$<$data$>$[,data]. - INFO keys must match the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate fields are not allowed. - Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional): -\begin{itemize} - \item AA : ancestral allele - \item AC : allele count in genotypes, for each ALT allele, in the same order as listed - \item AD, ADF, ADR: read depths for each allele; total (AD), on the forward (ADF) and the reverse (ADR) strand (Integer, Number=R) - \item AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes - \item AN : total number of alleles in called genotypes - \item BQ : RMS base quality at this position - \item CX: The 5 base context determined from the reference surrounding and including the current position (i.e., from position -2 to +2) - \item CIGAR : cigar string describing how to align an alternate allele to the reference allele - \item DB : dbSNP membership - \item DP : combined depth across samples, e.g. DP=154 - \item END : end position of the variant described in this record (for use with symbolic alleles) - \item H2 : membership in hapmap2 - \item H3 : membership in hapmap3 - \item MQ : RMS mapping quality, e.g. MQ=52 - \item MQ0 : Number of MAPQ == 0 reads covering this record - \item NS : Number of samples with data - \item SB : strand bias at this position - \item SOMATIC : indicates that the record is a somatic mutation, for cancer genomics - \item VALIDATED : validated by follow-up experiment - \item 1000G : membership in 1000 Genomes - \item $\ldots$ see Section~\ref{sv-info-keys} for a list of INFO keys reserved for structural variants. -\end{itemize} + INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: $<$key$>$=$<$data$>$[,data]. + INFO keys must match the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate fields are not allowed. Arbitrary keys are permitted, although the sub-fields listed in Table~\ref{table:reserved-info} are reserved (albeit optional). + + \begin{table}[htbp] + \centering + \begin{tabularx}{\textwidth}{ | p{2.5cm} | p{1.5cm} | p{1.5cm} | X | } + Field & Number & Type & Description \\ \hline + AA & 1 & String & Ancestral allele \\ + AC & A & Integer & Allele count in genotypes, for each ALT allele, in the same order as listed \\ + AD & R & Integer & Total read depth for each allele \\ + ADF & R & Integer & Read depth for each allele on the forward strand \\ + ADR & R & Integer & Read depth for each allele on the reverse strand \\ + AF & A & Float & Allele frequency for each ALT allele in the same order as listed (estimated from primary data, not called genotypes) \\ + AN & 1 & Integer & Total number of alleles in called genotypes \\ + BQ & 1 & Float & RMS base quality \\ + CIGAR & A & String & Cigar string describing how to align an alternate allele to the reference allele \\ + DB & 0 & Flag & dbSNP membership \\ + DP & 1 & Integer & Combined depth across samples \\ + END & 1 & Integer & End position (for use with symbolic alleles) \\ + H2 & 0 & Flag & HapMap2 membership \\ + H3 & 0 & Flag & HapMap3 membership \\ + MQ & 1 & . & RMS mapping quality \\ + MQ0 & 1 & Integer & Number of MAPQ == 0 reads \\ + NS & 1 & Integer & Number of samples with data \\ + SB & . & . & Strand bias \\ + SOMATIC & 0 & Flag & Somatic mutation (for cancer genomics) \\ + VALIDATED & 0 & Flag & Validated by follow-up experiment \\ + 1000G & 0 & Flag & 1000 Genomes membership \\ + \end{tabularx} + \caption{Reserved INFO fields} + \label{table:reserved-info} + \end{table} + + The exact format of each INFO sub-field should be specified in the meta-information (as described above). + Example for an INFO field: DP=154;MQ=52;H2. Keys without corresponding values may be used to indicate group membership (e.g. H2 indicates the SNP is found in HapMap 2). See Section~\ref{sv-info-keys} for additional reserved INFO sub-fields used to encode structural variants. \end{enumerate} -The exact format of each INFO sub-field should be specified in the meta-information (as described above). -Example for an INFO field: DP=154;MQ=52;H2. Keys without corresponding values are allowed in order to indicate group membership (e.g. H2 indicates the SNP is found in HapMap 2). It is not necessary to list all the properties that a site does NOT have, by e.g. H2=0. See below for additional reserved INFO sub-fields used to encode structural variants. -\subsubsection{Genotype fields} -If genotype information is present, then the same types of data must be present -for all samples. First a FORMAT field is given specifying the data types and -order (colon-separated FORMAT ids matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate fields are not allowed). This is followed by one data block per -sample, with the colon-separated data corresponding to the types -specified in the format. The first sub-field must always be the genotype (GT) -if it is present. There are no required sub-fields. -As with the INFO field, there are several common, reserved keywords that are standards across the community: +\subsubsection{Genotype fields} +If genotype information is present, then the same types of data must be present for all samples. First a FORMAT field is given specifying the data types and order (colon-separated FORMAT ids matching the regular expression \texttt{\^{}[A-Za-z\_][0-9A-Za-z\_.]*\$}, duplicate fields are not allowed). This is followed by one data block per sample, with the colon-separated data corresponding to the types specified in the format. The first sub-field must always be the genotype (GT) if it is present. There are no required sub-fields. Additional Genotype fields can be defined in the meta-information, however, software support for such fields is not guaranteed. + +If any of the fields is missing, it is replaced with the missing value. For example if the FORMAT is GT:GQ:DP:HQ then $0\mid0:.:23:23,34$ indicates that GQ is missing. Trailing fields can be dropped, with the exception of the GT field, which should always be present if specified in the FORMAT field. + +As with the INFO field, there are several common, reserved keywords that are standards across the community. See their detailed definitions below, as well as table~\ref{table:reserved-genotypes} for their reference Number, Type and Description. See also Section~\ref{sv-format-keys} for a list of genotype keys reserved for structural variants. + +\begin{table}[htbp] + \centering + \begin{tabularx}{\textwidth}{ | p{2.5cm} | p{1.5cm} | p{1.5cm} | X | } + Field & Number & Type & Description \\ \hline + AD & R & Integer & Read depth for each allele \\ + ADF & R & Integer & Read depth for each allele on the forward strand \\ + ADR & R & Integer & Read depth for each allele on the reverse strand \\ + DP & 1 & Integer & Read depth \\ + EC & A & Integer & Expected alternate allele counts \\ + FT & 1 & String & Filter indicating if this genotype was ``called'' \\ + GL & G & Float & Genotype likelihoods \\ + GP & G & Float & Genotype posterior probabilities \\ + GQ & 1 & Integer & Conditional genotype quality \\ + GT & 1 & String & Genotype \\ + HQ & 2 & Integer & Haplotype quality \\ + MQ & 1 & Integer & RMS mapping quality \\ + PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ + PQ & 1 & Integer & Phasing quality \\ + PS & 1 & Integer & Phase set \\ + \end{tabularx} + \caption{Reserved genotype fields} + \label{table:reserved-genotypes} +\end{table} \begin{itemize} \renewcommand{\labelitemii}{$\circ$} - \item AD, ADF, ADR: per-sample read depths for each allele; total (AD), on the forward (ADF) and the reverse (ADR) strand (Integer, Number=R) - \item DP : read depth at this position for this sample (Integer) - \item EC : comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field (typically used in association analyses) (Integers) - \item FT : sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semi-colon separated list of codes for filters that fail, or `.' to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (String, no white-space or semi-colons permitted) - \item GQ : conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant) (Integer) - \item GP : genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities (Float) - \item GT : genotype, encoded as allele values separated by either of $/$ or $\mid$. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be $0/1$, $1\mid0$, or $1/2$, etc. For haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondrion, only one allele value should be given; a triploid call might look like $0/0/1$. If a call cannot be made for a sample at a given locus, `.' should be specified for each missing allele in the GT field (for example `$./.$' for a diploid genotype and `.' for haploid genotype). The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes): + \item AD, ADF, ADR (Integer): Per-sample read depths for each allele; total (AD), on the forward (ADF) and the reverse (ADR) strand. + \item DP (Integer): Read depth at this position for this sample. + \item EC (Integer): Comma separated list of expected alternate allele counts for each alternate allele in the same order as listed in the ALT field. Typically used in association analyses. + \item FT (String): Sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semi-colon separated list of codes for filters that fail, or `.' to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs. No white-space or semi-colons permitted. + \item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant). + \item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities. + \item GT (String): Genotype, encoded as allele values separated by either of $/$ or $\mid$. The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be $0/1$, $1\mid0$, or $1/2$, etc. Haploid calls, e.g. on Y, male non-pseudoautosomal X, or mitochondrion, are indicated by having only one allele value. A triploid call might look like $0/0/1$. If a call cannot be made for a sample at a given locus, `.' must be specified for each missing allele in the GT field (for example `$./.$' for a diploid genotype and `.' for haploid genotype). The meanings of the separators are as follows (see the PS field below for more details on incorporating phasing information into the genotypes): \begin{itemize} \item $/$ : genotype unphased \item $\mid$ : genotype phased \end{itemize} - \item GL : genotype likelihoods comprised of comma separated floating point - $log_{10}$-scaled likelihoods for all possible genotypes given the set of - alleles defined in the REF and ALT fields. In presence of the GT field the - same ploidy is expected; without GT field, diploidy is assumed. + \item GL (Float): Genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected; without GT field, diploidy is assumed. - \textsc{Genotype Ordering.} In general case of ploidy P and N alternate alleles (0 is the REF and 1..N + \textsc{Genotype Ordering.} \label{genotype-fields:genotype-ordering} + In general case of ploidy P and N alternate alleles (0 is the REF and $1\ldots N$ the alternate alleles), the ordering of genotypes for the likelihoods can be expressed by the following pseudocode with as many nested loops as ploidy:\footnote{Note that we use inclusive \texttt{for} loop boundaries.} \begingroup @@ -445,16 +471,14 @@ \subsubsection{Genotype fields} } \end{itemize} - \item HQ : haplotype qualities, two comma separated phred qualities (Integers) - \item MQ : RMS mapping quality, similar to the version in the INFO field. (Integer) - \item PL : the phred-scaled genotype likelihoods rounded to the closest integer (and otherwise defined precisely as the GL field) (Integers) - \item PQ : phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set). We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality. (Integer) - \item PS : phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer) - \item $\ldots$ see Section~\ref{sv-format-keys} for a list of genotype keys reserved for structural variants. + \item HQ (Integer): Haplotype qualities, two comma separated phred qualities. + \item MQ (Integer): RMS mapping quality, similar to the version in the INFO field. + \item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined precisely as the GL field. + \item PQ (Integer): Phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set). We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality. + \item PS (non-negative 32-bit Integer): Phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). \end{itemize} - -If any of the fields is missing, it is replaced with the missing value. For example if the FORMAT is GT:GQ:DP:HQ then $0\mid0:.:23:23,34$ indicates that GQ is missing. Trailing fields can be dropped (with the exception of the GT field, which should always be present if specified in the FORMAT field). +If any of the fields is missing, it is replaced with the missing value. For example if the FORMAT is GT:GQ:DP:HQ then $0\mid0:.:23:23,34$ indicates that GQ is missing. Trailing fields can be dropped (with the exception of the GT field, which must always be present if specified in the FORMAT field). See below for additional genotype fields used to encode structural variants. Additional Genotype fields can be defined in the meta-information. However, software support for such fields is not guaranteed. @@ -462,22 +486,22 @@ \subsubsection{Bisulfite sequencing specific fields} As with genotype data, if DNA methylation information from bisulfite sequencing experiments is present then the same type of information -should be present for all samples, and the FORMAT field should +must be present for all samples, and the FORMAT field must specifiy the data types and order. If both methylation and genotype -data are present then they should be reported together. The relative +data are present then they must be reported together. The relative order of genotype and methylation fields is not determined by the -specifications,except that the first sub-field should be the genotype +specifications,except that the first sub-field must be the genotype (GT) if present as described above. There are no required sub-fields. It is, however, strongly recommended that the bisulfite strand specific counts (MC8) are present. If methylation data only are present, then -the GT and other genotype associated fields should be omitted. +the GT and other genotype associated fields must be omitted. In contrast to normal practice with genotype only data where only positions where sequence variants are called are generally present in the mVCF file, when methylation data is present then all observed positions where a C or a G allele is present either in the observed data or in the reference and every position where a non-reference -allele is reported should be present in the VCF file. In practice, +allele is reported must be present in the VCF file. In practice, this means every position where the called genotype for all samples is \emph{not} homozygous reference with the reference being A or T. Hard filtering of sites on read depth criteria is allowed, but it is @@ -485,22 +509,22 @@ \subsubsection{Bisulfite sequencing specific fields} quality as this can introduce biases, since different combinations of genotype/methylation required different coverage to achieve the same confidence of genotype call. If any of the fields are missing, they -should be replaced by the missing value. If the allele count field +must be replaced by the missing value. If the allele count field (MC8) is present then the methylation point estimates (MEF, MER) and number of methylation informative bases (MN) are not required. -However, if MC8 is \emph{not} present then MEF, MER and MN should all +However, if MC8 is \emph{not} present then MEF, MER and MN must all be present. \begin{itemize} \renewcommand{\labelitemii}{$\circ$} - \item MC8: Base counts for A,C,G,T \emph{not} informative for methylation followed by base counts for A,C,G,T \emph{informative} for methylation.(8 Integers). These counts do not consider the genotype call, and simply report the number of bases of each type seen at the position (after an optional quality filtering step). If not all counts are available (due to conversion from another format) then the missing character '.' should be used to represent the missing values. - \item CS: Strand of Cytosine with respect to reference genome (+/-/+-/NA). Heterozygous C/G SNPs should be represented as '+-' as there is a cytosine on both strands. Sites where no Cytosine is present on either strand should be represented by 'NA'. (String) + \item MC8: Base counts for A,C,G,T \emph{not} informative for methylation followed by base counts for A,C,G,T \emph{informative} for methylation.(8 Integers). These counts do not consider the genotype call, and simply report the number of bases of each type seen at the position (after an optional quality filtering step). If not all counts are available (due to conversion from another format) then the missing character '.' must be used to represent the missing values. + \item CS: Strand of Cytosine with respect to reference genome (+/-/+-/NA). Heterozygous C/G SNPs must be represented as '+-' as there is a cytosine on both strands. Sites where no Cytosine is present on either strand must be represented by 'NA'. (String) \item CG : CpG status for position as determined by the called genotypes. This field can take values 'CG',' N', 'H' or '?' to represent 'Yes', 'No', 'Heterozygous' and 'Unknown'. A position called as homozygous C that is followed by a homozygous G call would have a CpG status of 'Y', whereas if the following position was called as a heterozygote containing a G (i.e., AG or TG) then the CpG status would be 'H'. A status of 'N' is only given when the following base is confidently called as \emph{not} containing a G. Similar rules apply to a position called as a homozygous G with respect to the whether the genotype call for the preceding base contains a C. (String) - \item CX: 5 base sequence context based on called genotypes. This field provides additional information to the CG field above by giving the genotype calls for the 2 bases before the current position, the base at the current position, and the 2 bases following. The sequence context is always given with respect to the forward strand. Heterozygous genotype calls should be represented using the IUPAC codes. (String) + \item CX: 5 base sequence context based on called genotypes. This field provides additional information to the CG field above by giving the genotype calls for the 2 bases before the current position, the base at the current position, and the 2 bases following. The sequence context is always given with respect to the forward strand. Heterozygous genotype calls must be represented using the IUPAC codes. (String) \item MN: Number of bases informative for methylation. (Integer) - \item MEF: Methylation point estimate from the forward strand i.e., applying to a C. The estimate should be from 0-1. (Float) - \item MER: Methylation point estimate from the reverse strand. i.e., applying to a G. The estimate should be from 0-1. (Float) + \item MEF: Methylation point estimate from the forward strand i.e., applying to a C. The estimate must be from 0-1. (Float) + \item MER: Methylation point estimate from the reverse strand. i.e., applying to a G. The estimate must be from 0-1. (Float) \end{itemize} @@ -517,7 +541,7 @@ \subsection{VCF tag naming conventions} \begin{itemize} \item The "L" suffix means "likelihood" as log-likelihood in the sampling distribution, log10 Pr(Data$|$Model). Likelihoods are represented as log10 - scale, so has to be negative (e.g.~GL, CNL). The likelihood can be also + scale, thus they are negative numbers (e.g.~GL, CNL). The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g.~PL). \item The "P" suffix means "probability" as linear-scale probability in the @@ -542,18 +566,28 @@ \section{INFO keys used for structural variants} \end{verbatim} \normalsize For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate. + \footnotesize \begin{verbatim} ##INFO= \end{verbatim} \normalsize -Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is useful for filtering. +This key can be derived from the REF/ALT fields but is useful for filtering. The reserved values must be used for the types listed below: +\begin{itemize} + \item DEL: Deletion relative to the reference + \item INS: Insertion of novel sequence relative to the reference + \item DUP: Region of elevated copy number relative to the reference + \item INV: Inversion of reference sequence + \item CNV: Copy number variable region (may be both deletion and duplication) +\end{itemize} + \footnotesize \begin{verbatim} ##INFO= \end{verbatim} \normalsize One value for each ALT allele. Longer ALT alleles (e.g. insertions) have positive values, shorter ALT alleles (e.g. deletions) have negative values. + \footnotesize \begin{verbatim} ##INFO= @@ -564,6 +598,7 @@ \section{INFO keys used for structural variants} \end{verbatim} \normalsize For precise variants, the consensus sequence the alternate allele assembly is derivable from the REF and ALT fields. However, the alternate allele assembly file may contain additional information about the characteristics of the alt allele contigs. + \footnotesize \begin{verbatim} ##INFO= @@ -840,7 +875,7 @@ \subsection{Specifying complex rearrangements with breakends} An arbitrary rearrangement event can be summarized as a set of novel \textbf{adjacencies}. Each adjacency ties together $2$ \textbf{breakends}. The two breakends at either end of a novel adjacency are called \textbf{mates}. -There is one line of VCF (i.e. one record) for each of the two breakends in a novel adjacency. A breakend record is identified with the tag ``SYTYPE=BND'' in the INFO field. The REF field of a breakend record indicates a base or sequence s of bases beginning at position POS, as in all VCF records. The ALT field of a breakend record indicates a replacement for s. This ``breakend replacement'' has three parts: +There is one line of VCF (i.e. one record) for each of the two breakends in a novel adjacency. A breakend record is identified with the tag ``SVTYPE=BND'' in the INFO field. The REF field of a breakend record indicates a base or sequence s of bases beginning at position POS, as in all VCF records. The ALT field of a breakend record indicates a replacement for s. This ``breakend replacement'' has three parts: \begin{enumerate} \item The string t that replaces places s. The string t may be an extended version of s if some novel bases are inserted during the formation of the novel adjacency. \item The position p of the mate breakend, indicated by a string of the form ``chr:pos''. This is the location of the first mapped base in the piece being joined at this novel adjacency. @@ -970,7 +1005,7 @@ \subsubsection{Multiple mates} \normalsize \subsubsection{Explicit partners} -Two breakends which are connected in the reference genome but disconnected in the variants are called partners. Each breakend only has one partner, typically one basepair left or right. However, it is not uncommon to observe loss of a few basepairs during the rearrangement. It is then possible to explicitly name a breakend's partner, such as in Figure 5.: +Two breakends which are connected in the reference genome but disconnected in the variants are called partners. Each breakend only has one partner, typically one basepair left or right. However, it is not uncommon to observe loss of a few basepairs during the rearrangement. A breakend's partner may be explicitly named as in Figure 5: \begin{figure}[ht] \centering @@ -1305,9 +1340,9 @@ \section{Representing DNA methylation variation in VCF records} \subsection{An example} \scriptsize \begin{verbatim} -##fileformat=mVCFv1.0 +##fileformat=mVCFv4.3 ##fileDate=20150505 -##source=myBScallerV1.1 +##source=myBScallerV4.3 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=l ##INFO= @@ -1371,7 +1406,7 @@ \subsection{Overall file organization} No record may refer to a contig not present in the header itself. \item All INFO and GENOTYPE fields must be fully typed in the BCF2 header to -enable type-specific encoding of the fields in records. An error should be +enable type-specific encoding of the fields in records. An error must be thrown when converting a VCF to BCF2 when an unknown or not fully specified field is encountered in the records. \end{itemize} @@ -1459,8 +1494,8 @@ \subsection{BCF2 records} blocks. Each record is conceptually two parts. First is the site information (chr, pos, INFO field). Immediately after the sites data is the genotype data for every sample in the BCF2 file. The genotype data may be omitted entirely -from the record if there is no genotype data in the VCF file. Note that it's -acceptable to not BGZF compress a BCF2 file. +from the record if there is no genotype data in the VCF file. +Compression of a BCF file is recommended but not required. \subsubsection{Site encoding} @@ -1471,7 +1506,7 @@ \subsubsection{Site encoding} l\_indiv & uint32\_t & Data length of FORMAT and individual genotype fields \\ \hline CHROM & int32\_t & Given as an offset into the mandatory contig dictionary \\ \hline POS & int32\_t & 0-based leftmost coordinate \\ \hline -rlen & int32\_t & Length of the record as projected onto the reference sequence. May be the actual length of the REF allele but for symbolic alleles should be the declared length respecting the END attribute \\ \hline +rlen & int32\_t & Length of the record as projected onto the reference sequence. Must be the length of the REF allele or the declared length of a symbolic allele respecting the END attribute \\ \hline n\_allele\_info & int32\_t & n\_info, where n\_allele is the number of REF+ALT alleles in this record, and n\_info is the number of VCF INFO fields present in this record \\ \hline n\_fmt\_sample & uint32\_t & n\_sample, where n\_fmt is the number of format fields for genotypes in this record, and n\_samples is the number of samples present in this sample. Note that the number of samples must be equal to the number of samples in the header \\ \hline QUAL & float & Variant quality; 0x7F800001 for a missing value \\ \hline @@ -1517,7 +1552,7 @@ \subsubsection{Genotype encoding} The value is always implicitly a vector of N values, where N is the number of samples. The type byte of the value field indicates the type of each value of the N length vector. For atomic values this is straightforward (size = 1). But if the type field indicates that the values are themselves vectors (as often occurs, such as with the PL field) then each of the N values in the outer vector is itself a vector of values. This encoding is efficient when every value in the genotype field vector has the same length and type. -Note that the specific order of fields isn't defined, but it's probably a good idea to respect the ordering as specified in the input VCF/BCF2 file. +It is recommended to respect the ordering as specified in the input VCF/BCF2 file, but parsers should not rely on a specific ordering. If there are no sample records (genotype data) in this VCF/BCF2 file, the size of the genotypes block will be 0. @@ -1607,7 +1642,7 @@ \subsubsection{Type encoding} \end{tabular} \vspace{0.3cm} -\textbf{Character} values are not explicitly typed in BCF2. Instead, VCF Character values should be encoded by a single character string. See also \ref{character-encoding}. +\textbf{Character} values are not explicitly typed in BCF2. Instead, VCF Character values must be encoded by a single character string. See also \ref{character-encoding}. \vspace{0.3cm} \textbf{Flags} values -- which can only appear in INFO fields -- in BCF2 should be encoded by any non-MISSING value. The recommended best practice is to encode the value as an 1-element INT8 (type 0x11) with value of 1 to indicate present. Because FLAG values can only be encoded in INFO fields, BCF2 provides no mechanism to encode FLAG values in genotypes, but could be easily extended to do so if allowed in a future VCF version. @@ -1784,17 +1819,17 @@ \subsubsection{Encoding ID} \subsubsection{Encoding REF/ALT fields} We encode each of REF and ALT as typed strings, first REF followed immediately -by ALT. Each is a 1 element string (0x19), which would then be followed by the +by ALT. Each is a 1 element string (0x17), which would then be followed by the single bytes for the bases of 0x43 and 0x41: \vspace{0.3cm} \begin{tabular}{|l| l|} \hline -0x19 0x41 & REF A \\ \hline -0x19 0x43 & ALT C \\ \hline +0x17 0x41 & REF A \\ \hline +0x17 0x43 & ALT C \\ \hline \end{tabular} \vspace{0.3cm} -Just for discussion, suppose instead that ALT was ALT=C,T. The only thing that could change is that there would be another typed string following immediately after C encoding 0x19 (1 element string) with the value of 0x54. +Just for discussion, suppose instead that ALT was ALT=C,T. The only thing that could change is that there would be another typed string following immediately after C encoding 0x17 (1 element string) with the value of 0x54. \subsubsection{Encoding FILTER} @@ -1829,12 +1864,12 @@ \subsubsection{Encoding the INFO fields} \end{tabular} \vspace{0.3cm} -The ancestral allele (AA) tell us that among other primates the original allele is C, a Character here. Because we represent Characters as single element strings in BCF2 (0x19) with value 0x43 (C). So the entire key/value pair is: +The ancestral allele (AA) tell us that among other primates the original allele is C, a Character here. Because we represent Characters as single element strings in BCF2 (0x17) with value 0x43 (C). So the entire key/value pair is: \vspace{0.3cm} \begin{tabular}{|l |l|} \hline 0x11 0x51 & AA key \\ \hline -0x19 0x43 & with value of C \\ \hline +0x17 0x43 & with value of C \\ \hline \end{tabular} \subsubsection{Encoding Genotypes} @@ -1897,8 +1932,8 @@ \subsubsection{Encoding Genotypes} 0x00020004 & n\_allele\_info \\ \hline 0x05000003 & n\_fmt\_samples \\ \hline 0x59 0x72 0x73 0x31 0x32 0x33 & ID \\ \hline -0x19 0x41 & REF A \\ \hline -0x19 0x43 & ALT C \\ \hline +0x17 0x41 & REF A \\ \hline +0x17 0x43 & ALT C \\ \hline 0x11 0x00 & FILTER field PASS \\ \hline 0x11 0x50 0x11 0x01 & HM3 flag is present \\ \hline 0x11 0x51 & AC key \\ \hline @@ -1906,7 +1941,7 @@ \subsubsection{Encoding Genotypes} 0x11 0x52 & AN key \\ \hline 0x11 0x06 & with value of 6 \\ \hline 0x11 0x51 & AA key \\ \hline -0x19 0x43 & with value of C \\ \hline +0x17 0x43 & with value of C \\ \hline 0x1101 0x21 0x020202040404 & GT \\ \hline 0x1102 0x11 0x0A0A0A & GQ \\ \hline 0x1103 0x11 0x203040 & DP \\ \hline @@ -1933,6 +1968,13 @@ \subsection{BCF2 block gzip and indexing} \section{List of changes} +\subsection{Changes to VCFv4.3} + +\begin{itemize} +\item More strict language: "should" replaced with "must" where appropriate +\item Tables with Type and Number definitions for INFO and FORMAT reserved keys +\end{itemize} + \subsection{Changes between VCFv4.2 and VCFv4.3} \begin{itemize} @@ -1957,7 +1999,6 @@ \subsection{Changes between VCFv4.2 and VCFv4.3} \item Chromosome names cannot use reserved symbolic alleles and contain characters used by breakpoints (Section~\ref{sec-contig-field}). \item IUPAC ambiguity codes should be converted to a concrete base. \item Symbolic ALTs for IUPAC codes. -\item A section about storing DNA methylation information from bisulfite sequencing experiments was added. \end{itemize} \subsection{Changes between BCFv2.1 and BCFv2.2} diff --git a/_config.yml b/_config.yml new file mode 100644 index 000000000..fccf7e4b0 --- /dev/null +++ b/_config.yml @@ -0,0 +1,15 @@ +# Site settings +title: HTS file format specifications +description: > + Samtools organisation for next-generation sequencing + developers: htslib C API, htsjdk Java API, file format specifications, + and samtools/bcftools source code. +baseurl: "/hts-specs" +url: "http://samtools.github.io" +github_username: samtools +exclude: ["*.tex", img, new, Makefile, MAINTAINERS.md, README.md] + +# Build settings +markdown: kramdown +kramdown: + parse_block_html: true diff --git a/_includes/footer.html b/_includes/footer.html new file mode 100644 index 000000000..7ab07dc6d --- /dev/null +++ b/_includes/footer.html @@ -0,0 +1,18 @@ +
+
+ + +
+

{{ site.description }}

+
+
+
+
diff --git a/_includes/head.html b/_includes/head.html new file mode 100644 index 000000000..61eba9751 --- /dev/null +++ b/_includes/head.html @@ -0,0 +1,11 @@ + + + + + + {% if page.title %}{{ page.title | escape }}{% else %}{{ site.title | escape }}{% endif %} + + + + + diff --git a/_includes/icon-github.html b/_includes/icon-github.html new file mode 100644 index 000000000..e501a16b1 --- /dev/null +++ b/_includes/icon-github.html @@ -0,0 +1 @@ +{% include icon-github.svg %}{{ include.username }} diff --git a/_includes/icon-github.svg b/_includes/icon-github.svg new file mode 100644 index 000000000..4422c4f5d --- /dev/null +++ b/_includes/icon-github.svg @@ -0,0 +1 @@ + diff --git a/_layouts/default.html b/_layouts/default.html new file mode 100644 index 000000000..c58a0488c --- /dev/null +++ b/_layouts/default.html @@ -0,0 +1,17 @@ + + + + {% include head.html %} + + +
+
+ {{ content }} +
+
+ + {%unless page.suppress_footer %}{% include footer.html %}{% endunless %} + + + + diff --git a/htsget.md b/htsget.md new file mode 100644 index 000000000..0725897be --- /dev/null +++ b/htsget.md @@ -0,0 +1,385 @@ +--- +layout: default +title: htsget protocol +suppress_footer: true +--- + +# Htsget retrieval API spec v1.0.0 + +# Design principles + +This data retrieval API bridges from existing genomics bulk data transfers to a client/server model with the following features: + +* Incumbent data formats (BAM, CRAM) are preferred initially, with a future path to others. +* Multiple server implementations are supported, including those that do format transcoding on the fly, and those that return essentially unaltered filesystem data. +* Multiple use cases are supported, including access to small subsets of genomic data (e.g. for browsing a given region) and to full genomes (e.g. for calling variants). +* Clients can provide hints of the information to be retrieved; servers can respond with more information than requested but not less. +* We use the following pan-GA4GH standards: + * 0 start, half open coordinates + * The structuring of POST inputs, redirects and other non-reads data will be protobuf3 compatible JSON + +Explicitly this API does NOT: + +* Provide a way to discover the identifiers for valid ReadGroupSets --- clients obtain these via some out of band mechanism + + +# Protocol essentials + +All API invocations are made to a configurable HTTP(S) endpoint, receive URL-encoded query string parameters, and return JSON output. Successful requests result with HTTP status code 200 and have UTF8-encoded JSON in the response body. The server may provide responses with chunked transfer encoding. The client and server may mutually negotiate HTTP/2 upgrade using the standard mechanism. + +The JSON response is an object with the single key `htsget` as described in the [Response JSON fields](#response-json-fields) and [Error Response JSON fields](#error-response-json-fields) sections. This ensures that, apart from whitespace differences, the message always starts with the same prefix. The presence of this prefix can be used as part of a client's response validation. + +Any timestamps that appear in the response from an API method are given as [ISO 8601] date/time format. + +HTTP responses may be compressed using [RFC 2616] `transfer-coding`, not `content-coding`. + +Requests adhering to this specification MAY include an `Accept` header specifying the htsget protocol version they are using: + + Accept: application/vnd.ga4gh.htsget.v1.0.0+json + +JSON responses SHOULD include a `Content-Type` header describing the htsget protocol version defining the JSON schema used in the response, e.g., + + Content-Type: application/vnd.ga4gh.htsget.v1.0.0+json; charset=utf-8 + +## Errors + +The server MUST respond with an appropriate HTTP status code (4xx or 5xx) when an error condition is detected. In the case of transient server errors, (e.g., 503 and other 5xx status codes), the client SHOULD implement appropriate retry logic as discussed in [Reliability & performance considerations](#reliability--performance-considerations) below. + +For errors that are specific to the `htsget` protocol, the response body SHOULD be a JSON object (content-type `application/json`) providing machine-readable information about the nature of the error, along with a human-readable description. The structure of this JSON object is described as follows. + +### Error Response JSON fields + + + +
+`htsget` +_object_ + +Container for response object. + + + +
+`error` +_string_ + +The type of error. This SHOULD be chosen from the list below. +
+`message` +_string_ + +A message specific to the error providing information on how to debug the problem. Clients MAY display this message to the user. +
+
+ +The following errors types are defined: + +Error type | HTTP status code | Description +|-----|:---:|-----| +InvalidAuthentication | 401 | Authorization provided is invalid +PermissionDenied | 403 | Authorization is required to access the resource +NotFound | 404 | The resource requested was not found +UnsupportedFormat | 400 | The requested file format is not supported by the server +InvalidInput | 400 | The request parameters do not adhere to the specification +InvalidRange | 400 | The requested range cannot be satisfied + +The error type SHOULD be chosen from this table and be accompanied by the specified HTTP status code. An example of a valid JSON error response is: +```json +{ + "htsget" : { + "error": "NotFound", + "message": "No such accession 'ENS16232164'" + } +} +``` +## Security + +The htsget API enables the retrieval of potentially sensitive genomic data by means of a client/server model. Effective security measures are essential to protect the integrity and confidentiality of these data. + +Sensitive information transmitted on public networks, such as access tokens and human genomic data, MUST be protected using Transport Level Security (TLS) version 1.2 or later, as specified in [RFC 5246](https://tools.ietf.org/html/rfc5246). + +If the data holder requires client authentication and/or authorization, then the client's HTTPS API request MUST present an OAuth 2.0 bearer access token as specified in [RFC 6750](https://tools.ietf.org/html/rfc6750), in the `Authorization` request header field with the `Bearer` authentication scheme: + +``` +Authorization: Bearer [access_token] +``` + +The policies and processes used to perform user authentication and authorization, and the means through which access tokens are issued, are beyond the scope of this API specification. GA4GH recommends the use of the OAuth 2.0 framework ([RFC 6749](https://tools.ietf.org/html/rfc6749)) for authentication and authorization. + +## CORS + +All API resources should have the following support for cross-origin resource sharing ([CORS]) to support browser-based clients: + +If a request to the URL of an API method includes the `Origin` header, its contents will be propagated into the `Access-Control-Allow-Origin` header of the response. Preflight requests (`OPTIONS` requests to the URL of an API method, with appropriate extra headers as defined in the CORS specification) will be accepted if the value of the `Access-Control-Request-Method` header is `GET`. +The values of `Origin` and `Access-Control-Request-Headers` (if any) of the request will be propagated to `Access-Control-Allow-Origin` and `Access-Control-Allow-Headers` respectively in the preflight response. +The `Access-Control-Max-Age` of the preflight response is set to the equivalent of 30 days. + + +# Method: get reads by ID + + GET /reads/ + +The core mechanic for accessing specified reads data. The JSON response is a "ticket" allowing the caller to obtain the desired data in the specified format, which may involve follow-on requests to other endpoints, as detailed below. + +The client can request only reads overlapping a given genomic range. The response may however contain a superset of the desired results, including all records overlapping the range, and potentially other records not overlapping the range; the client should filter out such extraneous records if necessary. Successful requests with empty result sets still produce a valid response in the requested format (e.g. including header and EOF marker). + +## URL parameters + + + +
+`id` +_required_ + +A string specifying which reads to return. + +The format of the string is left to the discretion of the API provider, including allowing embedded "/" characters. Strings could be ReadGroupSetIds as defined by the GA4GH API, or any other format the API provider chooses (e.g. "/data/platinum/NA12878", "/byRun/ERR148333"). +
+ +## Query parameters + + + + + + + + + +
+`format` +_optional string_ + +Request read data in this format. Default: BAM. Allowed values: BAM,CRAM. + +The server SHOULD reply with an `UnsupportedFormat` error if the requested format is not supported. +[^a] +
+`referenceName` +_optional_ + +The reference sequence name, for example "chr1", "1", or "chrX". If unspecified, all reads (mapped and unmapped) are returned. [^b] + +The server SHOULD reply with a `NotFound` error if the requested reference does not exist. +
+`start` +_optional 32-bit unsigned integer_ + +The start position of the range on the reference, 0-based, inclusive. + +The server SHOULD respond with an `InvalidInput` error if `start` is specified and a reference is not specified +(see `referenceName`). + +The server SHOULD respond with an `InvalidRange` error if `start` and `end` are specified and `start` is greater +than `end`. +
+`end` +_optional 32-bit unsigned integer_ + +The end position of the range on the reference, 0-based exclusive. + +The server SHOULD respond with an `InvalidInput` error if `end` is specified and a reference is not specified +(see `referenceName`). + +The server SHOULD respond with an `InvalidRange` error if `start` and `end` are specified and `start` is greater +than `end`. +
+`fields` +_optional_ + +A list of fields to include, see below +Default: all +
+`tags` +_optional_ + +A comma separated list of tags to include, default: all. If the empty string is specified (tags=) no tags are included. + +The server SHOULD respond with an `InvalidInput` error if `tags` and `notags` intersect. +
+`notags` +_optional_ + +A comma separated list of tags to exclude, default: none. + +The server SHOULD respond with an `InvalidInput` error if `tags` and `notags` intersect. +
+ +### Field filtering + +The list of fields is based on BAM fields: + +Field | Description +|-------|-------| +QNAME | Read names +FLAG | Read bit flags +RNAME | Reference sequence name +POS | Alignment position +MAPQ | Mapping quality score +CIGAR | CIGAR string +RNEXT | Reference sequence name of the next fragment template +PNEXT | Alignment position of the next fragment in the template +TLEN | Inferred template size +SEQ | Read bases +QUAL | Base quality scores + +Example: `fields=QNAME,FLAG,POS`. + +## Response JSON fields + + + +
+`htsget` +_object_ + +Container for response object. + + + + +
+`format` +_string_ + +Read data format. Default: BAM. Allowed values: BAM,CRAM. +
+`urls` +_array of objects_ + +An array providing URLs from which raw data can be retrieved. The client must retrieve binary data blocks from each of these URLs and concatenate them to obtain the complete response in the requested format. + +Each element of the array is a JSON object with the following fields: + + + + +
+`url` +_string_ + +One URL. + +May be either a `https:` URL or an inline `data:` URI. HTTPS URLs require the client to make a follow-up request (possibly to a different endpoint) to retrieve a data block. Data URIs provide a data block inline, without necessitating a separate request. + +Further details below. +
+`headers` +_optional object_ + +For HTTPS URLs, the server may supply a JSON object containing one or more string key-value pairs which the client MUST supply as headers with any request to the URL. For example, if headers is `{"Range": "bytes=0-1023", "Authorization": "Bearer xxxx"}`, then the client must supply the headers `Range: bytes=0-1023` and `Authorization: Bearer xxxx` with the HTTPS request to the URL. +
+ +
+`md5` +_optional hex string_ + +MD5 digest of the blob resulting from concatenating all of the "payload" data --- the url data blocks. +
+
+ +An example of a JSON response is: +```json +{ + "htsget" : { + "format" : "BAM", + "urls" : [ + { + "url" : "data:application/vnd.ga4gh.bam;base64,QkFNAQ==" + }, + { + "url" : "https://htsget.blocksrv.example/sample1234/header" + }, + { + "url" : "https://htsget.blocksrv.example/sample1234/run1.bam", + "headers" : { + "Authorization" : "Bearer xxxx", + "Range" : "bytes=65536-1003750" + } + }, + { + "url" : "https://htsget.blocksrv.example/sample1234/run1.bam", + "headers" : { + "Authorization" : "Bearer xxxx", + "Range" : "bytes=2744831-9375732" + } + } + ] + } +} +``` + +## Response data blocks + +### Diagram of core mechanic + +![Diagram showing ticket flow](pub/htsget-ticket.png) + +1. Client sends a request with id, genomic range, and filter. +2. Server replies with a ticket describing data block locations (URLs and headers). +3. Client fetches the data blocks using the URLs and headers. +4. Client concatenates data blocks to produce local blob. + +While the blocks must be finally concatenated in the given order, the client may fetch them in parallel. + +### HTTPS data block URLs + +1. must have percent-encoded path and query (e.g. javascript encodeURIComponent; python urllib.urlencode) +2. must accept GET requests +3. should provide CORS +4. should allow multiple request retries, within reason +5. should use HTTPS rather than plain HTTP except for testing or internal-only purposes (providing both security and robustness to data corruption in flight) +6. need not use the same authentication scheme as the API server. URL and `headers` must include any temporary credentials necessary to access the data block. Client must not send the bearer token used for the API, if any, to the data block endpoint, unless copied in the required `headers`. +7. Server must send the response with either the Content-Length header, or chunked transfer encoding, or both. Clients must detect premature response truncation. +8. Client and URL endpoint may mutually negotiate HTTP/2 upgrade using the standard mechanism. +9. Client must follow 3xx redirects from the URL, subject to typical fail-safe mechanisms (e.g. maximum number of redirects), always supplying the `headers`, if any. +10. If a byte range HTTP header accompanies the URL, then the client MAY decompose this byte range into several sub-ranges and open multiple parallel, retryable requests to fetch them. (The URL and `headers` must be sufficient to authorize such behavior by the client, within reason.) + +### Inline data block URIs + +e.g. `data:application/vnd.ga4gh.bam;base64,SGVsbG8sIFdvcmxkIQ==` ([RFC 2397], [Data URI]). +The client obtains the data block by decoding the embedded base64 payload. + +1. must use base64 payload encoding (simplifies client decoding logic) +2. client should ignore the media type (if any), treating the payload as a partial blob. + +Note: the base64 text should not be additionally percent encoded. + +### Reliability & performance considerations + +To provide robustness to sporadic transfer failures, servers should divide large payloads into multiple data blocks in the `urls` array. Then if the transfer of any one block fails, the client can retry that block and carry on, instead of starting all over. Clients may also fetch blocks in parallel, which can improve throughput. + +Initial guidelines, which we expect to revise in light of future experience: +* Data blocks should not exceed ~1GB +* Inline data URIs should not exceed a few megabytes + +### Security considerations + +The data block URL and headers might contain embedded authentication tokens; therefore, production clients and servers should not unnecessarily print them to console, write them to logs, embed them in error messages, etc. + + +# Possible future enhancements + +1. add a mechanism to request reads from more than one ID at a time (e.g. for a trio) +2. allow clients to provide a suggested data block size to the server +3. consider adding other data types (e.g. variants) +4. add POST support (if and when request sizes get large) +5. [mlin] add a way to request all unmapped reads (e.g. by passing `*` for `referenceName`) +6. [dglazer] add a way to request reads in GA4GH binary format [^d] (e.g. fmt=proto) + +## Existing clarification suggestions + +[^a]: This should probably be specified as a (comma separated?) list in preference order. If the client can accept both BAM and CRAM it is useful for it to indicate this and let the server pick whichever format it is most comfortable with. +[^d]: How will compression work in this case - can we benefit from columnar compression as does Parquet? + + +[CORS]: http://www.w3.org/TR/cors/ +[Data URI]: https://en.wikipedia.org/wiki/Data_URI_scheme +[ISO 8601]: http://www.iso.org/iso/iso8601 +[RFC 2397]: https://www.ietf.org/rfc/rfc2397.txt +[RFC 2616]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html +[RFC 5246]: https://tools.ietf.org/html/rfc5246 +[RFC 6749]: https://tools.ietf.org/html/rfc6749 +[RFC 6750]: https://tools.ietf.org/html/rfc6750 + + diff --git a/index.md b/index.md new file mode 100644 index 000000000..cb6af7728 --- /dev/null +++ b/index.md @@ -0,0 +1,34 @@ +--- +layout: default +title: HTS format specifications +--- +{% capture newline %} +{% endcapture %} +{% capture readme %}{% include_relative README.md %}{% endcapture %} +{% assign readme_lines = readme | split: newline %} + +{% for line in readme_lines limit: 2 %} +{{line}}{% endfor %} + + +
+{% for line in readme_lines offset: 8 %} +{{line}}{% endfor %} +
+
diff --git a/new/.gitignore b/new/.gitignore new file mode 100644 index 000000000..a13633799 --- /dev/null +++ b/new/.gitignore @@ -0,0 +1 @@ +*.pdf diff --git a/pub/htsget-ticket.png b/pub/htsget-ticket.png new file mode 100644 index 000000000..db8fbe4ed Binary files /dev/null and b/pub/htsget-ticket.png differ diff --git a/pub/main.css b/pub/main.css new file mode 100644 index 000000000..2bd33fa0c --- /dev/null +++ b/pub/main.css @@ -0,0 +1,58 @@ +body { + font-family: Helvetica, Arial, sans-serif; +} + +a { color: #2a7ae2; text-decoration: none; } +a:hover { color: #000; text-decoration: underline; } +a:visited { color: #205caa; } + +.wrapper { margin: 2ex 4em; } +div.sidebar { margin-left: 10px; width: 230px; float: left; } +div.mainbar { margin-left: 240px; line-height: 1.5; } +div.lowered { margin-top: 10ex; } +div.clear { clear: both; } + +div.sidebar li { margin: 0.5ex 0; } + +table { + border-collapse: collapse; +} + +th, td { + border: 1px solid black; + padding: 1ex 1em; + vertical-align: top; +} + +.site-footer { + margin-top: 4ex; + border-top: 1px solid #e8e8e8; + padding-top: 3ex; +} + +.site-footer ul { list-style: none; } +.site-footer ul, .site-footer p { margin-top: 0; } +.site-footer li, .site-footer p { + font-size: 15px; + letter-spacing: -.3px; + color: #828282; +} + +.icon > svg { + display: inline-block; + width: 16px; + height: 16px; + vertical-align: middle; +} + +@media print, screen and (max-width: 720px) { +.wrapper { margin: 1ex 2em; } +div.sidebar { width: 160px; } +div.mainbar { margin-left: 170px; } +} + +@media print, screen and (max-width: 480px) { +.wrapper { margin: 1ex 1em; } +div.sidebar { display: none; } +div.mainbar { margin-left: 0; } +} diff --git a/tabix.pdf b/tabix.pdf new file mode 100644 index 000000000..ba9b9a8dc Binary files /dev/null and b/tabix.pdf differ