Update VCFv4.3

samtools · Oct 23, 2017 · 7b379de · 7b379de
1 parent 45b6c67
commit 7b379de
Show file tree

Hide file tree

Showing 34 changed files with 1,434 additions and 336 deletions.
diff --git a/.gitignore b/.gitignore
@@ -6,5 +6,6 @@
 *.ver
 
 *.dvi
-*.pdf
 *.ps
+
+/_site
diff --git a/BCFv1_qref.pdf b/BCFv1_qref.pdf
diff --git a/BCFv2_qref.pdf b/BCFv2_qref.pdf
diff --git a/CRAMv2.1.pdf b/CRAMv2.1.pdf
diff --git a/CRAMv2.1.tex b/CRAMv2.1.tex
@@ -392,12 +392,30 @@ \section{\textbf{File definition}}
 \hline
 unsigned byte & major format number & 2 (0x2)\tabularnewline
 \hline
-unsigned byte & minor format number & 0 (0x0)\tabularnewline
+unsigned byte & minor format number & 1 (0x1)\tabularnewline
 \hline
 byte[20] & file id & CRAM file identifier (e.g. file name or SHA1 checksum)\tabularnewline
 \hline
 \end{tabular}
 
+Valid CRAM \textit{major}.\textit{minor} version numbers are as follows:
+
+\begin{itemize}
+\item[\textit{1.0}]
+The original public CRAM release.
+
+\item[\textit{2.0}]
+The first CRAM release implemented in both Java and C; tidied up
+implementation vs specification differences in \textit{1.0}.
+
+\item[\textit{2.1}]
+Gained end of file markers; compatible with \textit{2.0}.
+
+\item[\textit{3.0}]
+Additional compression methods; header and data checksums;
+improvements for unsorted data.
+\end {itemize}
+
 \section{\textbf{Container structure}}
 
 The file definition is followed by one or more containers with the following header 

diff --git a/CRAMv3.pdf b/CRAMv3.pdf
diff --git a/CRAMv3.tex b/CRAMv3.tex
@@ -83,7 +83,7 @@ \section{\textbf{Data types}}
 types are written as words (e.g. int) while physical data types are written using 
 single letters (e.g. i). The difference between the two is that storage data types 
 define how logical data types are stored in CRAM. Data in CRAM is stored either 
-as as bits or as bytes. Writing values as bits and bytes is described in detail 
+as bits or bytes. Writing values as bits and bytes is described in detail 
 below.
 
 \subsection{\textbf{Logical data types}}
@@ -195,7 +195,7 @@ \subsection{\textbf{Writing bytes to a byte stream}}
 the number of bytes to follow. To accommodate 32 bits such representation requires 
 5 bytes with only 4 lower bits used in the last byte 5.
 
-\item[LTF-8 long or (ltf8)]\ \newline
+\item[LTF-8 long (ltf8)]\ \newline
 See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the 
 number of bytes used to encode a single value. To do so 64 bits are required and 
 this can be done with 9 byte at most with the first byte consisting of just 1s 
@@ -204,11 +204,8 @@ \subsection{\textbf{Writing bytes to a byte stream}}
 \item[{Array ([ ])}]\ \newline
 Array length is written first as integer (itf8), followed by the elements of the 
 array. 
-\end{description}
-
-
-\subsubsection*{Encoding}
 
+\item[{Encoding}]\ \newline
 Encoding is a data type that specifies how data series have been compressed. Encodings 
 are defined as encoding\texttt{<}type\texttt{>} where the type is a logical data 
 type as opposed to a storage data type.
@@ -244,8 +241,7 @@ \subsubsection*{Encoding}
 K = 0x1 = 1
 
 
-\subsubsection*{Map}
-
+\item[{Map}]\ \newline
 A map is a collection of keys and associated values. A map with N keys is written 
 as follows: 
 
@@ -258,11 +254,13 @@ \subsubsection*{Map}
 Both the size in bytes and the number of keys are written as integer (itf8). Keys 
 and values are written according to their data types and are specific to each map.
 
-\subsection{\textbf{Strings}}
-
-Strings are represented as byte arrays using UTF-8 format. Read names, reference 
+\item[String]\ \newline
+A string is represented as byte arrays using UTF-8 format. Read names, reference 
 sequence names and tag values with type `Z' are stored as UTF-8.
 
+\end{description}
+
+
 \section{\textbf{Encodings }}
 
 Encoding is a data structure that captures information about compression details 
@@ -397,14 +395,32 @@ \section{\textbf{File definition}}
 \hline
 byte[4] & format magic number & CRAM (0x43 0x52 0x41 0x4d)\tabularnewline
 \hline
-unsigned byte & major format number & 2 (0x2)\tabularnewline
+unsigned byte & major format number & 3 (0x3)\tabularnewline
 \hline
 unsigned byte & minor format number & 0 (0x0)\tabularnewline
 \hline
 byte[20] & file id & CRAM file identifier (e.g. file name or SHA1 checksum)\tabularnewline
 \hline
 \end{tabular}
 
+Valid CRAM \textit{major}.\textit{minor} version numbers are as follows:
+
+\begin{itemize}
+\item[\textit{1.0}]
+The original public CRAM release.
+
+\item[\textit{2.0}]
+The first CRAM release implemented in both Java and C; tidied up
+implementation vs specification differences in \textit{1.0}.
+
+\item[\textit{2.1}]
+Gained end of file markers; compatible with \textit{2.0}.
+
+\item[\textit{3.0}]
+Additional compression methods; header and data checksums;
+improvements for unsorted data.
+\end {itemize}
+
 \section{\textbf{Container structure}}
 
 The file definition is followed by one or more containers with the following header 
@@ -1009,6 +1025,8 @@ \subsection{\textbf{CRAM record bit flags (BF data series)}}
 \hline
 0x400 &  & PCR or optical duplicate\tabularnewline
 \hline
+0x800 &  & Supplementary alignment\tabularnewline
+\hline
 \end{tabular}
 
 * For segments within the same slice.

diff --git a/CSIv1.pdf b/CSIv1.pdf
diff --git a/CSIv2.pdf b/CSIv2.pdf
diff --git a/MAINTAINERS.md b/MAINTAINERS.md
@@ -0,0 +1,68 @@
+## Specification maintainers
+
+The SAM, BAM, and VCF formats originated in the 1000 Genomes Project.
+In February 2014, ongoing format maintenance was brought under the aegis of the [Global Alliance for Genomics & Health][ga4gh-ff].
+At this time, lead maintainers for each of the formats were nominated.
+The current maintainers are listed below.
+
+### SAM/BAM
+
+* James Bonfield (@jkbonfield)
+* John Marshall (@jmarshall)
+* Yossi Farjoun (@yfarjoun)
+
+Past SAM/BAM maintainers include Jay Carey, Tim Fennell, and Nils Homer.
+
+### CRAM
+
+* James Bonfield (@jkbonfield)
+* Vadim Zalunin (@vadimzalunin)
+
+### VCF/BCF
+
+* Cristina Yenyxe Gonzalez Garcia (@cyenyxe)
+* David Roazen (@droazen)
+* Petr Danecek (@pd3)
+
+Past VCF/BCF maintainers include Ryan Poplin.
+
+### Htsget
+
+* Mike Lin (@mlin)
+
+[ga4gh-ff]:  https://genomicsandhealth.org/working-groups/our-work/file-formats
+
+
+## Generating PDF specification documents
+
+Use the _Makefile_ to generate PDFs from the TeX source documents.
+Both TeX source and generated PDFs are checked into the **master** branch, so the make rules are set up to stage PDFs into a _new/_ subdirectory, from where they can be copied when you are ready to check them in.
+
+Most of the specifications use a _.ver_ file and associated rules to display a commit hash and datestamp on their title page.
+(See _SAMv1.tex_ and _new/SAMv1.pdf_'s _Makefile_ dependencies for how to add this to other specifications.)
+So the usual workflow when editing these documents is (for example, when working on the SAM specification):
+
+1. Edit _SAMv1.tex_, and type `make new/SAMv1.pdf` to generate a working PDF to preview your work.
+
+2. When you are ready, commit your _.tex_ source changes (but don't commit any changed PDF files yet).
+
+3. Type `make clean SAMv1.pdf` to regenerate the PDF and copy it to the main directory.
+(Optionally, verify that it contains the correct commit hash for your source changes.)
+Now commit your _.pdf_ changes, separately from any source changes.
+
+### Rationale
+
+It is a little inconvenient having the working PDFs down in a subdirectory, but this is outweighed by the convenience of being able to switch between Git branches etc without trouble — as there would be if updated working PDFs were in the main directory, overwriting the checked-in PDFs.
+
+The intention is that the commit hash embedded in a PDF encompasses all the source changes and commits that contribute to that PDF.
+The hash of the particular commit that updates the PDF is of course not yet known when the PDF is being generated, so the best that can be done is the hash of a slightly-previous commit.
+Therefore:
+* The PDF needs to be committed separately from the corresponding TeX source changes.
+* The PDF should not be updated in a merge commit (as commits from one or the other of the merge's parents will not be recorded), and there's not much point updating it in a pull request.
+* So pull requests need to be merged, and then their PDFs updated separately as a non-merge commit on **master**.
+* If a series of changes are being made or several pull requests are being merged at once, the PDF updates can be batched up and just made once at the end.
+* Conversely, if there are changes pending to several (even unrelated) PDFs, there is no reason not to commit them all at once.
+
+If you are working on several PDFs at once, be careful in step 3 and perhaps use `make clean new/VCFv4.2.pdf new/VCFv4.3.pdf; make VCFv4.2.pdf VCFv4.3.pdf` to ensure that spurious “-dirty” commit hashes don't make their way into your PDFs.
+
+<!-- vim:set linebreak: -->
diff --git a/Makefile b/Makefile
@@ -6,35 +6,40 @@ PDFS =	BCFv1_qref.pdf \
 	CRAMv3.pdf \
 	CSIv1.pdf \
 	SAMv1.pdf \
+	SAMtags.pdf \
 	tabix.pdf \
 	VCFv4.1.pdf \
 	VCFv4.2.pdf \
 	VCFv4.3.pdf
 
-pdf: $(PDFS)
+pdf: $(PDFS:%=new/%)
 
-CRAMv2.1.pdf: CRAMv2.1.tex CRAMv2.1.ver
-CRAMv3.pdf: CRAMv3.tex CRAMv3.ver
-SAMv1.pdf: SAMv1.tex SAMv1.ver
-VCFv4.1.pdf: VCFv4.1.tex VCFv4.1.ver
-VCFv4.2.pdf: VCFv4.2.tex VCFv4.2.ver
-VCFv4.3.pdf: VCFv4.3.tex VCFv4.3.ver
+%.pdf: new/%.pdf
+	cp $^ $@
 
+new/CRAMv2.1.pdf: CRAMv2.1.tex new/CRAMv2.1.ver
+new/CRAMv3.pdf: CRAMv3.tex new/CRAMv3.ver
+new/SAMv1.pdf: SAMv1.tex new/SAMv1.ver
+new/SAMtags.pdf: SAMtags.tex new/SAMtags.ver
+new/VCFv4.1.pdf: VCFv4.1.tex new/VCFv4.1.ver
+new/VCFv4.2.pdf: VCFv4.2.tex new/VCFv4.2.ver
+new/VCFv4.3.pdf: VCFv4.3.tex new/VCFv4.3.ver
 
-.SUFFIXES: .tex .pdf .ver
-.tex.pdf:
-	pdflatex $<
-	while grep -q 'Rerun to get [a-z-]* right' $*.log; do pdflatex $< || exit; done
 
-.tex.ver:
+new/%.pdf: %.tex
+	pdflatex --output-directory new $<
+	while grep -q 'Rerun to get [a-z-]* right' new/$*.log; do pdflatex --output-directory new $< || exit; done
+
+new/%.ver: %.tex
 	echo "@newcommand*@commitdesc{`git describe --always --dirty`}@newcommand*@headdate{`git rev-list -n1 --format=%aD HEAD $< | sed '1d;s/.*, *//;s/ *[0-9]*:.*//'`}" | tr @ \\ > $@
 
 
 mostlyclean:
-	-rm -f *.aux *.idx *.log *.out *.toc *.ver
+	-cd new && rm -f *.aux *.idx *.log *.out *.toc *.ver
 
 clean: mostlyclean
-	-rm -f $(PDFS)
+	-cd new && rm -f $(PDFS)
+	-rm -rf _site
 
 
 .PHONY: all pdf mostlyclean clean
diff --git a/README.md b/README.md
@@ -1,47 +1,56 @@
 SAM/BAM and related specifications
 ==================================
 
-Quick links
------------
-
-<!-- Whitespace at the ends of these lines are Markdown line breaks -->
-[HTS-spec GitHub page](http://samtools.github.io/hts-specs/)  
-[SAMv1.pdf](http://samtools.github.io/hts-specs/SAMv1.pdf)  
-[CRAMv2.1.pdf](http://samtools.github.io/hts-specs/CRAMv2.1.pdf)  
-[CRAMv3.pdf](http://samtools.github.io/hts-specs/CRAMv3.pdf)  
-[BCFv1.pdf](http://samtools.github.io/hts-specs/BCFv1_qref.pdf)  
-[BCFv2.1.pdf](http://samtools.github.io/hts-specs/BCFv2_qref.pdf)  
-[CSIv1.pdf](http://samtools.github.io/hts-specs/CSIv1.pdf)  
-[tabix.pdf](http://samtools.github.io/hts-specs/tabix.pdf)  
-[VCFv4.1.pdf](http://samtools.github.io/hts-specs/VCFv4.1.pdf)  
-[VCFv4.2.pdf](http://samtools.github.io/hts-specs/VCFv4.2.pdf)  
+Links **in bold** point to the corresponding PDFs on this repository's [GitHub Pages website][hts-specs].
+
+Please request improvements or report errors using this repository, but see also [the list of maintainers](MAINTAINERS.md) if you need to contact them directly.
+
 
 Alignment data files
 --------------------
 
-**SAMv1.tex** is the canonical specification for the SAM (Sequence Alignment/Map) format, BAM (its binary equivalent), and the BAI format for indexing BAM files.
+**[SAMv1.tex]** is the canonical specification for the SAM (Sequence Alignment/Map) format, BAM (its binary equivalent), and the BAI format for indexing BAM files.
+**[SAMtags.tex]** is a companion specification describing the predefined standard optional fields and tags found in SAM, BAM, and CRAM files.
 These formats are discussed on the [samtools-devel mailing list][samdev-ml].
 
-**CRAMv3.tex** is the canonical specification for the CRAM format, while **CRAMv2.1.tex** describes its now-obsolete predecessor.
+**[CRAMv3.tex]** is the canonical specification for the CRAM format, while **[CRAMv2.1.tex]** describes its now-obsolete predecessor.
 Further details can be found at [ENA's CRAM toolkit page][ena-cram].
 CRAM discussions can also be found on the [samtools-devel mailing list][samdev-ml].
 
-The **tabix.tex** and **CSIv1.tex** quick references summarize more recent index formats: the [tabix] tool indexes generic textual genome position-sorted files, while CSI is [htslib]'s successor to the BAI index format.
+The **[tabix.tex]** and **[CSIv1.tex]** quick references summarize more recent index formats: the tabix tool indexes generic textual genome position-sorted files, while CSI is [htslib]'s successor to the BAI index format.
 
 Variant calling data files
 --------------------------
 
-**VCFv4.1.tex** and **VCFv4.2.tex** are the canonical specifications for the Variant Call Format and its textual (VCF) and binary encodings (BCF 2.x).
+**[VCFv4.3.tex]** is the canonical specification for the Variant Call Format and its textual (VCF) and binary (BCF) encodings, while **[VCFv4.1.tex]** and **[VCFv4.2.tex]** describe their predecessors.
 These formats are discussed on the [vcftools-spec mailing list][vcfspec-ml].
 
-**BCFv1_qref.tex** summarizes the obsolete BCF1 format historically produced by [samtools].  This format is no longer recommended for use, as it has been superseded by the more widely-implemented BCF2.
+**[BCFv1_qref.tex]** summarizes the obsolete BCF1 format historically produced by [samtools].  This format is no longer recommended for use, as it has been superseded by the more widely-implemented BCF2.
+
+**[BCFv2_qref.tex]** is a quick reference describing just the layout of data within BCF2 files.
+
+Transfer protocols
+------------------
+
+**[Htsget.md]** describes the _hts-get_ retrieval protocol, which enables parallel streaming access to data sharded across multiple URLs or files.
 
-**BCFv2_qref.tex** is a quick reference describing just the layout of data within BCF2 files.
+[SAMv1.tex]:    http://samtools.github.io/hts-specs/SAMv1.pdf
+[SAMtags.tex]:  http://samtools.github.io/hts-specs/SAMtags.pdf
+[CRAMv2.1.tex]: http://samtools.github.io/hts-specs/CRAMv2.1.pdf
+[CRAMv3.tex]:   http://samtools.github.io/hts-specs/CRAMv3.pdf
+[CSIv1.tex]:    http://samtools.github.io/hts-specs/CSIv1.pdf
+[tabix.tex]:    http://samtools.github.io/hts-specs/tabix.pdf
+[VCFv4.1.tex]:  http://samtools.github.io/hts-specs/VCFv4.1.pdf
+[VCFv4.2.tex]:  http://samtools.github.io/hts-specs/VCFv4.2.pdf
+[VCFv4.3.tex]:  http://samtools.github.io/hts-specs/VCFv4.3.pdf
+[BCFv1_qref.tex]: http://samtools.github.io/hts-specs/BCFv1_qref.pdf
+[BCFv2_qref.tex]: http://samtools.github.io/hts-specs/BCFv2_qref.pdf
+[Htsget.md]:    http://samtools.github.io/hts-specs/htsget.html
 
 [ena-cram]:   http://www.ebi.ac.uk/ena/about/cram_toolkit
 [htslib]:     https://github.com/samtools/htslib
 [samtools]:   https://github.com/samtools/samtools
-[tabix]:      https://github.com/samtools/tabix
+[hts-specs]:  http://samtools.github.io/hts-specs/
 
 [samdev-ml]:  https://lists.sourceforge.net/lists/listinfo/samtools-devel
 [vcfspec-ml]: https://lists.sourceforge.net/lists/listinfo/vcftools-spec

diff --git a/SAMtags.pdf b/SAMtags.pdf
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,5 +6,6 @@ @@
     *.ver
     *.dvi
-    *.pdf
     *.ps
+    /_site