Skip to content

Commit

Permalink
Update VCFv4.3
Browse files Browse the repository at this point in the history
  • Loading branch information
MarcosFernandez committed Oct 23, 2017
1 parent 45b6c67 commit 7b379de
Show file tree
Hide file tree
Showing 34 changed files with 1,434 additions and 336 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,6 @@
*.ver

*.dvi
*.pdf
*.ps

/_site
Binary file added BCFv1_qref.pdf
Binary file not shown.
Binary file added BCFv2_qref.pdf
Binary file not shown.
Binary file added CRAMv2.1.pdf
Binary file not shown.
20 changes: 19 additions & 1 deletion CRAMv2.1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -392,12 +392,30 @@ \section{\textbf{File definition}}
\hline
unsigned byte & major format number & 2 (0x2)\tabularnewline
\hline
unsigned byte & minor format number & 0 (0x0)\tabularnewline
unsigned byte & minor format number & 1 (0x1)\tabularnewline
\hline
byte[20] & file id & CRAM file identifier (e.g. file name or SHA1 checksum)\tabularnewline
\hline
\end{tabular}

Valid CRAM \textit{major}.\textit{minor} version numbers are as follows:

\begin{itemize}
\item[\textit{1.0}]
The original public CRAM release.

\item[\textit{2.0}]
The first CRAM release implemented in both Java and C; tidied up
implementation vs specification differences in \textit{1.0}.

\item[\textit{2.1}]
Gained end of file markers; compatible with \textit{2.0}.

\item[\textit{3.0}]
Additional compression methods; header and data checksums;
improvements for unsorted data.
\end {itemize}

\section{\textbf{Container structure}}

The file definition is followed by one or more containers with the following header
Expand Down
Binary file added CRAMv3.pdf
Binary file not shown.
42 changes: 30 additions & 12 deletions CRAMv3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ \section{\textbf{Data types}}
types are written as words (e.g. int) while physical data types are written using
single letters (e.g. i). The difference between the two is that storage data types
define how logical data types are stored in CRAM. Data in CRAM is stored either
as as bits or as bytes. Writing values as bits and bytes is described in detail
as bits or bytes. Writing values as bits and bytes is described in detail
below.

\subsection{\textbf{Logical data types}}
Expand Down Expand Up @@ -195,7 +195,7 @@ \subsection{\textbf{Writing bytes to a byte stream}}
the number of bytes to follow. To accommodate 32 bits such representation requires
5 bytes with only 4 lower bits used in the last byte 5.

\item[LTF-8 long or (ltf8)]\ \newline
\item[LTF-8 long (ltf8)]\ \newline
See ITF-8 for more details. The only difference between ITF-8 and LTF-8 is the
number of bytes used to encode a single value. To do so 64 bits are required and
this can be done with 9 byte at most with the first byte consisting of just 1s
Expand All @@ -204,11 +204,8 @@ \subsection{\textbf{Writing bytes to a byte stream}}
\item[{Array ([ ])}]\ \newline
Array length is written first as integer (itf8), followed by the elements of the
array.
\end{description}


\subsubsection*{Encoding}

\item[{Encoding}]\ \newline
Encoding is a data type that specifies how data series have been compressed. Encodings
are defined as encoding\texttt{<}type\texttt{>} where the type is a logical data
type as opposed to a storage data type.
Expand Down Expand Up @@ -244,8 +241,7 @@ \subsubsection*{Encoding}
K = 0x1 = 1


\subsubsection*{Map}

\item[{Map}]\ \newline
A map is a collection of keys and associated values. A map with N keys is written
as follows:

Expand All @@ -258,11 +254,13 @@ \subsubsection*{Map}
Both the size in bytes and the number of keys are written as integer (itf8). Keys
and values are written according to their data types and are specific to each map.

\subsection{\textbf{Strings}}

Strings are represented as byte arrays using UTF-8 format. Read names, reference
\item[String]\ \newline
A string is represented as byte arrays using UTF-8 format. Read names, reference
sequence names and tag values with type `Z' are stored as UTF-8.

\end{description}


\section{\textbf{Encodings }}

Encoding is a data structure that captures information about compression details
Expand Down Expand Up @@ -397,14 +395,32 @@ \section{\textbf{File definition}}
\hline
byte[4] & format magic number & CRAM (0x43 0x52 0x41 0x4d)\tabularnewline
\hline
unsigned byte & major format number & 2 (0x2)\tabularnewline
unsigned byte & major format number & 3 (0x3)\tabularnewline
\hline
unsigned byte & minor format number & 0 (0x0)\tabularnewline
\hline
byte[20] & file id & CRAM file identifier (e.g. file name or SHA1 checksum)\tabularnewline
\hline
\end{tabular}

Valid CRAM \textit{major}.\textit{minor} version numbers are as follows:

\begin{itemize}
\item[\textit{1.0}]
The original public CRAM release.

\item[\textit{2.0}]
The first CRAM release implemented in both Java and C; tidied up
implementation vs specification differences in \textit{1.0}.

\item[\textit{2.1}]
Gained end of file markers; compatible with \textit{2.0}.

\item[\textit{3.0}]
Additional compression methods; header and data checksums;
improvements for unsorted data.
\end {itemize}

\section{\textbf{Container structure}}

The file definition is followed by one or more containers with the following header
Expand Down Expand Up @@ -1009,6 +1025,8 @@ \subsection{\textbf{CRAM record bit flags (BF data series)}}
\hline
0x400 & & PCR or optical duplicate\tabularnewline
\hline
0x800 & & Supplementary alignment\tabularnewline
\hline
\end{tabular}

* For segments within the same slice.
Expand Down
Binary file added CSIv1.pdf
Binary file not shown.
Binary file added CSIv2.pdf
Binary file not shown.
68 changes: 68 additions & 0 deletions MAINTAINERS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
## Specification maintainers

The SAM, BAM, and VCF formats originated in the 1000 Genomes Project.
In February 2014, ongoing format maintenance was brought under the aegis of the [Global Alliance for Genomics & Health][ga4gh-ff].
At this time, lead maintainers for each of the formats were nominated.
The current maintainers are listed below.

### SAM/BAM

* James Bonfield (@jkbonfield)
* John Marshall (@jmarshall)
* Yossi Farjoun (@yfarjoun)

Past SAM/BAM maintainers include Jay Carey, Tim Fennell, and Nils Homer.

### CRAM

* James Bonfield (@jkbonfield)
* Vadim Zalunin (@vadimzalunin)

### VCF/BCF

* Cristina Yenyxe Gonzalez Garcia (@cyenyxe)
* David Roazen (@droazen)
* Petr Danecek (@pd3)

Past VCF/BCF maintainers include Ryan Poplin.

### Htsget

* Mike Lin (@mlin)

[ga4gh-ff]: https://genomicsandhealth.org/working-groups/our-work/file-formats


## Generating PDF specification documents

Use the _Makefile_ to generate PDFs from the TeX source documents.
Both TeX source and generated PDFs are checked into the **master** branch, so the make rules are set up to stage PDFs into a _new/_ subdirectory, from where they can be copied when you are ready to check them in.

Most of the specifications use a _.ver_ file and associated rules to display a commit hash and datestamp on their title page.
(See _SAMv1.tex_ and _new/SAMv1.pdf_'s _Makefile_ dependencies for how to add this to other specifications.)
So the usual workflow when editing these documents is (for example, when working on the SAM specification):

1. Edit _SAMv1.tex_, and type `make new/SAMv1.pdf` to generate a working PDF to preview your work.

2. When you are ready, commit your _.tex_ source changes (but don't commit any changed PDF files yet).

3. Type `make clean SAMv1.pdf` to regenerate the PDF and copy it to the main directory.
(Optionally, verify that it contains the correct commit hash for your source changes.)
Now commit your _.pdf_ changes, separately from any source changes.

### Rationale

It is a little inconvenient having the working PDFs down in a subdirectory, but this is outweighed by the convenience of being able to switch between Git branches etc without trouble — as there would be if updated working PDFs were in the main directory, overwriting the checked-in PDFs.

The intention is that the commit hash embedded in a PDF encompasses all the source changes and commits that contribute to that PDF.
The hash of the particular commit that updates the PDF is of course not yet known when the PDF is being generated, so the best that can be done is the hash of a slightly-previous commit.
Therefore:
* The PDF needs to be committed separately from the corresponding TeX source changes.
* The PDF should not be updated in a merge commit (as commits from one or the other of the merge's parents will not be recorded), and there's not much point updating it in a pull request.
* So pull requests need to be merged, and then their PDFs updated separately as a non-merge commit on **master**.
* If a series of changes are being made or several pull requests are being merged at once, the PDF updates can be batched up and just made once at the end.
* Conversely, if there are changes pending to several (even unrelated) PDFs, there is no reason not to commit them all at once.

If you are working on several PDFs at once, be careful in step 3 and perhaps use `make clean new/VCFv4.2.pdf new/VCFv4.3.pdf; make VCFv4.2.pdf VCFv4.3.pdf` to ensure that spurious “-dirty” commit hashes don't make their way into your PDFs.

<!-- vim:set linebreak: -->
33 changes: 19 additions & 14 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,35 +6,40 @@ PDFS = BCFv1_qref.pdf \
CRAMv3.pdf \
CSIv1.pdf \
SAMv1.pdf \
SAMtags.pdf \
tabix.pdf \
VCFv4.1.pdf \
VCFv4.2.pdf \
VCFv4.3.pdf

pdf: $(PDFS)
pdf: $(PDFS:%=new/%)

CRAMv2.1.pdf: CRAMv2.1.tex CRAMv2.1.ver
CRAMv3.pdf: CRAMv3.tex CRAMv3.ver
SAMv1.pdf: SAMv1.tex SAMv1.ver
VCFv4.1.pdf: VCFv4.1.tex VCFv4.1.ver
VCFv4.2.pdf: VCFv4.2.tex VCFv4.2.ver
VCFv4.3.pdf: VCFv4.3.tex VCFv4.3.ver
%.pdf: new/%.pdf
cp $^ $@

new/CRAMv2.1.pdf: CRAMv2.1.tex new/CRAMv2.1.ver
new/CRAMv3.pdf: CRAMv3.tex new/CRAMv3.ver
new/SAMv1.pdf: SAMv1.tex new/SAMv1.ver
new/SAMtags.pdf: SAMtags.tex new/SAMtags.ver
new/VCFv4.1.pdf: VCFv4.1.tex new/VCFv4.1.ver
new/VCFv4.2.pdf: VCFv4.2.tex new/VCFv4.2.ver
new/VCFv4.3.pdf: VCFv4.3.tex new/VCFv4.3.ver

.SUFFIXES: .tex .pdf .ver
.tex.pdf:
pdflatex $<
while grep -q 'Rerun to get [a-z-]* right' $*.log; do pdflatex $< || exit; done

.tex.ver:
new/%.pdf: %.tex
pdflatex --output-directory new $<
while grep -q 'Rerun to get [a-z-]* right' new/$*.log; do pdflatex --output-directory new $< || exit; done

new/%.ver: %.tex
echo "@newcommand*@commitdesc{`git describe --always --dirty`}@newcommand*@headdate{`git rev-list -n1 --format=%aD HEAD $< | sed '1d;s/.*, *//;s/ *[0-9]*:.*//'`}" | tr @ \\ > $@


mostlyclean:
-rm -f *.aux *.idx *.log *.out *.toc *.ver
-cd new && rm -f *.aux *.idx *.log *.out *.toc *.ver

clean: mostlyclean
-rm -f $(PDFS)
-cd new && rm -f $(PDFS)
-rm -rf _site


.PHONY: all pdf mostlyclean clean
51 changes: 30 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,56 @@
SAM/BAM and related specifications
==================================

Quick links
-----------

<!-- Whitespace at the ends of these lines are Markdown line breaks -->
[HTS-spec GitHub page](http://samtools.github.io/hts-specs/)
[SAMv1.pdf](http://samtools.github.io/hts-specs/SAMv1.pdf)
[CRAMv2.1.pdf](http://samtools.github.io/hts-specs/CRAMv2.1.pdf)
[CRAMv3.pdf](http://samtools.github.io/hts-specs/CRAMv3.pdf)
[BCFv1.pdf](http://samtools.github.io/hts-specs/BCFv1_qref.pdf)
[BCFv2.1.pdf](http://samtools.github.io/hts-specs/BCFv2_qref.pdf)
[CSIv1.pdf](http://samtools.github.io/hts-specs/CSIv1.pdf)
[tabix.pdf](http://samtools.github.io/hts-specs/tabix.pdf)
[VCFv4.1.pdf](http://samtools.github.io/hts-specs/VCFv4.1.pdf)
[VCFv4.2.pdf](http://samtools.github.io/hts-specs/VCFv4.2.pdf)
Links **in bold** point to the corresponding PDFs on this repository's [GitHub Pages website][hts-specs].

Please request improvements or report errors using this repository, but see also [the list of maintainers](MAINTAINERS.md) if you need to contact them directly.


Alignment data files
--------------------

**SAMv1.tex** is the canonical specification for the SAM (Sequence Alignment/Map) format, BAM (its binary equivalent), and the BAI format for indexing BAM files.
**[SAMv1.tex]** is the canonical specification for the SAM (Sequence Alignment/Map) format, BAM (its binary equivalent), and the BAI format for indexing BAM files.
**[SAMtags.tex]** is a companion specification describing the predefined standard optional fields and tags found in SAM, BAM, and CRAM files.
These formats are discussed on the [samtools-devel mailing list][samdev-ml].

**CRAMv3.tex** is the canonical specification for the CRAM format, while **CRAMv2.1.tex** describes its now-obsolete predecessor.
**[CRAMv3.tex]** is the canonical specification for the CRAM format, while **[CRAMv2.1.tex]** describes its now-obsolete predecessor.
Further details can be found at [ENA's CRAM toolkit page][ena-cram].
CRAM discussions can also be found on the [samtools-devel mailing list][samdev-ml].

The **tabix.tex** and **CSIv1.tex** quick references summarize more recent index formats: the [tabix] tool indexes generic textual genome position-sorted files, while CSI is [htslib]'s successor to the BAI index format.
The **[tabix.tex]** and **[CSIv1.tex]** quick references summarize more recent index formats: the tabix tool indexes generic textual genome position-sorted files, while CSI is [htslib]'s successor to the BAI index format.

Variant calling data files
--------------------------

**VCFv4.1.tex** and **VCFv4.2.tex** are the canonical specifications for the Variant Call Format and its textual (VCF) and binary encodings (BCF 2.x).
**[VCFv4.3.tex]** is the canonical specification for the Variant Call Format and its textual (VCF) and binary (BCF) encodings, while **[VCFv4.1.tex]** and **[VCFv4.2.tex]** describe their predecessors.
These formats are discussed on the [vcftools-spec mailing list][vcfspec-ml].

**BCFv1_qref.tex** summarizes the obsolete BCF1 format historically produced by [samtools]. This format is no longer recommended for use, as it has been superseded by the more widely-implemented BCF2.
**[BCFv1_qref.tex]** summarizes the obsolete BCF1 format historically produced by [samtools]. This format is no longer recommended for use, as it has been superseded by the more widely-implemented BCF2.

**[BCFv2_qref.tex]** is a quick reference describing just the layout of data within BCF2 files.

Transfer protocols
------------------

**[Htsget.md]** describes the _hts-get_ retrieval protocol, which enables parallel streaming access to data sharded across multiple URLs or files.

**BCFv2_qref.tex** is a quick reference describing just the layout of data within BCF2 files.
[SAMv1.tex]: http://samtools.github.io/hts-specs/SAMv1.pdf
[SAMtags.tex]: http://samtools.github.io/hts-specs/SAMtags.pdf
[CRAMv2.1.tex]: http://samtools.github.io/hts-specs/CRAMv2.1.pdf
[CRAMv3.tex]: http://samtools.github.io/hts-specs/CRAMv3.pdf
[CSIv1.tex]: http://samtools.github.io/hts-specs/CSIv1.pdf
[tabix.tex]: http://samtools.github.io/hts-specs/tabix.pdf
[VCFv4.1.tex]: http://samtools.github.io/hts-specs/VCFv4.1.pdf
[VCFv4.2.tex]: http://samtools.github.io/hts-specs/VCFv4.2.pdf
[VCFv4.3.tex]: http://samtools.github.io/hts-specs/VCFv4.3.pdf
[BCFv1_qref.tex]: http://samtools.github.io/hts-specs/BCFv1_qref.pdf
[BCFv2_qref.tex]: http://samtools.github.io/hts-specs/BCFv2_qref.pdf
[Htsget.md]: http://samtools.github.io/hts-specs/htsget.html

[ena-cram]: http://www.ebi.ac.uk/ena/about/cram_toolkit
[htslib]: https://github.com/samtools/htslib
[samtools]: https://github.com/samtools/samtools
[tabix]: https://github.com/samtools/tabix
[hts-specs]: http://samtools.github.io/hts-specs/

[samdev-ml]: https://lists.sourceforge.net/lists/listinfo/samtools-devel
[vcfspec-ml]: https://lists.sourceforge.net/lists/listinfo/vcftools-spec
Expand Down
Binary file added SAMtags.pdf
Binary file not shown.
Loading

0 comments on commit 7b379de

Please sign in to comment.