Skip to content

Commit

Permalink
Added information on fields for storing DNA methylation information i…
Browse files Browse the repository at this point in the history
…n the VCF format
  • Loading branch information
heathsc committed Aug 1, 2016
1 parent 025cf96 commit 45b6c67
Showing 1 changed file with 105 additions and 10 deletions.
115 changes: 105 additions & 10 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -31,16 +31,21 @@
\tableofcontents
\newpage

\section{The VCF specification}
VCF is a text file format (most likely stored in a compressed manner).
\section{The VCF specification}

VCF is a text file format (most likely stored in a compressed manner).
It contains meta-information lines (prefixed with "\#\#"), a header
line (prefixed with "\#"), and data lines
each containing information about a position in the genome and genotype
information on samples for each position
(text fields separated by tabs). Zero length fields are not allowed, a dot (".") should
be used instead.
In order to ensure interoperability across platforms, VCF compliant implementations must support
both LF (\texttt{\textbackslash n}) and CR+LF (\texttt{\textbackslash r\textbackslash n}) newline conventions.
line (prefixed with "\#"), and data lines each containing information
about a position in the genome and genotype information on samples for
each position (text fields separated by tabs). The VCF format can also
store information on DNA methylation from bisulfite sequencing
experiments and other sources alongside information about genome
sequence variation. Zero length fields are not allowed, a dot (".")
should be used instead. In order to ensure interoperability across
platforms, VCF compliant implementations must support both LF
(\texttt{\textbackslash n}) and CR+LF (\texttt{\textbackslash
r\textbackslash n}) newline conventions.


\subsection{An example}
\scriptsize
Expand Down Expand Up @@ -108,7 +113,6 @@ \subsection{Data types}

\subsection{Meta-information lines}


File meta-information is included after the \#\# string and must be key=value
pairs. Meta-information lines are optional, but if they are present then
they must be completely well-formed. Note that BCF, the binary
Expand Down Expand Up @@ -340,6 +344,7 @@ \subsubsection{Fixed fields}
\item AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
\item AN : total number of alleles in called genotypes
\item BQ : RMS base quality at this position
\item CX: The 5 base context determined from the reference surrounding and including the current position (i.e., from position -2 to +2)
\item CIGAR : cigar string describing how to align an alternate allele to the reference allele
\item DB : dbSNP membership
\item DP : combined depth across samples, e.g. DP=154
Expand Down Expand Up @@ -453,6 +458,52 @@ \subsubsection{Genotype fields}

See below for additional genotype fields used to encode structural variants. Additional Genotype fields can be defined in the meta-information. However, software support for such fields is not guaranteed.

\subsubsection{Bisulfite sequencing specific fields}

As with genotype data, if DNA methylation information from bisulfite
sequencing experiments is present then the same type of information
should be present for all samples, and the FORMAT field should
specifiy the data types and order. If both methylation and genotype
data are present then they should be reported together. The relative
order of genotype and methylation fields is not determined by the
specifications,except that the first sub-field should be the genotype
(GT) if present as described above. There are no required sub-fields.
It is, however, strongly recommended that the bisulfite strand specific
counts (MC8) are present. If methylation data only are present, then
the GT and other genotype associated fields should be omitted.

In contrast to normal practice with genotype only data where only
positions where sequence variants are called are generally present in
the mVCF file, when methylation data is present then all observed
positions where a C or a G allele is present either in the observed
data or in the reference and every position where a non-reference
allele is reported should be present in the VCF file. In practice,
this means every position where the called genotype for all samples is
\emph{not} homozygous reference with the reference being A or T. Hard
filtering of sites on read depth criteria is allowed, but it is
recommended \emph{not} to perform hard filtering on genotype call
quality as this can introduce biases, since different combinations of
genotype/methylation required different coverage to achieve the same
confidence of genotype call. If any of the fields are missing, they
should be replaced by the missing value. If the allele count field
(MC8) is present then the methylation point estimates (MEF, MER) and
number of methylation informative bases (MN) are not required.
However, if MC8 is \emph{not} present then MEF, MER and MN should all
be present.

\begin{itemize}

\renewcommand{\labelitemii}{$\circ$}
\item MC8: Base counts for A,C,G,T \emph{not} informative for methylation followed by base counts for A,C,G,T \emph{informative} for methylation.(8 Integers). These counts do not consider the genotype call, and simply report the number of bases of each type seen at the position (after an optional quality filtering step). If not all counts are available (due to conversion from another format) then the missing character '.' should be used to represent the missing values.
\item CS: Strand of Cytosine with respect to reference genome (+/-/+-/NA). Heterozygous C/G SNPs should be represented as '+-' as there is a cytosine on both strands. Sites where no Cytosine is present on either strand should be represented by 'NA'. (String)
\item CG : CpG status for position as determined by the called genotypes. This field can take values 'CG',' N', 'H' or '?' to represent 'Yes', 'No', 'Heterozygous' and 'Unknown'. A position called as homozygous C that is followed by a homozygous G call would have a CpG status of 'Y', whereas if the following position was called as a heterozygote containing a G (i.e., AG or TG) then the CpG status would be 'H'. A status of 'N' is only given when the following base is confidently called as \emph{not} containing a G. Similar rules apply to a position called as a homozygous G with respect to the whether the genotype call for the preceding base contains a C. (String)
\item CX: 5 base sequence context based on called genotypes. This field provides additional information to the CG field above by giving the genotype calls for the 2 bases before the current position, the base at the current position, and the 2 bases following. The sequence context is always given with respect to the forward strand. Heterozygous genotype calls should be represented using the IUPAC codes. (String)
\item MN: Number of bases informative for methylation. (Integer)
\item MEF: Methylation point estimate from the forward strand i.e., applying to a C. The estimate should be from 0-1. (Float)
\item MER: Methylation point estimate from the reverse strand. i.e., applying to a G. The estimate should be from 0-1. (Float)

\end{itemize}

\section{Understanding the VCF format and the haplotype representation}
VCF records use a single general system for representing genetic variation data composed of:
\begin{itemize}
Expand Down Expand Up @@ -1241,6 +1292,49 @@ \subsection{Representing unspecified alleles and REF-only blocks (gVCF)}
\end{flushleft}
\normalsize

\section{Representing DNA methylation variation in VCF records}

DNA methylation is an important and widespread epigenetic marker that
can be assayed at the level of individual bases in the same way as
sequence variation. Common experiments to assay DNA methylation also
provide information on sequence variation, and joint consideration of
sequence and methylation variation is important when comparing
multiple samples. The VCF format allows storing both sequence and
methylation information in the same records.

\subsection{An example}
\scriptsize
\begin{verbatim}
##fileformat=mVCFv1.0
##fileDate=20150505
##source=myBScallerV1.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=22,length=51304566,assembly=B36,species="Homo sapiens",taxonomy=x>l
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=CX,Number=1,Type=String,Description="5 base sequence context (from position -2 to +2 on the positive strand) determined from the reference">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GL,Number=G,Type=Float,Description="Genotype Likelihood">
##FORMAT=<ID=MC8,Number=8,Type=Integer,Description="Base counts non-informative for methylation (ACGT) followed by informative for methylation (ACGT)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=CG,Number=1,Type=String,Description="CpG Status (from genotype calls)'>
##FORMAT=<ID=CS,Number=1,Type=String,Description="Strand of Cytosine relative to reference sequence (+/-/+-/NA)'>
##FORMAT=<ID=CX,Number=1,Type=String,Description='5 base sequence context (from position -2 to +2 on the positive strand) determined from genotype calls'>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
22 18549904 . G . 38 PASS CX=AAGAA GT:DP:GL:MC8:CG:CS:CX 0/0:20:-5.541e-05:0,0,15,0,5,0,0,0:N:-:AAGAA
22 18549908 . C . 44 PASS CX=ATCGC GT:DP:GL:MC8:CG:CS:CX 0/0:20:-1.464e-05:0,4,0,0,0,15,0,1:Y:+:ATCGN
22 18549909 . G . 49 PASS CX=TCGTT GT:DP:GL:MC8:CG:CS:CX 0/0:45:-4.984e-06:0,0,25,0,0,1,19,0:Y:-:TCGNT
22 18549981 . G . 9 q10 CX=GTGAG GT:DP:GL:MC8:CG:CS:CX 0/0:17:-0.007085:0,0,8,0,9,0,0,0:N:-:GTGRG
22 18549982 . A G 25 PASS CX=TGAGA GT:DP:GL:MC8:CG:CS:CX 1/0:19:-1.684,-0.009095,-29.17:7,0,1,0,11,0,0,0:N:-:TGRGA
\end{verbatim}
\normalsize
This example shows (in order): a G in non-CpG context with 5 converted bases and 0 non-converted bases giving an estimated methylation value (from the proportion of non-converted converted counts) of 0, a C followed by a G (so in CpG context) with converted and non-converted counts of $1,15$\ and $0,19$\ for the top and bottom strands respectively, giving methylation estimates of $15/16$ and 1, a G in non-CpG context with estimated methylation 0 and filltered for low quality, and a good A/G SNP where the G is not in CpG context and has an estimated methylation of 0.
Note that the counts in the MC8 field are raw counts and take no account of the called genotype. In the example above at position $18549909$, a count of 1 informative C has been reported despite the position being called as homozygous G based on the other counts. In this case the C is most likely to be a sequencing error.

\pagebreak
\section{BCF specification}

Expand Down Expand Up @@ -1863,6 +1957,7 @@ \subsection{Changes between VCFv4.2 and VCFv4.3}
\item Chromosome names cannot use reserved symbolic alleles and contain characters used by breakpoints (Section~\ref{sec-contig-field}).
\item IUPAC ambiguity codes should be converted to a concrete base.
\item Symbolic ALTs for IUPAC codes.
\item A section about storing DNA methylation information from bisulfite sequencing experiments was added.
\end{itemize}

\subsection{Changes between BCFv2.1 and BCFv2.2}
Expand Down

0 comments on commit 45b6c67

Please sign in to comment.