From 1bf2db95bd17f25eb13285b4ccfd477404c90059 Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Wed, 24 Jul 2019 14:49:23 -0400 Subject: [PATCH 1/6] Local_allele stuff --- VCFv4.4.tex | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/VCFv4.4.tex b/VCFv4.4.tex index 5e62456bd..cfad10b34 100644 --- a/VCFv4.4.tex +++ b/VCFv4.4.tex @@ -482,6 +482,9 @@ \subsubsection{Genotype fields} GQ & 1 & Integer & Conditional genotype quality \\ GT & 1 & String & Genotype \\ HQ & 2 & Integer & Haplotype quality \\ + LAA & . & Integer & Local Alternate Alleles\footnotemark[1]\\ + LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only.\footnotemark[1]\\ + LAD & . & Integer & Local Allele Depth for the reference and each of the local alleles\footnotemark[1]\\ MQ & 1 & Integer & RMS mapping quality \\ PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\ @@ -578,6 +581,14 @@ \subsubsection{Genotype fields} \end{itemize} \item HQ (Integer): Haplotype qualities, two comma separated phred qualities. + \item Local Alleles (*): + For callsets with a large number of samples, it is often the case that the majority of sites are not called and sites end up involving many alleles for which all the samples need to provide PL and AD. + This can cause the file-sizes to grow super-linearly with the number of samples. + To prevent this, one can choose to specify the allele depth and the genotype likelihood against a subset of ``Local Alleles''. + LAA is the (1-based) index into the list of alleles that are ALT for that variant. + For example, if REF is G, ALT is A,C,T,\verb!<*>! and a genotype only has information about G, C, and \verb!<*>!, one can have LAA=[2,4] and thus LPL will be interpreted as pertaining to the alleles [G, C, \verb!<*>!] and not contain likelihood values for genotypes that involve A or T. + It is implicit that REF is part of any ``local'' context, and it always has index 0, even if the genotype is compound HET. + LAA is required in order to interpret LAD and LPL. \item MQ (Integer): RMS mapping quality, similar to the version in the INFO field. \item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field. \item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field. From 9f04587c0807518790285b806a6a38858df90f40 Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Wed, 24 Jul 2019 16:50:06 -0400 Subject: [PATCH 2/6] responding to review comments --- VCFv4.4.tex | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/VCFv4.4.tex b/VCFv4.4.tex index cfad10b34..9095d3a59 100644 --- a/VCFv4.4.tex +++ b/VCFv4.4.tex @@ -482,9 +482,9 @@ \subsubsection{Genotype fields} GQ & 1 & Integer & Conditional genotype quality \\ GT & 1 & String & Genotype \\ HQ & 2 & Integer & Haplotype quality \\ - LAA & . & Integer & Local Alternate Alleles\footnotemark[1]\\ - LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only.\footnotemark[1]\\ - LAD & . & Integer & Local Allele Depth for the reference and each of the local alleles\footnotemark[1]\\ + LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample \\ + LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA) \\ + LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA) \\ MQ & 1 & Integer & RMS mapping quality \\ PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\ @@ -581,7 +581,7 @@ \subsubsection{Genotype fields} \end{itemize} \item HQ (Integer): Haplotype qualities, two comma separated phred qualities. - \item Local Alleles (*): + \item LAA (and LAD and LPL (*): For callsets with a large number of samples, it is often the case that the majority of sites are not called and sites end up involving many alleles for which all the samples need to provide PL and AD. This can cause the file-sizes to grow super-linearly with the number of samples. To prevent this, one can choose to specify the allele depth and the genotype likelihood against a subset of ``Local Alleles''. From 4f2bc722df027c524c33bb24ebb2ed39297aac74 Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Mon, 16 Sep 2019 11:24:39 -0400 Subject: [PATCH 3/6] responding to review comments --- VCFv4.4.tex | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/VCFv4.4.tex b/VCFv4.4.tex index 9095d3a59..6081059e3 100644 --- a/VCFv4.4.tex +++ b/VCFv4.4.tex @@ -482,9 +482,9 @@ \subsubsection{Genotype fields} GQ & 1 & Integer & Conditional genotype quality \\ GT & 1 & String & Genotype \\ HQ & 2 & Integer & Haplotype quality \\ - LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample \\ - LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA) \\ - LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA) \\ + LAA & . & Integer & Strictly increasing, 1-based indices into ALT, indicating which alternate alleles are relevant (local) for the current sample \\ + LAD & . & Integer & Local allele read depth for the reference and each of the local alternate alleles listed in LAA \\ + LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternative alleles listed in LAA \\ MQ & 1 & Integer & RMS mapping quality \\ PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ PP & G & Integer & Phred-scaled genotype posterior probabilities rounded to the closest integer \\ @@ -581,14 +581,18 @@ \subsubsection{Genotype fields} \end{itemize} \item HQ (Integer): Haplotype qualities, two comma separated phred qualities. - \item LAA (and LAD and LPL (*): - For callsets with a large number of samples, it is often the case that the majority of sites are not called and sites end up involving many alleles for which all the samples need to provide PL and AD. - This can cause the file-sizes to grow super-linearly with the number of samples. + \item LAA + In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS. + Usually, few of these alleles are actually observed in any one sample, but each genotype must supply fields like PL and AD for all of the alleles---a very inefficient representation as PL's size is quadratic in the allele count. + Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference. To prevent this, one can choose to specify the allele depth and the genotype likelihood against a subset of ``Local Alleles''. - LAA is the (1-based) index into the list of alleles that are ALT for that variant. + LAA is the strictly increasing, 1-based index into ALT, pointing out the alternative alleles that are actually in-play for that sample. For example, if REF is G, ALT is A,C,T,\verb!<*>! and a genotype only has information about G, C, and \verb!<*>!, one can have LAA=[2,4] and thus LPL will be interpreted as pertaining to the alleles [G, C, \verb!<*>!] and not contain likelihood values for genotypes that involve A or T. It is implicit that REF is part of any ``local'' context, and it always has index 0, even if the genotype is compound HET. + Note that reordering might be required and care need to be taken to reorder LAD and LPL appropriately. LAA is required in order to interpret LAD and LPL. + \item LAD: See LAA + \item LPL: See LAA \item MQ (Integer): RMS mapping quality, similar to the version in the INFO field. \item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field. \item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field. From 692e4cd98f2b00a593a4c06069a12e42e0909984 Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Mon, 7 Oct 2019 11:58:26 -0400 Subject: [PATCH 4/6] added example --- VCFv4.4.tex | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/VCFv4.4.tex b/VCFv4.4.tex index 6081059e3..6187ad95c 100644 --- a/VCFv4.4.tex +++ b/VCFv4.4.tex @@ -483,7 +483,8 @@ \subsubsection{Genotype fields} GT & 1 & String & Genotype \\ HQ & 2 & Integer & Haplotype quality \\ LAA & . & Integer & Strictly increasing, 1-based indices into ALT, indicating which alternate alleles are relevant (local) for the current sample \\ - LAD & . & Integer & Local allele read depth for the reference and each of the local alternate alleles listed in LAA \\ + LAD & . & Integer & Read depth for the reference and each of the local alternate alleles listed in LAA \\ + LGT & . & String & Genotype against the local alleles \\ LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternative alleles listed in LAA \\ MQ & 1 & Integer & RMS mapping quality \\ PL & G & Integer & Phred-scaled genotype likelihoods rounded to the closest integer \\ @@ -587,11 +588,23 @@ \subsubsection{Genotype fields} Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference. To prevent this, one can choose to specify the allele depth and the genotype likelihood against a subset of ``Local Alleles''. LAA is the strictly increasing, 1-based index into ALT, pointing out the alternative alleles that are actually in-play for that sample. - For example, if REF is G, ALT is A,C,T,\verb!<*>! and a genotype only has information about G, C, and \verb!<*>!, one can have LAA=[2,4] and thus LPL will be interpreted as pertaining to the alleles [G, C, \verb!<*>!] and not contain likelihood values for genotypes that involve A or T. + LAD is the depth of the local alleles, LPL is subset of the PL array that pertains to the alleles that are REF or referred to by LAA, LGT is the genotype but referencing the local alleles rather than the global ones. It is implicit that REF is part of any ``local'' context, and it always has index 0, even if the genotype is compound HET. + For example, if REF is G, ALT is A,C,T,\verb!<*>! and a genotype only has information about G, C, and \verb!<*>!, one can have LAA=[2,4] and thus LPL will be interpreted as pertaining to the alleles [G, C, \verb!<*>!] and not contain likelihood values for genotypes that involve A or T. + In this case LGT=0/1 means that the sample is G/C. + GQ is still the genotype quality, even when the genotype is given against the local alleles. Note that reordering might be required and care need to be taken to reorder LAD and LPL appropriately. - LAA is required in order to interpret LAD and LPL. + LAA is required in order to interpret LAD, LPL, and LGT. + + For example, these two lines are encoding the same information (some columns removed for clarity): + + \begin{tabular}[l]{lllll} +REF& ALT&FORMAT&Alice&Bob\\ +G&A,C,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 2,4:1/1:20,30,10:90,80,0,100,110,120 &3:0/1:15,25:40,0,80\\ +G&A,C,T,\textless*\textgreater& GT:AD:PL& 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.\\ +\end{tabular} \item LAD: See LAA + \item LGT: See LAA \item LPL: See LAA \item MQ (Integer): RMS mapping quality, similar to the version in the INFO field. \item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field. From 3dca8b8593cd60f9274d771a837d17afdf18c792 Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Mon, 7 Oct 2019 13:55:16 -0400 Subject: [PATCH 5/6] added example --- VCFv4.4.tex | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/VCFv4.4.tex b/VCFv4.4.tex index 6187ad95c..675b7cf70 100644 --- a/VCFv4.4.tex +++ b/VCFv4.4.tex @@ -586,9 +586,11 @@ \subsubsection{Genotype fields} In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS. Usually, few of these alleles are actually observed in any one sample, but each genotype must supply fields like PL and AD for all of the alleles---a very inefficient representation as PL's size is quadratic in the allele count. Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference. - To prevent this, one can choose to specify the allele depth and the genotype likelihood against a subset of ``Local Alleles''. + To prevent this growth in VCF size, one can choose to specify the genotype, allele depth and the genotype likelihood against a subset of ``Local Alleles''. LAA is the strictly increasing, 1-based index into ALT, pointing out the alternative alleles that are actually in-play for that sample. - LAD is the depth of the local alleles, LPL is subset of the PL array that pertains to the alleles that are REF or referred to by LAA, LGT is the genotype but referencing the local alleles rather than the global ones. + LAD is the depth of the local alleles, + LPL is subset of the PL array that pertains to the alleles that are REF or referred to by LAA, + LGT is the genotype but referencing the local alleles rather than the global ones. It is implicit that REF is part of any ``local'' context, and it always has index 0, even if the genotype is compound HET. For example, if REF is G, ALT is A,C,T,\verb!<*>! and a genotype only has information about G, C, and \verb!<*>!, one can have LAA=[2,4] and thus LPL will be interpreted as pertaining to the alleles [G, C, \verb!<*>!] and not contain likelihood values for genotypes that involve A or T. In this case LGT=0/1 means that the sample is G/C. @@ -596,12 +598,18 @@ \subsubsection{Genotype fields} Note that reordering might be required and care need to be taken to reorder LAD and LPL appropriately. LAA is required in order to interpret LAD, LPL, and LGT. - For example, these two lines are encoding the same information (some columns removed for clarity): - - \begin{tabular}[l]{lllll} -REF& ALT&FORMAT&Alice&Bob\\ -G&A,C,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 2,4:1/1:20,30,10:90,80,0,100,110,120 &3:0/1:15,25:40,0,80\\ -G&A,C,T,\textless*\textgreater& GT:AD:PL& 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.\\ + In the following example, the records with the same POS encode the same information (some columns removed for clarity): + + \begin{tabular}[l]{llllll} +POS &REF& ALT&FORMAT&sample\\ +1&G&A,C,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 2,4:1/1:20,30,10:90,80,0,100,110,120\\ +1&G&A,C,T,\textless*\textgreater& GT:AD:PL& 2/2:20,.,30,.,10:90,.,.,80,.,0,.,.,.,.,100,.,110,.,120\\ +2&A&C,G,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 3:0/1:15,25:40,0,80\\ +2&A&C,G,T,\textless*\textgreater& GT:AD:PL&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.\\ +3&C&G,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 4:0/0:30,1:0,30,80\\ +3&C&G,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.1:0,.,.,.,.,.,.,.,.,.,30,.,.,.,80\\ +4&G&A,T,\textless*\textgreater& LAA:LGT:LAD:LPL& :0/0:30:0\\ +4&G&A,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,..:0,.,.,.,.,.,.,.,.,.,.,.,.,.,.\\ \end{tabular} \item LAD: See LAA \item LGT: See LAA From 4e4bae4cc7a2889b079688170c777b1ff5401e39 Mon Sep 17 00:00:00 2001 From: Yossi Farjoun Date: Mon, 7 Oct 2019 14:02:36 -0400 Subject: [PATCH 6/6] added example --- VCFv4.4.tex | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/VCFv4.4.tex b/VCFv4.4.tex index 675b7cf70..d263a617e 100644 --- a/VCFv4.4.tex +++ b/VCFv4.4.tex @@ -582,7 +582,7 @@ \subsubsection{Genotype fields} \end{itemize} \item HQ (Integer): Haplotype qualities, two comma separated phred qualities. - \item LAA + \item LAA is a sorted list of $n$ distinct integers, where $1 \le n \le \left|\mathrm{ALT}\right|$, giving the (1-based) indices within ALT of the alleles that are observed in the sample. In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS. Usually, few of these alleles are actually observed in any one sample, but each genotype must supply fields like PL and AD for all of the alleles---a very inefficient representation as PL's size is quadratic in the allele count. Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference. @@ -611,9 +611,11 @@ \subsubsection{Genotype fields} 4&G&A,T,\textless*\textgreater& LAA:LGT:LAD:LPL& :0/0:30:0\\ 4&G&A,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,..:0,.,.,.,.,.,.,.,.,.,.,.,.,.,.\\ \end{tabular} - \item LAD: See LAA - \item LGT: See LAA - \item LPL: See LAA + \item LAD: is a list of $n+1$ integers giving read depths (as per AD) for the REF allele and each of the local alleles as listed in LAA. + \item LGT: is the genotype, encoded as allele indexes separated by either of $/$ or $\mid$, as with GT, however, the indexes are into the list consisting of REF and the ALTs referenced by LAA. + So that in the case that LAA is 2,3, LGT=0/2 is equivalent to GT=0/3 and LGT=1/2 is equivalent to GT=2/3 (see example above). + \item LPL: is a list of $n+1 \choose \mathrm{Ploidy}$ integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the REF and LAA local alleles. + The precise ordering is defined in the GL paragraph. \item MQ (Integer): RMS mapping quality, similar to the version in the INFO field. \item PL (Integer): The phred-scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field. \item PP (Integer): The phred-scaled genotype posterior probabilities rounded to the closest integer, and otherwise defined in the same way as the GP field.