Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which types of mutation effects should be ignored? #2

Open
dhimmel opened this issue Jul 14, 2016 · 5 comments
Open

Which types of mutation effects should be ignored? #2

dhimmel opened this issue Jul 14, 2016 · 5 comments
Labels

Comments

@dhimmel
Copy link
Member

dhimmel commented Jul 14, 2016

The PANCAN_mutation dataset (online doc) contains several types of mutations under the effect column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):

Effect Count Percent
Missense_Mutation 1,044,846 58.152%
Silent 432,995 24.099%
Nonsense_Mutation 81,092 4.513%
RNA 71,493 3.979%
Frame_Shift_Del 46,941 2.613%
Splice_Site 43,262 2.408%
Frame_Shift_Ins 22,546 1.255%
missense_variant 20,241 1.127%
In_Frame_Del 11,455 0.638%
synonymous_variant 7,907 0.440%
Translation_Start_Site 3,258 0.181%
In_Frame_Ins 3,052 0.170%
stop_gained 1,573 0.088%
3_prime_UTR_variant 1,420 0.079%
Nonstop_Mutation 1,318 0.073%
exon_variant 945 0.053%
EXON 420 0.023%
5_prime_UTR_variant 395 0.022%
splice_acceptor_variant 294 0.016%
splice_region_variant 255 0.014%
3'UTR 211 0.012%
splice_donor_variant 203 0.011%
Intron 148 0.008%
5_prime_UTR_premature_start_codon_gain_variant 110 0.006%
NON_SYNONYMOUS_CODING 95 0.005%
INTRAGENIC 57 0.003%
UTR_3_PRIME 38 0.002%
SYNONYMOUS_CODING 36 0.002%
start_lost 32 0.002%
5'UTR 28 0.002%
UTR_5_PRIME 22 0.001%
stop_lost 19 0.001%
IGR 16 0.001%
stop_retained_variant 7 0.000%
STOP_GAINED 6 0.000%
initiator_codon_variant 2 0.000%
SPLICE_SITE_ACCEPTOR 2 0.000%
SYNONYMOUS_STOP 1 0.000%
5'Flank 1 0.000%

It appears that certain effects are duplicates — such as 5_prime_UTR_variant, 5'UTR, UTR_5_PRIME — which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).

Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.

@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank — I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?

@clairemcleod
Copy link
Member

Are we interested in preserving mutation type as a data field? If I recall correctly, we were talking about having mutation as a binary outcome variable. If this is still the case, I think there are several ways we could get there. The first would be to parse the above set of effects, potentially eliminating some. I think it would be fine to eliminate the silent mutations category, but am unsure about the others. In the UCSC Xena documentation, they've grouped the mutation effects into four color coded groups - it seems like this might be based on severity, although I am not familiar enough with the topic to be sure. The groups (from here) are:

Red --> Nonsense_Mutation, frameshift_variant, stop_gained, splice_acceptor_variant, splice_acceptor_variant&intron_variant, splice_donor_variant, splice_donor_variant&intron_variant, Splice_Site, Frame_Shift_Del, Frame_Shift_Ins

Blue --> splice_region_variant, splice_region_variant&intron_variant, missense, non_coding_exon_variant, missense_variant, Missense_Mutation, exon_variant, RNA, Indel, start_lost, start_gained, De_novo_Start_OutOfFrame, Translation_Start_Site, De_novo_Start_InFrame, stop_lost, Nonstop_Mutation, initiator_codon_variant, 5_prime_UTR_premature_start_codon_gain_variant, disruptive_inframe_deletion, inframe_deletion, inframe_insertion, In_Frame_Del, In_Frame_Ins

Green --> synonymous_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, 5'Flank, 3'Flank, 3'UTR, 5'UTR, Silent, stop_retained_variant

Orange --> others, SV, upstreamgenevariant, downstream_gene_variant, intron_variant, intergenic_region

A second option would be using the somatic mutation data that is already called at the gene level. Positive mutation calls reflect the effects: nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, and inframe indels. We could also implement this same calling procedure ourselves.

@gwaybio
Copy link
Member

gwaybio commented Jul 15, 2016

Yes, I agree - I think we can toss Silent mutations.

I also think that keeping it simple would be the way to go. There are other resources available that are cleaner/simpler than this data available from TCGA Firehose that may be worth exploring.

@cgreene
Copy link
Member

cgreene commented Jul 15, 2016

@clairemcleod & @gwaygenomics : If you wanted to provide simple groups that would get people started, how would you combine them? We can always provide the option to drill down to a greater level of detail (e.g. any KRAS G12V mutation), but I agree with you both that a simple initial interface is optimal.

The very granular items will only be useful for mutations that are particularly common.

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Jul 15, 2016
Changed `base_url` for downloading data from the Xena browser from
https://genome-cancer.ucsc.edu/download/public/xena/TCGA/TCGA.PANCAN.sampleMap/
to https://tcga.xenahubs.net/download/TCGA.PANCAN.sampleMap/. This new location
seems to have resolved the unstandardized mutation effects reported in
cognoma#2.

Added json metadata files to `download` providing version info at time of
download. Thanks @jingchunzhu for the suggestion. See
https://groups.google.com/forum/#!msg/ucsc-cancer-genomics-browser/eg6nJOFSefw/wO0wNrMeAgAJ
@dhimmel
Copy link
Member Author

dhimmel commented Jul 15, 2016

In dhimmel/cancer-data@0239cba, I changed the download location for UCSC Xena data (and added version tracking). This resolved the unstandardized mutation effect types. The updated version of the frequency table is below (color refers to the Xena characterizations mentioned above):

Effect Count Percent Color
Missense_Mutation 1,132,319 59.504% Blue
Silent 474,679 24.945% Green
Nonsense_Mutation 87,104 4.577% Red
RNA 75,134 3.948% Blue
Frame_Shift_Del 46,991 2.469% Red
Splice_Site 46,477 2.442% Red
Frame_Shift_Ins 21,657 1.138% Red
In_Frame_Del 10,663 0.560% Blue
Translation_Start_Site 3,437 0.181% Blue
In_Frame_Ins 2,685 0.141% Blue
Nonstop_Mutation 1,370 0.072% Blue
3'UTR 211 0.011% Green
Intron 149 0.008% Orange
5'UTR 28 0.001% Green
IGR 16 0.001% Orange
5'Flank 1 0.000% Green

@clairemcleod, nice find with the mutation_bcgsc_gene dataset. This is a gene × sample matrix, which we could transpose to achieve our desired matrix. Unfortunately, this dataset seems to only include 3,219 samples, whereas our processed mutation matrix has 8,499 samples.

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Jul 15, 2016
Addresses cognoma#2 -- add additional mutation effects. Added all
red & blue mutations from http://xena.ucsc.edu/how-we-characterize-mutations/
that were present in the data.
@dhimmel
Copy link
Member Author

dhimmel commented Jul 16, 2016

a simple initial interface is optimal

I went with a simple solution. In dhimmel/cancer-data@ffe66ab, I retained only red and blue mutations (according to Xena), meaning orange and green mutations were removed. The only removed mutation effect category that was an appreciable portion of the data was "Silent" -- which I think we're all in agreement should be excluded.

I posted the mutation and expression datasets from this commit to figshare. Mutations were retained for 8,508 samples, 7,706 of which had corresponding expression data.

@dhimmel dhimmel added the task label Jul 17, 2016
dhimmel added a commit to dhimmel/cancer-data that referenced this issue Jul 22, 2016
Unstandardized mutation types were resolved. See
cognoma#2 (comment)

Addressed this reviewer comment:
cognoma#7 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants