Which types of mutation effects should be ignored? #2

dhimmel · 2016-07-14T22:06:02Z

The PANCAN_mutation dataset (online doc) contains several types of mutations under the effect column. My processing of the dataset (notebook) yielded the following mutation effect and frequencies (as counts and percentages):

Effect	Count	Percent
Missense_Mutation	1,044,846	58.152%
Silent	432,995	24.099%
Nonsense_Mutation	81,092	4.513%
RNA	71,493	3.979%
Frame_Shift_Del	46,941	2.613%
Splice_Site	43,262	2.408%
Frame_Shift_Ins	22,546	1.255%
missense_variant	20,241	1.127%
In_Frame_Del	11,455	0.638%
synonymous_variant	7,907	0.440%
Translation_Start_Site	3,258	0.181%
In_Frame_Ins	3,052	0.170%
stop_gained	1,573	0.088%
3_prime_UTR_variant	1,420	0.079%
Nonstop_Mutation	1,318	0.073%
exon_variant	945	0.053%
EXON	420	0.023%
5_prime_UTR_variant	395	0.022%
splice_acceptor_variant	294	0.016%
splice_region_variant	255	0.014%
3'UTR	211	0.012%
splice_donor_variant	203	0.011%
Intron	148	0.008%
5_prime_UTR_premature_start_codon_gain_variant	110	0.006%
NON_SYNONYMOUS_CODING	95	0.005%
INTRAGENIC	57	0.003%
UTR_3_PRIME	38	0.002%
SYNONYMOUS_CODING	36	0.002%
start_lost	32	0.002%
5'UTR	28	0.002%
UTR_5_PRIME	22	0.001%
stop_lost	19	0.001%
IGR	16	0.001%
stop_retained_variant	7	0.000%
STOP_GAINED	6	0.000%
initiator_codon_variant	2	0.000%
SPLICE_SITE_ACCEPTOR	2	0.000%
SYNONYMOUS_STOP	1	0.000%
5'Flank	1	0.000%

It appears that certain effects are duplicates — such as 5_prime_UTR_variant, 5'UTR, UTR_5_PRIME — which if true represents a poor case of standardization. If we want to improve the standardization, we can create our own mapping, or we can report the issue to the upstream creators (although these fixes usually take a long time).

Anyways, we'll have to decide which types of effects to consider as functionally relevant mutations. For example, a "Silent" mutation generally does not have a biological effect. We could also let users decide for themselves, but that adds complexity.

@clairemcleod, @mp8, @DCousminer, @gwaygenomics, @cgreene, @stephenshank — I thought you may have a better understanding than I do of the biology here. Can any of these categories be discarded as irrelevant to a tumor's function and classification? Are you interested in creating a consolidated set of effects with duplicates merged?

The text was updated successfully, but these errors were encountered:

clairemcleod · 2016-07-15T18:24:54Z

Are we interested in preserving mutation type as a data field? If I recall correctly, we were talking about having mutation as a binary outcome variable. If this is still the case, I think there are several ways we could get there. The first would be to parse the above set of effects, potentially eliminating some. I think it would be fine to eliminate the silent mutations category, but am unsure about the others. In the UCSC Xena documentation, they've grouped the mutation effects into four color coded groups - it seems like this might be based on severity, although I am not familiar enough with the topic to be sure. The groups (from here) are:

Red --> Nonsense_Mutation, frameshift_variant, stop_gained, splice_acceptor_variant, splice_acceptor_variant&intron_variant, splice_donor_variant, splice_donor_variant&intron_variant, Splice_Site, Frame_Shift_Del, Frame_Shift_Ins

Blue --> splice_region_variant, splice_region_variant&intron_variant, missense, non_coding_exon_variant, missense_variant, Missense_Mutation, exon_variant, RNA, Indel, start_lost, start_gained, De_novo_Start_OutOfFrame, Translation_Start_Site, De_novo_Start_InFrame, stop_lost, Nonstop_Mutation, initiator_codon_variant, 5_prime_UTR_premature_start_codon_gain_variant, disruptive_inframe_deletion, inframe_deletion, inframe_insertion, In_Frame_Del, In_Frame_Ins

Green --> synonymous_variant, 5_prime_UTR_variant, 3_prime_UTR_variant, 5'Flank, 3'Flank, 3'UTR, 5'UTR, Silent, stop_retained_variant

Orange --> others, SV, upstreamgenevariant, downstream_gene_variant, intron_variant, intergenic_region

A second option would be using the somatic mutation data that is already called at the gene level. Positive mutation calls reflect the effects: nonsense, missense, frame-shif indels, splice site mutations, stop codon readthroughs, change of start codon, and inframe indels. We could also implement this same calling procedure ourselves.

gwaybio · 2016-07-15T20:43:18Z

Yes, I agree - I think we can toss Silent mutations.

I also think that keeping it simple would be the way to go. There are other resources available that are cleaner/simpler than this data available from TCGA Firehose that may be worth exploring.

cgreene · 2016-07-15T20:47:27Z

@clairemcleod & @gwaygenomics : If you wanted to provide simple groups that would get people started, how would you combine them? We can always provide the option to drill down to a greater level of detail (e.g. any KRAS G12V mutation), but I agree with you both that a simple initial interface is optimal.

The very granular items will only be useful for mutations that are particularly common.

@jingchunzhu

Changed `base_url` for downloading data from the Xena browser from https://genome-cancer.ucsc.edu/download/public/xena/TCGA/TCGA.PANCAN.sampleMap/ to https://tcga.xenahubs.net/download/TCGA.PANCAN.sampleMap/. This new location seems to have resolved the unstandardized mutation effects reported in cognoma#2. Added json metadata files to `download` providing version info at time of download. Thanks @jingchunzhu for the suggestion. See https://groups.google.com/forum/#!msg/ucsc-cancer-genomics-browser/eg6nJOFSefw/wO0wNrMeAgAJ

dhimmel · 2016-07-15T22:57:42Z

In dhimmel/cancer-data@0239cba, I changed the download location for UCSC Xena data (and added version tracking). This resolved the unstandardized mutation effect types. The updated version of the frequency table is below (color refers to the Xena characterizations mentioned above):

Effect	Count	Percent	Color
Missense_Mutation	1,132,319	59.504%	Blue
Silent	474,679	24.945%	Green
Nonsense_Mutation	87,104	4.577%	Red
RNA	75,134	3.948%	Blue
Frame_Shift_Del	46,991	2.469%	Red
Splice_Site	46,477	2.442%	Red
Frame_Shift_Ins	21,657	1.138%	Red
In_Frame_Del	10,663	0.560%	Blue
Translation_Start_Site	3,437	0.181%	Blue
In_Frame_Ins	2,685	0.141%	Blue
Nonstop_Mutation	1,370	0.072%	Blue
3'UTR	211	0.011%	Green
Intron	149	0.008%	Orange
5'UTR	28	0.001%	Green
IGR	16	0.001%	Orange
5'Flank	1	0.000%	Green

@clairemcleod, nice find with the mutation_bcgsc_gene dataset. This is a gene × sample matrix, which we could transpose to achieve our desired matrix. Unfortunately, this dataset seems to only include 3,219 samples, whereas our processed mutation matrix has 8,499 samples.

Addresses cognoma#2 -- add additional mutation effects. Added all red & blue mutations from http://xena.ucsc.edu/how-we-characterize-mutations/ that were present in the data.

dhimmel · 2016-07-16T19:43:46Z

a simple initial interface is optimal

I went with a simple solution. In dhimmel/cancer-data@ffe66ab, I retained only red and blue mutations (according to Xena), meaning orange and green mutations were removed. The only removed mutation effect category that was an appreciable portion of the data was "Silent" -- which I think we're all in agreement should be excluded.

I posted the mutation and expression datasets from this commit to figshare. Mutations were retained for 8,508 samples, 7,706 of which had corresponding expression data.

Unstandardized mutation types were resolved. See cognoma#2 (comment) Addressed this reviewer comment: cognoma#7 (comment)

dhimmel added the task label Jul 17, 2016

dhimmel mentioned this issue Jul 20, 2016

Process data into expression and mutation matrices #7

Merged

dhimmel added a commit to dhimmel/cancer-data that referenced this issue Jul 22, 2016

Remove old note on unstandardized mutation types

16a69c0

Unstandardized mutation types were resolved. See cognoma#2 (comment) Addressed this reviewer comment: cognoma#7 (comment)

dhimmel mentioned this issue Aug 1, 2016

Process the clinical matrix to extract sample attributes #10

Closed

dhimmel mentioned this issue Apr 12, 2018

Filtering samples is (potentially) too strict #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which types of mutation effects should be ignored? #2

Which types of mutation effects should be ignored? #2

dhimmel commented Jul 14, 2016 •

edited

Loading

clairemcleod commented Jul 15, 2016

gwaybio commented Jul 15, 2016 •

edited

Loading

cgreene commented Jul 15, 2016

dhimmel commented Jul 15, 2016 •

edited

Loading

dhimmel commented Jul 16, 2016

Which types of mutation effects should be ignored? #2

Which types of mutation effects should be ignored? #2

Comments

dhimmel commented Jul 14, 2016 • edited Loading

clairemcleod commented Jul 15, 2016

gwaybio commented Jul 15, 2016 • edited Loading

cgreene commented Jul 15, 2016

dhimmel commented Jul 15, 2016 • edited Loading

dhimmel commented Jul 16, 2016

dhimmel commented Jul 14, 2016 •

edited

Loading

gwaybio commented Jul 15, 2016 •

edited

Loading

dhimmel commented Jul 15, 2016 •

edited

Loading