Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release #16

gwaybio · 2016-08-09T23:43:03Z

I have had this issue in the past (see zenodo file) and it looks like the current PANCAN_mutation file from xena has less samples and less columns than a previous version.

One of the columns we don't have is the specific nucleotide mutation and is preventing us from completing #15

It may be good to ask a direct question to the UCSC Xena Google Group. They have been helpful in the past (see #14)

gwaybio · 2016-08-10T00:09:46Z

@stephenshank

jingchunzhu · 2016-08-10T17:27:23Z

Agreed. If you have any questions regarding data from UCSC Xena. The google group is the most effective way to get the message to us.

Jing
UCSC Xena
http://xena.ucsc.edu

gwaybio · 2016-08-11T16:44:56Z

Thanks @jingchunzhu - your group continues to be very helpful!

For the cognoma community, here is a link to the UCSC Xena google group discussion about this issue

dhimmel · 2016-08-16T16:31:43Z

@gwaygenomics, with @jingchunzhu's latest reply (quoted below), what's the status of this issue? Basically, should it be closed or is the issue ongoing (and what's the next step to progress forward)?

I have noticed that different versions of PanCancer mutation data have been deposited in the browser here. The current version is not a complete representation of previous versions. It lacks several columns and several samples that were present in a previous version the last time I accessed the data on 12 June 2015.

Yes. These are not from the same release. However I am surprised by "lacking several columns", could you say how the columns are different?

Sample number change is not surprising because we periodically update TCGA data. This particular dataset is compiled by the Xena team at UCSC, and in almost all cases, TCGA has multiple version of mutation calls from several sequencing and analysis groups, broad, WashU, BCM, and UCSC, plus there are curated and automated calls, plus there are different sequencing platforms. So we made our internal decision on which dataset to include, and the exact selection has been changed over time, not drastically, but there are changes. The change will effect sample numbers.

Is there a place where data is versioned besides in the JSON metadata (see discussion here)?

Starting 2016, we store our release data on AWS S3, which means that all versions of data starting 2016 will be on S3. We plan to do so in the future as long as there is resource to sustain it. .json files are part of the data releases, which will stores the version information. Our previous data releases are not on S3. Do you need the previous version that you retrieved in June 12? We can send to you directly.

Jing

gwaybio · 2016-08-16T16:35:30Z

@dhimmel - I responded to @jingchunzhu on the google groups but the message was not posted. Not sure what happened here.

My post listed the different columns between the two versions. There were many more columns in the older version. Perhaps @jingchunzhu is looking into it before passing my comment through the moderators?

dhimmel · 2016-08-16T16:43:47Z

I responded to @jingchunzhu on the google groups but the message was not posted. Not sure what happened here.

@gwaygenomics good to know. Give it time -- there is a delay between posting and the message appearing (perhaps an approval stage with a poor user experience). I actually posted a suggestion to move the Google Group to GitHub issues to avoid these blocks, although this post is also currently hidden.

jingchunzhu · 2016-08-16T17:57:59Z

I don't see either of the two messages. Not sure what's going on. Sorry.

> Perhaps @jingchunzhu is looking into it before passing my comment

through the moderators?

Greg, I don't know if the message will show up at all. Can you email me with your post that did not get through?

dhimmel · 2016-08-16T18:20:17Z

@jingchunzhu, every time I post to the Google Group there is a substantial delay till it appears. I'm pretty sure the messages will show up if we wait.

dhimmel · 2016-08-18T14:39:02Z

@jingchunzhu I'm starting to think that @gwaygenomics and my posts may actually be permanently missing this time. Is the Google Group moderated and if so, can you confirm that our posts are not waiting on approval?

jingchunzhu · 2016-08-18T17:28:13Z

Yes. It is moderated. I think because Mary is off on vacation till next Monday. All incoming posts is in the to be approved queue. I will talk to her to give me approval permission after she comes back. Or perhaps to see if there is a feature in google group that can give some people or some accounts permission to bypass moderation.

dhimmel · 2016-08-26T22:21:53Z

So It seems like one of the reasons for missing samples could be the upgrade to hg19 I'm content with the fluctuation in sample number -- we'll work with whatever the latest release from Xena contains.

@gwaygenomics, it seems that there is still one outstanding question before we can close this issue. You mention that a previous release of PANCAN_mutation contained an extra column that could help us interpret the amino acid effect of variant? What exactly was this variable and its name?

gwaybio · 2016-09-06T12:45:43Z

@dhimmel @jingchunzhu sorry for the late response. The exact variables are HGVSc and HGVSp - they are the actual mutation calls at the DNA level and at the protein level.

Thanks!

gwaybio added the help wanted label Aug 9, 2016

gwaybio closed this as completed Aug 11, 2016

gwaybio reopened this Aug 11, 2016

dhimmel changed the title ~~Mutation Data from Xena is not complete~~ Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release Aug 16, 2016

dhimmel mentioned this issue Aug 25, 2016

Extract sample info from PANCAN_clinicalMatrix #20

Merged

gwaybio mentioned this issue Sep 7, 2016

Add exploratory analyses of mutation data #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release #16

Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release #16

gwaybio commented Aug 9, 2016

gwaybio commented Aug 10, 2016

jingchunzhu commented Aug 10, 2016

gwaybio commented Aug 11, 2016 •

edited by dhimmel

Loading

dhimmel commented Aug 16, 2016 •

edited

Loading

gwaybio commented Aug 16, 2016 •

edited

Loading

dhimmel commented Aug 16, 2016

jingchunzhu commented Aug 16, 2016 •

edited by dhimmel

Loading

dhimmel commented Aug 16, 2016

dhimmel commented Aug 18, 2016 •

edited

Loading

jingchunzhu commented Aug 18, 2016 •

edited by dhimmel

Loading

dhimmel commented Aug 26, 2016

gwaybio commented Sep 6, 2016

Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release #16

Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release #16

Comments

gwaybio commented Aug 9, 2016

gwaybio commented Aug 10, 2016

jingchunzhu commented Aug 10, 2016

gwaybio commented Aug 11, 2016 • edited by dhimmel Loading

dhimmel commented Aug 16, 2016 • edited Loading

gwaybio commented Aug 16, 2016 • edited Loading

dhimmel commented Aug 16, 2016

jingchunzhu commented Aug 16, 2016 • edited by dhimmel Loading

dhimmel commented Aug 16, 2016

dhimmel commented Aug 18, 2016 • edited Loading

jingchunzhu commented Aug 18, 2016 • edited by dhimmel Loading

dhimmel commented Aug 26, 2016

gwaybio commented Sep 6, 2016

gwaybio commented Aug 11, 2016 •

edited by dhimmel

Loading

dhimmel commented Aug 16, 2016 •

edited

Loading

gwaybio commented Aug 16, 2016 •

edited

Loading

jingchunzhu commented Aug 16, 2016 •

edited by dhimmel

Loading

dhimmel commented Aug 18, 2016 •

edited

Loading

jingchunzhu commented Aug 18, 2016 •

edited by dhimmel

Loading