Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release #16

Open
gwaybio opened this issue Aug 9, 2016 · 12 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Aug 9, 2016

I have had this issue in the past (see zenodo file) and it looks like the current PANCAN_mutation file from xena has less samples and less columns than a previous version.

One of the columns we don't have is the specific nucleotide mutation and is preventing us from completing #15

It may be good to ask a direct question to the UCSC Xena Google Group. They have been helpful in the past (see #14)

@gwaybio
Copy link
Member Author

gwaybio commented Aug 10, 2016

@stephenshank

@jingchunzhu
Copy link

Agreed. If you have any questions regarding data from UCSC Xena. The google group is the most effective way to get the message to us.

Jing
UCSC Xena
http://xena.ucsc.edu

@gwaybio
Copy link
Member Author

gwaybio commented Aug 11, 2016

Thanks @jingchunzhu - your group continues to be very helpful!

For the cognoma community, here is a link to the UCSC Xena google group discussion about this issue

@gwaybio gwaybio closed this as completed Aug 11, 2016
@gwaybio gwaybio reopened this Aug 11, 2016
@dhimmel
Copy link
Member

dhimmel commented Aug 16, 2016

@gwaygenomics, with @jingchunzhu's latest reply (quoted below), what's the status of this issue? Basically, should it be closed or is the issue ongoing (and what's the next step to progress forward)?


I have noticed that different versions of PanCancer mutation data have been deposited in the browser here. The current version is not a complete representation of previous versions. It lacks several columns and several samples that were present in a previous version the last time I accessed the data on 12 June 2015.

Yes. These are not from the same release. However I am surprised by "lacking several columns", could you say how the columns are different?

Sample number change is not surprising because we periodically update TCGA data. This particular dataset is compiled by the Xena team at UCSC, and in almost all cases, TCGA has multiple version of mutation calls from several sequencing and analysis groups, broad, WashU, BCM, and UCSC, plus there are curated and automated calls, plus there are different sequencing platforms. So we made our internal decision on which dataset to include, and the exact selection has been changed over time, not drastically, but there are changes. The change will effect sample numbers.

Is there a place where data is versioned besides in the JSON metadata (see discussion here)?

Starting 2016, we store our release data on AWS S3, which means that all versions of data starting 2016 will be on S3. We plan to do so in the future as long as there is resource to sustain it. .json files are part of the data releases, which will stores the version information. Our previous data releases are not on S3. Do you need the previous version that you retrieved in June 12? We can send to you directly.

Jing

@gwaybio
Copy link
Member Author

gwaybio commented Aug 16, 2016

@dhimmel - I responded to @jingchunzhu on the google groups but the message was not posted. Not sure what happened here.

My post listed the different columns between the two versions. There were many more columns in the older version. Perhaps @jingchunzhu is looking into it before passing my comment through the moderators?

@dhimmel
Copy link
Member

dhimmel commented Aug 16, 2016

I responded to @jingchunzhu on the google groups but the message was not posted. Not sure what happened here.

@gwaygenomics good to know. Give it time -- there is a delay between posting and the message appearing (perhaps an approval stage with a poor user experience). I actually posted a suggestion to move the Google Group to GitHub issues to avoid these blocks, although this post is also currently hidden.

@dhimmel dhimmel changed the title Mutation Data from Xena is not complete Current Xena PANCAN_mutation dataset is missing some samples and variables from a previous release Aug 16, 2016
@jingchunzhu
Copy link

jingchunzhu commented Aug 16, 2016

I don't see either of the two messages. Not sure what's going on. Sorry.

​>​ Perhaps @jingchunzhu is looking into it before passing my comment

through the moderators?

​Greg, I don't know if the message will show up​ at all. Can you email me with your post that did not get through?

@dhimmel
Copy link
Member

dhimmel commented Aug 16, 2016

@jingchunzhu, every time I post to the Google Group there is a substantial delay till it appears. I'm pretty sure the messages will show up if we wait.

@dhimmel
Copy link
Member

dhimmel commented Aug 18, 2016

@jingchunzhu I'm starting to think that @gwaygenomics and my posts may actually be permanently missing this time. Is the Google Group moderated and if so, can you confirm that our posts are not waiting on approval?

@jingchunzhu
Copy link

jingchunzhu commented Aug 18, 2016

Yes. It is moderated. I think because Mary is off on vacation till next Monday. All incoming posts is in the to be approved queue. I will talk to her to give me approval permission after she comes back. Or perhaps to see if there is a feature in google group that can give some people or some accounts permission to bypass moderation.

@dhimmel
Copy link
Member

dhimmel commented Aug 26, 2016

So It seems like one of the reasons for missing samples could be the upgrade to hg19 I'm content with the fluctuation in sample number -- we'll work with whatever the latest release from Xena contains.

@gwaygenomics, it seems that there is still one outstanding question before we can close this issue. You mention that a previous release of PANCAN_mutation contained an extra column that could help us interpret the amino acid effect of variant? What exactly was this variable and its name?

@gwaybio
Copy link
Member Author

gwaybio commented Sep 6, 2016

@dhimmel @jingchunzhu sorry for the late response. The exact variables are HGVSc and HGVSp - they are the actual mutation calls at the DNA level and at the protein level.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants