Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much Richer Sample Clinical Data #38

Open
gwaybio opened this issue Feb 17, 2017 · 9 comments
Open

Much Richer Sample Clinical Data #38

gwaybio opened this issue Feb 17, 2017 · 9 comments

Comments

@gwaybio
Copy link
Member

gwaybio commented Feb 17, 2017

Stumbled upon snaptron today and eventually found my way to this resource.

There are many variables curated here measured on each sample (in samples.tsv) including treatment (both specific therapeutic agent, and class of therapy (e.g. chemotherapy, immunotherapy, etc.). I know that @yigalron was very interested in this particular data...

@dhimmel
Copy link
Member

dhimmel commented Feb 17, 2017

Wow http://snaptron.cs.jhu.edu/data/tcga/samples.tsv does have lot's of variables! Do you know whether this resource will be continually maintained?

@gwaybio
Copy link
Member Author

gwaybio commented Feb 17, 2017

Not sure but perhaps @ChristopherWilks will know? we could figshare it

@dhimmel
Copy link
Member

dhimmel commented Feb 17, 2017

@gwaygenomics here are the following attributes I care about:

  1. The data is public domain and released under a CC0 license if necessary.
  2. The creators are committed to providing support
  3. The creators are committed to updating the resource to fix issues and incorporate new TCGA data. Preferably, there is some established workflow to facilitate updates.

@ChristopherWilks
Copy link

ChristopherWilks commented Feb 17, 2017

Good to see Snaptron and/or its data could be useful for you folks.
@nellore and @lcolladotor were primarily the ones who put samples.tsv together, originally for the recount project, https://jhubiostatistics.shinyapps.io/recount/

As far as maintaining, the project as a whole will be maintained in terms of the expression and junction data for various data sources (SRA, GTEx, TCGA). I'll defer to @nellore and @lcolladotor regarding updates to the metadata, specifically TCGA (not surprisingly, this was the hardest metadata to integrate and there was a non-trivial effort put into it).

@gwaybio
Copy link
Member Author

gwaybio commented Feb 17, 2017

thanks for the quick response @ChristopherWilks - yes, I certainly understand that collecting that metadata was no small feat!

I also agree with those 3 points @dhimmel raises. Would be great to incorporate this data to cognoma if possible!

@lcolladotor
Copy link

Hi,

Sorry for the delay. We had to discuss internally some of your concerns with @jtleek, @nellore and others.

Basically, our answer is no, we won't use a CC0 license due to the lack of attribution. We are providing support to https://jhubiostatistics.shinyapps.io/recount/ according to the needs of that project. So we'll most likely fix issues with the data in recount2 if they arise and will be glad to take pull requests to fix any issues with the TCGA metadata in particular. If we update the TCGA RNA-seq data in recount2 at some point, we would update the metadata accordingly. But given the cost of doing so, I assume that it won't happen anytime soon if at all.

So basically, you can use the TCGA metadata we cleaned at your own risk if you cite our work. Currently the manuscript is available as a pre-print but it is in press right now and will appear published soon.

Best,
Leonardo

PS The related GitHub repositories to recount2 are:

There's also the recount Bioconductor package at https://github.com/leekgroup/recount and http://bioconductor.org/packages/recount

@dhimmel
Copy link
Member

dhimmel commented Feb 28, 2017

@lcolladotor thanks for the detailed information. The links are really helpful, and it's great to see that all your work is on GitHub.

Basically, our answer is no, we won't use a CC0 license due to the lack of attribution.

Very understandable. Have you considered releasing your data under a Creative Commons Attribution (CC BY) or Open Data Commons Attribution (ODC-BY) License? Both of these meet the Open Definition and allow reuse as long as attribution is provided.

As it stands without a license, it is potentially copyright infringement to reuse your datasets. The situation is complicated, since the data is likely not subject to copyright in United States, but may be subject to copyright in Europe. Additionally, fair use may protect reuse, but that varies by jurisdiction and is a subjective & non-deterministic judgement.

Therefore, an open license can help others know with greater certainty what reuse is permissible. I know most researchers aren't excited to divert precious time to understanding legalities. But I bring it up because with a little effort up front, I think we can save a lot of headaches down the road and help advance the data sciences.

@lcolladotor
Copy link

Hi,

Thank you for your detailed reply. We have decided to use the CC BY 4.0 license for the data in the recount2 project. For the code we'll use the MIT license.

In more detail:

In relation to the TCGA metadata, the code for creating it is in https://github.com/leekgroup/recount-website. The resulting table is downloadable via https://jhubiostatistics.shinyapps.io/recount/. Specifically via http://duffel.rail.bio/recount/TCGA/TCGA.tsv. An R formatted version can be downloaded with:

## Install if needed
source("https://bioconductor.org/biocLite.R")
biocLite('recount')

library('recount')
metadata <- all_metadata('TCGA')

## Some quick info about it
> dim(metadata)
[1] 11284   864

Best,
Leonardo

@dhimmel
Copy link
Member

dhimmel commented Mar 9, 2017

@lcolladotor thanks! I know licensing is a hassle. Really appreciate that you took the time to unambiguously and openly license recount2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants