-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query with thousands of intervals #18
Comments
Hi Pedro, Sorry for the late reply, this slipped off my radar for a time. You're correct that Snapcount was made to do multiple studies/samples in a single query, however, it was originally intended for one or a few queries at a time (e.g. a single gene of interest, or a handful of related gene regions). Doing 50K regions as you said could take a while and may even fail. For large scale screens, such as what you're proposing, I suggest downloading the Snaptron backing data and doing a search over a local stream of that data offline with your own code. That said, you can certainly try to use snapcount to do the whole set (there's no hard limit on number of queries or throttling). However, all of the Snaptron/snapcount backing data is available publicly here: http://snaptron.cs.jhu.edu/data/ where subdirectories are named according to their compilation short name (e.g. srav2). Junction databases are in both block-gzip (readable by gzip) and SQLite3 formats, generically named: junctions.bgz for a specific compilation. The junction columns are defined in: Additional information about the rows/columns which snapcount presents from Snaptron can be found here: Links to the full set of metadata fields for each of the recount2 era compilations (including srav2), are here: I don't know why you'd be getting that error you mentioned in the edit, if you could post the full query that you were running when you got that (or one of them), that would be helpful. Thanks, |
Hi Chris, I was too busy to look at this in detail, but now i'm again on track.
Thanks for pointing out this. It will be useful for sure.
Now that I'm really into understanding how snaptron works behind the scenes, I realised that my previous query did not make sense. I'm starting simple, so I picked one known intron and exon from a gene just to make sure I understand the output of basic queries. And right away I was intrigued by the snapcount output regarding the number of samples. Let me know if this makes sense:
Why is the sample count so different, given that all the annotated isoforms for the gene are composed by this intron and exon ? For the GTEx compilation it is the same. Additionally, why was the overall count so low, given that in the paper is mentioned that SRAv2 has more than 44k public samples ? Thanks in advance, |
Hi Pedro, [Just a note about my response frequency, I've moved on from actively maintaining Snaptron as one of my primary tasks as I'm no longer a student in the lab that produced Snaptron, so my responses will be hit & miss. That said you should feel free to continue to post here (it's the proper place) and someone else may pick up where I left off]. So I'll check the exon myself, but to your question about low sample counts for the junction, and in general, this is expected. Junctions are called in a heuristic manner by the aligner (Rail-RNA in this case, but this is true about spliced aligners in general, e.g. STAR), independent of an annotation. The aligner only counts reads which actually span a junction, so this also limits the amount of evidence (read coverage) present for the aligner to make a junction call, on top of that, the aligner typically will filter out suspicious junction calls which have lower coverage and/or really short alignments on either side of the junction (anchor lengths). Also, given the variety of technical factors (read length, sequencing depth, paired/unpaired) and biological factors (specific conditions, regions, tissues sequenced) represented in the RNA-seq sequence datasets in recount2/3, pulled out from SRA with minimal filtering, it's unlikely from my experience to have junctions, even annotated ones, called in every sample in a large compendium. I've also seen evidence of a few annotated junctions that never appear to be called in a large compendium of SRA based datasets. It's important to remember that RNA-seq, unlike DNA, is only giving you a snapshot of what's transcribed in the targeted tissue/cells/regions in the experiment at that time, so missing known exons/junctions is not necessarily surprising in certain samples depending on their specific experiment parameters. Hope that helps, |
Thanks for the note, I will open new issues, if needed.
That makes sense indeed. I understand. This raised another question: i'm not sure I understood how exon calls are made. In the recount package it is mentioned: "With base-pair coverage for the exonic sequences computed, the coverage count for each distinct exon is simply the sum of the base-pair coverage for each base in a given distinct exon.". Are exon queries related to exon counts calculated as part of Recount ? Since it relies on base-pair coverage, I assume it accounts for both spanning junction reads and non-split reads. May this also be the reason why
This is true, and I was missing a very important aspect in my simple query. MYBPC3 is a cardiac gene, so it is mostly expressed in the heart tissue. By refining the query ( How can I query snaptron using group names, instead of using the SMTS field ? And last one, my apologies for throwing so many questions. Where can I get, if any, detailed documentation of the sra compilation ? I would like to know more about the metadata, and how multiple samples from the same project are taken care. Does a sample ID refers to a single sample from a single accession number ? Best regards, |
Hi,
Thanks for this user friendly R package.
I'm starting to play around with snapcount, I would like to know what's your advice for querying with many intervals spanning many genes across a full compilation (e.g. srav2). From the recount vignette I read that it's advisable to use it when we are querying single studies, while snapcount is preferred when using multiple regions, am i right ?
Example of what I'm doing now for a small number of regions:
I'm afraid It will be quite slow for around 50k intervals that I have, or I even that i exceed any limited number of requests.
Do you have any tips to optimize such large queries ?
Additionally, It would be good to improve documentation of row and column filters, as I'm not sure what kind of filters are available for the sra compilation. For example, SMTS is only available in gtex.
Edit:
I have some intervals with no data returned. Is that the cause of this error I get with
return_rse = TRUE
?Error: Matrices must have same number of columns in rbind2(argl[[i]], r)
Best,
Pedro
The text was updated successfully, but these errors were encountered: