Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRA scale search problem/question - Results never appear? #55

Open
Anto007 opened this issue Feb 7, 2025 · 13 comments
Open

SRA scale search problem/question - Results never appear? #55

Anto007 opened this issue Feb 7, 2025 · 13 comments

Comments

@Anto007
Copy link

Anto007 commented Feb 7, 2025

Hi @ctb @luizirber

Thank you for bringing out such a cool tool for the community. I tried submitting two different genomes on the branchwater web-app on two different days (around 5 days apart; latest submission was just an hour ago) but never got to see the results? I waited for an hour or longer and yet I saw no results. Also, there's a bit of a confusion in my mind regarding the below matters and I was hoping that you could clear it for me. Thank you very much.

  1. The branch web-app page says that 1,161,119 metagenomes are being screened and yet according to the wiki on the main page of this github repository, only ~946,000 SRA metagenomes are indexed on the server?

  2. According to the info at this link, containment of >=0.20 corresponds to a minimum of 92% average nucleotide identity. However, the recently published paper on early branching Cyanobacteriia from you states this containment value to be corresponding to nearly 94% ANI? On the branchwater web-app, there is the mention of a cANI metric and cANI > 0.97 is mentioned as being representative of a species-level match. So far I've not got to see the results from my searches on the branchwater web-app but if it were to work, is my understanding correct here that I can retrieve species-level matches to my input prokaryote genome? I remember from an earlier blog page from @ctb that mastiff sra searches via nucleotide k-mers come with a limitation of not being able to go down to the species level? Previously, I have successfully used mastiff to perform screenings of the SRA metagenomes

@Anto007
Copy link
Author

Anto007 commented Feb 7, 2025

Btw, trying searches with the examples on the branchwater web-app also doesn't seem to be working

@ctb
Copy link
Contributor

ctb commented Feb 7, 2025

hi @Anto007 thanks - we literally just (yesterday) updated the branchwater-web server, so probably it has to do with that. Clearly we need better release testing 😅 .

I'll have to get back to you on the precise numbers - I'm traveling at the moment - but in our experience (and also math, simulations, etc.) k-mer approaches (containment & cANI) accurately distinguish between genomes in the range of 90-99.5% ANI. You can definitely get to species level. It is harder to go beyond genus level (as in, upwards - genus and family level matches are less guaranteed, although it also depends on the taxonomy you use, since NCBI isn't ANI-based).

Thank you for asking these questions! We've clearly got some harmonization to do among these resources! The code inside of sourmash does properly calculate cANI, and does also adjust for k-mer size (which matters for translating containment to cANI), but I need to make sure we're properly connecting all the dots here.

@luizirber
Copy link
Member

luizirber commented Feb 7, 2025

Thank you for bringing out such a cool tool for the community. I tried submitting two different genomes on the branchwater web-app on two different days (around 5 days apart; latest submission was just an hour ago) but never got to see the results? I waited for an hour or longer and yet I saw no results. Also, there's a bit of a confusion in my mind regarding the below matters and I was hoping that you could clear it for me. Thank you very much.

Working on better feedback on what is happening (instead of "wait 5 minutes!"), and stabilizing the service right now 👷

The branch web-app page says that 1,161,119 metagenomes are being screened and yet according to the wiki on the main page of this github repository, only ~946,000 SRA metagenomes are indexed on the server?

Yes, as Titus mentioned it was updated this week, will do a wider announcement when service is more stable. Also need to update the README to point to the new number 😓

You might want to try https://branchwater-dev.sourmash.bio in the mean time, it is running at a higher scaled value (s=100,000 instead of s=1000), but should give similar results if your genome is large enough (~500kb).

@Anto007
Copy link
Author

Anto007 commented Feb 8, 2025

Thank you very much for your responses @ctb and @luizirber. I tried but didn't have any luck in seeing any sort of results despite waiting for more than 1 hour now. Screenshot attached below

Image

@Anto007
Copy link
Author

Anto007 commented Feb 10, 2025

Hi again @ctb and @luizirber, https://branchwater-dev.sourmash.bio worked well today with my input genome. The results page looks cool and is useful. I was wondering if you would in the near future allow users to search specifically cANI > 0.97 to get only species-level matches? That way the map plot might end up being a bit more useful. Thanks again for this fantastic tool

@Anto007
Copy link
Author

Anto007 commented Feb 10, 2025

Seems to have stopped working now with yet another input genome that I tried :-(

@Anto007
Copy link
Author

Anto007 commented Feb 10, 2025

@ctb and @luizirber, I have a question for understanding: For example, if I get a SRR13208515 hit (at cANI > 0.97) with my input genome/MAG, my understanding based on sourmash is that it doesn't return any abundance info of the input genome in SRR13208515, correct? In this context, if I were to download this SRR13208515 read-set and perform a metagenomic assembly with it, would I be able to recover the MAG that is close to my input genome? My question here is to understand if branchwater detects an organism's content in metagenomic reads even when it is way too lowly abundant to be recoverable as a MAG and/or when it is present in the metagenomic reads in a very incomplete form (in terms of genomic completeness-for example, only 5% of the total genome present in the metagenomic reads)?

@ctb
Copy link
Contributor

ctb commented Feb 10, 2025

@ctb and @luizirber, I have a question for understanding: For example, if I get a SRR13208515 hit (at cANI > 0.97) with my input genome/MAG, my understanding based on sourmash is that it doesn't return any abundance info of the input genome in SRR13208515, correct? In this context, if I were to download this SRR13208515 read-set and perform a metagenomic assembly with it, would I be able to recover the MAG that is close to my input genome? My question here is to understand if branchwater detects an organism's content in metagenomic reads even when it is way too lowly abundant to be recoverable as a MAG and/or when it is present in the metagenomic reads in a very incomplete form (in terms of genomic completeness-for example, only 5% of the total genome present in the metagenomic reads)?

excellent question! Let me try to break it down -

My question here is to understand if branchwater detects an organism's content in metagenomic reads even when it is way too lowly abundant to be recoverable as a MAG and/or when it is present in the metagenomic reads in a very incomplete form (in terms of genomic completeness-for example, only 5% of the total genome present in the metagenomic reads)?

Yes, it does detect (extremely) low abundance and partial matches - e.g. 1x coverage (and even below - partial matches are fine), and/or anything over 10-50kb worth of genomic detection. This is because collections of 31-mers are extremely species and genome specific.

Per Biogeographic distribution of five Antarctic cyanobacteria using large-scale k-mer searching with sourmash branchwater, you would definitely expect the metagenomic reads to map to the detected genome, and the depth and coverage of those reads should generally be greater than what is detected by k-mers (i.e. branchwater underestimates mapping).

In this context, if I were to download this SRR13208515 read-set and perform a metagenomic assembly with it, would I be able to recover the MAG that is close to my input genome?

Maybe, or maybe not. Assembly and binning are extremely lossy. Assembly drops low abundance and strain-varying content, while binning typically drops the vast majority of contigs. It doesn't hurt to try, but don't be disappointed if it doesn't work!

Two complementary alternatives would be to do "read recruitment" to your query genome, which will give you SNPs and small indels, but not detect significant novel content; or to use a tool like spacegraphcats. I'm afraid spacegraphcats has not been kept up, and it is also not super efficient or easy to use, but it was (not by coincidence) designed for this purpose ;). There may be more modern alternatives to spacegraphcats, too.

HTH! Ask as you have more questions!

@ctb
Copy link
Contributor

ctb commented Feb 10, 2025

Seems to have stopped working now with yet another input genome that I tried :-(

if you can send your query genomes (or your query sketches with k=31 and scaled=1000) I would love to try them. I suspect that the front-end is timing out but we can verify... Sorry for this!

@bluegenes
Copy link
Collaborator

I was wondering if you would in the near future allow users to search specifically cANI > 0.97 to get only species-level matches

You can filter the results using the boxes above each column. After entering, the table and figures will update.

@Anto007
Copy link
Author

Anto007 commented Feb 11, 2025

Thank you very much @ctb and @bluegenes for taking the time to respond and for your detailed answers. I was also thinking along similar lines as your answer with respect to the ability to recover a MAG in a hit genome and thanks again for making this super-clear. @ctb I was able to get the results for my second genome when I tried the server this morning but thank you very much for your offer; I might take it up the next time that the server becomes non-responsive. @bluegenes sorry, I missed that function on the results page and that's actually cool that the map plot gets automatically updated. Any reason why the map doesn't show all the sample locations?
For example, in the below example from my analysis on branchwater, metagenome read libraries from 41 locations in the USA had hits to my input genome (at cANI >= 0.97) but only few locations (green dots) are shown? I suppose these green dots are represented by unique metagenome source/biome type and multiple identical metagenome source/biome types are not plotted. I believe plotting all locations would be useful irrespective of the metagenome source/biome type as it will offer perhaps a better picture of the global diversity of the bacterium of interest? I hope I've not missed seeing some function on the results page of your wonderful tool that can already do this.

Image

@Anto007
Copy link
Author

Anto007 commented Feb 11, 2025

Now that I've plotted this on my own with all the locations with the hit, I see that I end up with a similar plot as yours. I guess multiple read libraries are represented by the same sampling coordinates. Thank you once again

@bluegenes
Copy link
Collaborator

bluegenes commented Feb 11, 2025

Great, I'm glad the filtering worked for you! Unfortunately not all metagenomes have latitude/longitude data, some only have country data. That is most likely why you're seeing fewer green dots relative to the number of results (but please do let us know if that's not the case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants