Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim GG before clustering #1

Open
andreas-wilm opened this issue Jan 26, 2016 · 7 comments
Open

Trim GG before clustering #1

andreas-wilm opened this issue Jan 26, 2016 · 7 comments

Comments

@andreas-wilm
Copy link

The classification database (99% OTU) should have been trimmed before clustering instead of using the preclustered database. Pointed out by Christophe LAY

@lch14forever
Copy link
Member

Do you need help on this?

@andreas-wilm
Copy link
Author

Sure :) Are you sure you have the time?

The issue is this: for classification we map against a version of Greengenes that's pre-clustered at 99% id. The clustering should have happened after primer trimming though to make things comparable. So we would need to primer trim Greengenes (discard the ones not matching the primer?) and then cluster at 99%.

Andreas

@lch14forever
Copy link
Member

I have done the trimming and clustering for SILVA in fact. If you want to stick to GG, I can help you on this.

@andreas-wilm
Copy link
Author

Cool! Yes please. With vsearch or uclust? Input would be /mnt/genomeDB/misc/greengenes.secondgenome.com/downloads/13_5/gg_13_5.fasta
I'm unsure how exactly to assign a taxonomy to each cluster though. The existing OTU clustering came with the assignment. Any idea?

Andreas

@lch14forever
Copy link
Member

That is tricky then... Previously I kept only sequences with species level
assignment.

Chenhao.

On Tue, Jan 26, 2016 at 10:48 PM, Andreas Wilm [email protected]
wrote:

Cool! Yes please. With vsearch or uclust? Input would be
/mnt/genomeDB/misc/
greengenes.secondgenome.com/downloads/13_5/gg_13_5.fasta
I'm unsure how exactly to assign a taxonomy to each cluster though. The
existing OTU clustering came with the assignment. Any idea?

Andreas


Reply to this email directly or view it on GitHub
#1 (comment)
.

@paolaflorez
Copy link

Hey guys, thanks for chasing this. Chenhao, do I understand correctly regarding retaining only sequences with species level assignment in the gg_13_5.fasta file. That 99_OTU_taxonomy.txt file contains, 203,452 entries. Only 16,869 of these can be assigned to one species. In total we have 639 unique species in there. If your suggestion is to only keep the 16,869, it seems drastic to cut out so many entries.

#command to find out how many entries have species level designations.
grep -c 's__[A-Za-z0-9]' 99_otu_taxonomy.txt

Cheers,
Paola

@andreas-wilm
Copy link
Author

Let's not do that. This will just introduce a bias. I'm happy to live with
some ambiguity instead
On 27 Jan 2016 22:51, "paolaflorez" [email protected] wrote:

Hey guys, thanks for chasing this. Chenhao, do I understand correctly
regarding retaining only sequences with species level assignment in the
gg_13_5.fasta file. That 99_OTU_taxonomy.txt file contains, 203,452
entries. Only 16,869 of these can be assigned to one species. In total we
have 639 unique species in there. If your suggestion is to only keep the
16,869, it seems drastic to cut out so many entries.

#command to find out how many entries have species level designations.
grep -c 's__[A-Za-z0-9]' 99_otu_taxonomy.txt

Cheers,
Paola


Reply to this email directly or view it on GitHub
#1 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants