Trim GG before clustering #1

andreas-wilm · 2016-01-26T11:06:52Z

The classification database (99% OTU) should have been trimmed before clustering instead of using the preclustered database. Pointed out by Christophe LAY

lch14forever · 2016-01-26T14:25:19Z

Do you need help on this?

andreas-wilm · 2016-01-26T14:28:38Z

Sure :) Are you sure you have the time?

The issue is this: for classification we map against a version of Greengenes that's pre-clustered at 99% id. The clustering should have happened after primer trimming though to make things comparable. So we would need to primer trim Greengenes (discard the ones not matching the primer?) and then cluster at 99%.

Andreas

lch14forever · 2016-01-26T14:30:55Z

I have done the trimming and clustering for SILVA in fact. If you want to stick to GG, I can help you on this.

andreas-wilm · 2016-01-26T14:48:34Z

Cool! Yes please. With vsearch or uclust? Input would be /mnt/genomeDB/misc/greengenes.secondgenome.com/downloads/13_5/gg_13_5.fasta
I'm unsure how exactly to assign a taxonomy to each cluster though. The existing OTU clustering came with the assignment. Any idea?

Andreas

lch14forever · 2016-01-26T14:55:25Z

That is tricky then... Previously I kept only sequences with species level
assignment.

Chenhao.

On Tue, Jan 26, 2016 at 10:48 PM, Andreas Wilm [email protected]
wrote:

Cool! Yes please. With vsearch or uclust? Input would be
/mnt/genomeDB/misc/
greengenes.secondgenome.com/downloads/13_5/gg_13_5.fasta
I'm unsure how exactly to assign a taxonomy to each cluster though. The
existing OTU clustering came with the assignment. Any idea?

Andreas

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

paolaflorez · 2016-01-27T14:51:50Z

Hey guys, thanks for chasing this. Chenhao, do I understand correctly regarding retaining only sequences with species level assignment in the gg_13_5.fasta file. That 99_OTU_taxonomy.txt file contains, 203,452 entries. Only 16,869 of these can be assigned to one species. In total we have 639 unique species in there. If your suggestion is to only keep the 16,869, it seems drastic to cut out so many entries.

#command to find out how many entries have species level designations.
grep -c 's__[A-Za-z0-9]' 99_otu_taxonomy.txt

Cheers,
Paola

andreas-wilm · 2016-01-27T14:56:32Z

Let's not do that. This will just introduce a bias. I'm happy to live with
some ambiguity instead
On 27 Jan 2016 22:51, "paolaflorez" [email protected] wrote:

Hey guys, thanks for chasing this. Chenhao, do I understand correctly
regarding retaining only sequences with species level assignment in the
gg_13_5.fasta file. That 99_OTU_taxonomy.txt file contains, 203,452
entries. Only 16,869 of these can be assigned to one species. In total we
have 639 unique species in there. If your suggestion is to only keep the
16,869, it seems drastic to cut out so many entries.

#command to find out how many entries have species level designations.
grep -c 's__[A-Za-z0-9]' 99_otu_taxonomy.txt

Cheers,
Paola

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

lch14forever closed this as completed Jan 26, 2016

andreas-wilm reopened this Jan 26, 2016

andreas-wilm added the enhancement label May 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trim GG before clustering #1

Trim GG before clustering #1

andreas-wilm commented Jan 26, 2016

lch14forever commented Jan 26, 2016

andreas-wilm commented Jan 26, 2016

lch14forever commented Jan 26, 2016

andreas-wilm commented Jan 26, 2016

lch14forever commented Jan 26, 2016

paolaflorez commented Jan 27, 2016

andreas-wilm commented Jan 27, 2016

Trim GG before clustering #1

Trim GG before clustering #1

Comments

andreas-wilm commented Jan 26, 2016

lch14forever commented Jan 26, 2016

andreas-wilm commented Jan 26, 2016

lch14forever commented Jan 26, 2016

andreas-wilm commented Jan 26, 2016

lch14forever commented Jan 26, 2016

paolaflorez commented Jan 27, 2016

andreas-wilm commented Jan 27, 2016