Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to format the taxonomy file to retrain classifier #18

Open
yingeddi2008 opened this issue Nov 30, 2016 · 36 comments
Open

How to format the taxonomy file to retrain classifier #18

yingeddi2008 opened this issue Nov 30, 2016 · 36 comments

Comments

@yingeddi2008
Copy link

yingeddi2008 commented Nov 30, 2016

Hi rdp staff,

I am trying to retrain RDP classifier using NCBI 16s database, however, when I looked into the example taxonomy file and the fasta file, I am a bit confused how should I even generate that file.

0*Root*-1*0*rootrank
1*Bacteria*0*1*domain
2*"Actinobacteria"*1*2*phylum
3*Actinobacteria*2*3*class
4*Acidimicrobidae*3*4*subclass
5*Acidimicrobiales*4*5*order
6*"Acidimicrobineae"*5*6*suborder
7*Acidimicrobiaceae*6*7*family
8*Acidimicrobium*7*8*genus
9*Ferrimicrobium*7*8*genus
10*Ferrithrix*7*8*genus
11*Ilumatobacter*7*8*genus
12*Iamiaceae*6*7*family
3102*Aquihabitans*12*8*genus
13*Iamia*12*8*genus

Could you please explain how each line is constructed? Allow me to take a line as an example,

6*"Acidimicrobineae"*5*6*suborder

I could guess that the first number is the taxonomy id for Acidimicrobineae, which is 6, and its parent taxonomy is 5, Acidimicrobiales. I assume that the suborder at the end of the line indicates that Acidimicrobineae is at the taxonomy rank of suborder, right? Then what is the 6 before suborder mean? when I look at 12*Iamiaceae*6*7*family, I can say Iamiaceae is a family level taxonomy, which has the parent of 6 (Acidimicrobineae) and 7 (Acidimicrobiaceae)? I am not sure I am getting what's the rule of constructing the taxonomy file here. Could you please explain how this is done?

Thanks in advance,

Eddi

@rdpstaffmsu
Copy link

rdpstaffmsu commented Nov 30, 2016 via email

@yingeddi2008
Copy link
Author

Hi Benli,

Thanks for your prompt response. I am now looking for the two scripts you mentioned in the reply, lineage2taxTrain.py and addFullLineage.py. Could you please show me where they are included? Are they in the RDP zipped folder?

Thanks,

Eddi

@chaibenl
Copy link
Contributor

chaibenl commented Dec 1, 2016 via email

@yingeddi2008
Copy link
Author

I still don't see them...

@rdpstaffmsu
Copy link

rdpstaffmsu commented Dec 1, 2016 via email

@yingeddi2008
Copy link
Author

You can send the scripts to [email protected]. Thanks.

@yingeddi2008
Copy link
Author

yingeddi2008 commented Dec 5, 2016

Hi Benli,

Haven't heard from you for the scripts for a while. I'd be really appreciated if you could follow up on this issue.

Thanks a lot!

Eddi

@chaibenl
Copy link
Contributor

chaibenl commented Dec 5, 2016 via email

@yingeddi2008
Copy link
Author

Thanks Benli, I received them.

@yingeddi2008
Copy link
Author

yingeddi2008 commented Dec 5, 2016

Hi Benli,

I am trying the scripts you provided in the email to re-train RDP classifier using NCBI 16s database, but I encountered some error messages when I use the files generated by your scripts to train.

I have generated the fasta file with lineage added to the sequence ID, and you can download from https://www.dropbox.com/s/86uqecg3iflrom5/16SMicrobial.ready4train.fasta?dl=0

I also have the taxonomy file in RDP compatible format, and you can download from https://www.dropbox.com/s/rnmw2izjdsdc39f/16SMicrobial.ready4train.taxonomy?dl=0.

When I tried to train using the following command: (I am using 2.12 version)
java -Xmx1g -jar /Users/huaiyinglin/Downloads/rdp_classifier_2.12/dist/classifier.jar train -o 16S_ncbi -s 16SMicrobial.ready4train.fasta -t 16SMicrobial.ready4train.taxonomy

Error Messages like the following appears:

edu.msu.cme.rdp.classifier.train.NameRankDupException: Error: duplicate taxon name and rank in the taxonomy file.
ponticoccus genus 2
at edu.msu.cme.rdp.classifier.train.TreeFactory.creatTaxidMap(TreeFactory.java:126)
at edu.msu.cme.rdp.classifier.train.TreeFactory.(TreeFactory.java:61)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.(ClassifierTraineeMaker.java:63)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.main(ClassifierTraineeMaker.java:170)
at edu.msu.cme.rdp.classifier.cli.ClassifierMain.main(ClassifierMain.java:77)

I have looked up the genus name "ponticoccus" listed as part of the error message, I did find three entries for ponticoccus at genus level, but for three different species. Since I want to train at species level, when I made the taxonomy file, I made sure there is no duplicated taxonomy information, so each sequence should be unique taxonomy-wise on species level.

It seems to me that the RDP classifier can only be trained on genus level even after I provided Species level information. Could you please help me figure out how I can train at species level?

Thanks a lot in advance!

Eddi

@rdpstaffmsu
Copy link

rdpstaffmsu commented Dec 7, 2016 via email

@yingeddi2008
Copy link
Author

Thanks Benli, I see where the problem is. I will remove any convergent evolution and try again.

@AnnyYoung
Copy link

AnnyYoung commented Apr 11, 2017

Hi Eddi,

I met some problem about rdp_classifier-2.4.jar, I already checked the .fasta and taxonomy.txt's format like you said in "How to format the taxonomy file to retrain classifier #18" , but I get the same error information , so I want to try lineage2taxTrain.py and addFullLineage.py . Clould you give me this two script please?

the error information like this:
Exception in thread "main" java.lang.IllegalArgumentException:
Illegal taxonomy format at 3260**32597genus
at edu.msu.cme.rdp.classifier.train.TreeFactory.creatTaxidMap(TreeFactory.java:79)
at edu.msu.cme.rdp.classifier.train.TreeFactory.(TreeFactory.java:58)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.(ClassifierTraineeMaker.java:40)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.main(ClassifierTraineeMaker.java:131)

Thanks a lot!

anny

@TurbulentCupcake
Copy link

Hi RDP Staff,

Can you pass on the script used to create the taxonomy file mentioned earlier in the thread? I would greatly appreciate it.

Thanks,
Adithya

@chaibenl
Copy link
Contributor

chaibenl commented Jun 23, 2017 via email

@TurbulentCupcake
Copy link

Hi,

Thanks for the response, but I am unable to see them. Can you forward them to [email protected]?

Thanks,
Adithya

@jbholm
Copy link

jbholm commented Jan 8, 2018

Hi, I'm looking for the following scripts for re-training the classifier w/a new lineage. Are these publicly-available some place or must be they emailed? If so, my email is [email protected]

Thanks!

lineage2taxTrain.py
addFullLineage.py

@xysswang
Copy link

Hi RDP Team,

I would like to create my own training data. Could you also send me the scripts, lineage2taxTrain.py and addFullLineage.py ? I will really appreciate that. My email address is [email protected]

Thanks !

@AmeLaporte
Copy link

Hello RDP Team,
I also am interested in the scripts to generate my training set. Is it possible to receive them at my email address: [email protected]

Thanks!
Amélie

@mbenucci
Copy link

mbenucci commented Apr 10, 2018

Hi all,
I came across this issue recently, similarly to many of other users, while in the process of testing the classifier and in the process of training it with our own reference sequence files. I think I managed to find the above mentioned python scripts that allows for correct formatting of the files for retraining the RDP classifier. I cloned them and checked the scripts to make sure they were doing what I thought they were supposed to do...and ultimately they seem to be working fine for me.

https://github.com/GLBRC-TeamMicrobiome/python_scripts.git

I hope this helps others as well.
Marco

@chaibenl
Copy link
Contributor

chaibenl commented May 22, 2018 via email

@jbholm
Copy link

jbholm commented May 22, 2018 via email

@anaya1
Copy link

anaya1 commented May 31, 2018

Dear Benli,
Would it be possible to provide me with your scripts: lineage2taxTrain.py and addFullLineage.py. Highly appreciate your help.

Anna Alessi ([email protected])

@anaya1
Copy link

anaya1 commented Jun 4, 2018

Dear Benli,
Thank you for providing the scripts. I have successfully created a ready4train_taxonomy.txt file for my database. However when I want to add lineage to my rawSeq.fasta the output file says "AB001438 not in taxonomy file". What do you think causing it? I have checked taxonomy and seq files and they both contain AB001438. In fact this is a first entry in both files. Many thanks,
Anna

@anaya1
Copy link

anaya1 commented Jun 4, 2018

Hi Benli,
I know why I had a previous issue. You must use: python addFullLineage.py rawtax.txt rawSeg.fasta > ready4train_seq.fasta. Now however I have another problem while trying to train my database:
Exception in thread "main" java.lang.IllegalArgumentException: Sequence AY230195 has different lowest rank: Genus from the previous lowest rank: Species
at edu.msu.cme.rdp.classifier.train.TreeFactory.addSequencewithLineage(T reeFactory.java:278)
at edu.msu.cme.rdp.classifier.train.TreeFactory.parseSequenceFile(TreeFa ctory.java:152)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.(Classi fierTraineeMaker.java:65)
at edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker.main(Classifi erTraineeMaker.java:170)
at edu.msu.cme.rdp.classifier.cli.ClassifierMain.main(ClassifierMain.jav a:77)
I was wondering if you could comment on it and help me to solve this problem.
Thanks,
Anna

@chaibenl
Copy link
Contributor

chaibenl commented Jun 4, 2018 via email

@rwst
Copy link

rwst commented Aug 7, 2018

@yingeddi2008

I am not sure I am getting what's the rule of constructing the taxonomy file here. Could you please explain how this is done?

I'm pretty sure that the second number is the level, i.e. depth of the tree node, not another id.

@rwst
Copy link

rwst commented Aug 7, 2018

@rdpstaffmsu
Is the copynumber file required? If so, can you please add information on what exactly should the content be?

@andrewmaltezthomas
Copy link

Dear @rdpstaffmsu

Could you send the python scripts:

lineage2taxTrain.py

addFullLineage.py

To my email address:

[email protected]

Thanks

@050114dragon
Copy link

Dear @rdpstaffmsu

if you will send me lineage2taxTrain.py and addFullLineage.py, I shall be very grateful,
my email is [email protected].

Thanks

@wangchao-malab
Copy link

Dear Benli,
Would it be possible to provide me with your scripts: lineage2taxTrain.py and addFullLineage.py. Highly appreciate your help.
My email is: [email protected]

@ghost
Copy link

ghost commented Nov 2, 2019

Dear @rdpstaffmsu

Can you please send me a copy of lineage2taxTrain.py and addFullLineage.py? My email is [email protected]

Thanks

@wangchao-malab
Copy link

wangchao-malab commented Nov 7, 2019 via email

@ghost
Copy link

ghost commented Nov 13, 2019

Hello,

Thanks for your response. The university email associated with my github account automatically rejects any attachments with code in them. Could you please resend the scripts to my gmail account [email protected]?

Thanks for your help,
Carter

@LIU3379
Copy link

LIU3379 commented May 11, 2021

Hi RDP Team,

I would like to create my own training data. Could you also send me the scripts, lineage2taxTrain.py and addFullLineage.py ? I will really appreciate that. My email address is [email protected]

Thanks !

@ctb
Copy link

ctb commented Jul 7, 2021

note to the RDP team: you can attach the files to this GitHub issue by renaming them as .txt files and adding them to this issue on the web interface, if you like. Or if you send them to me at [email protected] I can do that for you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests