-
Notifications
You must be signed in to change notification settings - Fork 277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kraken2 needs species level classification for 16S. #911
Comments
I pushed a change to the kraken2 repo that updates
Here's an example run:
|
Is this change changed the species level issue? Or is it just to use latest version ofSILVA? |
Apologies! I was too keyboard happy and didn't see the rest of the message! Your modifications seem good so far. The reason why the sequences are not being classified could be because you did not update the seqid2taxid.map which looks something like this:
This file tells kraken what taxid a particular sequence id maps to. Example: Given the following sequence:
And your sample taxonomy (modified so that the taxids are whole numbers):
You will need to add an entry in the seqid2taxid.map that maps Please make sure that your taxids are unique, however, and if the there is already an entry for |
I don't think you are reading the post at all :(. Because i did what you said. In this part of my post: ACC_TAXID: TAX: . As you can see all unqiue. I don't know what happens after this point. |
I missed that part, but what I trying to point out to you is that the taxid needs to be an integer not a float, that may be reason why kraken2 is not able to map the sequence to the ID. |
I tried again. It gave the same error... I really need help here. I changed decimal numbers to integers. I'm giving the head of the files i edited again: --------------------SILVA_138.2_SSURef_NR99_tax_silva.fasta (I didn't changed anything here):
----------------------tax_slv_ssu_138.2.acc_taxid: What i did: To get rid of decimal points i added 0000n number to end of it. So if it was it became ----------------------tax_slv_ssu_138.2.txt: . then i run this codes: build_silva_taxonomy.pl "${TAXO_PREFIX}.txt" kraken2-build --db $KRAKEN2_DB_NAME --build --threads $KRAKEN2_THREAD_CT And got the same result as before: Creating sequence ID to taxonomy ID map (step 1)... Well i ran out of ideas. I can't see where am i doing wrong. Im open to any help at that point. |
I’ll see what I can do for you this evening.On Feb 16, 2025, at 11:55 AM, Aytaç Dursun Öksüzoğlu ***@***.***> wrote:
I tried again. It gave the same error... I really need help here. I changed decimal numbers to integers. I'm giving the head of the files i edited again:
--------------------SILVA_138.2_SSURef_NR99_tax_silva.fasta (I didn't changed anything here):
AY846379.1.1791 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. Itas 9/21 14-6w
sequence
AY846382.1.1778 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum
sequence
…----------------------tax_slv_ssu_138.2.acc_taxid:
AY846379.1.1791 1259100001
AY846382.1.1778 1259100002
AB000393.1.1510 6268800001
AY909590.1.2352 4990200001
AB000480.1.1326 6185900001
BD359736.3.2150 877500001
AY909586.1.1754 4990200002
AY955003.1.1727 1263400001
AY955005.1.1727 1263400002
AB001783.1.1507 5993600001
FZ423313.1.1291 5865800001
HG529990.1.1403 4406900001
AB001778.1.1507 5993600002
AB001521.1.1560 6242000001
AB000479.1.1326 6185900002
AY846384.1.2409 1259100003
What i did: To get rid of decimal points i added 0000n number to end of it. So if it was
AY846379.1.1791 12591
AY846382.1.1778 12591
AYxxxxxx.x.xxxx(Lets say 90th) 12591
it became
AY846379.1.1791 1259100001
AY846382.1.1778 1259100002
AYxxxxxx.x.xxxx(Lets say 90th) 1259100090
----------------------tax_slv_ssu_138.2.txt:
Archaea; 2 domain
Archaea;Aenigmarchaeota; 11084 phylum 123
Archaea;Aenigmarchaeota;Aenigmarchaeia; 42913 class 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales; 42914 order 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Candidatus Aenigmarchaeum; 42915 genus 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis; 57376 family 138.2
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis;Candidatus Aenigmarchaeum; 57377 genus 138.2
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700001 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700002 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700003 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700004 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedmarinearchaeon; 5737700005 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedmarinearchaeon; 5737700006 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700007 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedeuryarchaeote; 5737700008 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedeuryarchaeote; 5737700009 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700010 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700011 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700012 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700013 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700014 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700015 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;CandidatusAenigmarchaeumsubterraneumSCGCAAA011-O16; 5737700016 species 2025
.
.
.
.
.
then i run this codes:
mkdir -p "$KRAKEN2_DB_NAME"
pushd "$KRAKEN2_DB_NAME"
mkdir -p data taxonomy library
pushd data
wget "$REMOTE_DIR/${FASTA_FILENAME}.gz"
gunzip "${FASTA_FILENAME}.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.acc_taxid.gz"
gunzip "${TAXO_PREFIX}.acc_taxid.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.txt.gz"
gunzip "${TAXO_PREFIX}.txt.gz"
build_silva_taxonomy.pl "${TAXO_PREFIX}.txt"
popd
mv data/names.dmp data/nodes.dmp taxonomy/
mv data/${TAXO_PREFIX}.acc_taxid seqid2taxid.map
sed -e '/^>/!y/U/T/' "data/$FASTA_FILENAME" > library/silva.fna
popd
kraken2-build --db $KRAKEN2_DB_NAME --build --threads $KRAKEN2_THREAD_CT
And got the same result as before:
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 147564248 bytes
Capacity estimation complete. [5.321s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 19 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp <------------- IT DIDNT READ THEM ATT ALL AGAIN!
Writing data to disk... complete.
Database files completed. [4.149s]
Database construction complete. [Total: 9.490s]
Well i ran out of ideas. I can't see where am i doing wrong. Im open to any help at that point.—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
aytovicc left a comment (DerrickWood/kraken2#911)
I tried again. It gave the same error... I really need help here. I changed decimal numbers to integers. I'm giving the head of the files i edited again:
--------------------SILVA_138.2_SSURef_NR99_tax_silva.fasta (I didn't changed anything here):
AY846379.1.1791 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. Itas 9/21 14-6w
sequence
AY846382.1.1778 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum
sequence
----------------------tax_slv_ssu_138.2.acc_taxid:
AY846379.1.1791 1259100001
AY846382.1.1778 1259100002
AB000393.1.1510 6268800001
AY909590.1.2352 4990200001
AB000480.1.1326 6185900001
BD359736.3.2150 877500001
AY909586.1.1754 4990200002
AY955003.1.1727 1263400001
AY955005.1.1727 1263400002
AB001783.1.1507 5993600001
FZ423313.1.1291 5865800001
HG529990.1.1403 4406900001
AB001778.1.1507 5993600002
AB001521.1.1560 6242000001
AB000479.1.1326 6185900002
AY846384.1.2409 1259100003
What i did: To get rid of decimal points i added 0000n number to end of it. So if it was
AY846379.1.1791 12591
AY846382.1.1778 12591
AYxxxxxx.x.xxxx(Lets say 90th) 12591
it became
AY846379.1.1791 1259100001
AY846382.1.1778 1259100002
AYxxxxxx.x.xxxx(Lets say 90th) 1259100090
----------------------tax_slv_ssu_138.2.txt:
Archaea; 2 domain
Archaea;Aenigmarchaeota; 11084 phylum 123
Archaea;Aenigmarchaeota;Aenigmarchaeia; 42913 class 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales; 42914 order 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Candidatus Aenigmarchaeum; 42915 genus 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis; 57376 family 138.2
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis;Candidatus Aenigmarchaeum; 57377 genus 138.2
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700001 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700002 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700003 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700004 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedmarinearchaeon; 5737700005 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedmarinearchaeon; 5737700006 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700007 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedeuryarchaeote; 5737700008 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedeuryarchaeote; 5737700009 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700010 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700011 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700012 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700013 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700014 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700015 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;CandidatusAenigmarchaeumsubterraneumSCGCAAA011-O16; 5737700016 species 2025
.
.
.
.
.
then i run this codes:
mkdir -p "$KRAKEN2_DB_NAME"
pushd "$KRAKEN2_DB_NAME"
mkdir -p data taxonomy library
pushd data
wget "$REMOTE_DIR/${FASTA_FILENAME}.gz"
gunzip "${FASTA_FILENAME}.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.acc_taxid.gz"
gunzip "${TAXO_PREFIX}.acc_taxid.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.txt.gz"
gunzip "${TAXO_PREFIX}.txt.gz"
build_silva_taxonomy.pl "${TAXO_PREFIX}.txt"
popd
mv data/names.dmp data/nodes.dmp taxonomy/
mv data/${TAXO_PREFIX}.acc_taxid seqid2taxid.map
sed -e '/^>/!y/U/T/' "data/$FASTA_FILENAME" > library/silva.fna
popd
kraken2-build --db $KRAKEN2_DB_NAME --build --threads $KRAKEN2_THREAD_CT
And got the same result as before:
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 147564248 bytes
Capacity estimation complete. [5.321s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 19 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp <------------- IT DIDNT READ THEM ATT ALL AGAIN!
Writing data to disk... complete.
Database files completed. [4.149s]
Database construction complete. [Total: 9.490s]
Well i ran out of ideas. I can't see where am i doing wrong. Im open to any help at that point.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>
|
I was able to successfully do it as evidenced below
1259100 is my made up tax ID. Here are the steps that I took. I used If you do not have
This:
Changes to:
If the process hangs after it has indicated that it is completed then simply terminate the process by pressing
Output:
You can see that AB600437.1.1389, the only sequence that we changed, got classified with our new tax ID, 5737700. I hope this helps! |
I will try to understand your steps before i move on. But i guess the problem im having is from this line of code:
It gives a error that says there is some problem a strange input in line 8. Which is the one i added. As you can see the 5737700001.
i tried to do same with others like i itterate all this lines in Python to see whats the seperator of the empty spaces. The pattern is: But i still clearly dont understand what am i missing in this tax_slv_ssu_138.2.txt file that i updated. If we can solve this it will be no problem in building database i hope. I will try your code. |
This is the error:
|
Did you make sure that the fields are separated by tabs? |
I tried to create a database from SILVA's 138.2 respiratory. (https://www.arb-silva.de/no_cache/download/archive/release_138_2/).
FASTA - SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
ACC_TAXID - tax_slv_ssu_138.2.acc_taxid.gz
TAX - tax_slv_ssu_138.2.txt.gz
FASTA folder has sequences. As far as i tried to understand what kraken2 does, it takes sequences from your fastq/fasta data and match with FASTA sequences. Than it uses the headers and gets the lets say ACCESSION TAXONOMY:
It returns the headers some form of:
I think it gets TAXID from ACC_TAXID based on its match from the ACCESSION.
Then it gets the taxonomic name from TAX by using this TAXID and prints the TAXONOMY.
If i am wrong please tell me.
#################
So the problem is SILVA TAX designed for Genus level. I tried to add species and tried to update ACC_TAXID. Because in ACC_TAXID:
ACCESSION_species_1 17414 (Random number)
ACCESSION_species_2 17414
ACCESSION_species_3 17414
Even though there is the knowledge for 3 species its designed as all the same for 17414. A random number i didnt checked the ACC_TAXID for a real one.
Then in TAX its for a genus like:
Bacteria;-------;-------;-------;Some_genus 17414.
I think this is the SILVA's problem.
I tried the update them as in FASTA:
Example:
ACC_TAXID:
...
AY846379.1.1791 12591.3
AY846380.1.2583 12591.4
AY846382.1.1778 12591.5
AY846384.1.2409 12591.6
...
TAX:
...
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. Itas 9/21 14-6w; 12591.1 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum; 12591.2 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium saxatile; 12591.3 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium convolutum; 12591.4 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium minutum; 12591.5 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum; 12591.6 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. YLY-2; 12591.7 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. C29; 12591.8 species 2025
...
Then i run this code as for my named files. I also changed ACC_TAXID.txt to ACC_TAXID.acc_taxid in Linux.
mkdir -p "$KRAKEN2_DB_NAME"
pushd "$KRAKEN2_DB_NAME"
mkdir -p data taxonomy library
pushd data
wget "$REMOTE_DIR/${FASTA_FILENAME}.gz"
gunzip "${FASTA_FILENAME}.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.acc_taxid.gz"
gunzip "${TAXO_PREFIX}.acc_taxid.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.txt.gz"
gunzip "${TAXO_PREFIX}.txt.gz"
build_silva_taxonomy.pl "${TAXO_PREFIX}.txt"
popd
mv data/names.dmp data/nodes.dmp taxonomy/
mv data/${TAXO_PREFIX}.acc_taxid seqid2taxid.map
sed -e '/^>/!y/U/T/' "data/$FASTA_FILENAME" > library/silva.fna
popd
kraken2-build --db $KRAKEN2_DB_NAME --build --threads $KRAKEN2_THREAD_CT
It built it but with this anomaly:
I think there is a problem :
(metagenomics) XXXX:~/Desktop/Aytac$ kraken2-build --db SILVA_species_db/ --build --threads 8
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 147564248 bytes
Capacity estimation complete. [6.402s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 14 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp <------------------PROBLEMMMM!
Writing data to disk... complete.
Database files completed. [2.893s]
Database construction complete. [Total: 9.318s]
And it didnt make a match to my sequence.
%0.00 classified.
I really need this for my finishing project. Does anyone else know how to fix this. I saw a post similar to this but i couldnt do that. So i need help with this.
The text was updated successfully, but these errors were encountered: