Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kraken2 needs species level classification for 16S. #911

Open
aytovicc opened this issue Feb 8, 2025 · 11 comments
Open

Kraken2 needs species level classification for 16S. #911

aytovicc opened this issue Feb 8, 2025 · 11 comments

Comments

@aytovicc
Copy link

aytovicc commented Feb 8, 2025

I tried to create a database from SILVA's 138.2 respiratory. (https://www.arb-silva.de/no_cache/download/archive/release_138_2/).

FASTA - SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
ACC_TAXID - tax_slv_ssu_138.2.acc_taxid.gz
TAX - tax_slv_ssu_138.2.txt.gz

FASTA folder has sequences. As far as i tried to understand what kraken2 does, it takes sequences from your fastq/fasta data and match with FASTA sequences. Than it uses the headers and gets the lets say ACCESSION TAXONOMY:

AB000393.1.1510 Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Vibrionaceae;Vibrio;Vibrio halioticoli
ACCESSION TAXONOMY

It returns the headers some form of:

ACCESSION|kraken:taxid|TAXID

I think it gets TAXID from ACC_TAXID based on its match from the ACCESSION.

Then it gets the taxonomic name from TAX by using this TAXID and prints the TAXONOMY.

If i am wrong please tell me.

#################
So the problem is SILVA TAX designed for Genus level. I tried to add species and tried to update ACC_TAXID. Because in ACC_TAXID:

ACCESSION_species_1 17414 (Random number)
ACCESSION_species_2 17414
ACCESSION_species_3 17414

Even though there is the knowledge for 3 species its designed as all the same for 17414. A random number i didnt checked the ACC_TAXID for a real one.

Then in TAX its for a genus like:
Bacteria;-------;-------;-------;Some_genus 17414.

I think this is the SILVA's problem.

I tried the update them as in FASTA:

ACCESSION_species_1|kraken:taxid|17414
sequence

Example:

AY846379.1.1791|kraken:taxid|12591.3

ACC_TAXID:
...
AY846379.1.1791 12591.3
AY846380.1.2583 12591.4
AY846382.1.1778 12591.5
AY846384.1.2409 12591.6
...

TAX:
...
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. Itas 9/21 14-6w; 12591.1 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum; 12591.2 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium saxatile; 12591.3 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium convolutum; 12591.4 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium minutum; 12591.5 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum; 12591.6 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. YLY-2; 12591.7 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. C29; 12591.8 species 2025
...

Then i run this code as for my named files. I also changed ACC_TAXID.txt to ACC_TAXID.acc_taxid in Linux.

mkdir -p "$KRAKEN2_DB_NAME"
pushd "$KRAKEN2_DB_NAME"
mkdir -p data taxonomy library
pushd data
wget "$REMOTE_DIR/${FASTA_FILENAME}.gz"
gunzip "${FASTA_FILENAME}.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.acc_taxid.gz"
gunzip "${TAXO_PREFIX}.acc_taxid.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.txt.gz"
gunzip "${TAXO_PREFIX}.txt.gz"

build_silva_taxonomy.pl "${TAXO_PREFIX}.txt"
popd
mv data/names.dmp data/nodes.dmp taxonomy/
mv data/${TAXO_PREFIX}.acc_taxid seqid2taxid.map
sed -e '/^>/!y/U/T/' "data/$FASTA_FILENAME" > library/silva.fna
popd

kraken2-build --db $KRAKEN2_DB_NAME --build --threads $KRAKEN2_THREAD_CT

It built it but with this anomaly:

I think there is a problem :

(metagenomics) XXXX:~/Desktop/Aytac$ kraken2-build --db SILVA_species_db/ --build --threads 8
Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 147564248 bytes
Capacity estimation complete. [6.402s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 14 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp <------------------PROBLEMMMM!
Writing data to disk... complete.
Database files completed. [2.893s]
Database construction complete. [Total: 9.318s]

And it didnt make a match to my sequence.
%0.00 classified.

I really need this for my finishing project. Does anyone else know how to fix this. I saw a post similar to this but i couldnt do that. So i need help with this.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Feb 13, 2025

I pushed a change to the kraken2 repo that updates k2 to use 138.2 of SILVA. Since the changes have not been officially released yet you will need to clone the latest kraken2 repo and compile the code. You can find the k2 script inside the scripts directory.
Silva can be built using the following command:

./scripts/k2 build --db <path_to_database> --special silva --threads <n>

Here's an example run:

./scripts/k2 build --db /tmp/silva --special silva --threads 4
[INFO - 2025-02-13 10:51:28,706]: Downloading ftp://ftp.arb-silva.de/release_138_2/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
[INFO - 2025-02-13 10:51:45,321]: Saved SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz to /tmp/silva/data
[INFO - 2025-02-13 10:51:45,528]: Downloading ftp://ftp.arb-silva.de/release_138_2/Exports/taxonomy/tax_slv_ssu_138.2.acc_taxid.gz
[INFO - 2025-02-13 10:51:47,193]: Saved tax_slv_ssu_138.2.acc_taxid.gz to /tmp/silva/data
[INFO - 2025-02-13 10:51:47,193]: Decompressing /tmp/silva/data/tax_slv_ssu_138.2.acc_taxid.gz
[INFO - 2025-02-13 10:51:47,241]: Finished decompressing /tmp/silva/data/tax_slv_ssu_138.2.acc_taxid.gz
[INFO - 2025-02-13 10:51:47,444]: Downloading ftp://ftp.arb-silva.de/release_138_2/Exports/taxonomy/tax_slv_ssu_138.2.txt.gz
[INFO - 2025-02-13 10:51:48,414]: Saved tax_slv_ssu_138.2.txt.gz to /tmp/silva/data
[INFO - 2025-02-13 10:52:04,395]: Masking low-complexity regions of downloaded library silva
[INFO - 2025-02-13 10:52:11,485]: A seqid2taxid.map already present and newer than any of the prelim_map.txt files, skipping
[INFO - 2025-02-13 10:52:11,501]: Running: /home/rcharles/src/git/kraken2/src/estimate_capacity -S 1111111111111111101010101010101 -k 35 -l 31 -p 4 -B 16384
[INFO - 2025-02-13 10:52:17,994]: Estimated hash table requirement: 138.86MB
[INFO - 2025-02-13 10:52:17,994]: Starting database build
[INFO - 2025-02-13 10:52:18,004]: Running: /home/rcharles/src/git/kraken2/src/build_db -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taxonomy -m seqid2taxid.map -c 36401737.14285714 -S 1111111111111111101010101010101 -k 35 -l 31 -p 4 -B 16384 -b 4096 -r 0
[DEBUG - 2025-02-13 10:52:18,758]: Taxonomy parsed and converted.
[DEBUG - 2025-02-13 10:52:18,805]: CHT created with 14 bits reserved for taxid.
[DEBUG - 2025-02-13 10:54:52,909]: Processed:100% [=============================>] 510495/510495
[INFO - 2025-02-13 10:54:52,992]: Finished building database

@aytovicc
Copy link
Author

Is this change changed the species level issue? Or is it just to use latest version ofSILVA?

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Feb 13, 2025

Apologies! I was too keyboard happy and didn't see the rest of the message!

Your modifications seem good so far. The reason why the sequences are not being classified could be because you did not update the seqid2taxid.map which looks something like this:

head seqid2taxid.map 
A16379.1.1485   62616
A45315.1.1521   58772
A61579.1.1437   46692
AAAA02020713.1.1297     62507
AAAA02020714.1.1202     61970
AAAA02038450.2584.4394  10099
AAAA02039541.11.1886    47183
AAAA02041579.2617.4209  12194
AAAA02046270.117.1956   12194
AAAA02048270.689.2185   44317

This file tells kraken what taxid a particular sequence id maps to.

Example:

Given the following sequence:

>AY846379.1.1791 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. Itas 9/21 14-6w
AACCTGGTTGATCCTGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTCTAAGTATAAACTG
CTTATACTGTGAAACTGCGAATGGCTCATTAAATCAGTTATAGTTTATTTGATGGTACCTCTACACGGATAA
CCGTAGTAATTCTAGAGCTAATACGTGCGTAAATCCCGACTTCTGGAAGGGACGTATTTATTAGATAAAAGG
CCGACCGAGCTTTGCTCGACCCGCGGTGAATCATGATAACTTCACGAATCGCATAGCCTTGTGCTGGCGATG
TTTCATTCAAATTTCTGCCCTATCAACTTTCGATGGTAGGATAGAGGCCTACCATGGTGGTAACGGGTGACG
GAGGATTAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCACATCCAAGGAAGGCAGCAGGCGCG
CAAATTACCCAATCCTGATACGGGGAGGTAGTGACAATAAATAACAATGCCGGGCATTTCATGTCTGGCAAT
TGGAATGAGTACAATCTAAATCCCTTAACGAGGATCAATTGGAGGGCAAGTCTGGTGCCAGCAGCCGCGGTA
ATTCCAGCTCCAATAGCGTATATTTAAGTTGTTGCAGTTAAAAAGCTCGTAGTTGGATTTCGGGTGGGTTCC

And your sample taxonomy (modified so that the taxids are whole numbers):

Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium
sp. Itas 9/21 14-6w; 12591 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium
contortum; 12591 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium
saxatile; 12591 species 2025

You will need to add an entry in the seqid2taxid.map that maps AY846379.1.1791 to 12591.

Please make sure that your taxids are unique, however, and if the there is already an entry for AY846379.1.1791 in your seqid2taxid.map your may need to update it to map to your newly created taxid. I hope that makes sense.

@aytovicc
Copy link
Author

I don't think you are reading the post at all :(. Because i did what you said. In this part of my post:

ACC_TAXID:
...
AY846379.1.1791 12591.3
AY846380.1.2583 12591.4
AY846382.1.1778 12591.5
AY846384.1.2409 12591.6
...

TAX:
...
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. Itas 9/21 14-6w; 12591.1 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum; 12591.2 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium saxatile; 12591.3 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium convolutum; 12591.4 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium minutum; 12591.5 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum; 12591.6 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. YLY-2; 12591.7 species 2025
Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. C29; 12591.8 species 2025

.

As you can see all unqiue. I don't know what happens after this point.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Feb 14, 2025

I missed that part, but what I trying to point out to you is that the taxid needs to be an integer not a float, that may be reason why kraken2 is not able to map the sequence to the ID.

@aytovicc
Copy link
Author

I tried again. It gave the same error... I really need help here. I changed decimal numbers to integers. I'm giving the head of the files i edited again:

--------------------SILVA_138.2_SSURef_NR99_tax_silva.fasta (I didn't changed anything here):

AY846379.1.1791 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium sp. Itas 9/21 14-6w
sequence
AY846382.1.1778 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Sphaeropleales;Monoraphidium;Monoraphidium contortum
sequence

----------------------tax_slv_ssu_138.2.acc_taxid:
AY846379.1.1791 1259100001
AY846382.1.1778 1259100002
AB000393.1.1510 6268800001
AY909590.1.2352 4990200001
AB000480.1.1326 6185900001
BD359736.3.2150 877500001
AY909586.1.1754 4990200002
AY955003.1.1727 1263400001
AY955005.1.1727 1263400002
AB001783.1.1507 5993600001
FZ423313.1.1291 5865800001
HG529990.1.1403 4406900001
AB001778.1.1507 5993600002
AB001521.1.1560 6242000001
AB000479.1.1326 6185900002
AY846384.1.2409 1259100003

What i did: To get rid of decimal points i added 0000n number to end of it. So if it was
AY846379.1.1791 12591
AY846382.1.1778 12591
AYxxxxxx.x.xxxx(Lets say 90th) 12591

it became
AY846379.1.1791 1259100001
AY846382.1.1778 1259100002
AYxxxxxx.x.xxxx(Lets say 90th) 1259100090

----------------------tax_slv_ssu_138.2.txt:
Archaea; 2 domain
Archaea;Aenigmarchaeota; 11084 phylum 123
Archaea;Aenigmarchaeota;Aenigmarchaeia; 42913 class 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales; 42914 order 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Candidatus Aenigmarchaeum; 42915 genus 138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis; 57376 family 138.2
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis;Candidatus Aenigmarchaeum; 57377 genus 138.2
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700001 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700002 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700003 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700004 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedmarinearchaeon; 5737700005 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedmarinearchaeon; 5737700006 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700007 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedeuryarchaeote; 5737700008 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedeuryarchaeote; 5737700009 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700010 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700011 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700012 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700013 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700014 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700015 species 2025
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;CandidatusAenigmarchaeumsubterraneumSCGCAAA011-O16; 5737700016 species 2025

.
.
.
.
.

then i run this codes:
mkdir -p "$KRAKEN2_DB_NAME"
pushd "$KRAKEN2_DB_NAME"
mkdir -p data taxonomy library
pushd data
wget "$REMOTE_DIR/${FASTA_FILENAME}.gz"
gunzip "${FASTA_FILENAME}.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.acc_taxid.gz"
gunzip "${TAXO_PREFIX}.acc_taxid.gz"
wget "$REMOTE_DIR/taxonomy/${TAXO_PREFIX}.txt.gz"
gunzip "${TAXO_PREFIX}.txt.gz"

build_silva_taxonomy.pl "${TAXO_PREFIX}.txt"
popd
mv data/names.dmp data/nodes.dmp taxonomy/
mv data/${TAXO_PREFIX}.acc_taxid seqid2taxid.map
sed -e '/^>/!y/U/T/' "data/$FASTA_FILENAME" > library/silva.fna
popd

kraken2-build --db $KRAKEN2_DB_NAME --build --threads $KRAKEN2_THREAD_CT

And got the same result as before:

Creating sequence ID to taxonomy ID map (step 1)...
Sequence ID to taxonomy ID map already present, skipping map creation.
Estimating required capacity (step 2)...
Estimated hash table requirement: 147564248 bytes
Capacity estimation complete. [5.321s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 19 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp <------------- IT DIDNT READ THEM ATT ALL AGAIN!
Writing data to disk... complete.
Database files completed. [4.149s]
Database construction complete. [Total: 9.490s]


Well i ran out of ideas. I can't see where am i doing wrong. Im open to any help at that point.

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Feb 16, 2025 via email

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Feb 17, 2025

I was able to successfully do it as evidenced below

[INFO - 2025-02-16 19:55:36,379]: Decompressing /home/rcharles/src/git/kraken2/scripts/silva/data/tax_slv_ssu_138.2.acc_taxid.gz
[INFO - 2025-02-16 19:55:36,427]: Finished decompressing /home/rcharles/src/git/kraken2/scripts/silva/data/tax_slv_ssu_138.2.acc_taxid.gz
[INFO - 2025-02-16 19:55:36,567]: Masking low-complexity regions of downloaded library silva
[INFO - 2025-02-16 19:55:36,570]: A seqid2taxid.map already present and newer than any of the prelim_map.txt files, skipping
[INFO - 2025-02-16 19:55:36,582]: Running: /home/rcharles/src/git/kraken2/src/estimate_capacity -S 1111111111111111101010101010101 -k 35 -l 31 -p 1 -B 16384
[INFO - 2025-02-16 19:55:36,657]: Estimated hash table requirement: 51.43kB
[INFO - 2025-02-16 19:55:36,657]: Starting database build
[INFO - 2025-02-16 19:55:36,666]: Running: /home/rcharles/src/git/kraken2/src/build_db -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taxonomy -m seqid2taxid.map -c 13165.714285714286 -S 1111111111111111101010101010101 -k 35 -l 31 -p 1 -B 16384 -b 16384 -r 0
[DEBUG - 2025-02-16 19:55:37,398]: Taxonomy parsed and converted.
[DEBUG - 2025-02-16 19:55:37,399]: CHT created with 14 bits reserved for taxid.
[DEBUG - 2025-02-16 19:55:37,400]: Completed processing of 1 sequences, 1791 bp--] 1/510495
[DEBUG - 2025-02-16 19:55:37,400]: Writing data to disk...  complete.
./k2 classify --db silva silva/library/silva/library.fna 
Loading database information... done.
Processed 1 sequences (1791 bp) ...C    AY846379.1.1791 1259100 1791    1259100:1757
1 sequences (0.00 Mbp) processed in 0.000s (147.4 Kseq/m, 264.03 Mbp/m).
  1 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

1259100 is my made up tax ID.


Here are the steps that I took.

I used k2 a new script that we have been developing for this demonstration. I would advise that you clone the kraken2 to repo if you are intending to use k2. You can do so with the command git clone https://github.com/DerrickWood/kraken2. Once the clone is done cd into the kraken2/src and run make to compile the binaries. k2 can be found inside the scripts directory. All commands will be run inside the kraken2 directory that the git created.

If you do not have tax_slv_ssu_138.2.acc_taxid.gz, and tax_slv_ssu_138.2.txt.gz run the following command to download the files:
./scripts/k2 build --special silva --db silva and cancel the command soon as the files are downloaded.

  • To prevent k2 from overwriting your changes while testing modify k2 and comment lines 2681-2684 and 2686. These lines are responsible for downloading the taxonomy and sequence data.

  • Decompress and modify the tax_slv_ssu_138.2.txt.gz. In this case I added the line:

Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis;Candidatus Aenigmarchaeum;uncultured archaeon; 5737700 species 138.2

Archaea;        2       domain          
Archaea;Aenigmarchaeota;        11084   phylum          123
Archaea;Aenigmarchaeota;Aenigmarchaeia; 42913   class           138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;        42914   order           138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Candidatus Aenigmarchaeum;      42915   genus           138
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis; 57376   family          138.2
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis;Candidatus Aenigmarchaeum;       57377   genus           138.2
Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis;Candidatus Aenigmarchaeum;uncultured archaeon;   5737700 species  138.2
  • Decompress and modify the tax_slv_ssu_138.2.acc.taxid and find the tax ID you would like to remap and update them accordingly. In our example I remapped AB600437.1.1389 from 57377 to 5737700. This is also shown below.

This:

...
AB600437.1.1389	57377
AB600439.1.1331	57676
AB600440.1.1635	57381
AB600441.1.1309	57618
AB600442.1.1581	57381
AB600444.1.1288	57653
AB600445.1.1407	57377
...

Changes to:

...
AB600437.1.1389	5737700
AB600439.1.1331	57676
AB600440.1.1635	57381
AB600441.1.1309	57618
AB600442.1.1581	57381
AB600444.1.1288	57653
AB600445.1.1407	57377
...
  • Once this is done go ahead and build the database using the command ./k2 build --db <name_of_database> --special silva --threads <n>. The script will generate the taxonomy/names.dmp, taxonomy/nodes.dmp and seqid2taxid.map.
./scripts/k2 build --db silva --special silva --threads 4
[INFO - 2025-02-16 22:08:12,559]: Decompressing /kraken2/scripts/silva/data/tax_slv_ssu_138.2.acc_taxid.gz
[INFO - 2025-02-16 22:08:12,606]: Finished decompressing /kraken2/scripts/silva/data/tax_slv_ssu_138.2.acc_taxid.gz
[INFO - 2025-02-16 22:08:27,501]: Masking low-complexity regions of downloaded library silva
[INFO - 2025-02-16 22:08:33,918]: A seqid2taxid.map already present and newer than any of the prelim_map.txt files, skipping
[INFO - 2025-02-16 22:08:33,928]: Running: /kraken2/src/estimate_capacity -S 1111111111111111101010101010101 -k 35 -l 31 -p 4 -B 16384
[INFO - 2025-02-16 22:08:39,997]: Estimated hash table requirement: 138.86MB
[INFO - 2025-02-16 22:08:39,997]: Starting database build
[INFO - 2025-02-16 22:08:40,002]: Running: /kraken2/src/build_db -H hash.k2d.tmp -t taxo.k2d.tmp -o opts.k2d.tmp -n taxonomy -m seqid2taxid.map -c 36401737.14285714 -S 1111111111111111101010101010101 -k 35 -l 31 -p 4 -B 16384 -b 4096 -r 0
[DEBUG - 2025-02-16 22:08:40,711]: Taxonomy parsed and converted.
[DEBUG - 2025-02-16 22:08:40,752]: CHT created with 14 bits reserved for taxid.
[DEBUG - 2025-02-16 22:11:11,552]: Completed processing of 510494 sequences, 747370590 bp/510495
[DEBUG - 2025-02-16 22:11:11,591]: Writing data to disk...  complete.

If the process hangs after it has indicated that it is completed then simply terminate the process by pressing Ctrl-C. cd into the database directory, it is called silva in this example, and rename the *.k2d.tmp files to *.k2d. If you do not do this kraken2 will not consider the database valid.

  • Now that the build is complete we are now going to classify the sequence ./scripts//k2 classify --db silva foo.fasta where foo.fasta contains the following data:
>AB600437.1.1389 Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales
TTCCGGTTGATCCTGCCGGAGGTCACCGTTATCAGGTTTCGACTAAGCCATGCGAGGCGTGAGTCATCTGGA
CTCGCCGAACTGCTCAGTAACACGTAGTTAACCTAACCTCAAGTGGAGGATAATCCCGCGAAAGTGGGGCTA
ATACTCCATAGCTGTTGAATGCTGGAATGCTTTAACAGCCAAAGTCTCGGCGCTTGAGGATGGGACTGCGAC
CTATCATGGTAGTTGCGAGTATAAAATACTCACAAGCCCGTAACGGGTAGGGGCCATGAGAGTGGAAGCCTC
GAGATGCCTTCTGAGACATGAAGGCAGGTCCTACGGGACGCAGCAGGTACGAAAACTTTCCAATAGCTTAAC
GGCTGAGAGGGGAAGCCTGAGTGACTATGCACTGCATAGTCTTTTGTTTTGTCTAAAAAGCAGGACGAATAA
GTGGTGGGTAAGACGGGTGGCAGCCGCCGCGGTAATACCCGCGCCACGAATGGTGACCACGATTATTGGGTC
TAAAGCGTCCGTAGCCGGTTCAACAAGTTCTTGGTTAAATTCAACAGCTGACTGTTGGTTTGCTGAGAATAC
TGTTTGACTTGGGATTGGGAAAGGAGAAAGGTATTCCACGTGTAGGGGTTAAATCCTTTAATCCGTGGAAGA
CCTCCGGTGGCGAAGGCGTTTCTCTGGAACAAGTCCGACGGTGAGGGACGAAAGCTAGGGGAGCAAATGGGA
>AB600445.1.1407 Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales
TTCCGGTTGATCCTGCCGGAGATCACCGTTATCAGTGTTCGACTTAGCCATGCGAGGCGAGAGCTCTTTGGG
CTCGCCGAACTGCTCAGTAACGCGTAGTTAACATACCCTCAAGTGAAGGATATTCTCGCGAAAGTGAGGCTA
ATACTTCATAGCTGTTAAATGCTGGAATGCTTTAACAGTTAAAGCTGAGGCGCTTGAGGATTGGACTGCGAT
CTATCATGGCTGTTGTGAGTGTAAAGTACTCACAAACCTGTAACGGATAGGGGTCATGAAAGTGGGAGCCTC
GAGATGCCTTCTGAGACATGAAGGCAGGTCCTACGGGACGCAGCAGGTGCGAAAACTTTCCAATTCCTGCAA
GGGTGAGAAGGGAATTCTGAATGGTTATGCAATGCATAACCTTTTATCTTGTCTAAAAAGCAAGATGAATAA
GTGGTGGGTAAGACGGGTGGCAGCCGCCGCGGTAATACCCGCGCCACAAATGGTGATCACTTTTATTTGGTC
TAAAGGGTCCGTAGCCGGTTCAGCAAGTTCTTAGGGAAATCTGCTAGCTAACTAGCAGGCTTTCTGAGAATA
CTATTGAACTTGGGATTGGGATAGGATAGAGGTATTCCATGTGTAGGGGTTAAATCCTTTAATCCATGGAAG
ACCTCCGGTGGCGAAGGCGTCTATCTAGAACAAGTCCGACGGTGAGGGACGAAAGTATGGGGAGCAAACGGG

Output:

./k2 classify --db silva foo.fasta
Loading database information... done.
Processed 2 sequences (1440 bp) ...C    AB600437.1.1389 5737700 720     5737700:8 57377:2 5737700:13 57376:7 5737700:3 2:5 57376:9 2:18 57376:4 2:10 57377:5 11084:7 57377:6 5737700:4 57377:1 5737700:13 57376:1 5737700:10 57376:22 5737700:104 57377:4 57376:56 57377:3 57376:3 5737700:13 57376:5 5737700:1 57377:4 5737700:6 57377:2 5737700:11 57377:18 5737700:6 57377:5 5737700:29 57376:6 2:3 57376:3 2:1 57376:5 2:3 57376:17 11084:5 57376:5 11084:3 57376:9 2:4 57376:5 2:13 57377:11 5737700:49 57377:42 57376:14 5737700:7 57376:5 57377:1 5737700:1 57377:5 5737700:1 57376:1 5737700:5 57376:5 57377:20 57376:1 2:9 1:3 2:5 1:1
C       AB600445.1.1407 57377   720     57376:1 57377:44 57376:2 57377:5 57376:4 57377:158 57376:14 57377:7 57376:11 57377:18 57376:5 57377:5 57376:51 57377:3 57376:1 57377:5 57376:5 57377:45 57376:5 57377:15 57376:20 2:3 57376:3 2:1 57376:5 2:3 57376:17 11084:5 57377:3 2:5 57377:18 2:8 57377:97 57376:1 57377:5 57376:36 2:4 57376:10 57377:2 57376:3 2:5 57376:4 57377:6 57376:2 57377:11 1:5
2 sequences (0.00 Mbp) processed in 0.001s (200.0 Kseq/m, 144.00 Mbp/m).
  2 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

You can see that AB600437.1.1389, the only sequence that we changed, got classified with our new tax ID, 5737700.

I hope this helps!

@aytovicc
Copy link
Author

I will try to understand your steps before i move on. But i guess the problem im having is from this line of code:

build_silva_taxonomy.pl "tax_slv_ssu_138.2.txt"

It gives a error that says there is some problem a strange input in line 8. Which is the one i added. As you can see the 5737700001.

Archaea; 2 domain Archaea;Aenigmarchaeota; 11084 phylum 123 Archaea;Aenigmarchaeota;Aenigmarchaeia; 42913 class 138 Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales; 42914 order 138 Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Candidatus Aenigmarchaeum; 42915 genus 138 Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis; 57376 family 138.2 Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;Incertae Sedis;Candidatus Aenigmarchaeum; 57377 genus 138.2 Archaea;Aenigmarchaeota;Aenigmarchaeia;Aenigmarchaeales;IncertaeSedis;CandidatusAenigmarchaeum;unculturedarchaeon; 5737700001 species 138.3

i tried to do same with others like i itterate all this lines in Python to see whats the seperator of the empty spaces. The pattern is:
-> f'{row["TAXONOMY"]};\t{row["TAXID_NEW"]}\t{row["LEVEL"]}\t\t{row["VERSION"]}\n'

But i still clearly dont understand what am i missing in this tax_slv_ssu_138.2.txt file that i updated. If we can solve this it will be no problem in building database i hope.

I will try your code.
But i didnt understand that part " Once the clone is done cd into the kraken2/src and run make to compile the binaries." how to run make? There is no code named make, there is Makefile.

@aytovicc
Copy link
Author

This is the error:

(./custom_db/data$ kraken2/scripts/build_silva_taxonomy.pl "tax_slv_ssu_138.2.txt" build_silva_taxonomy.pl: orphan error, line 8

@ch4rr0
Copy link
Collaborator

ch4rr0 commented Feb 17, 2025

Did you make sure that the fields are separated by tabs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants