This script it's a toolbox to automatic recovery information of NCBI.
This script was build on python 3.6.5+ and have only two dependencies:
- biopython.Entrez: To understand the basic sintax.
- ncbi Entrez ebook: To check the databases, output types and file formats that entrez can access.
- To recovery genbank information from nucleotide sequences:
python ncbi_seq_retrieve.py -in file_with_access_ids.txt -db nucleotide -ot gb
Or to recovery in xml format, just insert the parameter -tf xml.
- To recovery cds translated to aminoacids from nucleotide sequences:
python ncbi_seq_retrieve.py -in file_with_acess_ids.txt -db nucleotide -ot fasta_cds_aa
Or to recovery cds not translated, just change fasta_cds_aa for fasta_cds_na
- To recovery nucleotide of aminoacid sequences
python ncbi_seq_retrieve.py -in file_with_acess_ids.txt -db (nucleotide or protein) -ot fasta
Or to recovery in xml format, just insert the parameter -tf xml.
- To recovery taxonomy information of ncbi acess IDs
python ncbi_seq_retrieve.py -in file_with_acess_ids.txt -db (nucleotide or protein) -ot gb -tx True
- To recovery taxonomy information of host of ncbi acess IDs (ideal for viruses)
python ncbi_seq_retrieve.py -in file_with_acess_ids.txt -db (nucleotide or protein) -ot gb -tx True -th True
If you have a file with IDs from nucleotide sequences, you can't use this file in a protein database, and vice-versa. If you call help function, a table with which text formats are allowed per output type, and which output types are allowed per database.
- This script will continue to be developed to englobe others functions, like features of sequences, for example.