NOTE:
restez
is no longer available on CRAN due to the archiving of a key dependency (MonetDBLite). It can still be installed via GitHub. The issue is being dealt with and hopefully a new version ofrestez
will be available on CRAN soon.
ADDITIONAL NOTE (2022-07-07): MonetDBLite has now been replaced with duckdb in version 2.0.0, which should allow for submission to CRAN. Because of this change, restez v2.0.0 or higher is not compatible with databases built with previous versions of restez.
Download parts of NCBI’s GenBank to a local folder and create a simple SQL-like database. Use ‘get’ tools to query the database by accession IDs. rentrez wrappers are available, so that if sequences are not available locally they can be searched for online through Entrez.
See the detailed tutorials for more information.
Vous entrez, vous rentrez et, maintenant, vous …. restez!
Downloading sequences and sequence information from GenBank and related NCBI taxonomic databases is often performed via the NCBI API, Entrez. Entrez, however, has a limit on the number of requests and downloading large amounts of sequence data in this way can be inefficient. For programmatic situations where multiple Entrez calls are made, downloading may take days, weeks or even months.
This package aims to make sequence retrieval more efficient by allowing a user to download large sections of the GenBank database to their local machine and query this local database either through package specific functions or Entrez wrappers. This process is more efficient as GenBank downloads are made via NCBI’s FTP using compressed sequence files. With a good internet connection and a middle-of-the-road computer, a database comprising 20 GB of sequence information can be generated in less than 10 minutes.
The package can currently only be installed through GitHub:
# install.packages("remotes")
remotes::install_github("ropensci/restez")
(It was previously available via CRAN but was archived due to a key dependency MonetDBLite being no longer available.)
For more detailed information on the package’s functions and detailed guides on downloading, constructing and querying a database, see the detailed tutorials.
# Warning: running these examples may take a few minutes
library(restez)
# choose a location to store GenBank files
restez_path_set(rstz_pth)
# Run the download function
db_download()
# after download, create the local database
db_create()
# get a random accession ID from the database
id <- sample(list_db_ids(), 1)
#> Warning in list_db_ids(): Number of ids returned was limited to [100].
#> Set `n=NULL` to return all ids.
# you can extract:
# sequences
seq <- gb_sequence_get(id)[[1]]
str(seq)
#> chr "GATCCATATGTATACGGAAAGCGTGTTTACCTCTTTGGTCAGATANTAACTGATGCTCACCCAATCNTNCCGTAGACCGTGCCTGATTGGGCCGGANGACCTGGGGAACGA"| __truncated__
# definitions
def <- gb_definition_get(id)[[1]]
print(def)
#> [1] "Unidentified clone A3 DNA sequence from ocean beach sand"
# organisms
org <- gb_organism_get(id)[[1]]
print(org)
#> [1] "unidentified"
# or whole records
rec <- gb_record_get(id)[[1]]
cat(rec)
#> LOCUS AF298085 537 bp DNA linear UNA 23-NOV-2000
#> DEFINITION Unidentified clone A3 DNA sequence from ocean beach sand.
#> ACCESSION AF298085
#> VERSION AF298085.1
#> KEYWORDS .
#> SOURCE unidentified
#> ORGANISM unidentified
#> unclassified sequences.
#> REFERENCE 1 (bases 1 to 537)
#> AUTHORS Naviaux,R.K.
#> TITLE Sand DNA: a multigenomic library on the beach
#> JOURNAL Unpublished
#> REFERENCE 2 (bases 1 to 537)
#> AUTHORS Naviaux,R.K.
#> TITLE Direct Submission
#> JOURNAL Submitted (21-AUG-2000) Medicine, University of California, San
#> Diego School of Medicine, 200 West Arbor Drive, San Diego, CA
#> 92103-8467, USA
#> FEATURES Location/Qualifiers
#> source 1..537
#> /organism="unidentified"
#> /mol_type="genomic DNA"
#> /db_xref="taxon:32644"
#> /clone="A3"
#> /note="anonymous environmental sample sequence from ocean
#> beach sand"
#> ORIGIN
#> 1 gatccatatg tatacggaaa gcgtgtttac ctctttggtc agatantaac tgatgctcac
#> 61 ccaatcntnc cgtagaccgt gcctgattgg gccggangac ctggggaacg accgcgaagt
#> 121 tgtgtanctg cggaacctcn gtatnggngt nttacancaa cactgtgccc tcggccattt
#> 181 ccaggttgcc gttgtcctcc tcagnctgna tgacccatnn ataacggcca tctgccaaaa
#> 241 cccggntcac gatctgctgg gtaccgccgg ccagttccac cgtttccggc tcgtccacaa
#> 301 caccntccca atagacactg tatgcaccgg gcccccggcg ccggttctgc cgaaagaaga
#> 361 attgttgctc gtcttcnttc tcnaaatata tgctgacgtt ngccgaacga gtgagtctat
#> 421 accggatctc ggtgacgtcg gtgtcgncat cggcgtncgg nctgatccna tcgggggcga
#> 481 cgctnaatcc ctnaacagcg ggccaaaggc cangtgcatt tggccgcaac cacctgc
#> //
# use the entrez_* wrappers to access GB data
res <- entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
#> >AF298085.1 Unidentified clone A3 DNA sequence from ocean beach sand
#> GATCCATATGTATACGGAAAGCGTGTTTACCTCTTTGGTCAGATANTAACTGATGCTCACCCAATCNTNC
#> CGTAGACCGTGCCTGATTGGGCCGGANGACCTGGGGAACGACCGCGAAGTTGTGTANCTGCGGAACCTCN
#> GTATNGGNGTNTTACANCAACACTGTGCCCTCGGCCATTTCCAGGTTGCCGTTGTCCTCCTCAGNCTGNA
#> TGACCCATNNATAACGGCCATCTGCCAAAACCCGGNTCACGATCTGCTGGGTACCGCCGGCCAGTTCCAC
#> CGTTTCCGGCTCGTCCACAACACCNTCCCAATAGACACTGTATGCACCGGGCCCCCGGCGCCGGTTCTGC
#> CGAAAGAAGAATTGTTGCTCGTCTTCNTTCTCNAAATATATGCTGACGTTNGCCGAACGAGTGAGTCTAT
#> ACCGGATCTCGGTGACGTCGGTGTCGNCATCGGCGTNCGGNCTGATCCNATCGGGGGCGACGCTNAATCC
#> CTNAACAGCGGGCCAAAGGCCANGTGCATTTGGCCGCAACCACCTGC
# if the id is not in the local database
# these wrappers will search online via the rentrez package
res <- entrez_fetch(db = 'nucleotide', id = c('S71333.1', id),
rettype = 'fasta')
#> [1] id(s) are unavailable locally, searching online.
cat(res)
#> >AF298085.1 Unidentified clone A3 DNA sequence from ocean beach sand
#> GATCCATATGTATACGGAAAGCGTGTTTACCTCTTTGGTCAGATANTAACTGATGCTCACCCAATCNTNC
#> CGTAGACCGTGCCTGATTGGGCCGGANGACCTGGGGAACGACCGCGAAGTTGTGTANCTGCGGAACCTCN
#> GTATNGGNGTNTTACANCAACACTGTGCCCTCGGCCATTTCCAGGTTGCCGTTGTCCTCCTCAGNCTGNA
#> TGACCCATNNATAACGGCCATCTGCCAAAACCCGGNTCACGATCTGCTGGGTACCGCCGGCCAGTTCCAC
#> CGTTTCCGGCTCGTCCACAACACCNTCCCAATAGACACTGTATGCACCGGGCCCCCGGCGCCGGTTCTGC
#> CGAAAGAAGAATTGTTGCTCGTCTTCNTTCTCNAAATATATGCTGACGTTNGCCGAACGAGTGAGTCTAT
#> ACCGGATCTCGGTGACGTCGGTGTCGNCATCGGCGTNCGGNCTGATCCNATCGGGGGCGACGCTNAATCC
#> CTNAACAGCGGGCCAAAGGCCANGTGCATTTGGCCGCAACCACCTGC
#>
#> >S71333.1 alpha 1,3 galactosyltransferase [New World monkeys, mermoset lymphoid cell line B95.8, mRNA Partial, 1131 nt]
#> ATGAATGTCAAAGGAAAAGTAATTCTGTCGATGCTGGTTGTCTCAACTGTGATTGTTGTGTTTTGGGAAT
#> ATATCAACAGCCCAGAAGGCTCTTTCTTGTGGATATATCACTCAAAGAACCCAGAAGTTGATGACAGCAG
#> TGCTCAGAAGGACTGGTGGTTTCCTGGCTGGTTTAACAATGGGATCCACAATTATCAACAAGAGGAAGAA
#> GACACAGACAAAGAAAAAGGAAGAGAGGAGGAACAAAAAAAGGAAGATGACACAACAGAGCTTCGGCTAT
#> GGGACTGGTTTAATCCAAAGAAACGCCCAGAGGTTATGACAGTGACCCAATGGAAGGCGCCGGTTGTGTG
#> GGAAGGCACTTACAACAAAGCCATCCTAGAAAATTATTATGCCAAACAGAAAATTACCGTGGGGTTGACG
#> GTTTTTGCTATTGGAAGATATATTGAGCATTACTTGGAGGAGTTCGTAACATCTGCTAATAGGTACTTCA
#> TGGTCGGCCACAAAGTCATATTTTATGTCATGGTGGATGATGTCTCCAAGGCGCCGTTTATAGAGCTGGG
#> TCCTCTGCGTTCCTTCAAAGTGTTTGAGGTCAAGCCAGAGAAGAGGTGGCAAGACATCAGCATGATGCGT
#> ATGAAGACCATCGGGGAGCACATCTTGGCCCACATCCAACACGAGGTTGACTTCCTCTTCTGCATGGATG
#> TGGACCAGGTCTTCCAAGACCATTTTGGGGTAGAGACCCTGGGCCAGTCGGTGGCTCAGCTACAGGCCTG
#> GTGGTACAAGGCAGATCCTGATGACTTTACCTATGAGAGGCGGAAAGAGTCGGCAGCATATATTCCATTT
#> GGCCAGGGGGATTTTTATTACCATGCAGCCATTTTTGGAGGAACACCGATTCAGGTTCTCAACATCACCC
#> AGGAGTGCTTTAAGGGAATCCTCCTGGACAAGAAAAATGACATAGAAGCCGAGTGGCATGATGAAAGCCA
#> CCTAAACAAGTATTTCCTTCTCAACAAACCCTCTAAAATCTTATCTCCAGAATACTGCTGGGATTATCAT
#> ATAGGCCTGCCTTCAGATATTAAAACTGTCAAGCTATCATGGCAAACAAAAGAGTATAATTTGGTTAGAA
#> AGAATGTCTGA
Want to contribute? Check the contributing page.
MIT
Bennett et al. (2018). restez: Create and Query a Local Copy of GenBank in R. Journal of Open Source Software, 3(31), 1102. https://doi.org/10.21105/joss.01102
Benson, D. A., Karsch-Mizrachi, I., Clark, K., Lipman, D. J., Ostell, J., & Sayers, E. W. (2012). GenBank. Nucleic Acids Research, 40(Database issue), D48–D53. http://doi.org/10.1093/nar/gkr1202
Winter DJ. (2017) rentrez: An R package for the NCBI eUtils API. PeerJ Preprints 5:e3179v2 https://doi.org/10.7287/peerj.preprints.3179v2
This package previously developed and maintained by Dom Bennett