This is a template for any new hash-based MLST database

Why?

We want to have a space to share MLST alleles with mechanisms to add/remove/curate those alleles. We can admit that there is no perfect solution to this and so here are the advantages/disadvantages to our approach.

Advantages

Contextualize genomes with what else is out there
Alleles are hashed and so sequence data are not revealed
The hash is a fixed length, and so it is an easy check to see if an allele has been truncated.
Frees the database from funding sources.
Git repo!
- ... can be copied and/or made decentralized easily.
- ... can be versioned
- ... can be forked - individuals or institutions can decide to have their own database
- ... can be pushed - new alleles or loci can be updated
- ... can be pulled - databases can update with the latest alleles or loci

Disadvantages

Allelic sequences are lost through hashing.
The database creates a limited way that the database can be queried: either the query hits against an exact hashsum or it doesn't.
The database does not state whether any one allele conforms to any one rule. For example, it is unknown if a particular allele is bound by start and stop sites.
There is a lot of work ahead of us.

Database format

In the db folder, each scheme has these files.

refs.fasta - reference alleles for each locus
alleles.tsv - information on each allele
clusters.tsv - information on clusters. Clusters could be outbreak codes. Or, they could be something else like allele codes.
profiles.tsv - each sample and its alleles

The specification is at docs/specification.md

Example

python

mkdir -v db
python3 scripts/digestFasta.py t/senterica/*.tfa --out db/senterica.dbhpy --force

perl

mkdir -v db
perl scripts/digestFasta.pl t/senterica/*.tfa --out db/senterica.dbhpl --force

Installation

Clone the repo
Put scripts into your PATH

Usage

To add your own database, use this repo as a template and then add your database using the scripts. Make a new repo with it. Upload to a git hosting site such as github.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
docs		docs
nim		nim
rust		rust
scripts		scripts
t		t
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST		MANIFEST
Makefile.PL		Makefile.PL
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This is a template for any new hash-based MLST database

Why?

Advantages

Disadvantages

Database format

Example

python

perl

Installation

Usage

About

Releases

Packages

Languages

License

CFSAN-Biostatistics/mlst-hash-template

Folders and files

Latest commit

History

Repository files navigation

This is a template for any new hash-based MLST database

Why?

Advantages

Disadvantages

Database format

Example

python

perl

Installation

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages