We want to have a space to share MLST alleles with mechanisms to add/remove/curate those alleles. We can admit that there is no perfect solution to this and so here are the advantages/disadvantages to our approach.
- Contextualize genomes with what else is out there
- Alleles are hashed and so sequence data are not revealed
- The hash is a fixed length, and so it is an easy check to see if an allele has been truncated.
- Frees the database from funding sources.
- Git repo!
- ... can be copied and/or made decentralized easily.
- ... can be versioned
- ... can be forked - individuals or institutions can decide to have their own database
- ... can be pushed - new alleles or loci can be updated
- ... can be pulled - databases can update with the latest alleles or loci
- Allelic sequences are lost through hashing.
- The database creates a limited way that the database can be queried: either the query hits against an exact hashsum or it doesn't.
- The database does not state whether any one allele conforms to any one rule. For example, it is unknown if a particular allele is bound by start and stop sites.
- There is a lot of work ahead of us.
In the db folder, each scheme has these files.
refs.fasta
- reference alleles for each locusalleles.tsv
- information on each alleleclusters.tsv
- information on clusters. Clusters could be outbreak codes. Or, they could be something else like allele codes.profiles.tsv
- each sample and its alleles
The specification is at docs/specification.md
mkdir -v db
python3 scripts/digestFasta.py t/senterica/*.tfa --out db/senterica.dbhpy --force
mkdir -v db
perl scripts/digestFasta.pl t/senterica/*.tfa --out db/senterica.dbhpl --force
- Clone the repo
- Put
scripts
into your PATH
To add your own database, use this repo as a template and then add your database using the scripts. Make a new repo with it. Upload to a git hosting site such as github.