File format specification for the hash allele database

The hash allele database format describes an MLST database where each locus has many alleles. The locus has a reference allele that can be used to match query sequences against. The other alleles are hashed such that only exact matches can be found in a query.

In many other allele databases, the loci have alleles with integer identifiers. Here, we use hashsum identifiers. Therefore, databases will be able to be merged and compared universally. "Allele number 1 of locus 1" is unfortunately ambugiuous in a decentralized database or across multiple databases of the same type. However, allele FF..FF is unambiguous between schemes of the same hashsum algorithm. Therefore, merging alleles from different hash allele databases should be trivial and unambiguous.

The basic structure is that the folder name is the name of the database. Each database is a folder with these files.

refs.fasta
profiles.tsv
clusters.tsv
alleles.tsv

refs.fasta

These are reference alleles for each locus. The defline must be in the format of >locus or >locus_allele. Locus must match the regex /[A-Z0-9-]+/i, i.e., only letters, numbers, and dashes. refs.fasta must be compatible with bioinformatics software such as makeblastdb and blastn. This file would normally have one allele per locus but it can have more than one allele per locus.

profiles.tsv

This is a listing of every MLST profile. Whitespace is not allowed in the values. Values are separated by tabs. The first line is a header. The first column is the MLST scheme and its header is scheme. The second column is sequence type and its header is ST. The subsequent columns are names of loci which must be identical to those found in alleles.tsv.

Note: typically, sequence types are integers. However, since this is a decentralized specification reliant on hashsums instead of integers defined from a central location, the sequence type is a hashsum too. It is calculated by concatenating the alleles in the profile, in order of alphabet-sorted loci, separated by tabs. If the hashsum result is case-insensitive, then the values should be uppercase. Therefore, there is a third required column hash-type.

An example calculation of a sequence type is with these five loci and their alleles. The alleles shown are truncated for simplicity.

xyzB	fooB	locusC	barK	helloW
AB	2F	A2	22	a4

Loci are sorted alphabetically like so: barK, fooB, helloW, locusC, xyzB. Therefore, the alleles, concatenated with tabs would like like this:
22 2F a4 A2 AB

On the command line, hashsumming looks like this:

echo -ne $'22\t2F\ta4\tA2\tAB' | openssl dgst -md5 -binary | openssl enc -base64

The md5sum of this string is hGPy1TKezj177pTM29V7lA== and therefore this is the sequence type of this example profile.

See the hashing appendix for more information.

Special alleles in profiles.tsv

. indicates that the allele is the same as the reference allele in the database. This is the single allele shown in refs.fasta for this locus. This is an invalid allele if there are multiple alleles in refs.fasta for this locus.
- indicates that there is no allele call for this locus.

Defined columns in profiles.tsv

Columns can be in any order and so the column number is just a suggestion.

Label	column number (1-based)	definition	example
scheme	1	The MLST scheme	`Salmonella_enterica_cgMLST`
ST	2	The sequence type	`689ec302e620f47a02daa4c38168b852`
hash-type	3	The hashsum algorithm used to define the ST	`md5`
locus-name1	subsequent column	There are unlimited columns starting here, describing each locus and its allele, one at a time.	an allele hashsum

clusters.tsv

This file has a similar purpose to profiles.tsv but in a more elegant way. This is if you have something like allele codes or SNP codes in your system.

Defined columns in clusters.tsv

Columns can be in any order and so the column number is just a suggestion.

Label	column number (1-based)	definition	example(s)
sample	1	The name of your strain, sample, or genome	`LT2`
clusterScheme	2	The name of the scheme for clustering	`alleleCode`
clusterName	3	The cluster group	The value of the cluster group in this cluster scheme

alleles.tsv

This file has two sections: a header and a body. The header lines start with ## and indicate information about the file itself.

The first line of alleles.tsv should describe the file format and version like ## hash-alleles-format v0.2

Lines starting with single pound signs are comments and can be ignored.

There is an optional comment line allowed between the header and body which can describe the fields:

# locus  allele  hash-type  attributes

After the headers and this optional line describing the fields, each line in the file is an allele definition with mandatory fields. Attributes is the only optional field. In a database folder, alleles.tsv can be seprated to multiple files with letters in between alleles and .tsv, e.g., alleles.aa.tsv, alleles.ab.tsv, ... , alleles.yz.tsv, alleles.zz.tsv.

The fields:

Locus: the locus name. Must match the regex /[A-Z0-9_-]+/i.
Allele: the hashsum of the sequence in base64.
hash-type: the algorithm that hashed the sequence. It should be in base64 format. There is only one valid value at this time md5. This field is case insensitive. See the hashing appendix for more information.
attributes: optional fields in GFF attributes format.

Attributes field

The attributes are in the fourth column and are in the GFF attributes format.

attributes are key/values separated by =.
different attributes are separated with ;.
keys and values are case insensitive. ChewBBACA should be interpreted the same as chewbbaca.
defined attributes are allele-caller, allele-caller-version, sequencing-platform, sequencing-platform-model, assembler, assembler-version
Fields with version should have values in semver format, e.g., 3.0.0.
Values should be quoted. Values cannot have the " character because it is reserved. Values are allowed to have single quotes ' however.
example attributes: allele-caller="chewbbaca";allele-caller-version="2";sequencing-platform="A fake 'SNP' platform"

Defined attributes for alleles.tsv

Attribute	Data type	Description	Example
allele-caller	String	The software used to call the allele	ChewBBACA
allele-caller-version	Version	The version of the allele-caller	2.1.0
allele-caller-options	String	Any non-default options used in the allele caller	--size-threshold 0.3
sequencing-platform	String	The sequencing platform used to sequence the this allele	Illumina
sequencing-platform-model	String	The model name of the sequencing platform	MiSeq
assembler	String	The software used to assemble the raw reads from the sequencer	SPAdes
assembler-version	Version	The version of the assembler software	3.13
assembler-options	String	Any non-default options used in the assembler	--careful
start-sequence	String	The first nucleotides of the allele, usually the start codon	ATG
stop-sequence	String	The last nucleotides of the allele, usually the stop codon, in the forward direction	TGA
length	Integer	The number of nucleotides in the allele	947
CIGAR experimental	String	A CIGAR string describing the match to the reference sequence. Specification for the CIGAR string is described in the SAM specification. This field requires another field `ref`. `M` is discouraged, as it does not distinguish between a match and a mismatch. Instead, use `Y` for match and `X` for mismatch. Normally, a CIGAR has `=` for a match, but unfortunately this is a reserved character already in the attributes field. Assumes the reference is the single reference sequence in the database for this locus. If multiple references exist for this locus, then `ref` is required.	30Y5I30Y
SNP experimental	String	A SNP notation describing what was the reference nucleotide and what is the new nucleotide. The format is concatenated and has three fields: reference base(s), position, allele base(s). Coordinates are 1-based. Can describe indels too. Multiple SNP values can be separated by semicolons. This field is available but discouraged if you can use CIGAR instead. Assumes the reference is the single reference sequence in the database for this locus. If multiple references exist for this locus, then `ref` is required.	SNP in the 5th position, insertion in the 10th position from A to two Gs, and deletion of ATG in the 30th position: `A5G;A10GG;ATG30`
ref	String	The identifier of the reference allele that this allele was compared against. Do not include extra information after the whitespace in an identifier, if it exists. The allele must exist in `refs.fasta`.	aroC_1
was	String	The original identifier of the allele	`adk_1` or `1`

Examples for alleles.tsv

Only required fields

## hash-alleles-format v0.3
# locus allele  hash-type
aroC    6GUMqxkMYXpIDEPWB7GXJg  md5
aroC    YaT2ElkUSm8IvbW6g/hxSg  md5
aroC    PO9EWkqaMIxKj7kRtQUt5A  md5
dnaN    1AF2Py325f6H4eB9PBcP5g  md5
dnaN    8khwhE2lNGi1ARavWpiPnw  md5
dnaN    D9pt/Lk/D8BOMO0ZmkGSlA  md5
hemD    /kXf/b7JIRAdxKQR2OWB2A  md5
hemD    Z1wFdsONZPsiBY0We8badg  md5
hemD    Xqa0fIqryOcOG390D1HfNQ  md5
hisD    n3YsJGxULFLJTFAiymIxHA  md5
hisD    PDnj+IrIcQ0hqksnlaInLA  md5
hisD    rJG6kUykD7QR+6kVB+3uag  md5
purE    3+0cJja2LgafXtLwFWlSRg  md5
purE    /58bj78QhjGigSl9bPtV/A  md5
purE    8iP6DvzzYcjFiBOmOVWydg  md5
sucA    SBtkVPM/rnh1tJeMFAlOww  md5
sucA    PcnmEBZq9wOow/WyVMFHZg  md5
sucA    VLbw66gQl3nDdppBRX5R/Q  md5
thrA    6uxkS0Eb0LOrHghvur0pyQ  md5
thrA    3Iobq+fag08oHdKCJ9b5tQ  md5
thrA    dhqKwb2BFpPAvDaWt3+9yA  md5

With attributes field

The third entry has no attributes field, to help illustrate that alleles with and without an attribute field can be in the same file.

## hash-alleles-format v0.3
# locus allele  hash-type
aroC    6GUMqxkMYXpIDEPWB7GXJg  md5  allele-caller="chewbbaca";allele-caller-version="2";sequencing-platform="Illumina";sequencing-platform-model="MiSeq";start-sequence="GTT";stop-sequence="GGT"
aroC    YaT2ElkUSm8IvbW6g/hxSg  md5  allele-caller="stringmlst";allele-caller-version="0.6.3";sequencing-platform="Illumina";sequencing-platform-model="NovaSeq"
aroC    PO9EWkqaMIxKj7kRtQUt5A  md5

Appendix

Accepted hashing methods

Hashing methods must output base-64 instead of their normal hexadecimal output. One way to do this is with openssl like so

echo -ne $'22\t2F\ta4\tA2\tAB' | \
  openssl dgst -md5 -binary | \
  openssl enc -base64

Where the first line is the echo statement with -e for "evaluate" the $'' syntax with \t characters and -n is for no newline. A newline character would change the output hash. openssl has a dgst subcommand which outputs the md5. This is converted to base64 with the next openssl command with subcommand enc.

name	note
crc32	Known to cause collisions. See: #11, #13, and #16.
md5
sha1
sha256
plaintext	No hashing; actual nucleic acid sequence
plaintext-protein	No hashing; actual amino acid sequence. Not supported in the executables as of version 0.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

specification.md

specification.md

File format specification for the hash allele database

refs.fasta

profiles.tsv

Special alleles in profiles.tsv

Defined columns in profiles.tsv

clusters.tsv

Defined columns in clusters.tsv

alleles.tsv

Attributes field

Defined attributes for alleles.tsv

Examples for alleles.tsv

Only required fields

With attributes field

Appendix

Accepted hashing methods

Files

specification.md

Latest commit

History

specification.md

File metadata and controls

File format specification for the hash allele database

refs.fasta

profiles.tsv

Special alleles in profiles.tsv

Defined columns in profiles.tsv

clusters.tsv

Defined columns in clusters.tsv

alleles.tsv

Attributes field

Defined attributes for alleles.tsv

Examples for alleles.tsv

Only required fields

With attributes field

Appendix

Accepted hashing methods