INITIAL DRAFT (to be replaced by the version number)
The macromolecular transmission format (MMTF) is a binary encoding of biological structures. It includes the coordinates, the topology and associated data. Specifically, a large subset of the data in mmCIF or PDB files can be represented. Pronounced goals are a reduced file size for efficient transmission over the Internet or from hard disk to memory and fast decoding/parsing speed. Additionally the format aims to be easy to understand and implement to facilitates its dissemination. For testing encoder and decoder implementations a test suite is available.
This specification describes a set of required and optional fields representing molecular structures and associated data. The fields are limited to six primitive types for efficient serialization and deserialization using the binary MessagePack format. The fields in MMTF are stored in a binary container format. The top-level of the container contains the field names as keys and field data as values. To describe the layout of data in MMTF we use the JSON notation throughout this document.
The first step of decoding MMTF is decoding the MessagePack-encoded container. Many of the resulting MMTF fields do not need to be decoded any further. However, to allow for custom compression some fields are given as binary data and must be decoded using the strategies described below. For maximal size savings the binary MMTF data can be compressed using general purpose algorithms like gzip or brotli.
The fields in the MMTF format group data of the same type together to create a flat data-structure, for instance, the coordinates of all atoms are stored together, instead of in atom objects with other atom-related data. This avoids imposing a deeply-nested hierarchical structure on consuming programs, while still allowing efficient traversal of models, chains, groups, and atoms.
In principle any serialization format that supports the types described below can be used to store the above fields. MMTF files (specifically files with the .mmtf
extension) use the binary MessagePack serialization format.
The MessagePack format (version 5) is used as the binary container format of MMTF. The MessagePack specification describes the data types and the data layout. Encoding and decoding libraries for MessagePack are available in many languages, see the MessagePack website.
The test suite will additionally provide files representing the MMTF fields as JSON to help validating implementations of this specification.
The following types are used for the fields in this specification.
String
An UTF-8 encoded string.Float
A 32-bit floating-point number.Integer
A 32-bit signed integer.Map
A data structure of key-value pairs where each key is unique. Also known as "dictionary", "hash".Array
A list of elements that may be of different type.Binary
A list of unsigned 8-bit integer numbers representing binary data.
The Binary
type is used here to store arrays with values of the same type. Such "typed arrays" are not directly supported by the msgpack
format. However it is straightforward to work with arrays of simple numeric types by re-interpreting the data in a Binary
field:
- List of 8-bit unsigned integers.
- List of 8-bit signed integers.
- List of 16-bit unsigned integers in big-endian format.
- List of 16-bit signed integers in big-endian format.
- List of 32-bit unsigned integers in big-endian format.
- List of 32-bit signed integers in big-endian format.
- List of 32-bit floating-point numbers in big-endian format.
For example, for an array of 32-bit integers groups of 4 bytes are interpreted as 32-bit integers. All such multi-byte types must be represented in big-endian format.
Note that the MessagePack format limits the String
, Map
, Array
and Binary
type to (2^32)-1 entries per instance.
The following encoding strategies are used to compress the data contained in MMTF files.
Run-length decoding can generally be used to compress lists that contain stretches of equal values. Instead of storing each value itself, stretches of equal values are represented by the value itself and the occurrence count, that is a value/count pair.
Examples:
Starting with the encoded list of value/count pairs. In the following example there are three pairs 1, 10
, 2, 1
and 1, 4
. The first entry in a pair is the value to be repeated and the second entry denotes how often the value must be repeated.
[ 1, 10, 2, 1, 1, 4 ]
Applying run-length decoding by repeating, for each pair, the value as often as denoted by the count entry.
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1 ]
Another example that has strings instead of numbers as values:
[ "A", 5, "B", 2 ]
Applying run-length decoding:
[ "A", "A", "A", "A", "A", "B", "B" ]
Delta encoding is used to store lists of numbers. Instead of storing the numbers themselves, the differences (deltas) between the numbers are stored. When the values of the deltas are smaller than the numbers themselves they can be more efficiently packed to require less space.
Note that lists which change by an identical amount for a range of consecutive values lend themselves to subsequent run-length encoding.
Example:
Starting with the encoded list of delta values:
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1 ]
Applying delta decoding. The first entry in the list is left as is, the second is calculated as the sum of the first and the second (not decoded) value, the third as the sum of the second (decoded) and third (not decoded) value and so forth.
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16 ]
Split-list delta encoding is an adjusted delta encoding to handle lists with some intermittent large delta values. The list is split into two arrays, called "big" and "small". The "big" array of 32-bit signed integers alternates between a large delta value (>=2^15) and the number of subsequent small delta values (<2^15). The "small" array of 16-bit signed integers holds the small values, that is, the values fitting into a 16-bit signed integer.
Example:
Starting with the "big" and the "small" arrays:
[ 1200, 3, 100, 2 ] // big
[ 0, 2, -1, -3, 5 ] // small
Building the decoded array step by step. The first value, 1200
, from the "big" array is the initial delta:
[ 1200 ]
It is followed by 3
delta values from the "small" array 0, 2, -1
:
[ 1200, 1200, 1202, 1201 ]
Adding the next delta value in the "big" array, 100
:
[ 1200, 1200, 1202, 1201, 1301 ]
Followed by 2
delta values from the "small" array -3, 5
to create the final array of 32-bit signed integers::
[ 1200, 1200, 1202, 1201, 1301, 1298, 1303 ]
In integer encoding, floating point numbers are converted to integer values by multiplying with a factor and discard everything after the decimal point. Depending on the multiplication factor this can change the precision but with a sufficiently large factor it is lossless. The integer values can then often be compressed with delta encoding which is the main motivation for it.
Example:
Starting with the list of integer values:
[ 100, 100, 100, 100, 50, 50 ]
Applying integer decoding with a divisor of 100
:
[ 1.00, 1.00, 1.00, 1.00, 0.50, 0.50 ]
For dictionary encoding a Map
is created to store key-value pairs. References to the keys can then be used instead of repeating the values over and over again. Lists of keys can afterwards be compressed with delta and run-length encoding.
Example:
First create a Map
to hold values that are referable by keys. In the following example the are two keys, 1
and 2
with some values associated.
{
"1": {
"A": [ 1, 2, 3 ],
"B": "HELLO"
},
"2": {
"A": [ 4, 5, 6 ],
"B": "HEY"
}
}
The keys can then be used to reference the values as often as needed:
[ 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1 ]
Note that in the above example the list of keys can also be efficiently run-length encoded:
[ 1, 4, 2, 4, 1, 3 ]
The following table lists all top level fields, including their type and whether they are required or optional. The top-level fields themselves are stores as a Map
.
Required field
Type: String.
Description: The version number of the specification the file adheres to. The specification follows a semantic versioning scheme. In a version number MAJOR.MINOR
, the MAJOR
part is incremented when specification changes are incompatible with previous versions. The MINOR
part is changed for additions to the specification that are backwards compatible.
Examples:
The current, unreleased, in development specification:
"0.1"
A future version with additions backwards compatible to versions "1.0" and "1.1":
"1.2"
Required field
Type: String.
Description: The name and version of the software used to produce the file. For development versions it can be useful to also include the checksum of the commit. The main purpose of this field is to identify the software that has written a file, for instance because it has format errors.
Examples:
A software name and the checksum of a commit:
"RCSB PDB mmtf-java-encoder---version: 6b8635f8d319beea9cd7cc7f5dd2649578ac01a0"
Another software name and its version number:
"NGL mmtf exporter v1.2"
Optional field
Type: String.
Description: A short description of the structural data included in the file.
Example:
"CRAMBIN"
Optional field
Type: String.
Description: An ID for the structure, for example the PDB ID if applicable. If not in conflict with the format of the ID, it must be given in uppercase.
Example:
"1CRN"
Optional field
Type: String with the format YYYY-MM-DD
, where YYYY
stands for the year in the Gregorian calendar, MM
is the month of the year between 01 (January) and 12 (December), and DD
is the day of the month between 01 and 31.
Description: A date that relates to the deposition of the structure in a database, e.g. the wwPDB archive.
Example:
For example, the second day of October in the year 2005 is written as:
"2005-10-02"
Optional field
Type: String with the format YYYY-MM-DD
, where YYYY
stands for the year in the Gregorian calendar, MM
is the month of the year between 01 (January) and 12 (December), and DD
is the day of the month between 01 and 31.
Description: A date that relates to the release of the structure in a database, e.g. the wwPDB archive.
Example:
For example, the third day of December in the year 2013 is written as:
"2013-12-03"
Required field
Type: Integer.
Description: The overall number of atoms in the structure. This also includes atoms at alternate locations.
Example:
1023
Required field
Type: Integer.
Description: The overall number of bonds. This number must reflect both the bonds given in bondAtomList
and the bonds given in the groupType
entries in groupList
.
Example:
1142
Optional field
Type: String.
Description: The Hermann-Mauguin space-group symbol.
Example:
"P 1 21 1"
Optional field
Type: Array of six Float values.
Description: List of six values defining the unit cell. The first three entries are the length of the sides a
, b
, and c
in Å. The last three angles are the alpha
, beta
, and gamma
angles in degree.
Example:
[ 10.0, 12.0, 30.0, 90.0, 90.0, 120.0 ]
Optional field
Type: Array
of assembly objects with the following fields:
Name | Type | Description |
---|---|---|
transformList | Array | List of transform objects |
Fields in a transform
object:
Name | Type | Description |
---|---|---|
chainIndexList | Array | Pointers into chain data fields, Integers |
matrix | Array | 4x4 transformation matrix, Floats |
The entries of chainIndexList
are indices into the chainIdList and chainNameList fields.
The elements of the 4x4 transformation matrix
are stored linearly in row major order. Thus, the translational component comprises the 4th, 8th, and 12th element.
Description: List of instructions on how to transform coordinates for a list of chains to create (biological) assemblies. The translational component is given in Å.
Example:
The following example shows two transform objects from PDB ID 4OPJ. The transformation matrix of the first object performs no rotation and a translation of 42.387 Å in dimension x. The second one translates -42.387 Å in dimension x.
[
{
"transformList": [
{
"chainIndexList": [ 0, 4, 6 ],
"matrix": [
1.0, 0.0, 0.0, 42.387,
0.0, 1.0, 0.0, 0.000,
0.0, 0.0, 1.0, 0.000,
0.0, 0.0, 0.0, 1.000
]
}
]
},
{
"transformList": [
{
"chainIndexList": [ 0, 4, 6 ],
"matrix": [
1.0, 0.0, 0.0, -42.387,
0.0, 1.0, 0.0, 0.000,
0.0, 0.0, 1.0, 0.000,
0.0, 0.0, 0.0, 1.000
]
}
]
}
]
Optional field
Type: Array of entity objects with the following fields:
Name | Type | Description |
---|---|---|
chainIndexList | Array | Pointers into chain data fields, Integers |
description | String | Description of the entity |
type | String | Name of the entity type |
sequence | String | Sequence of the full construct in one-letter-code |
The entries of chainIndexList
are indices into the chainIdList and chainNameList fields.
The sequence
string contains the full construct, not just the resolved residues. Its characters are referenced by the entries of the sequenceIndexList field. Further, characters follow the IUPAC single letter code for protein or DNA/RNA residues, otherwise the character 'X'.
Description: List of unique molecular entities within the structure. Each entry in chainIndexList
represents an instance of that entity in the structure.
Vocabulary: Known values for the entity field type
from the mmCIF dictionary are macrolide
, non-polymer
, polymer
, water
.
Example:
[
{
"chainIndexList": [ 0, 1, 2, 3 ],
"description": "some polymer",
"type": "polymer",
"sequence": "MNTYASDE"
},
{
"chainIndexList": [ 4 ],
"description": "some ligand",
"type": "non-polymer",
"sequence": ""
}
]
Optional field
Type: Float.
Description: The experimental resolution in Angstrom. If not applicable the field must be omitted.
Examples:
2.3
Optional field
Type: Float.
Description: The R-free value. If not applicable the field must be omitted.
Examples:
0.203
Optional field
Type: Float.
Description: The R-work value. If not applicable the field must be omitted.
Examples:
0.176
Optional field
Description: The list of experimental methods employed for structure determination.
Vocabulary: Known values from the mmCIF dictionary are ELECTRON CRYSTALLOGRAPHY
, ELECTRON MICROSCOPY
, EPR
, FIBER DIFFRACTION
, FLUORESCENCE TRANSFER
, INFRARED SPECTROSCOPY
, NEUTRON DIFFRACTION
, POWDER DIFFRACTION
, SOLID-STATE NMR
, SOLUTION NMR
, SOLUTION SCATTERING
, THEORETICAL MODEL
, X-RAY DIFFRACTION
.
Example:
[ "X-RAY DIFFRACTION" ]
Optional field
Type: Binary data that is interpreted as an array of 32-bit unsigned integers.
Description: Pairs of values represent indices of covalently bonded atoms. The indices point to the Atom data arrays. Only covalent bonds may be given.
Example:
In the following example there are three bonds, one between the atoms with the indices 0 and 1, one between the atoms with the indices 0 and 2, as well as one between the atoms with the indices 2 and 4.
[ 0, 1, 0, 2, 2, 4 ]
Optional field If it exists bondAtomList must also be present. However bondAtomList
may exist without bondOrderList
.
Type: Binary data that is interpreted as an array of 8-bit unsigned integers, i.e. take as is.
Description: List of bond orders for bonds in bondAtomList
. Must be values between 1 and 3.
Example:
In the following example there are bond orders given for three bonds. The first and third bond have a bond order of 1 while the second bond has a bond order of 2.
[ 1, 2, 1 ]
The number of models in a structure is equal to the length of the chainsPerModel field. The chainsPerModel
field also defines which chains belong to each model.
Required field
Type: Array of Integer numbers. The number of models is thus equal to the length of the chainsPerModel
field.
Description: List of the number of chains in each model. The list allows looping over all models:
# initialize index counter
set modelIndex to 0
# traverse models
for modelChainCount in chainsPerModel
print modelIndex
increment modelIndex by one
Example:
In the following example there are 2 models. The first model has 5 chains and the second model has 8 chains. This also means that the chains with indices 0 to 4 belong to the first model and that the chains with indices 5 to 12 belong to the second model.
[ 5, 8 ]
The number of chains in a structure is equal to the length of the groupsPerChain field. The groupsPerChain
field also defines which groups belong to each chain.
Required field
Type: Array of Integer numbers.
Description: List of the number of groups (aka residues) in each chain. The number of chains is thus equal to the length of the groupsPerChain
field. In conjunction with chainsPerModel
, the list allows looping over all chains:
# initialize index counters
set chainIndex to 0
set chainIndex to 0
# traverse models
for modelChainCount in chainsPerModel
print modelIndex
# traverse chains
for 1 to modelChainCount
print chainIndex
set offset to chainIndex * 4
print chainIdList[ offset : offset + 4 ]
print chainNameList[ offset : offset + 4 ]
increment chainIndex by 1
increment modelIndex by 1
Example:
In the following example there are 3 chains. The first chain has 73 groups, the second 59 and the third 1. This also means that the groups with indices 0 to 72 belong to the first chain, groups with indices 73 to 131 to the second chain and the group with index 132 to the third chain.
[ 73, 59, 1 ]
Required field
Type: Binary data that is interpreted as an array of 8-bit unsigned integers representing ASCII characters.
Decoding: Groups of four consecutive ASCII characters create the list of chain IDs. Note that the ASCII decoding here is optional, a decoding library may choose to pass the array of 8-bit unsigned integers on for performance reasons. Nevertheless we describe all the steps for complete decoding here as an illustration.
Description: List of chain IDs. For storing data from mmCIF files the chainIdList
field should contain the value from the label_asym_id
mmCIF data item and the chainNameList
the auth_asym_id
mmCIF data item. In PDB files there is only a single name/identifier for chains that corresponds to the auth_asym_id
item. When there is only a single chain identifier available it must be stored in the chainIdList
field.
Example:
Starting with the array of 8-bit unsigned integers:
[ 0, 0, 0, 65, 0, 0, 0, 66, 0, 0, 89, 67 ]
Decoding the ASCII characters:
[ "", "", "", "A", "", "", "", "B", "", "", "Y", "C" ]
Creating the list of chain IDs:
[ "A", "B", "YC" ]
Optional field
Type: Binary data that is interpreted as an array of 8-bit unsigned integers representing ASCII characters.
Decoding: Same as for the chainIdList
field.
Description: List of chain names. This field allows to specify an additional set of labels/names for chains. For example, it can be used to store both, the label_asym_id
(in chainIdList
) and the auth_asym_id
(in chainNameList
) from mmCIF files.
The fields in the following sections hold group-related data.
The mmCIF format allows for so-called micro-heterogeneity on the group-level. For groups (residues) with micro-heterogeneity there are two or more entries given that have the same sequence index, group id (and insertion code) but are of a different group type. The defining property is their identical sequence index.
Required field
Type: Array of groupType
objects with the following fields:
Name | Type | Description |
---|---|---|
atomChargeList | Array | List of formal charges as Integers |
atomNameList | Array | List of atom names, 0 to 5 character Strings |
elementList | Array | List of elements, 0 to 3 character Strings |
bondAtomList | Array | List of bonded atom indices, Integers |
bondOrderList | Array | List of bond orders as Integers between 1 and 3 |
groupName | String | The name of the group, 0 to 5 characters |
singleLetterCode | String | The single letter code, 1 character |
chemCompType | String | The chemical component type |
The element name must follow the IUPAC standard where only the first character is capitalized and the remaining ones are lower case, for instance Cd
for Cadmium.
Two consecutive entries in bondAtomList
representing indices of covalently bound atoms. The indices point into the atomChargeList
, atomNameList
, and elementList
fields.
The singleLetterCode
is the IUPAC single letter code for protein or DNA/RNA residues, otherwise the character 'X'.
Description: Common group (residue) data that is referenced via the groupType
key by group entries.
Vocabulary: Known values for the groupType field chemCompType
from the mmCIF dictionary are D-beta-peptide, C-gamma linking
, D-gamma-peptide, C-delta linking
, D-peptide COOH carboxy terminus
, D-peptide NH3 amino terminus
, D-peptide linking
, D-saccharide
, D-saccharide 1,4 and 1,4 linking
, D-saccharide 1,4 and 1,6 linking
, DNA OH 3 prime terminus
, DNA OH 5 prime terminus
, DNA linking
, L-DNA linking
, L-RNA linking
, L-beta-peptide, C-gamma linking
, L-gamma-peptide, C-delta linking
, L-peptide COOH carboxy terminus
, L-peptide NH3 amino terminus
, L-peptide linking
, L-saccharide
, L-saccharide 1,4 and 1,4 linking
, L-saccharide 1,4 and 1,6 linking
, RNA OH 3 prime terminus
, RNA OH 5 prime terminus
, RNA linking
, non-polymer
, other
, peptide linking
, peptide-like
, saccharide
.
Example:
{
"0": {
"atomChargeList": [ 0, 0, 0, 0 ],
"atomNameList": [ "N", "CA", "C", "O" ],
"elementList": [ "N", "C", "C", "O" ],
"bondAtomList": [ 1, 0, 2, 1, 3, 2 ],
"bondOrderList": [ 1, 1, 2 ],
"chemCompType": "PEPTIDE LINKING",
"groupName": "GLY",
"singleLetterCode": "G"
}
}
Required field
Type: Binary data that is interpreted as an array of 32-bit signed integers.
Description: List of pointers to groupType
entries in groupList
by their keys. One entry for each residue, thus the number of residues is equal to the length of the groupTypeId
field.
Example:
In the following example there are 5 groups. The 1st, 4th and 5th reference the groupType
with key 2
, the 2nd references key 0
and the third references key 1
.
[ 2, 0, 1, 2, 2 ]
Required field
Type: Binary data that is interpreted as an array of 32-bit signed integers.
Decoding: First, run-length decode the input array of 32-bit signed integers into a second array of 32-bit signed integers. Finally apply delta decoding to the second array, which can be done in-place, to create the output array of 32-bit signed integers.
Description: List of group (residue) numbers. One entry for each group/residue.
Example:
Starting with the array of 32-bit signed integers:
[ 1, 10, -10, 1, 1, 4 ]
Applying run-length decoding:
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -10, 1, 1, 1, 1 ]
Applying delta decoding:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5 ]
Optional field
Type: Binary data that is interpreted as an array of 8-bit signed integers.
Description: List of secondary structure assignments coded according to the following table, which shows the eight different types of secondary structure the DSSP algorithm distinguishes. If the field is included there must be an entry for each group (residue) either in all models or only in the first model.
Code | Name |
---|---|
0 | pi helix |
1 | bend |
2 | alpha helix |
3 | extended |
4 | 3-10 helix |
5 | bridge |
6 | turn |
7 | coil |
-1 | undefined |
Example:
Starting with the array of 8-bit signed integers:
[ 7, 7, 2, 2, 2, 2, 2, 2, 2, 7 ]
Optional field
Type: Binary data that is interpreted as an array of 32-bit signed integers.
Decoding: Run-length decode the input array of 32-bit signed integers into an array of 8-bit unsigned integers representing ASCII characters.
Description: List of insertion codes, one for each group (residue).
Example:
Starting with the array of 32-bit signed integers:
[ 0, 5, 65, 3, 66, 2 ]
Applying run-length decoding:
[ 0, 0, 0, 0, 0, 65, 65, 65, 66, 66 ]
If needed the ASCII codes can be converted to an Array
of String
s with the zeros as zero-length String
s:
[ "", "", "", "", "", "A", "A", "A", "B", "B" ]
Required field
Type: Binary data that is interpreted as an array of 32-bit signed integers.
Decoding: First, run-length decode the input array of 32-bit signed integers into a second array of 32-bit signed integers. Finally apply delta decoding to the second array, which can be done in-place, to create the output array of 32-bit signed integers.
Description: List of indices that point into the sequence
property of an entity object in the entityList field that is associated with the chain the group belongs to (i.e. the index of the chain is included in the chainIndexList
of the entity). There is one entry for each group (residue). It must be set to -1
when a group entry has no associated entity (and thus no sequence), for example water molecules.
Example:
Starting with the array of 32-bit signed integers:
[ 1, 10, -10, 1, 1, 4 ]
Applying run-length decoding:
[ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -10, 1, 1, 1, 1 ]
Applying delta decoding:
[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 0, 1, 2, 3, 4 ]
The fields in the following sections hold atom-related data.
The mmCIF format allows for alternate locations of atoms. Such atoms have multiple entries in the atom-level fields (including the fields in the groupList entries). They can be identified and distinguished by their distinct values in the altLocList field.
Optional field
Type: Binary data that is interpreted as an array of 32-bit signed integers.
Decoding: First, run-length decode the input array of 32-bit signed integers into a second array of 32-bit signed integers. Finally apply delta decoding to the second array, which can be done in-place, to create the output array of 32-bit signed integers.
Description: List of atom serial numbers. One entry for each atom.
Example:
Starting with the array of 32-bit signed integers:
[ 1, 7, 2, 1 ]
Applying run-length decoding:
[ 1, 1, 1, 1, 1, 1, 1, 2 ]
Applying delta decoding:
[ 1, 2, 3, 4, 5, 6, 7, 9 ]
Optional field
Type: Binary data that is interpreted as an array of 32-bit signed integers.
Decoding: Run-length decode the input array of 32-bit signed integers into an array of 8-bit unsigned integers representing ASCII characters.
Description: List of alternate location labels, one for each atom.
Example:
Starting with the array of 32-bit signed integers:
[ 0, 5, 65, 3, 66, 2 ]
Applying run-length decoding:
[ 0, 0, 0, 0, 0, 65, 65, 65, 66, 66 ]
If needed the ASCII codes can be converted to an Array
of String
s with the zeros as zero-length String
s:
[ "", "", "", "", "", "A", "A", "A", "B", "B" ]
Optional fields
Type: Two Binary data fields that are interpreted as array of 32-bit signed integers and array of 16-bit signed integers.
Decoding: First split-list delta decode the input array of 32-bit signed integers and the array of 16-bit signed integers into a second array of 32-bit signed integers. Finally integer decode the second array using 100
as the divisor to create an array of 32-bit floating-point values.
Description: List of atom B-factors in in Å^2. One entry for each atom.
Example:
Starting with the "big" array of 32-bit signed integers and the "small" array of 16-bit signed integers:
[ 200, 3, 100, 2 ] // big
[ 0, 2, -1, -3, 5 ] // small
Applying split-list delta decoding to create an array of 32-bit signed integers:
[ 200, 200, 202, 201, 301, 298, 303 ]
Applying integer decoding with a divisor of 100
to create an array of 32-bit floating-point values:
[ 2.00, 2.00, 2.02, 2.01, 3.01, 2.98, 3.03 ]
Required fields
Type: Two Binary data fields that are interpreted as array of 32-bit signed integers and array of 16-bit signed integers.
Decoding: First split-list delta decode the input array of 32-bit signed integers and the array of 16-bit signed integers into a second array of 32-bit signed integers. Finally integer decode the second array using 1000
as the divisor to create an array of 32-bit floating-point values.
Description: List of x, y, and z atom coordinates, respectively, in Å. One entry for each atom and coordinate.
Note: To clarify, the data for each coordinate is stored in a separate pair of arrays.
Example:
Starting with the "big" array of 32-bit signed integers and the "small" array of 16-bit signed integers:
[ 1200, 3, 100, 2 ] // big
[ 0, 2, -1, -3, 5 ] // small
Applying split-list delta decoding to create an array of 32-bit signed integers:
[ 1200, 1200, 1202, 1201, 1301, 1298, 1303 ]
Applying integer decoding with a divisor of 1000
to create an array of 32-bit floating-point values:
[ 1.000, 1.200, 1.202, 1.201, 1.301, 1.298, 1.303 ]
Optional field
Description: List of atom occupancies, one for each atom.
Type: Binary data that is interpreted as an array of 32-bit signed integers.
Decoding: First, run-length decode the input array of 32-bit signed integers into a second array of 32-bit signed integers. Finally apply integer decoding using 100
as the divisor to the second array to create a array of 32-bit floating-point values.
Example:
Starting with the array of 32-bit signed integers:
[ 100, 4, 50, 2 ]
Applying run-length decoding:
[ 100, 100, 100, 100, 50, 50 ]
Applying integer decoding with a divisor of 100
to create an array of 32-bit floating-point values:
[ 1.00, 1.00, 1.00, 1.00, 0.50, 0.50 ]
The following traversal pseudo code assumes that all fields have been decoded and specifically that the split-list delta encoded fields are decoded into fields named like in the following example: xCoordBig
/xCoordSmall
decode into xCoordList
.
# initialize index counters
set modelIndex to 0
set chainIndex to 0
set groupIndex to 0
set atomIndex to 0
# traverse models
for modelChainCount in chainsPerModel
print modelIndex
# traverse chains
for 1 to modelChainCount
print chainIndex
set offset to chainIndex * 4
print chainIdList[ offset : offset + 4 ]
print chainNameList[ offset : offset + 4 ]
set chainGroupCount to groupsPerChain[ chainIndex ]
# traverse groups
for 1 to chainGroupCount
print groupIndex
print groupIdList[ groupIndex ]
print insCodeList[ groupIndex ]
print secStructList[ groupIndex ]
print sequenceIndexList[ groupIndex ]
print groupTypeList[ groupIndex ]
set group to groupList[ groupTypeList[ groupIndex ] ]
print group.groupName
print group.singleLetterCode
print group.chemCompType
set atomOffset to atomIndex
set groupBondCount to group.bondAtomList.length / 2
for i in 1 to groupBondCount
print atomOffset + group.bondAtomList[ i * 2 ] # atomIndex1
print atomOffset + group.bondAtomList[ i * 2 + 1 ] # atomIndex2
print group.bondOrderList[ i ]
set groupAtomCount to group.atomNameList.length
# traverse atoms
for i in 1 to groupAtomCount
print atomIndex
print xCoordList[ atomIndex ]
print yCoordList[ atomIndex ]
print zCoordList[ atomIndex ]
print bFactorList[ atomIndex ]
print atomIdList[ atomIndex ]
print altLocList[ atomIndex ]
print occupancyList[ atomIndex ]
print group.atomChargeList[ i ]
print group.atomNameList[ i ]
print group.elementList[ i ]
increment atomIndex by 1
increment groupIndex by 1
increment chainIndex by 1
increment modelIndex by 1
# traverse inter-group bonds
for i in 1 to bondAtomList.length / 2
print bondAtomList[ i * 2 ] # atomIndex1
print bondAtomList[ i * 2 + 1 ] # atomIndex2
print bondOrderList[ i ]