Specification on PDB input files

Valid input

The NucleicNet operates on protein inputs in PDB file format. These file inputs should be placed in the "GridData" folder as indicated from our bash script. To allow a uniform processing of PDB files, users are recommended to check validity of their input files by observing the following criteria:

Only contain rows starting with "ATOM" or "TER". Chain Termination indicated by "TER".

The PDB file should only contain protein. (i.e. without RNA/DNA/solvents/ligands/HETATM etcetera)

Do not contain non-standard amino acid within the protein chain.

Do not contain chemicals other than proteins.

Each PDB file should contain only one model (c.f. NMR models). In case multiple models are included in the same file, only the first one will be analysed.

The file name can be any 4-digit alphanumeric starting with an integer followed by ".pdb" suffix. (e.g. "03Aa.pdb" or "2357.pdb" are valid, but "t3f4.pdb" is not.)

The file name should not contain non-alphanumeric other than "." in ".pdb".

Include Chain ID and all fields intact as indicated by ftp://ftp.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_Letter.pdf

Both residue and atom index needs to be in base-10 integer. (Some PDB files are written with base-16 integers to accommodate for their very large size. Currently, these files are unsupported.)

Protein residues are recommended to be intact atom-wise. Users may call PDBfixer (https://anaconda.org/omnia/pdbfixer) to fulfil this requirement or simply remove the residue that are not intact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specification on PDB input files

Valid input

Clone this wiki locally