Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/docs/cases: Covalent Bond Input #329

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 64 additions & 6 deletions apps/protein_folding/helixfold3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,15 @@ Note: If you have a different version of python3 and cuda, please refer to [here
#### Install Maxit
The conversion between `.cif` and `.pdb` relies on [Maxit](https://sw-tools.rcsb.org/apps/MAXIT/index.html).
Download Maxit source code from https://sw-tools.rcsb.org/apps/MAXIT/maxit-v11.100-prod-src.tar.gz. Untar and follow
its `README` to complete installation.
its `README` to complete installation. If you encouter error like your GCC version not support (9.4.0, for example), editing `etc/platform.sh` and reruning compilation again would make sense. See below:

```bash
# Check if it is a Linux platform
Linux)
# Check if it is GCC version 4.x
gcc_ver=`gcc --version | grep -e " 4\."` # edit `4\.` to `9\.`
if [[ -z $gcc_ver ]]
```

### Usage

Expand Down Expand Up @@ -96,10 +104,11 @@ The script `scripts/download_all_data.sh` can be used to download and set up all

There are some demo input under `./data/` for your test and reference. Data input is in the form of JSON containing
several entities such as `protein`, `ligand`, `nucleic acids`, and `iron`. Proteins and nucleic acids inputs are their sequence.
HelixFold3 supports input ligand as SMILES or CCD id, please refer to `/data/demo_6zcy_smiles.json` and `demo_output/demo_6zcy_smiles/`
for more details about SMILES input. More flexible input will come in soon.
HelixFold3 supports input ligand as SMILES, CCD id or small molecule files, please refer to `/data/demo_6zcy_smiles.json` and `data/demo_p450_heme_sdf.json`
for more details about SMILES input. Flexible input from small molecule is now supported. See `obabel -L formats |grep -v 'Write-only'`

A example of input data is as follows:

```json
{
"entities": [
Expand All @@ -116,11 +125,59 @@ A example of input data is as follows:
]
}
```
Example of **covalently modified** input:

```json
{
"entities": [
{
"type": "protein",
"sequence": "MDALYKSTVAKFNEVIQLDCSTEFFSIALSSIAGILLLLLLFRSKRHSSLKLPPGKLGIPFIGESFIFLRALRSNSLEQFFDERVKKFGLVFKTSLIGHPTVVLCGPAGNRLILSNEEKLVQMSWPAQFMKLMGENSVATRRGEDHIVMRSALAGFFGPGALQSYIGKMNTEIQSHINEKWKGKDEVNVLPLVRELVFNISAILFFNIYDKQEQDRLHKLLETILVGSFALPIDLPGFGFHRALQGRAKLNKIMLSLIKKRKEDLQSGSATATQDLLSVLLTFRDDKGTPLTNDEILDNFSSLLHASYDTTTSPMALIFKLLSSNPECYQKVVQEQLEILSNKEEGEEITWKDLKAMKYTWQVAQETLRMFPPVFGTFRKAITDIQYDGYTIPKGWKLLWTTYSTHPKDLYFNEPEKFMPSRFDQEGKHVAPYTFLPFGGGQRSCVGWEFSKMEILLFVHHFVKTFSSYTPVDPDEKISGDPLPPLPSKGFSIKLFPRP",
"count": 1
},
{
"type": "ligand",
"ccd": "HEM",
"count": 1
},
{
"type": "ligand",
"smiles": "CC1=C2CC[C@@]3(CCCC(=C)[C@H]3C[C@@H](C2(C)C)CC1)C",
"count": 1
},
{
"type": "bond",
"bond": "A,CYS,445,SG,B,HEM,1,FE,covale,2.3",
"_comment": "<chain-id>,<residue name>,<residue index>,<atom id>,<chain-id>,<residue name>,<residue index>,<atom id>,<bond type>,<bond length>",
"_another_comment": "use semicolon to separate multiple bonds",
"_also_comment": "For ccd input, use CCD key as residue name; for smiles and file input, use `UNK-<index>` where index is the chain order you input. eg. `UNK-1` for the first ligand chain(or the count #1), `UNK-2` the second(or the count #2)."
}
]
}
```
Another example of **disulfide modified** input:

```json
{
"entities": [
{
"type": "protein",
"sequence": "MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSPTASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN",
"count": 1
},
{
"type": "bond",
"bond": "A,CYS,41,SG,A,CYS,77,SG,disulf,2.2;A,CYS,51,SG,A,CYS,66,SG,disulf,2.2;A,CYS,67,SG,A,CYS,92,SG,disulf,2.2;A,CYS,79,SG,A,CYS,99,SG,disulf,2.2",
"_case_from": "https://www.uniprot.org/uniprotkb/Q43495/entry#ptm_processing"
}
]
}
```

#### Running HelixFold for Inference
To run inference on a sequence or multiple sequences using HelixFold3's pretrained parameters, run e.g.:
* Inference on single GPU (change the settings in script BEFORE you run it)
```
```shell
sh run_infer.sh
```

Expand Down Expand Up @@ -184,7 +241,7 @@ The outputs will be in a subfolder of `output_dir`, including the computed MSAs,
ranked structures, and evaluation metrics. For a task of inferring twice with diffusion batch size 3,
assume your input JSON is named `demo_data.json`, the `output_dir` directory will have the following structure:

```
```text
<output_dir>/
└── demo_data/
├── demo_data-pred-1-1/
Expand All @@ -208,9 +265,10 @@ assume your input JSON is named `demo_data.json`, the `output_dir` directory wil
└── ...

```

The contents of each output file are as follows:
* `final_features.pkl` – A `pickle` file containing the input feature NumPy arrays
used by the models to predict the structures.
used by the models to predict the structures. If you need to re-run a inference without re-building the MSAs, delete this file.
* `msas/` - A directory containing the files describing the various genetic
tool hits that were used to construct the input MSA.
* `demo_data-pred-X-Y` - Prediction results of `demo_data.json` in X-th inference and Y-thdiffusion batch,
Expand Down
155 changes: 155 additions & 0 deletions apps/protein_folding/helixfold3/data/7s69_glycan.sdf
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@

OpenBabel03042416223D

72 77 0 0 1 0 0 0 0 0999 V2000
29.7340 3.2540 76.7430 C 0 0 0 0 0 2 0 0 0 0 0 0
29.8160 4.4760 77.6460 C 0 0 1 0 0 3 0 0 0 0 0 0
28.5260 5.2840 77.5530 C 0 0 2 0 0 3 0 0 0 0 0 0
28.1780 5.5830 76.1020 C 0 0 1 0 0 3 0 0 0 0 0 0
28.2350 4.3240 75.2420 C 0 0 1 0 0 3 0 0 0 0 0 0
28.1040 4.6170 73.7650 C 0 0 0 0 0 2 0 0 0 0 0 0
31.3020 3.8250 79.4830 C 0 0 0 0 0 0 0 0 0 0 0 0
31.3910 3.4410 80.9280 C 0 0 0 0 0 1 0 0 0 0 0 0
30.0760 4.0880 79.0210 N 0 0 0 0 0 2 0 0 0 0 0 0
28.6870 6.5050 78.2670 O 0 0 0 0 0 1 0 0 0 0 0 0
26.8490 6.0910 76.0350 O 0 0 0 0 0 0 0 0 0 0 0 0
29.4950 3.6650 75.4130 O 0 0 0 0 0 0 0 0 0 0 0 0
29.3670 4.5550 73.1150 O 0 0 0 0 0 1 0 0 0 0 0 0
32.2950 3.8940 78.7640 O 0 0 0 0 0 0 0 0 0 0 0 0
26.7420 7.4140 75.6950 C 0 0 1 0 0 3 0 0 0 0 0 0
25.2700 7.7830 75.6110 C 0 0 1 0 0 3 0 0 0 0 0 0
25.1290 9.2300 75.1610 C 0 0 2 0 0 3 0 0 0 0 0 0
25.9180 10.1440 76.0880 C 0 0 1 0 0 3 0 0 0 0 0 0
27.3630 9.6720 76.2210 C 0 0 1 0 0 3 0 0 0 0 0 0
28.1310 10.4360 77.2730 C 0 0 0 0 0 2 0 0 0 0 0 0
23.8820 5.8170 75.1400 C 0 0 0 0 0 0 0 0 0 0 0 0
23.1980 5.0100 74.0810 C 0 0 0 0 0 1 0 0 0 0 0 0
24.5530 6.8930 74.7160 N 0 0 0 0 0 2 0 0 0 0 0 0
23.7530 9.5950 75.1670 O 0 0 0 0 0 1 0 0 0 0 0 0
25.9170 11.4700 75.5730 O 0 0 0 0 0 0 0 0 0 0 0 0
27.4050 8.2900 76.6040 O 0 0 0 0 0 0 0 0 0 0 0 0
29.5300 10.4030 77.0280 O 0 0 0 0 0 1 0 0 0 0 0 0
23.8300 5.5110 76.3290 O 0 0 0 0 0 0 0 0 0 0 0 0
25.3940 12.4250 76.4090 C 0 0 1 0 0 3 0 0 0 0 0 0
25.9490 13.7680 75.9090 C 0 0 2 0 0 3 0 0 0 0 0 0
25.1320 14.9560 76.4900 C 0 0 2 0 0 3 0 0 0 0 0 0
23.6130 14.6900 76.6390 C 0 0 1 0 0 3 0 0 0 0 0 0
23.3700 13.3000 77.2280 C 0 0 1 0 0 3 0 0 0 0 0 0
21.9020 12.9360 77.3500 C 0 0 0 0 0 2 0 0 0 0 0 0
25.9010 13.8490 74.4810 O 0 0 0 0 0 1 0 0 0 0 0 0
25.3420 16.1410 75.7110 O 0 0 0 0 0 0 0 0 0 0 0 0
23.0420 15.6520 77.5170 O 0 0 0 0 0 1 0 0 0 0 0 0
23.9910 12.3690 76.3570 O 0 0 0 0 0 0 0 0 0 0 0 0
21.3660 12.8480 76.0500 O 0 0 0 0 0 0 0 0 0 0 0 0
20.8090 11.6500 75.6780 C 0 0 2 0 0 3 0 0 0 0 0 0
20.6800 11.6410 74.1740 C 0 0 2 0 0 3 0 0 0 0 0 0
19.5510 12.5850 73.8180 C 0 0 2 0 0 3 0 0 0 0 0 0
18.2370 12.0940 74.4540 C 0 0 1 0 0 3 0 0 0 0 0 0
18.4030 11.9240 75.9810 C 0 0 1 0 0 3 0 0 0 0 0 0
17.2710 11.1260 76.6120 C 0 0 0 0 0 2 0 0 0 0 0 0
20.2900 10.3510 73.7080 O 0 0 0 0 0 1 0 0 0 0 0 0
19.4280 12.7380 72.4110 O 0 0 0 0 0 0 0 0 0 0 0 0
17.2120 13.0460 74.2030 O 0 0 0 0 0 1 0 0 0 0 0 0
19.6260 11.2000 76.3010 O 0 0 0 0 0 0 0 0 0 0 0 0
16.0670 11.4490 75.9360 O 0 0 0 0 0 1 0 0 0 0 0 0
20.2190 13.6280 71.7260 C 0 0 2 0 0 3 0 0 0 0 0 0
19.6090 14.0000 70.3810 C 0 0 2 0 0 3 0 0 0 0 0 0
19.6360 12.7820 69.4880 C 0 0 2 0 0 3 0 0 0 0 0 0
21.0860 12.3100 69.3240 C 0 0 1 0 0 3 0 0 0 0 0 0
21.7030 12.0240 70.7120 C 0 0 1 0 0 3 0 0 0 0 0 0
23.1940 11.7460 70.6620 C 0 0 0 0 0 2 0 0 0 0 0 0
20.4080 14.9810 69.7000 O 0 0 0 0 0 1 0 0 0 0 0 0
19.0310 13.0500 68.2340 O 0 0 0 0 0 1 0 0 0 0 0 0
21.1060 11.1280 68.5380 O 0 0 0 0 0 1 0 0 0 0 0 0
21.5380 13.1700 71.5840 O 0 0 0 0 0 0 0 0 0 0 0 0
23.8240 12.5210 71.6820 O 0 0 0 0 0 1 0 0 0 0 0 0
26.0070 17.3020 76.0200 C 0 0 2 0 0 3 0 0 0 0 0 0
27.0750 17.5250 74.9350 C 0 0 2 0 0 3 0 0 0 0 0 0
28.3660 16.8320 75.3290 C 0 0 2 0 0 3 0 0 0 0 0 0
28.7820 17.2470 76.7510 C 0 0 1 0 0 3 0 0 0 0 0 0
27.6930 16.8120 77.7320 C 0 0 1 0 0 3 0 0 0 0 0 0
27.9770 17.2020 79.1710 C 0 0 0 0 0 2 0 0 0 0 0 0
27.3990 18.9140 74.8010 O 0 0 0 0 0 1 0 0 0 0 0 0
29.4060 17.0990 74.3950 O 0 0 0 0 0 1 0 0 0 0 0 0
30.0160 16.6410 77.0930 O 0 0 0 0 0 1 0 0 0 0 0 0
26.4610 17.4820 77.3520 O 0 0 0 0 0 0 0 0 0 0 0 0
27.3660 18.4620 79.4040 O 0 0 0 0 0 1 0 0 0 0 0 0
1 2 1 0 0 0 0
1 12 1 0 0 0 0
2 3 1 0 0 0 0
2 9 1 1 0 0 0
3 10 1 1 0 0 0
3 4 1 0 0 0 0
4 5 1 0 0 0 0
4 11 1 1 0 0 0
5 6 1 6 0 0 0
5 12 1 0 0 0 0
6 13 1 0 0 0 0
7 14 2 0 0 0 0
7 8 1 0 0 0 0
7 9 1 0 0 0 0
15 16 1 0 0 0 0
15 11 1 1 0 0 0
15 26 1 0 0 0 0
16 23 1 6 0 0 0
16 17 1 0 0 0 0
17 18 1 0 0 0 0
17 24 1 1 0 0 0
18 25 1 6 0 0 0
18 19 1 0 0 0 0
19 20 1 1 0 0 0
19 26 1 0 0 0 0
20 27 1 0 0 0 0
21 22 1 0 0 0 0
21 23 1 0 0 0 0
21 28 2 0 0 0 0
29 38 1 0 0 0 0
29 25 1 6 0 0 0
29 30 1 0 0 0 0
30 35 1 6 0 0 0
30 31 1 0 0 0 0
31 32 1 0 0 0 0
31 36 1 6 0 0 0
32 33 1 0 0 0 0
32 37 1 1 0 0 0
33 38 1 0 0 0 0
33 34 1 6 0 0 0
34 39 1 0 0 0 0
40 49 1 0 0 0 0
40 41 1 0 0 0 0
40 39 1 1 0 0 0
41 46 1 1 0 0 0
41 42 1 0 0 0 0
42 43 1 0 0 0 0
42 47 1 6 0 0 0
43 48 1 1 0 0 0
43 44 1 0 0 0 0
44 49 1 0 0 0 0
44 45 1 6 0 0 0
45 50 1 0 0 0 0
51 47 1 6 0 0 0
51 60 1 0 0 0 0
51 52 1 0 0 0 0
52 53 1 0 0 0 0
52 57 1 6 0 0 0
53 54 1 0 0 0 0
53 58 1 6 0 0 0
54 59 1 6 0 0 0
54 55 1 0 0 0 0
55 56 1 6 0 0 0
55 60 1 0 0 0 0
56 61 1 0 0 0 0
62 71 1 0 0 0 0
62 36 1 1 0 0 0
62 63 1 0 0 0 0
63 68 1 1 0 0 0
63 64 1 0 0 0 0
64 69 1 6 0 0 0
64 65 1 0 0 0 0
65 70 1 1 0 0 0
65 66 1 0 0 0 0
66 67 1 1 0 0 0
66 71 1 0 0 0 0
67 72 1 0 0 0 0
M END
$$$$
20 changes: 20 additions & 0 deletions apps/protein_folding/helixfold3/data/demo_3fap_protein_sm.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"entities": [
{
"type": "protein",
"sequence": "GVQVETISPGDGRTFPKRGQTCVVHYTGMLEDGKKFDSSRDRNKPFKFMLGKQEVIRGWEEGVAQMSVGQRAKLTISPDYAYGATGHPGIIPPHATLVFDVELLKLE",
"count": 1
},
{
"type": "protein",
"sequence": "VAILWHEMWHEGLEEASRLYFGERNVKGMFEVLEPLHAMMERGPQTLKETSFNQAYGRDLMEAQEWCRKYMKSGNVKDLTQAWDLYYHVFRRIS",
"count": 1
},
{
"type": "ligand",
"sdf": "/mnt/data/yinying/tests/helixfold/ligands/ARD_ideal.sdf",
"use_3d": false,
"count": 1
}
]
}
45 changes: 45 additions & 0 deletions apps/protein_folding/helixfold3/data/demo_4Fe-4S.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"entities": [
{
"type": "protein",
"sequence": "MKLNVDGLLVYFPYDYIYPEQFSYMRELKRTLDAKGHGVLEMPSGTGKTVSLLALIMAYQRAYPLEVTKLIYCSRTVPEIEKVIEELRKLLNFYEKQEGEKLPFLGLALSSRKNLCIHPEVTPLRFGKDVDGKCHSLTASYVRAQYQHDTSLPHCRFYEEFDAHGREVPLPAGIYNLDDLKALGRRQGWCPYFLARYSILHANVVVYSYHYLLDPKIADLVSKELARKAVVVFDEAHNIDNVCIDSMSVNLTRRTLDRCQGNLETLQKTVLRIKETDEQRLRDEYRRLVEGLREASAARETDAHLANPVLPDEVLQEAVPGSIRTAEHFLGFLRRLLEYVKWRLRVQHVVQESPPAFLSGLAQRVCIQRKPLRFCAERLRSLLHTLEITDLADFSPLTLLANFATLVSTYAKGFTIIIEPFDDRTPTIANPILHFSCMDASLAIKPVFERFQSVIITSGTLSPLDIYPKILDFHPVTMATFTMTLARVCLCPMIIGRGNDQVAISSKFETREDIAVIRNYGNLLLEMSAVVPDGIVAFFTSYQYMESTVASWYEQGILENIQRNKLLFIETQDGAETSVALEKYQEACENGRGAILLSVARGKVSEGIDFVHHYGRAVIMFGVPYVYTQSRILKARLEYLRDQFQIRENDFLTFDAMRHAAQCVGRAIRGKTDYGLMVFADKRFARGDKRGKLPRWIQEHLTDANLNLTVDEGVQVAKYFLRQMAQPFHREDQLGLSLLSLEQLESEETLKRIEQIAQQL",
"count": 1
},
{
"type": "ligand",
"ccd": "SF4",
"count": 1,
"_note": "5T5I"
},
{
"type": "bond",
"bond": "A,CYS,116,SG,B,SF4,1,FE1,metalc,2.2;A,CYS,134,SG,B,SF4,1,FE2,metalc,2.2;A,CYS,155,SG,B,SF4,1,FE3,metalc,2.2;A,CYS,190,SG,B,SF4,1,FE4,metalc,2.2",
"_case_from": "https://www.uniprot.org/uniprotkb/P18074/entry",
"_note":"ALL_CYS-ALL_FE"
},
{
"type": "bond",
"bond": "B,SF4,1,FE1,B,SF4,1,S2,metalc,2.2;B,SF4,1,FE1,B,SF4,1,S3,metalc,2.2;B,SF4,1,FE1,B,SF4,1,S4,metalc,2.2",
"_case_from": "https://www.uniprot.org/uniprotkb/P18074/entry",
"_note":"FE1-S234"
},
{
"type": "bond",
"bond": "B,SF4,1,FE2,B,SF4,1,S1,metalc,2.2;B,SF4,1,FE2,B,SF4,1,S3,metalc,2.2;B,SF4,1,FE2,B,SF4,1,S4,metalc,2.2",
"_case_from": "https://www.uniprot.org/uniprotkb/P18074/entry",
"_note":"FE2-S134"
},
{
"type": "bond",
"bond": "B,SF4,1,FE3,B,SF4,1,S1,metalc,2.2;B,SF4,1,FE3,B,SF4,1,S2,metalc,2.2;B,SF4,1,FE3,B,SF4,1,S4,metalc,2.2",
"_case_from": "https://www.uniprot.org/uniprotkb/P18074/entry",
"_note":"FE3-S124"
},
{
"type": "bond",
"bond": "B,SF4,1,FE4,B,SF4,1,S1,metalc,2.2;B,SF4,1,FE14,B,SF4,1,S2,metalc,2.2;B,SF4,1,FE4,B,SF4,1,S3,metalc,2.2",
"_case_from": "https://www.uniprot.org/uniprotkb/P18074/entry",
"_note":"FE4-S123"
}
]
}
20 changes: 20 additions & 0 deletions apps/protein_folding/helixfold3/data/demo_7s69_coval.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"entities": [
{
"type": "protein",
"sequence": "DRHHHHHHKLGKMKIVEEPNSFGLNNPFLSQTNKLQPRVQPSPVSGPSHLFRLAGKCFNLVESTYKYELCPFHNVTQHEQTFRWNAYSGILGIWQEWDIENNTFSGMWMREGDSCGNKNRQTKVLLVCGKANKLSSVSEPSTCLYSLTFETPLVCHPHSLLVYPTLSEGLQEKWNEAEQALYDELITEQGHGKILKEIFREAGYLKTTKPDGEGKETQDKPKEFDSLEKCNKGYTELTSEIQRLKKMLNEHGISYVTNGTSRSEGQPAEVNTTFARGEDKVHLRGDTGIRDGQ",
"count": 1
},
{
"type": "ligand",
"sdf": "/repo/PaddleHelix/apps/protein_folding/helixfold3/data/7s69_glycan.sdf",
"count": 1
},
{
"type": "bond",
"bond": "A,ASN,74,ND2,B,UNK-1,1,C16,covale,2.3",
"_comment": "'A,74,ND2:B,1:CW,null' from RF2AA.",
"_also_comment": "For ccd input, use CCD key as residue name; for smiles and file input, use `UNK-<index>` where index is the chain order you input"
}
]
}
23 changes: 23 additions & 0 deletions apps/protein_folding/helixfold3/data/demo_E2-Ub.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"entities": [
{
"type": "protein",
"sequence": "MSRAKRIMKEIQAVKDDPAAHITLEFVSESDIHHLKGTFLGPPGTPYEGGKFVVDIEVPMEYPFKPPKMQFDTKVYHPNISSVTGAICLDILKNAWSPVITLKSALISLQALLQSPEPNDPQDAEVAQHYLRDRESFNKTAALWTRLYASETSNGQKGNVEESDLYGIDHDLIDEFESQGFEKDKIVEVLRRLGVKSLDPNDNNTANRIIEELLK",
"count": 1,
"_case_from": "https://www.uniprot.org/uniprotkb/P21734/entry#sequences",
"_note": "E2"
},
{
"type": "protein",
"sequence": "MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLADYNIQKESTLHLVLRLRGG",
"count": 1,
"_case_from": "https://www.uniprot.org/uniprotkb/P21734/entry#sequences",
"_note": "Ub"
},
{
"type": "bond",
"bond": "A,LYS,93,NZ,B,GLY,76,C,covale,1.66",
"_note": "Glycyl lysine isopeptide (Lys-Gly) (interchain with G-Cter in ubiquitin)"
}
]
}
Loading