Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Unresolved Residues in mmcif_parsing #284

Open
v-shaoningli opened this issue Sep 23, 2024 · 1 comment
Open

Missing Unresolved Residues in mmcif_parsing #284

v-shaoningli opened this issue Sep 23, 2024 · 1 comment

Comments

@v-shaoningli
Copy link

Hi All!

Thank you for your effort in developing the open-source AF3.

Issue Description

I have encountered an issue with the mmcif_parsing module related to unresolved residues. It appears that when the protein sequence is parsed directly from the structure object in Biopython, the unresolved residues — those that do not appear in the mmcif coordinates part (_atom_site) — are not included in the MmcifObject.

Impact

We need the unresolved residues for some computations, such as calculating the unresolved relative solvent accessible surface area (RASA).

Example

For instance, the actual sequence for the protein with PDB ID 7a4d is:

QVQLQESGGGLVQPGGSLRLSCAAPGFRLDNYVIGWFRQAPGKEREGVSCISSSAGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCATACYSSYVTYWGQGTQVTVSSGRYPYDVPDYGSGRA

However, when using mmcif_parsing, the parsed sequence is:

QLQESGGGLVQPGGSLRLSCAAPGFRLDNYVIGWFRQAPGKEREGVSCISSSAGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYCATACYSSYVTYWGQGTQVTVSSGR

Additional Observations

We also noticed that the cached MSAs in data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1A_protein.a3m are computed based on the latter sequence, which excludes the unresolved residues.

Request for Assistance

Is there a solution to include the unresolved residues in the parsed sequence? Any guidance or help with this issue would be greatly appreciated.

Best regards,
Shaoning

@amorehead
Copy link
Contributor

Hi, @v-shaoningli. I'm glad to see you've found this work useful so far.

As you said, mmcif_parsing relies initially on Biopython to parse the mmCIF input files' metadata, after which we manually collect all atoms associated with coordinate data here (following AF2's parsing logic). This parsing logic is quite complex to account for numerous edge cases that can arise when working with heterogeneous PDB complexes. It's possible to modify this function to accommodate the use case you've outlined above, but I will warn you that other side effects may easily "leak" into the downstream components of the codebase without rigorous unit testing afterwards.

If you have additional questions along the way, let me know. Best of luck.

Best,
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants