You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your effort in developing the open-source AF3.
Issue Description
I have encountered an issue with the mmcif_parsing module related to unresolved residues. It appears that when the protein sequence is parsed directly from the structure object in Biopython, the unresolved residues — those that do not appear in the mmcif coordinates part (_atom_site) — are not included in the MmcifObject.
Impact
We need the unresolved residues for some computations, such as calculating the unresolved relative solvent accessible surface area (RASA).
Example
For instance, the actual sequence for the protein with PDB ID 7a4d is:
We also noticed that the cached MSAs in data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1A_protein.a3m are computed based on the latter sequence, which excludes the unresolved residues.
Request for Assistance
Is there a solution to include the unresolved residues in the parsed sequence? Any guidance or help with this issue would be greatly appreciated.
Best regards,
Shaoning
The text was updated successfully, but these errors were encountered:
Hi, @v-shaoningli. I'm glad to see you've found this work useful so far.
As you said, mmcif_parsing relies initially on Biopython to parse the mmCIF input files' metadata, after which we manually collect all atoms associated with coordinate data here (following AF2's parsing logic). This parsing logic is quite complex to account for numerous edge cases that can arise when working with heterogeneous PDB complexes. It's possible to modify this function to accommodate the use case you've outlined above, but I will warn you that other side effects may easily "leak" into the downstream components of the codebase without rigorous unit testing afterwards.
If you have additional questions along the way, let me know. Best of luck.
Hi All!
Thank you for your effort in developing the open-source AF3.
Issue Description
I have encountered an issue with the
mmcif_parsing
module related to unresolved residues. It appears that when the protein sequence is parsed directly from thestructure
object inBiopython
, the unresolved residues — those that do not appear in the mmcif coordinates part (_atom_site
) — are not included in theMmcifObject
.Impact
We need the unresolved residues for some computations, such as calculating the unresolved relative solvent accessible surface area (RASA).
Example
For instance, the actual sequence for the protein with PDB ID
7a4d
is:However, when using
mmcif_parsing
, the parsed sequence is:Additional Observations
We also noticed that the cached MSAs in
data/pdb_data/data_caches/msa/train_msas/7a4d-assembly1A_protein.a3m
are computed based on the latter sequence, which excludes the unresolved residues.Request for Assistance
Is there a solution to include the unresolved residues in the parsed sequence? Any guidance or help with this issue would be greatly appreciated.
Best regards,
Shaoning
The text was updated successfully, but these errors were encountered: