Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability for the PDBManager to perform interface-based chain filtering #333

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

amorehead
Copy link
Contributor

What does this implement/fix? Explain your changes

This allows one to select PDB complex chains satisfying certain interface contact or hydrogen bonding constraints.

@@ -23,6 +25,11 @@
)
from graphein.utils.dependencies import is_tool

PRIMARY_INTERCHAIN_CONTACT_ATOMS_FOR_FILTERING: List[str] = ["CA", "C4'"]
SECONDARY_INTERCHAIN_CONTACT_ATOMS_NOT_FOR_FILTERING: List[str] = ["H"]
PRIMARY_HYDROGEN_BOND_ATOMS_FOR_FILTERING: List[str] = ["N", "O", "N1", "N9", "N3", "C2", "C4", "C5", "C6"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be vetted more carefully, as I initially chose these atom types heuristically.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this atom naming scheme? It doesn't ring any bells for me (

ATOM_NUMBERING: Dict[str, int] = {
)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have these constants:

HYDROGEN_BOND_DONORS: Dict[str, Dict[str, int]] = {

HYDROGEN_BOND_ACCEPTORS: Dict[str, Dict[str, int]] = {

Copy link
Contributor Author

@amorehead amorehead Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this atom naming scheme? It doesn't ring any bells for me (

ATOM_NUMBERING: Dict[str, int] = {

)

The N, CA, O, and H atoms correspond to regular protein vocabulary, however, all other types correspond to nucleic acid residue atoms. My initial goal with this PR was to make a generic dataset chain filter for protein-protein interactions, protein-nucleic acid interactions, and nucleic acid-nucleic acid interactions (inspired by the dataset curation technique of RoseTTAFold2NA for protein-nucleic acid structure prediction - https://www.biorxiv.org/content/10.1101/2022.09.09.507333v1.full.pdf - page 8). I am essentially trying to reproduce this filtering logic with the PDBManager (minus all the sequence alignments), and I thought a PR would be in order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per a suggestion from a colleague, I have removed the C atoms from the hydrogen bond calculation, as these atoms are very rarely involved in the formation of h-bonds in proteins and NAs.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The N, CA, O, and H atoms correspond to regular protein vocabulary, however, all other types correspond to nucleic acid residue atoms.

Got it, bells ring for me now :)

So these H-bond definitions do not account for sidechain-X hbonds, only backbone-backbone hbonds?

Copy link
Contributor Author

@amorehead amorehead Aug 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Here's a naive question on my part: How frequent would you say the occurrence of sidechain-X hbonds is? If they are pretty common, perhaps we can simply include more protein and nucleic acid (NA) atom types to the list here?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By way of how I have designed this filtering logic, I am assuming that each (protein or NA) residue (potentially) contains the following atoms: "N", "O", "N1", "N9", "N3". Given the prevalence of sidechain hbonds, what types of protein atoms (shared across all residue types) would you say would be most reasonable to include to cover most of the possible hbonds mentioned in this article? The only other atom type I think we could include would be the carbon-beta (Cb) atoms.

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants