-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RMSD calculation for whole PDBs #69
Comments
Hi there, |
I've had a go at adding a fix for this but I hit a bit of a brick wall. My solution
ProblemEven after subsetting the longer pdb df to contain only residues present in the shorter, the 0th axis of the dataframes are still not equal. I found this was because although the subsetted dfs contain the same amino acid sequences, a small number of residues within the structure differ in their atom composition (how.. they are the same amino acid!? I have included my python code and some intermediate outputs below so you can check it yourself. from biopandas.pdb import PandasPdb
from biopandas.pdb.engines import amino3to1dict
from pathlib import Path
import difflib
from itertools import chain, groupby
# !mkdir data
# !curl -O https://files.rcsb.org/download/2VU9.pdb && mv 2VU9.pdb data/
# !curl -O https://files.rcsb.org/download/2VUA.pdb && mv 2VUA.pdb data/
DATA_PATH = Path('../data')
pdb_paths = !ls {DATA_PATH}/*
pdb1 = PandasPdb().read_pdb(pdb_paths[0])
pdb2 = PandasPdb().read_pdb(pdb_paths[1])
def amino3to1(pdbdf):
tmp = pdbdf.copy()
indices, residues = [], []
for k, g in groupby(tmp.itertuples(), key = lambda x: x.residue_number):
first_atom = list(g)[0]
indices.append(first_atom.Index)
residues.append(amino3to1dict[first_atom.residue_name])
tmp = (
tmp.loc[indices, ['chain_id', 'residue_number', 'insertion']]
.assign(residue_name = residues)
.reset_index()
)
return tmp
def locate_matching_blocks(seq1, seq2):
matcher = difflib.SequenceMatcher(None, seq1, seq2)
blocks = matcher.get_matching_blocks()[:-1]
indices_1 = list(chain.from_iterable([
range(block.a, block.a + block.size)
for block in blocks
]))
indices_2 = list(chain.from_iterable([
range(block.b, block.b + block.size)
for block in blocks
]))
return indices_1, indices_2
seq_df1 = amino3to1(pdb1.df['ATOM'])
seq_df2 = amino3to1(pdb2.df['ATOM'])
seq1 = ''.join(seq_df1.residue_name.tolist())
seq2 = ''.join(seq_df2.residue_name.tolist())
# residue numbers 355, 356, 357 should be missing in sequence 1 (non matching tag)
seq_df1_indices_to_compare, seq_df2_indices_to_compare = locate_matching_blocks(seq1, seq2)
def add_residue_number_insertion_col(df):
return (df.assign(
residue_number_insertion = df.residue_number.astype(str).str.cat(df.insertion, sep = '_'))
)
def subset_residue_number_insertions(df, subset):
return df.loc[df.residue_number_insertion.isin(subset)]
seq1_residue_number_insertions = set(
seq_df1.iloc[seq_df1_indices_to_compare]
.pipe(add_residue_number_insertion_col)
.residue_number_insertion
)
seq2_residue_number_insertions = set(
seq_df2.iloc[seq_df2_indices_to_compare]
.pipe(add_residue_number_insertion_col)
.residue_number_insertion
)
subsetted_df1 = pdb1.df['ATOM'].pipe(add_residue_number_insertion_col).pipe(subset_residue_number_insertions, seq1_residue_number_insertions)
subsetted_df2 = pdb2.df['ATOM'].pipe(add_residue_number_insertion_col).pipe(subset_residue_number_insertions, seq2_residue_number_insertions)
def join_seq(df): return ''.join(amino3to1(df).residue_name)
# amino acid sequences are exactly the same
join_seq(subsetted_df1) == join_seq(subsetted_df2)
True
# shapes still unequal
print(subsetted_df1.shape, subsetted_df2.shape)
(3509, 22) (3514, 22) If we collect the residue numbers mapped to their atom lists for each subsetted, dataframe you can see that although the residues match, some residue numbers are composed of different atoms in each dataframe. atom_groups1 = [
(k, [o.atom_name for o in list(g)] )
for k, g in groupby(subsetted_df1.itertuples(), key = lambda x: x.residue_number_insertion)
]
atom_groups2 = [
(k, [o.atom_name for o in list(g)] )
for k, g in groupby(subsetted_df2.itertuples(), key = lambda x: x.residue_number_insertion)
]
[(g1,g2) for g1,g2 in zip(atom_groups1, atom_groups2) if g1[1] != g2[1]]
[(('873_', ['N', 'CA', 'C', 'O', 'CB', 'CG', 'ND1', 'CD2', 'CE1', 'NE2']),
('873_', ['CA', 'C', 'O', 'CB', 'CG', 'ND1', 'CD2', 'CE1', 'NE2'])),
(('899_', ['N', 'N', 'CA', 'CA', 'C', 'C', 'O', 'O', 'CB', 'CB', 'CG', 'CG', 'OD1', 'OD1', 'ND2', 'ND2']),
('899_', ['N', 'C', 'O', 'CA', 'CB', 'CG', 'OD1', 'ND2', 'CA', 'CB', 'CG', 'OD1', 'ND2'])),
(('925_', ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', 'OE1', 'OE2']),
('925_', ['N', 'C', 'O', 'CA', 'CB', 'CG', 'CD', 'OE1', 'OE2', 'CA', 'CB', 'CG', 'CD', 'OE1', 'OE2']))
...
] There are 8 mismatches in the pair given in the description of the issue. Not sure what the best way to proceed is (or why this is happening..) @rasbt any thoughts? |
Describe the workflow you want to enable
From what I can understand PandasPDB.rmsd can calculate the rmsd only if both dataframes have the same length. However, if I want to compare 2 PDBs (one chain each) where the target protein (Uniprot ID) is the same (for example 2vua vs 2vu9) I can't because although the protein is the same one has a different purification tag and thus I cannot use the rsmd function
Describe your proposed solution
it should be pretty easy to calculate de identity between 2 structures and select only those residues. This can ofc be done manually by each user, but I think it would be a great improvement.
The text was updated successfully, but these errors were encountered: