You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
stumbled upon this paper describing the mapping of PDB residue id's to the ones in the sequence deposited in Uniprot:
Choudhary, P.; Anyango, S.; Berrisford, J.; Varadi, M.; Tolchard, J.; Velankar, S. Unified Access to up-to-Date Residue-Level Annotations from UniProt and Other Biological Databases for PDB Data via PDBx/mmCIF Files. bioRxiv, 2022, 2022.08.10.503473. https://doi.org/10.1101/2022.08.10.503473.
Frustrated by the inconsistencies in numbering, I'm writing some code to output pdb's with these Uniprot sequence matching id's, and using biopandas for the crunching.
The mmCIF's with the mapped residues can be downloaded from the url:
The CIF file is nicely read with the mmCIF parser. The resid matching the one in Uniprot is in the column pdbx_sifts_xref_db_num, giving None for those without mapping to sequence, eg. ligands and the UNK's.
This paper/python code/webserver describes a similar thing using the SIFTS:
Faezov, B.; Dunbrack, R. L., Jr. PDBrenum: A Webserver and Program Providing Protein Data Bank Files Renumbered according to Their UniProt Sequences. PLoS One 2021, 16 (7), e0253411. https://doi.org/10.1371/journal.pone.0253411.
For the residues without a mapping, the residues are renumbered using an offset of 5k/50k so that there's no overlap with the new resids of amino acids.
However, occasionally a part of the chain is are UNK's, so I will implemented a way to use continuous numbering wrt the Uniprot mapped resids for these.
Work in progress - if there's an already existing way to do this, let me know :)
The text was updated successfully, but these errors were encountered:
mrauha
changed the title
Using
Using SIFTS data for renumbering residues to match the Uniprot sequence resids
Aug 18, 2022
Hi all,
stumbled upon this paper describing the mapping of PDB residue id's to the ones in the sequence deposited in Uniprot:
Frustrated by the inconsistencies in numbering, I'm writing some code to output pdb's with these Uniprot sequence matching id's, and using biopandas for the crunching.
The mmCIF's with the mapped residues can be downloaded from the url:
https://www.ebi.ac.uk/pdbe/entry-files/download/{pdb_id}_updated.cif"
The CIF file is nicely read with the mmCIF parser. The resid matching the one in Uniprot is in the column
pdbx_sifts_xref_db_num
, giving None for those without mapping to sequence, eg. ligands and the UNK's.This paper/python code/webserver describes a similar thing using the SIFTS:
For the residues without a mapping, the residues are renumbered using an offset of 5k/50k so that there's no overlap with the new resids of amino acids.
However, occasionally a part of the chain is are UNK's, so I will implemented a way to use continuous numbering wrt the Uniprot mapped resids for these.
Work in progress - if there's an already existing way to do this, let me know :)
The text was updated successfully, but these errors were encountered: