Hackathon team. Lead: Philippe Youkharibache. Tech: Matteo Manfredi, Chiragkumar Patel, and Raul Cachau. Writer: Stephanie Byrum
Understanding 3D protein structure within 2D space. 1D protein sequence provides amino acid residue information; however, critical topological information is completely absent. On the other hand, 3D structure provides toplogical annotation and visualization of domain interactions, but without the amino acid residue information and is difficult to interpret biological function. 2D representation maps provide both sequence information and 3D assembly of domains, which allows biologists to understand residue relationships and ultimately function.
Ig domains are an important component of the immune system. They recognize foreign antigens, bind the receptor binding domain (RBD), and clear antigens from the biological system. IgV domains are particularly important due to the ability of the variable region to bind new antigens and RBDs. Visualizing the domain folds and RBD binding in a 2D map will provide insight into regions of insertions and/or mutations preventing the immune system from recognizing pathogens. 2D maps will provide a tool to identify new therapeutic targets.
-
2D map Visualization: create a 2D map of Ig domains in such a way that non-experts can understand
-
Overlay Features on 2D map: utilize 3D annotation to create a vector representation of residue features and incorporate the features into the 2D map. Ex: bold residues to indicate if the structure is facing in or out.
-
Create 2D maps for domains with no prior 3D annotation: Identify the key features using machine learning in order to create novel 2D maps from 1D sequence information
- A structure is parsed by iCn3D to obtain a residue contact list
- The corresponding sequence is extracted and sequence numbering harmonized. This procedure is necessary to establish a correspondence in the 2D map positions to make the comparison of two or more sequences possible. We use Kabat reference sequence number when available. This requirement can be simplified by aligning the sequences of the proteins to be compared and assigning the sequence number to match the alignment.
- An enhanced 1D sequence is generated and saved in a json file. The purpose of this step is to capture per-residue information (i.e. number of contacts is makes with other residues; list of residues it interacts with, etc.)
- The enhanced 1D representation can be used to generate enhanced 1D representations (Logo plots, Contact maps, etc.)
- A new 2D plot from residue contact information must be generated if no template exist. This procedure is being developed, based on the contact map information.
- A new 2D ProteoMap can be generated when a template is available as a modification of the previous template. This is a valuable approach that allow comparing ProteoMaps. ProteoMaps can be enhanced by color coding 1D properties captured in the enhanced 1D sequence.
-
scripts/excel_template.py -> can be called with a parameter 1 or 0 from command line where the value 1 creates the template(s) with the color coding for the contact maps and 0 creates the standard ones.
-
scripts/.xlsx --> templates
-
data/source -> contain the .txt files which are the output of ANARCi (http://opig.stats.ox.ac.uk/webapps/newsabdab/sabpred/anarci/) for Kabat numbering normalization
.txt to .json conversion is provided by an excel macro (Chirag)
-
data/input_files -> folder with the json files
-
scripts/contacts.sh -> Contact counting is generated with the script contacts.sh starting from the contact map generated by iCn3D (refere to iCn3D for details)
-
data/output and data/output2 -> folder where the colorized (by number of contacts) outputs are stored.