β Authors: ROUAUD Lucas
π Formation: Master 2 Bio-informatics at UniveritΓ© de Paris
π This program were write during an internship in Institut de MinΓ©ralogie, de Physique et de Cosmochimie, Sorbonne UniversitΓ©, UMR7590, CNRS, MusΓ©um national dβHistoire naturelle; in the bioinformatique et biophysique team.
π This work was supported by the French National Research Agency (ANR-21-CE12-0021).
A script called INSTALL.sh
is made to facilitate this script installation. To used it, do:
bash INSTALL.sh
All used commands are described in the next parts (Cloning the repository; Install conda environment; Data decompression)! This script is available at the release page: https://github.com/FilouPlains/FIERLENIUZ/releases/tag/v1.2.3
conda activate fierleniuz
To clone the repository in your computer, use the next command:
git clone [email protected]:FilouPlains/FIERLENIUZ.git
cd FIERLENIUZ/
This repository is using Python. To install packages, conda is used and you can refer to their website to install it: https://docs.conda.io/projects/conda/en/stable/user-guide/install/download.html
Once conda is installed (if it was not already the case), simply used those next commands to use the program (when you are in the root project directory π ./
):
conda env create -n fierleniuz -f env/fierleniuz.yml
conda activate fierleniuz
Some data were too heavy to be simply put like that into the repository. So they were compressed. So next commands have to be used (when you are in the root project directory π ./
):
tar -xf data/peitsch2vec/default_domain.tar.gz -C data/peitsch2vec/
tar -xf data/peitsch2vec/redundancy/30_percent_redundancy.tar.gz -C data/peitsch2vec/redundancy/
tar -xf data/peitsch2vec/redundancy/70_percent_redundancy.tar.gz -C data/peitsch2vec/redundancy/
tar -xf data/peitsch2vec/redundancy/90_percent_redundancy.tar.gz -C data/peitsch2vec/redundancy/
To have a description of the parameters and an example of command, use this next one:
python src/embeddings/peitsch2vec.py -h
This script is used in order to transform a corpus of hydrophobic clusters into vectors.
- A script used to transform a
.fasta
file into a.out
file. To have a description of the parameters and an example of command, use this next one:
python3 src/hca_extraction/hca_extraction.py -h
- A script used do computed the context diversity ad output a
.csv
format. To have a description of the parameters and an example of command, use this next one:
python3 src/scope_tree/context_extraction.py -h
- A script used to computed a network linked to SCOPe with context diversity coloration and context diversity distribution through plotly. To have a description of the parameters and an example of command, use this next one:
python3 src/scope_tree/scope_tree.py -h
cd-hit
is a software used to treated sequences redundancy. It is available at this next webpage (GitHub) https://github.com/weizhongli/cdhit/releases. To use it, type the following command is used:
cd-hit -i {scope_database}.fa -o cd-hit_{i}.fasta -c 1
With :
{scope_database}.fa
: Use here the original SCOPe database, with the sequence at.fa
format. You can download the dataset here: https://scop.berkeley.edu/astral/subsets/ver=2.08. InPercentage identity-filtered Astral SCOPe genetic domain sequence subsets, based on PDB SEQRES records
, usesequences
andless than 30
,less than 70
andless than 90
parameters.cd-hit_{i}.fasta
: How to name the output file. For this respository, the output is namedcd-hit_30.fasta
,cd-hit_70.fasta
andcd-hit_90.fasta
.
For some scripts, the cluster PCIA - Plateforme de calcul intensif du MusΓ©um national dβHistoire naturelle have been used. The next script were used to launch the job (in this next cluster path π STAGE_M2/
):
sbatch LAUNCH_SCRIPT.sh
- Output are done in next directory:
π /mnt/beegfs/abruley/CONTEXT/
. - The script HAVE TO BE MANUALLY EDITED if you want to change the input database.
The used script is available at π src/cluster/launch_script_90.sh
.
π src/cluster/launch_script_90.sh
: Script used on the cluster to computed the context diversity.π src/embeddings/arg_parser.py
: Parse given arguments for theπΎ The main script
.π src/embeddings/context_analyzer.py
: Compute ordered and unordered diversity contexts. There is also a function to extract and center words for a given window.π src/embeddings/domain.py
: UNUSED, deprecated.π src/embeddings/genetic_deep_learning/correlation_matrix.py
: Computed the correlation between two matrices.π src/embeddings/genetic_deep_learning/genetic_algorithm.py
: Genetic algorithms to select the best Word2Vec model.π src/embeddings/genetic_deep_learning/hca_out_format_reader.py
: Transform a whole.out
into a corpus usable by Word2Vec.π src/embeddings/genetic_deep_learning/running_model.py
: Run a Word2Vec model.π src/embeddings/hca_reader.py
: Parse a.out
file to extract information from it.π src/embeddings/hcdb_parser.py
: Parse the hydrophobic cluster database.π src/embeddings/notebook/comparing_distribution.ipynb
: Plot of the distribution of some characteristics using plotly.π src/embeddings/notebook/data_meaning.ipynb
: Plot information like mostly to the norm using Plotly.π src/embeddings/notebook/matplotlib_for_report.ipynb
: Used matplotlib to producedplot.pdf
to use into the report.π src/embeddings/notebook/matrix.ipynb
: Computed cosine similarities matrix.π src/embeddings/notebook/projection.ipynb
: Test a lot of projection for the vectors, with a lot of descriptors.π src/embeddings/notebook/sammon.py
: Computed a sammon map using this next GitHub repository: https://github.com/tompollard/sammon.π src/embeddings/peitsch2vec.py
: The main program used to computed Word2Vec vectors and other characteristics.π src/embeddings/peitsch.py
: Object to manipulate the hydrophobic clusters.π src/embeddings/write_csv.py
: Write a.csv
file with some hydrophobic clusters characteristics.π src/hca_extraction/arg_parser.py
: Parse given arguments for theπ src/hca_extraction/hca_extraction.py
script.πsrc/hca_extraction/domain_comparison.py
: Script used to compared multiple domain between them by computing the context diversity and output the best result through an user define threshold.π src/hca_extraction/hca_extraction.py
: Go from a.fasta
files to a.out
file.π src/scope_tree/arg_parser.py
: Parse given arguments for theπ src/scope_tree/context_extraction.py
andπ src/scope_tree/scope_score.py
scripts.π src/scope_tree/context_extraction.py
: Extract the context informations, taking also in consideration the SCOPe levels, and output a.csv
file.π src/scope_tree/scope_score.py
: Computed a score between two or multiple domains to see how far they are from each other in the SCOPe tree.π src/scope_tree/scope_tree.py
: Computed a network of one given hydrophobic clusters. The network is linked to the SCOPe tree, with the indications of the context diversity on each nodes.
π data/HCDB_2018_summary_rss.csv
: Hydrophobic clusters database with the summary of the regular secondary structures. Made in 2018.π pyHCA_SCOPe_30identity_globular.out
: pyHCA output. It were applied on the SCOPe2.07
database with a redundancy level of 30 %, download trough Astral. Original dataset available here: https://raw.githubusercontent.com/DarkVador-HCA/Order-Disorder-continuum/main/data/SCOPe/hca.out.π SCOPe_2.08_classification.txt
: A file that permits to go from the domain ID to the SCOPe precise class (for instance, fromd1ux8a_
toa.1.1.1
). File is available here: https://scop.berkeley.edu/downloads/parse/dir.des.scope.2.08-stable.txt.π output_plot/
: All plots produced by the notebooksrc/embeddings/notebook/matplotlib_for_report.ipynb
, all in.pdf
format.π data/REDUNDANCY_DATASET/cd-hit_30.fasta
;π data/REDUNDANCY_DATASET/cd-hit_70.fasta
;π data/REDUNDANCY_DATASET/cd-hit_90.fasta
: Amino acids sequences from SCOPe2.08
with different redundancy levels (30 %, 70 %, 90 %). Redundancy were treated through Astral and cd-hit. Original dataset are available here: https://scop.berkeley.edu/astral/subsets/ver=2.08.π data/REDUNDANCY_DATASET/cd-hit_30.out
;π data/REDUNDANCY_DATASET/cd-hit_70.out
;π data/REDUNDANCY_DATASET/cd-hit_90.out
: Hydrophobic clusters sequences from SCOPe2.08
with different redundancy levels (30 %, 70 %, 90 %). Redundancy were treated through Astral and cd-hit. Not treated by pyHCA.π data/REDUNDANCY_DATASET/redundancy_30_context_conservation_2023-05-09_14-38-42.csv
;π data/REDUNDANCY_DATASET/redundancy_70_context_conservation_2023-05-11_10-39-29.csv
;π data/REDUNDANCY_DATASET/redundancy_90_context_conservation_2023-05-11_10-41-19.csv
: All context diversity computed for different redundancy levels (30 %, 70 %, 90 %). Redundancy were treated through Astral and cd-hit. Little things to know:100.0 =
context computed with a full diversity;100 =
context could not be computed, so a full diversity have been attributed. This have been corrected in the program by puttingNA
instead of "int(100)
".π data/peitsch2vec/default_domain/
: Data for the dataset with a redundancy level of 30 %, treated by pyHCA, not treated by cd-hit.π data/peitsch2vec/redundancy/30_percent_redundancy/
;π data/peitsch2vec/redundancy/70_percent_redundancy/
;π data/peitsch2vec/redundancy/90_percent_redundancy/
: Data for the dataset with a redundancy level of 30 %, 70 %, 90 %, not treated by pyHCA, treated by cd-hit.
For the path given in 8.
and 9.
:
π characteristics_
{date}.npy/
: Hydrophobic clusters characteristics for a given redundancy level, like the size or the regular secondary structure. The characteristics are listed here, in the same order as this file:- Peitch code.
- Hydrophobic cluster (binary code).
- Hydrophobic score.
- Cluster size.
- Regular secondary structure.
- Occurences.
- Number of cluster inside the domain, were the cluster[i] is found.
- Domain size, were the cluster[i] is found.
- Score HCA define by pyHCA, were the cluster[i] is found.
- P-value define by pyHCA, were the cluster[i] is found.
π corpus_
{date}.npy/
: Corpus given to Word2Vec, after applying the filters.π embedding_
{date}.npy/
: Vector embedding generated by Word2Vec.π matrix_cosine_
{date}.npy/
: Cosine similarities matrix generated from the vector embedding, generated by Word2Vec.π model_
{date}.w2v/
: The trained Word2Vec models.
$ tree -lF -h
[6.8G]
.
βββ [4.0K] "data/"
βΒ Β βββ [4.0K] "output_plot/"
βΒ Β βββ [4.0K] "peitsch2vec/"
βΒ Β βΒ Β βββ [4.0K] "default_domain/"
βΒ Β βΒ Β βββ [4.0K] "redundancy/"
βΒ Β βΒ Β βββ [4.0K] "30_percent_redundancy/"
βΒ Β βΒ Β βββ [4.0K] "70_percent_redundancy/"
βΒ Β βΒ Β βββ [4.0K] "90_percent_redundancy/"
βΒ Β βββ [4.0K] "REDUNDANCY_DATASET/"
βββ [4.0K] "env/"
βΒ Β βββ [ 905] "fierleniuz.yml"
βΒ Β βββ [ 885] "README.md"
βββ [ 895] "INSTALL.sh"
βββ [ 20K] "LICENSE"
βββ [ 13K] "README.md"
βββ [4.0K] "src/"
βββ [4.0K] "cluster/"
βββ [4.0K] "embeddings/"
βΒ Β βββ [4.0K] "genetic_deep_learning/"
βΒ Β βββ [4.0K] "notebook/"
βββ [4.0K] "hca_extraction/"
βββ [4.0K] "scope_tree/"
18 directories, 88 files
This work is licensed under a Creative Commons Attribution 4.0 International License.