- Acknowledgement
- Getting started
- Utilities
- Developing
This repository contains several utilities which simulate an evolutionary process and matches it against real biological data.
Suffix tree implementation was adapted from Peter Us' code: https://github.com/ptrus/suffix-trees
Clone the repository:
Step into the cloned repository
cd jump_model
Create a Python3 virtual environment:
Further reading here: https://docs.python.org/3/tutorial/venv.html
python3 -m venv venv
Activate the virtual environment:
source venv/bin/activate
Use pip to install the Python requirements:
pip install -r requirements.txt
Now you can start developing!
This utility runs a simulation of the jump model producing a zipped JSON file as the result.
The simulation constructs a YuleTree model where the edge lengths of the tree are taken from an exponential
distribution according to an input scale
parameter.
After constructing the tree, a "genome" is propagated from the
root to the leaves while optionally mutating it at each inheritance step.
The longer an edge length, the more probable that a gene will "jump" during the inheritance of the genome to that node
When a "jump" occurs the size of the jumping group is determined according to a
geometric distribution generated according to the alpha
parameter.
The leaves of the tree are taken as genomes of a simulated population, the genomes are used as sequences of
integers (representing genes) to construct a general suffix tree.
The suffix tree is used to count the number of occurrences of each shared subsequence for each subsequence length.
python Simulate.py CONFIG_FILE
The resulting file has the following structure:
{
"model": {
"newick": "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);",
"edge_count": 0,
"median_edge_len": 0.0,
"average_edge_len": 0.0
},
"genome_size": 4096,
"total_jumps": 17,
"avg_jumps": 10.0,
"expected_edge_len": 0.5,
"leaves_count": 256,
"seed": 1234,
"occurrences": {
"2": [2,2,2,3],
"10": [3,4,4]
},
"alpha": 0.5
}
model
- Holds data related to the construction of the tree:newick
- The resulting tree represented in Newick formatedge_count
- The number of edges in the tree.median_edge_len
- The median edge length.average_edge_len
- The average edge length.
genome_size
- The number of genes in each genome.total_jumps
- The total number of jump events which occurred during the simulation.avg_jumps
- The average number of jump events in a single inheritance step.expected_edge_len
- The expected edge length for the constructed tree (the scale parameter of the exponential distribution).leaves_count
- Number of leaves in the generated tree.seed
- Value used to seed the random number generator.occurrences
- A dictionary containing the list of common occurrences for each word size.alpha
- The alpha argument used to determine the size of the "jumping" group.
The simulation reads parameters from a configuration file, an example file can be found in: Code Directory.
{
"data_path": "~/jump_model/data/genes/",
"tree_count": 70,
"alpha": 1,
"genome_size": 4096,
"leaf_count": 256,
"processes": 20,
"scale": [0.1, 0.6, 0.1],
"ultrametric": true
}
data_path
- Output directorytree_count
- Generates a JSON file for each treealpha
- The alpha parameter (0 < alpha <= 1.0)genome_size
- Number of genes in each genomeleaf_count
- The number of leaves in the tree (each leaf represents a genome).processes
- Number of processes to use for concurrency.scale
- The scale used to determining the exponential distribution of the edge lengths. Starting from 0.1 up to (and including 0.6), advancing by 0.1 each step.ultrametric
- Iffalse
, the tree is constructed by adding two child nodes for a randomly selected leaf until the number of leaves in the tree equalsleaf_count
. If set totrue
, the tree is constructed by "hanging" a new father for a randomly selected leaf and creating a new siebling for it, thus keeping the edge lengths more evenly distributed.
This utility is used to convert the JSON file produced by the Simulate
utility into CSV files
python Tabulate.py CONFIG_FILE
The utility reads parameters from a configuration file, an example file can be found in: Code Directory.
{
"data": "~/jump_model/data/genes",
"output": "~/jump_model/data/distributions",
"file_pattern": "*.gz",
"processes": 20
}
data
- The path to the directory containing the JSON files produced by theSimulate
utility.output
- The directory to put the produced CSVsfile_pattern
- The pattern to match the JSON files to be parsed.processes
- Number of processes to use for concurrency.
This utility contains several subcommands used to parse real biological data into structures relevant for the Jump Model simulation
This subcommand reads parameters from a configuration file, an example file can be found in: Code Directory.
The input CSVs are assumed to have the following fields: "Taxid", "Gene name", "Contig", "Srnd", "Start", "Stop", "Length", "Cog"
python RealData.py parse CONFIG_FILE
{
"real_data": "/tmp",
"output": "/tmp"
}
real_data
- The path to the directory containing the "real data" filesoutput
- The path to create the resulting JSON file in.
python RealData.py make_csvs DATA_FILE OUT_DIR MIN_OCCURRENCES MIN_DENSITY
DATA_FILE
- The resulting JSON file from theparse
subcommand.OUT_DIR
- The directory to put the resulting CSVsMIN_OCCURRENCES
- The minimum number of occurrences to consider (set to 0 to consider all).MIN_DENSITY
- The minimum density per occurrences to consider (set to 0 to consider all).
python RealData.py draw DATA_FILE OUT_DIR
DATA_FILE
- The resulting JSON file from theparse
subcommand.OUT_DIR
- The directory to put the resulting PNGs
This utility is used to calculate a probability "score" by comparing real biological data produced by the RealData
utility to data produced by the Tabulate
utility.
python Likelihood.py SIMULATED_DIR SIMULATED_GLOB REALDATA_DIR REALDATA_GLOB [SIMULATED_REGEX = \d+] [REALDATA_REGEX = \d+]
SIMULATED_DIR
- Directory containing CSVs produced by theTabulate
utility.SIMULATED_GLOB
- Glob pattern to match CSV files in the simulated directory.REALDATA_DIR
- Directory containing CSVs produced by theRealData
utility.REALDATA_GLOB
- Glob pattern to match CSV files in the realdata directory.SIMULATED_REGEX
- Optional. The regex used to extract the word size from a CSV file name. Default = "\d+"REALDATA_REGEX
- Optional. The regex used to extract the word size from a CSV file name. Default = "\d+"
This utility calculates the average distribution of word occurrences from multiple JSON files produced by the Simulate
subcommand.
python Averages.py CONFIG_FILE
Sample configuration file can be found here: Code Directory
{
"data": "~/jump_model/data/genes",
"output": "~/jump_model/data/distributions",
"file_pattern": "*.gz",
"processes": 20
}
data
- The path to the directory containing the JSON files produced by theSimulate
utility.output
- The directory to put the resulting averaged distributions at.file_pattern
- The file pattern to match when looking for the JSON files to parse.processes
- Number of processes to use for concurrency.
This utility is used to create PNGs of distribution plots produced from the JSON files created by the Averages
utility.
The distribution plots will plot the data for the different expected edge lengths in the same image.
python MakePlots.py DATA_PATH OUTPUT_PATH EDGE_LENGTHS
DATA_PATH
- The path to the directory containing the JSON files produced by theAverages
utility.OUTPUT_PATH
- The path to the directory to contain the produced imagesEDGE_LENGTH
- The number of different expected edge lengths which the input JSON files include
This utility is used to combine PNGs produced by the MakePlots
utility into a single GIF file.
python MergePlots.py OUTPUT_DIR OUTPUT_NAME VISUALIZED_PATH
OUTPUT_DIR
- The directory to contain the produced GIFsOUTPUT_NAME
- The name to save the resulting GIF file in.VISUALIZED_PATH
- The directory containing the PNG files produced by theMakePlots
utility
To add a new python package add it the requirements.txt file
Testing is done using pytest: https://docs.pytest.org