Skip to content

Code to simulate phylogenetic trees that can be used to train neural networks

License

Notifications You must be signed in to change notification settings

aezarebski/derp-simulation

Repository files navigation

DERP Simulation

This repository contains the code to simulate a database of phylogenetic trees that will be used in a machine learning project in which a neural network will be trained to solve phylodynamics problems.

Usage

Getting started generating a database

To generate the database, run the main script:

python main.py <path/to/config.json>

If you want a small example to test this out, try using the config/debugging.json file. The whole simulation is configured by the JSON file provided at the command line.

Configuring a simulation

The way in which a dataset is simulated is configured with a JSON file. There is a schema for valid configurations described here. There are some example configurations provided:

Additional information about these datasets is given below.

Visualising the data

Two scripts, visualisation.py and visualisation_temporal.py can be used to visualise the output of a simulation. These need to be modified so that Python knows the location of the config and are then run without arguments.

python visualisation.py
python visualisation_temporal.py

Note that the latter only applies for simulations which are configured to report temporal data (that is, report_temporal_data is set to true in the config).

Update this section of the documentation!!!

Using the database

The following demonstrates how to use the database in Python. Don’t forget to close the database connection after using it!

db_conn = h5py.File("dataset.hdf5")
for k in db_conn.keys():
    out_grp = db_conn[k]['output']
    r0_vals = out_grp['parameters']['r0']['values'][()]
    r0_chng = out_grp['parameters']['r0']['change_times'][()]
    prev = out_grp["present_prevalence"][()]
    print(f"Record {k} prevalence {prev}")
    tree = pickle.loads(db_conn[k]['input']['tree'][...].tobytes())
db_conn.close()

If you want a GUI to inspect the output HDF5 file, the HDFCompass tool provides a simple way to inspect the data that has been generated. There is some basic information about the simulation stored as attributes in the HDF5 file. This includes the date of creation and the size of the dataset.

Conda environment

A conda environment to run this simulation can be created from the environment.yaml file by running the following command:

conda env create -f environment.yaml

This environment will have all the correct packages for running the simulations.

Installing BEAST2

BEAST2 is used to simulate the data. If you don’t have BEAST2 installed, there is a script scr/setuplib.sh which will download and install this for you. Once you have BEAST2 installed, you will need to install remaster through BEAUti.

Datasets

There are a sequence of configurations: Charmander, Charmeleon and Charizard. These all use the same model but are of increasing size and use broader distributions over the simulation parameters.

Charmander

This is intended as a toy dataset. It has a 800-100-100 training-validation-testing split. The parameters are nearly constant through time, for example, the $R_0$ values are shown in Figure fig:charmander-r0s. The configuration for this simulation is simulation-charmander.json.

./out/sim-charmander/plots/r0_trajectories.png

Charmander contemporaneous

This is very similar to the Charmander configuration but instead of serial sampling, there is a single contemporaneous sample at the present.

Charmeleon

This is intended as a small dataset. It has a 1600-200-200 training-validation-testing split. The parameters vary significantly through time, for example, the $R_0$ values are shown in Figure fig:charmeleon-r0s. The configuration for this simulation is simulation-charmeleon.json.

./out/sim-charmeleon/plots/r0_trajectories.png

Charizard

This is intended as a plausible dataset for use in training a useful neural network. It has a 8000-1000-1000 training-validation-testing split (although there are 11000 simulations attempted to adjust for failures). The parameters vary significantly through time, for example, the $R_0$ values are shown in Figure fig:charizard-r0s. The configuration for this simulation is simulation-charizard.json.

./out/sim-charizard/plots/r0_trajectories.png

About

Code to simulate phylogenetic trees that can be used to train neural networks

Resources

License

Stars

Watchers

Forks