This repository contains the code to simulate a database of phylogenetic trees that will be used in a machine learning project in which a neural network will be trained to solve phylodynamics problems.
To generate the database, run the main script:
python main.py <path/to/config.json>
If you want a small example to test this out, try using the
config/debugging.json
file. The whole simulation is configured by
the JSON file provided at the command line.
The way in which a dataset is simulated is configured with a JSON file. There is a schema for valid configurations described here. There are some example configurations provided:
- debugging
- Charmander (see here)
- Charmander with a contemporaneous sample
- Charmeleon (see here)
- Charizard (see here)
Additional information about these datasets is given below.
Two scripts, visualisation.py
and visualisation_temporal.py
can be
used to visualise the output of a simulation. These need to be modified
so that Python knows the location of the config and are then run without
arguments.
python visualisation.py
python visualisation_temporal.py
Note that the latter only applies for simulations which are configured
to report temporal data (that is, report_temporal_data
is set to
true
in the config).
The following demonstrates how to use the database in Python. Don’t forget to close the database connection after using it!
db_conn = h5py.File("dataset.hdf5")
for k in db_conn.keys():
out_grp = db_conn[k]['output']
r0_vals = out_grp['parameters']['r0']['values'][()]
r0_chng = out_grp['parameters']['r0']['change_times'][()]
prev = out_grp["present_prevalence"][()]
print(f"Record {k} prevalence {prev}")
tree = pickle.loads(db_conn[k]['input']['tree'][...].tobytes())
db_conn.close()
If you want a GUI to inspect the output HDF5 file, the HDFCompass tool provides a simple way to inspect the data that has been generated. There is some basic information about the simulation stored as attributes in the HDF5 file. This includes the date of creation and the size of the dataset.
A conda environment to run this simulation can be created from the
environment.yaml
file by running the following command:
conda env create -f environment.yaml
This environment will have all the correct packages for running the simulations.
BEAST2 is used to simulate the data. If you don’t have BEAST2
installed, there is a script scr/setuplib.sh
which will download and
install this for you. Once you have BEAST2 installed, you will need to
install remaster through BEAUti.
There are a sequence of configurations: Charmander, Charmeleon and Charizard. These all use the same model but are of increasing size and use broader distributions over the simulation parameters.
This is intended as a toy dataset. It has a 800-100-100
training-validation-testing split. The parameters are nearly constant
through time, for example, the
This is very similar to the Charmander configuration but instead of serial sampling, there is a single contemporaneous sample at the present.
This is intended as a small dataset. It has a 1600-200-200
training-validation-testing split. The parameters vary significantly
through time, for example, the
This is intended as a plausible dataset for use in training a useful
neural network. It has a 8000-1000-1000 training-validation-testing
split (although there are 11000 simulations attempted to adjust for
failures). The parameters vary significantly through time, for
example, the