Any details about your local setup that might be helpful in troubleshooting.
+
Detailed steps to reproduce the bug.
+
+
+
+
Fix Bugs
+
Look through the GitHub issues for bugs. Anything tagged with bug and help wanted is open to whoever wants to implement it.
+
+
+
Implement Features
+
Look through the GitHub issues for features. Anything tagged with enhancement and help wanted is open to whoever wants to implement it.
+
+
+
Write Documentation
+
spaceprime could always use more documentation, whether as part of the official spaceprime docs, in docstrings, or even on the web in blog posts, articles, and such.
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
+
$ mkvirtualenv spaceprime
+$ cd spaceprime/
+$ python setup.py develop
+
Create a branch for local development:
+
$ git checkout -b name-of-your-bugfix-or-feature
+
Now you can make your changes locally.
+
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
+
$ flake8 spaceprime tests
+$ python setup.py test or pytest
+$ tox
+
To get flake8 and tox, just pip install them into your virtualenv.
+
Commit your changes and push your branch to GitHub:
+
$ git add .
+$ git commit -m "Your detailed description of your changes."
+$ git push origin name-of-your-bugfix-or-feature
+
Submit a pull request through the GitHub website.
+
+
+
+
Pull Request Guidelines
+
Before you submit a pull request, check that it meets these guidelines:
+
+
The pull request should include tests.
+
If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
spaceprime is a Python package that facilitates the creation and analysis of spatially gridded coalescent models in the msprime library. The package is designed to make it easier for practitioners to convert spatial maps of habitat suitability into extensible two-dimensional stepping-stone models of gene flow, where each pixel of the map represents a deme and demes are able to migrate with their neighbors. Demes and migration rates are able to change over time according to habitat suitability model projections to different time periods. These demographic parameters are then used to simulate genetic data under a coalescent model with msprime as the simulator, which can be used to infer the demographic history of the population. The package is designed to be user-friendly and intuitive, allowing users to easily simulate and analyze spatially explicit genetic data.
+
+
+
+
+
+
+Note for R users
+
+
+
+
spaceprime is implemented in Python, yet many interested users may come from an R background. I have a spaceprime for R users vignette that provides a brief introduction to the Python concepts necessary to use spaceprime through a practical walk-through of an example analysis. If you still want to use R, it is possible to use Python code in R using the reticulate package. For more information on how to use Python code in R, see the reticulate documentation.
+
+
+
+
+
Main features
+
spaceprime includes a number of features:
+
+
Convert habitat suitability values into demographic parameters, including deme sizes, migration rates, and their change through time using very little code. Code complexity does not increase with model complexity, allowing users to focus on the biological questions they are interested in.
+
+
Simulate spatially explicit genetic data under a coalescent model with msprime. The modeling approach is fully coalescent with no forward-time component, allowing for computationally efficient simulations of large spatially explicit models.
+
+
Visualize demographic models to facilitate model interpretation and model checking.
+
Compute genetic summary statistics for simulated and empirical data to facilitate comparison with empirical data.
+
+
Extensibility: spaceprime is designed to be interoperable with msprime, where users can setup a model with spaceprime, then customize it using the full range of msprime functionality.
+
+
+
+
Installation
+
+
Stable release
+
spaceprime can be installed using pip or conda. Due to the package’s reliance on msprime, the pip installation is only available on unix-based systems (MacOS, Linux). Windows users should install spaceprime using conda.
+
To install spaceprime using pip, run the following command in your terminal:
+
pip install spaceprime
+
To install spaceprime using conda, run the following command in your terminal:
+
conda install -c conda-forge spaceprime
+
The mamba package manager can also be used to install spaceprime using conda. mamba tends to be much faster than conda. To install spaceprime using mamba, install mamba, then run the following command in your terminal:
+
mamba install -c conda-forge spaceprime
+
+
+
From source
+
To install spaceprime from source, run this command in your terminal:
To install the analysis module, which is imported separately from the main package to reduce the number of dependencies:
+
pip install spaceprime[analysis]
+
or
+
conda install -c conda-forge spaceprime[analysis]
+
+
+
+
Usage
+
To use spaceprime in a project:
+
+
import spaceprime
+
+
To use the analysis module, which is imported separately from the main package to reduce the number of dependencies:
+
+
from spaceprime import analysis
+
+
+
+
+
+
+
+Important
+
+
+
+
Make sure to install the relevant analysis dependencies with pip install spaceprime[analysis] or conda install -c conda-forge spaceprime[analysis].
+
+
+
+
+
Quickstart
+
+
+
+
+
+
+Tip
+
+
+
+
This quickstart guide assumes you have a basic understanding of Python. If you are an R user, please refer to the spaceprime for R users vignette for an overview of spaceprime with the necessary Python concepts explained.
+
+
+
+
1. Download data
+
The data we’re using in this example are a GeoTiff raster file of habitat suitability values and a GeoJSON file containing geographic localities and metadata for this cute frog, Phyllomedusa distincta:
+
+
+
+
Follow the link to download the projections.tif file. You do not need to download the localities.geojson file, as it is read in from the web in the code below.
+
The raster file contains 23 layers, where each layer is a projection of the habitat suitability model (aka species distribution model or ecological niche model) to a time slice in the past, ranging from the present day to 22,000 years ago in 1,000 year intervals. The habitat suitability values range from zero to one, where zero represents no suitability for the species and one represents perfect suitability. In the following plots, yellow represents higher suitability and purple represents lower suitability. Here are a few time slices of the model:
+
+
+
+
+
+
+
+
The GeoJSON file contains geographic sampling localities of P. distincta in the Brazilian Atlantic Forest, along with metadata about each locality. Each row is a single individual/observation. spaceprime counts the number of observations with coordinates that overlap with a raster cell/deme and samples the calculated number for simulations and summary statistics. Here are the localities plotted on top of the present-day habitat suitability model:
+
+
+
+
+
+
+
+
+
+
2. Read in packages and data
+
Now that we have our data, let’s read in the packages and data we’ll be using. The GeoPandas and Rasterio packages are dependencies of spaceprime, so you shouldn’t need to install them separately. They are needed for reading in locality data and habitat suitability rasters, respectively.
+
+
import spaceprime as sp
+import geopandas as gpd
+import rasterio
+
+
Make sure to replace the projections.tif file path with the path to the file on your system. The GeoJSON file is read in from the web, so you don’t need to download it.
+
+
r = rasterio.open("projections.tif")
+locs = gpd.read_file("https://raw.githubusercontent.com/connor-french/spaceprime/main/spaceprime/data/localities.geojson")
+
+
+
+
3. Set up the demographic model
+
Next, we’ll convert the habitat suitability values into deme sizes, so each cell in the raster will represent a deme in our model. We’ll use a linear transformation to convert the suitability values to deme sizes, where the suitability value is multiplied by a constant to get the deme size. The constant is the maximum local deme size, which we set to 1000. For more on transformations, see the suitability to deme size transformation functions vignette.
+
+
d = sp.raster_to_demes(r, transformation="linear", max_local_size=1000)
+
+
Now that we have our deme sizes, we can set up the demographic model. The model that spaceprime uses is a two-dimensional stepping-stone model with a global migration rate of 0.001 between neighboring demes. The global rate by default is scaled, where demes exchange the same number of migrants with their neighbors, regardless of deme size. To change this behavior, set scale=false. We’re assuming that P. distincta has a generation time of one year. Using a single value for the timesteps argument tells spaceprime that 1000 generations passes in between each raster time step in the model.
+
This step may take a few seconds (10-15 seconds on my machine) to run.
+
+
# initialize the model
+demo = sp.spDemography()
+
+# populate the spDemography object with the deme sizes and migration rates
+demo.stepping_stone_2d(d, rate=0.001, timesteps=1000)
+
+
After initializing the spatial component of the simulation, it’s desirable to add one or more ancestral populations to the model. This is done by providing a list of ancestral population sizes and the time (in generations) at which the spatially distributed demes migrate into the ancestral population(s). The following code adds a single ancestral population of 100,000 individuals that demes merge into 23,000 generations in the past:
+
+
# add ancestral population
+demo.add_ancestral_populations([100000], 23000)
+
+
+
+
4. Inspect your model
+
Now that we have our demographic model set up, we can inspect it to make sure it looks as expected. spaceprime has a series of plot_() functions that make this easier.
+
+
plot_landscape()
+
plot_landscape() plots the deme sizes in space, which allows you to quickly inspect whether the transformation you applied to your habitat suitability map make sense. Here, we provide the demographic model object, the raster object, the index of the time slice to plot (0 for the present day in this case), and basemap=True to add an OpenStreetMap basemap, providing geographic context to the plot. If you don’t have an internet connection, set basemap=False (the default) to plot without the basemap.
+
+
sp.plot_landscape(demo, r, 0, basemap=True)
+
+
+
+
+
+
+
+
+
plot_model()
+
plot_model() plots the deme sizes on a folium interactive map, with precise deme sizes and outgoing migration rates for each deme present in a popup.
+
+
sp.plot_model(demo, r, 0)
+
+
Make this Notebook Trusted to load map: File -> Trust Notebook
+
+
+
+
+
+
5. Simulate genetic data
+
Before simulating this demography, we need to create a sample dictionary that translates the empirical sampling localities to the model’s deme indices and maps those to the number of samples to take from each deme. By default, this function sets the number of individuals to sample from each deme to the number of empirical localities in that deme. The coords_to_sample_dict() function also returns two other dictionaries that are not used in this example, so we’ll ignore them.
Now we get to simulate! The first task is to simulate the ancestry of the samples using the coalescent. All of the hard work is done through msprime’s sim_ancestry() function, for which spaceprime provides a convenience wrapper. This function returns a tskit TreeSequence, which “represents a sequence of correlated evolutionary trees along a genome” and is an incredibly powerful and compact data representation for population genomic analyses. The minimum number of arguments required for this function are the sample dictionary and the demographic model. If you need to overlay mutations, you need to supply the sequence length. Notice the lack of mutations in the table. We’ll set record_provenance to False to decrease the memory overhead of storing a bunch of metadata about the simulation.
We’ll take a peak at a single tree from the TreeSequence object to see what it looks like. The draw_svg() method plots trees from the TreeSequence object. Here, I selected a single tree and removed the node labels because there are tons of nodes that crowd the plot and we’re only interested in the tree structure.
Overlaying mutations after simulating ancestry isn’t necessary for calculating genetic summary statistics on a TreeSequence, but it is necessary if you would like to compare your simulations with empirical data that are represented as a table of genotypes rather than a TreeSequence. The sim_mutations() function overlays mutations on the TreeSequence object returned by sim_ancestry() and requires the mutation rate. The mutation rate is the number of mutations per base pair per generation. For this example, we’ll use a mutation rate of 1e-10 so we don’t overcrowd the tree sequence visualization. You can see from the table that the tree sequence has some mutations!
Use the analysis module to calculate genetic summary statistics on the TreeSequence object. For more on the analysis module, see the analysis module documentation.
+
+
Save the TreeSequence to use later or analyze on a platform like tskit with sim.dump(file/path/to/write/to.trees).
+
+
Convert the TreeSequence with mutations to a genotype matrix for use in a program like scikit-allel with sim.genotype_matrix(). For more information on this function, see the tskit documentation.
+
+
Export the TreeSequence with mutations to a VCF file using sim.write_vcf. For more information on how to use this function, see the tskit documentation.
+
+
+
+
+
+
+
+TODO
+
+
+
+
add a link to the analysis module documentation when it’s ready.
Calculates a migration matrix based on deme sizes and a global migration rate. The migration rate can be scaled based on population size or set as a constant.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
demes
+
np.ndarray
+
The 2D numpy array representing the deme sizes.
+
required
+
+
+
rate
+
float
+
The migration rate.
+
required
+
+
+
scale
+
bool
+
Whether to scale the migration rate based on population size. Default is True.
Calculates a suite of genetic summary statistics on allele counts matrices generated by filter_gt or otherwise generated through scikit-allel. The required input is an allele counts matrix for all individuals/demes, and a dictionary mapping deme IDs to their coordinates, generated by coords_to_deme_dict. Optional inputs are dictionaries mapping ancestral population IDs to their constituent demes (anc_demes_dict), an dictionary of allele counts matrices for each deme (ac_demes), and dictionary of allele counts matrices for each ancestral population (ac_anc).
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
ac
+
np.ndarray
+
An allele counts matrix for all individuals/demes.
+
required
+
+
+
coords_dict
+
dict
+
A dictionary mapping deme IDs to their coordinates, generated by [coords_to_deme_dict][utilities.coords_to_deme_dict].
+
required
+
+
+
anc_demes_dict
+
dict
+
A dictionary mapping ancestral population IDs to their constituent demes. Defaults to None.
+
None
+
+
+
ac_demes
+
dict
+
A dictionary mapping deme IDs to their allele counts matrices. Necessary if you want to calculate Fst or Dxy between demes. Defaults to None.
+
None
+
+
+
ac_anc
+
dict
+
A dictionary mapping ancestral population IDs to their allel counts matrices. If provided, summary statistics are calculated within ancestral populations and not among them. Defaults to None.
+
None
+
+
+
between_anc_pop_sumstats
+
bool
+
Whether to calculate Fst or Dxy between ancestral populations. Defaults to False.
+
False
+
+
+
return_df
+
bool
+
Whether to return the summary statistics as a pandas DataFrame. Defaults to False.
+
False
+
+
+
precision
+
int
+
The number of decimal places to round the summary statistics to. Defaults to 6.
+
6
+
+
+
+
+
+
Returns
+
+
+
+
Type
+
Description
+
+
+
+
+
dict
+
A dictionary of summary statistics.
+
+
+
+
+
+
Notes
+
This function calculates the following summary statistics, either species-wide or per ancestral population, if provided: - Site Frequency Spectrum Hill numbers (q1 and q2), corrected for the number of sites - Pi (nucleotide diversity) - Tajima’s D - Pairwise Dxy - If between_anc_pop_sumstats is True, also calculates pairwise Dxy and Hudson’s FST between ancestral populations - Pairwise Hudson’s FST - If between_anc_pop_sumstats is True, also calculates pairwise Dxy and Hudson’s FST between ancestral populations - Isolation-by-distance slope and R2 - Moran’s I
Finds the cells a given set of coordinates belong to in a raster and returns a dictionary mapping the cell indices to the centroid coordinates of those cells. Because the cells correspond with demes in the 2D stepping stone models, the cell indices are considered deme indices. The coordinates typically correspond to empirical data that the simulations need to be sampled from.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
raster
+
rasterio.DatasetReader
+
The raster data as a rasterio DatasetReader object.
Convert sample coordinates to sample dictionaries for simulation and analysis. Can optionally include empirical data, which is accepted as a path to a VCF file.
+
This function takes a raster, a list of coordinates, and optional individual IDs and VCF path. It masks the raster with the given coordinates, retrieves the cell IDs for each individual’s locality, and returns two dictionaries: a sample dictionary containing the number of individuals to sample from the simulation, and a sample dictionary containing the range of individual indices for each cell ID. The first dictionary is used to sample individuals from the simulation, and the second dictionary is used to calculate genetic summary statistics from the sampled individuals.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
raster
+
Union[np.ndarray, rasterio.DatasetReader]
+
The raster data as a numpy array or rasterio DatasetReader object.
A tuple containing two or three dictionaries. The first dictionary contains the number of individuals to sample from the simulation for each cell ID. The second dictionary contains the indices of individuals for each cell ID. The third, optional dictionary contains the indices of individuals in the VCF file for each cell ID.
Creates a raster dataset from a numpy array and reference raster and writes it to a new GeoTiff file. The new raster dataset will have the same dimensions, crs, and transform as the reference raster.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
data
+
np.ndarray
+
The numpy array containing the data you want for the raster.
+
required
+
+
+
reference_raster
+
rasterio.DatasetReader
+
The reference rasterio DatasetReader object.
+
required
+
+
+
out_folder
+
str
+
The output folder location where the new raster dataset will be saved.
+
required
+
+
+
out_prefix
+
str
+
The prefix for the output file name.
+
required
+
+
+
+
+
+
Returns
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
None
+
The function writes the new raster dataset to the output file location
Filter genotype matrices output by ts.genotype_matrix() to filter out monomorphic sites, loci in linkage disequilibrium, and recreate missing data patterns common to empirical genotype data. Returns the genotype matrix and allele counts matrix for the filtered loci, and optionally allele counts matrices for demes and ancestral populations.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
gt
+
np.ndarray
+
The genotype matrix.
+
required
+
+
+
deme_dict_inds
+
dict
+
A dictionary containing the indices of individuals in each deme. Defaults to None.
+
None
+
+
+
deme_dict_anc
+
dict
+
A dictionary containing the indices of individuals in each ancestral population. Defaults to None.
+
None
+
+
+
missing_data_perc
+
float
+
The percentage of missing data allowed. Defaults to 0.
+
0
+
+
+
r2_thresh
+
float
+
The threshold for linkage disequilibrium. Defaults to 0.1.
+
0.1
+
+
+
filter_monomorphic
+
bool
+
Whether to filter out monomorphic sites, keeping only segregating sites. Defaults to True.
+
True
+
+
+
filter_singletons
+
bool
+
Whether to filter out singletons. Defaults to True.
A tuple containing the filtered genotype matrix, the allele counts matrix, a dictionary of allele counts matrices for demes (if deme_dict_inds is provided), and a dictionary of allele counts matrices for ancestral populations (if deme_dict_anc is provided).
+
+
+
+
+
+
Notes
+
This function uses a random mask to simulate missing data in the genotype matrix. For reproducibility it’s advised to set a np.random.seed() before calling this function.
Calculates a migration matrix based on deme sizes and a global migration rate. The migration rate can be scaled based on population size or set as a constant.
Finds the cells a given set of coordinates belong to in a raster and returns a dictionary mapping the cell indices to the centroid coordinates of those cells.
Convert sample coordinates to sample dictionaries for simulation and analysis. Can optionally include empirical data, which is accepted as a path to a VCF file.
This function takes the coordinates of empirical sampling localities, finds which raster cells they belong to, extracts the values of the first layer for those localities, and finds the minimum value.
Converts a raster to a 2D np.ndarray of deme sizes using either linear, threshold, or sigmoid transformation functions. For more detail about transformation functions, see this brief overview.
Filter genotype matrices output by ts.genotype_matrix() to filter out monomorphic sites, loci in linkage disequilibrium, and recreate missing data patterns common to empirical genotype data.
This function takes the coordinates of empirical sampling localities, finds which raster cells they belong to, extracts the values of the first layer for those localities, and finds the minimum value. This value is the maximum threshold value to determine a presence vs absence in a threshold transformation. If the threshold is set any higher, empirical sampling localities will not be sampled in the simulations.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
raster
+
rasterio.DatasetReader
+
The rasterio DatasetReader object representing the raster data containing the suitability values.
Plots a static map of a transformed landscape at the timestep of your choice.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
demo
+
spaceprime.spDemography
+
The demographic model to plot.
+
required
+
+
+
raster
+
rasterio.DatasetReader
+
The raster dataset used to create the demes matrix(es).
+
required
+
+
+
timestep
+
int
+
The timestep to plot.
+
required
+
+
+
cmap
+
str
+
The colormap to use. Defaults to “viridis”.
+
'viridis'
+
+
+
legend
+
bool
+
Whether to show the colorbar legend. Defaults to True.
+
True
+
+
+
basemap
+
bool
+
Whether to add a basemap. Requires an internet connection. Defaults to False.
+
False
+
+
+
+
+
+
Returns
+
+
+
+
Type
+
Description
+
+
+
+
+
matplotlib.axes.Axes
+
A plot of the transformed landscape.
+
+
+
+
+
+
Note
+
Setting basemap=True requires an internet connection to download the basemap tiles. It may take some time to load the tiles depending on your internet speed. Since this function returns a matplotlib axes object, you can further modify the plot with the matplotlib library.
Converts a raster to a 2D np.ndarray of deme sizes using either linear, threshold, or sigmoid transformation functions. For more detail about transformation functions, see this brief overview. Raster data should be continuous and positive. This function was created with the idea of taking in habitat suitability rasters scaled from 0 to 1, where 0 is no suitability and 1 is the highest suitability. However, it is flexible enough to accommodate other continuous rasters that can be coaxed to a 0 to 1 scale with the operation (data - np.min(data)) / (np.max(data) - np.min(data)) by setting the normalize flag to True.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
raster
+
Union[np.ndarray, rasterio.DatasetReader]
+
The input raster data. It can be a numpy array or a rasterio DatasetReader with one or more layers.
+
required
+
+
+
transformation
+
str
+
The transformation function to be used. Options are “linear”, “threshold”, and “sigmoid”. Default is “linear”.
+
'linear'
+
+
+
max_local_size
+
int
+
The maximum local deme size. Default is 1000.
+
1000
+
+
+
normalize
+
bool
+
Whether to normalize the raster data. Use if your data is not scaled from 0-1. Default is False.
+
False
+
+
+
threshold
+
float
+
The threshold value for the “threshold” transformation method. Default is None.
+
None
+
+
+
thresh_norm
+
bool
+
Whether to normalize the local deme size based on the average suitability above the threshold. This is useful when comparing thresholded simulations with linear or sigmoid simulations, to maintain similar landscape-wide population sizes across max_local_size values. Default is False.
+
False
+
+
+
inflection_point
+
float
+
The inflection point for the “sigmoid” transformation method. Default is 0.5.
+
0.5
+
+
+
slope
+
float
+
The slope value for the “sigmoid” transformation method. Default is 0.05.
The TreeSequence object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used.
+
+
+
+
+
+
Notes
+
This function takes the same arguments as msprime.sim_ancestry and calls it directly, allowing users to use simulation functionality within the spaceprime namespace.
+
See the msprime.sim_ancestry documentation for more information: See the msprime documentation for more information: https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_ancestry
+
+
+
Returns
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
Union[TreeSequence, Iterator[TreeSequence]]
+
The TreeSequence object representing the results of the simulation if no replication is performed, or an iterator over the independent replicates simulated if the num_replicates parameter has been used.
The result of msprime.sim_mutations with the provided arguments.
+
+
+
+
+
+
+
Notes
+
This function takes the same arguments as msprime.sim_ancestry and calls it directly, allowing users to use simulation functionality within the spaceprime namespace.
+
See the msprime.sim_mutations documentation for more information: https://tskit.dev/msprime/docs/stable/api.html#msprime.sim_mutations
Adds ancestral populations to the given demographic model, mapping demes in the spatial simulation to ancestral populations.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
anc_sizes
+
List[float]
+
A list of ancestral population sizes.
+
required
+
+
+
merge_time
+
Union[float, int]
+
The time at which all demes in the spatial simulation merge into one or more ancestral populations.
+
required
+
+
+
anc_id
+
Optional[np.ndarray]
+
An array of ancestral population IDs- the output of [split_landscape_by_pop][utilities.split_landscape_by_pop]. Defaults to None.
+
None
+
+
+
anc_merge_times
+
Optional[List[float]]
+
A list of merge times for ancestral populations. Defaults to None.
+
None
+
+
+
anc_merge_sizes
+
Optional[List[float]]
+
A list of sizes for merged ancestral populations. Defaults to None.
+
None
+
+
+
migration_rate
+
Optional[float]
+
The symmetric migration rate between ancestral populations. Defaults to None.
+
None
+
+
+
+
+
+
Returns
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
msprime.Demography
+
The demographic model with the added ancestral populations.
+
+
+
+
+
+
Raises
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
ValueError
+
If the model already contains ancestral populations.
+
+
+
ValueError
+
If the number of demes in the demographic model does not match the number of demes in the admixture ID raster.
+
+
+
+
+
+
Notes
+
The function adds ancestral populations to the given demographic model. If anc_id is not provided, a single ancestral population is added with the initial size specified in anc_sizes[0]. If anc_id is provided, a new ancestral population is added for each admixture population, with sizes specified in anc_sizes. The demes in the simulation are then merged into their respective ancestral populations based on the values in anc_id. If anc_merge_times is provided, the ancestral populations are merged at the specified times. If migration_rate is provided, symmetric migration is allowed between ancestral populations.
Adds ancestral populations to the given demographic model, mapping demes in the spatial simulation to ancestral populations.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
anc_sizes
+
List[float]
+
A list of ancestral population sizes.
+
required
+
+
+
merge_time
+
Union[float, int]
+
The time at which all demes in the spatial simulation merge into one or more ancestral populations.
+
required
+
+
+
anc_id
+
Optional[np.ndarray]
+
An array of ancestral population IDs- the output of [split_landscape_by_pop][utilities.split_landscape_by_pop]. Defaults to None.
+
None
+
+
+
anc_merge_times
+
Optional[List[float]]
+
A list of merge times for ancestral populations. Defaults to None.
+
None
+
+
+
anc_merge_sizes
+
Optional[List[float]]
+
A list of sizes for merged ancestral populations. Defaults to None.
+
None
+
+
+
migration_rate
+
Optional[float]
+
The symmetric migration rate between ancestral populations. Defaults to None.
+
None
+
+
+
+
+
+
Returns
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
msprime.Demography
+
The demographic model with the added ancestral populations.
+
+
+
+
+
+
Raises
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
ValueError
+
If the model already contains ancestral populations.
+
+
+
ValueError
+
If the number of demes in the demographic model does not match the number of demes in the admixture ID raster.
+
+
+
+
+
+
Notes
+
The function adds ancestral populations to the given demographic model. If anc_id is not provided, a single ancestral population is added with the initial size specified in anc_sizes[0]. If anc_id is provided, a new ancestral population is added for each admixture population, with sizes specified in anc_sizes. The demes in the simulation are then merged into their respective ancestral populations based on the values in anc_id. If anc_merge_times is provided, the ancestral populations are merged at the specified times. If migration_rate is provided, symmetric migration is allowed between ancestral populations.
Create a 2D stepping stone model, either for a single time step or for multiple time steps of deme size change.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
d
+
numpy.ndarray
+
The demography matrix representing the population sizes.
+
required
+
+
+
rate
+
float or numpy.ndarray
+
The migration rate(s) between populations. If a float, it represents a constant migration rate for all populations. If a numpy.ndarray, it represents a migration matrix with shape (T, N, N), where N is the total number of populations and T is the number of time steps - 1, if T > 1.
+
required
+
+
+
scale
+
bool
+
Whether to scale the migration rate matrix. Default is True.
+
True
+
+
+
timesteps
+
Union[int, List[int]]
+
The list of timesteps representing the amount of time passing between each demographic event, in generations. If a single integer is provided, the function assumes that the time steps are equal.
+
None
+
+
+
+
+
+
Returns
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
msprime.Demography
+
The constructed 2d stepping stone model as an msprime.Demography object.
+
+
+
+
+
+
Notes
+
The demography matrix d should have shape (n, m) or (k, n, m), where n is the number of rows and m is the number of columns for a 2D array and k is the number of layers in a 3D array. The migration rate matrix rate should have shape (N, N), where N is the total number of populations. If there are multiple time steps of population size change, the add_landscape_change function is called to modify the model accordingly.
Create a 2D stepping stone model, either for a single time step or for multiple time steps of deme size change.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
d
+
numpy.ndarray
+
The demography matrix representing the population sizes.
+
required
+
+
+
rate
+
float or numpy.ndarray
+
The migration rate(s) between populations. If a float, it represents a constant migration rate for all populations. If a numpy.ndarray, it represents a migration matrix with shape (T, N, N), where N is the total number of populations and T is the number of time steps - 1, if T > 1.
+
required
+
+
+
scale
+
bool
+
Whether to scale the migration rate matrix. Default is True.
+
True
+
+
+
timesteps
+
Union[int, List[int]]
+
The list of timesteps representing the amount of time passing between each demographic event, in generations. If a single integer is provided, the function assumes that the time steps are equal.
+
None
+
+
+
+
+
+
Returns
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
msprime.Demography
+
The constructed 2d stepping stone model as an msprime.Demography object.
+
+
+
+
+
+
Notes
+
The demography matrix d should have shape (n, m) or (k, n, m), where n is the number of rows and m is the number of columns for a 2D array and k is the number of layers in a 3D array. The migration rate matrix rate should have shape (N, N), where N is the total number of populations. If there are multiple time steps of population size change, the add_landscape_change function is called to modify the model accordingly.
Uses nearest-neighbor interpolation to classify a landscape raster based on the ancestral population assigned to sampled individuals. This function takes in a raster and a list of coordinates and ancestral population IDs assigned to each individual in the empirical data set. It then interpolates the population IDs across the landscape and returns the new raster as a masked array.
+
+
Parameters
+
+
+
+
+
+
+
+
+
+
Name
+
Type
+
Description
+
Default
+
+
+
+
+
raster
+
rasterio.DatasetReader
+
The rasterio DatasetReader object representing the landscape raster that you want to divide.
A list of (x, y) coordinates or a geopandas GeoDataFrame representing the coordinates assigned to each individual in the empirical data set.
+
required
+
+
+
anc_pop_id
+
List[Union[int, np.integer]]
+
A list of ancestral population IDs assigned to each empirical individual1.
+
required
+
+
+
band_index
+
int
+
The index of the raster to read in. Default is 1. Note- rasterio begins indexing at 1 for raster bands.
+
1
+
+
+
mask_rast
+
bool
+
Whether to mask the interpolation by the landscape. Default is False.
+
False
+
+
+
+
+
+
Returns
+
+
+
+
+
+
+
+
Type
+
Description
+
+
+
+
+
np.ma.MaskedArray
+
The new population assignment raster as a masked array.
+
+
+
+
+
+
Notes
+
+
+
+
+
+
+
Footnotes
+
+
+
These IDs are assigned to each empirical individual typically based on genetic clustering methods like STRUCTURE or PCA. The IDs are used to assign individuals to ancestral populations in the landscape.↩︎
The primary reason for implementing spaceprime in Python is to facilitate a natural integration with the msprime library. While there are R packages that can interface with msprime, most notably slendr, maintaining the connection between R and Python adds significant maintenance overhead. If someone wants to pay me to maintain it, I’d be happy to create an R API, but until then, Python it is! It’s also good to learn a little Python- it’s a very useful language to know. Fortunately, you don’t have to be an expert in Python to use spaceprime, so I’ll get you up to speed on the basics.
+
+
+
Installation
+
In R, package installation is designed to be easy, which is manageable due to R’s limited scope. Python is an incredibly diverse language, so flexibility in package design and management is valued more than ease of use. The focus then is to create unique development environments for the projects you are working on. This facilitates reproducibility as well as an explicit and clean developing environment.
+
To maintain a clean developing environment, it’s common practice to use a virtual environment for your projects. This is a self-contained Python environment that allows you to install packages without affecting your system Python installation. The most popular package for managing virtual environments is conda. There are multiple conda distributions, but I recommend installing the miniforge distribution, which contains the mamba package manager. Mamba is a faster, more efficient package manager than conda, and it’s what I use to install packages. If you’re already familiar with conda, mamba is a drop-in replacement.
+
Once you have miniforge installed, you can create a new environment with the following command:
+
mamba create -n spaceprime
+
This will create a new environment called spaceprime. To activate the environment, use the following command:
+
mamba activate spaceprime
+
Now, when you install packages, they will be installed in the spaceprime environment. To install the packages you need for this tutorial, use the following command:
+
mamba install spaceprime rasterio geopandas
+
The rasterio and geopandas packages are used for reading in and manipulating spatial data. I’ll discuss them in more detail later.
+
+
+
Loading packages
+
In Python, you load packages using the import statement. For example, to load the spaceprime package, you would use the following command:
+
+
import spaceprime as sp
+
+
The as sp part of the command is an alias, which allows you to refer to the package by a shorter name. You want to do this for longer package names to save typing. Python requires you to use the package name when calling functions from the package, rather than it being optional, like in R. For example, to call a function from the spaceprime package, you would use the following syntax:
+
+
sp.function_name()
+
+
This is analogous to the package::function_name() syntax in R.
+
In Python, you can also import specific modules or functions from a package, rather than importing the entire package. For example, to import just the demography module from the spaceprime package, you would use the following command:
+
+
from spaceprime import demography
+
+
+
+
+
+
+
+Modules
+
+
+
+
In Python, a module is a file that contains code you can use to perform specific tasks. This code can include functions, variables, and classes. Python packages typically contain multiple modules, each of which performs a specific set of tasks. Unlike in R, Python allows you to import specific modules or functions from a package, rather than having to load the entire package.
+
+
+
If you want to just import the raster_to_demes() function from the utilities module, you would use the following command:
+
+
from spaceprime.utilities import raster_to_demes
+
+
+
Import the necessary packages
+
For this tutorial, we’ll be using the spaceprime, rasterio, and geopandas packages.
+
+
import spaceprime as sp
+import rasterio
+import geopandas as gpd
+
+
The rasterio package is used for reading, manipulating, and writing raster data, with its closest R analogue being the terra package. The geopandas package is used for reading, manipulating, and writing vector data, with its closest R analogue being the sf package.
+
+
+
+
Download data
+
The data we’re using in this example are a GeoTiff raster file of habitat suitability values and a GeoJSON file containing geographic localities and metadata for this cute frog, Phyllomedusa distincta:
+
+
+
+
Follow the link to download the projections.tif file. You do not need to download the localities.geojson file, as it is read in from the web in the code below.
+
The raster file contains 23 layers, where each layer is a projection of the habitat suitability model (aka species distribution model or ecological niche model) to a time slice in the past, ranging from the present day to 22,000 years ago in 1,000 year intervals. The habitat suitability values range from zero to one, where zero represents no suitability for the species and one represents perfect suitability. In the following plots, yellow represents higher suitability and purple represents lower suitability. Here are a few time slices of the model:
+
+
+
+
+
+
+
+
The GeoJSON file contains geographic sampling localities of P. distincta in the Brazilian Atlantic Forest, along with metadata about each locality. Each row is a single individual/observation. spaceprime counts the number of observations with coordinates that overlap with a raster cell/deme and samples the calculated number for simulations and summary statistics. Here are the localities plotted on top of the present-day habitat suitability model:
+
+
+
+
+
+
+
+
+
Read in data
+
Make sure to replace the projections.tif file path with the path to the file on your system. The GeoJSON file is read in from the web, so you don’t need to download it. Notice that each function is called with the package name or alias followed by a period, then the function name!
+
+
r = rasterio.open("projections.tif")
+locs = gpd.read_file("https://raw.githubusercontent.com/connor-french/spaceprime/main/spaceprime/data/localities.geojson")
+
+
To check out your raster object, you can print the meta attribute to the console. This will give you a summary of the raster, including the number of bands, the width and height of the raster, the coordinate reference system (CRS), and the bounds of the raster.
To check out your GeoDataFrame object, use the head() method, which will print the first few rows to your console.
+
+
locs.head()
+
+
+
+
+
+
+
+
+
species
+
longitude
+
latitude
+
individual_id
+
anc_pop_id
+
geometry
+
+
+
+
+
0
+
distincta
+
-48.633300
+
-25.883300
+
1145545021
+
1.0
+
POINT (-48.6333 -25.8833)
+
+
+
1
+
distincta
+
-48.716829
+
-27.169879
+
3044567206
+
2.0
+
POINT (-48.71683 -27.16988)
+
+
+
2
+
distincta
+
-47.983204
+
-24.116604
+
2596355434
+
1.0
+
POINT (-47.9832 -24.1166)
+
+
+
3
+
distincta
+
-47.014321
+
-24.369859
+
3772589489
+
1.0
+
POINT (-47.01432 -24.36986)
+
+
+
4
+
distincta
+
-49.581356
+
-28.559573
+
2448001684
+
2.0
+
POINT (-49.58136 -28.55957)
+
+
+
+
+
+
+
+
If you would like to perform further data exploration on your GeoDataFrame object, I highly recommend the plotnine package, which is a Python implementation of the ggplot2 package in R. It uses the grammar of graphics to make nice plots. It’s the most painless way to switch from R plotting to Python plotting.
+
+
+
+
Set up the demographic model
+
Next, we’ll convert the habitat suitability values into deme sizes, so each cell in the raster will represent a deme in our model. We’ll use a linear transformation to convert the suitability values to deme sizes, where the suitability value is multiplied by a constant to get the deme size. The constant is the maximum local deme size, which we set to 1000. For more on transformations, see the suitability to deme size transformation functions vignette.
+
+
d = sp.raster_to_demes(r, transformation="linear", max_local_size=1000)
+
+
Now that we have our deme sizes, we can set up the demographic model. spaceprime uses an object-oriented programming paradigm for creating a demographic model. Although most R users are more familiar with a functional programming paradigm, object-oriented programming does exist in R.
+
In object-oriented programming, you create an instance of a class and then call methods on that instance. The class is like a blueprint for creating objects, and the methods are functions that operate on the object’s data. This is useful for creating complex data structures like demographic models, where you have multiple components that interact with each other.
+
In spaceprime, the spDemography class is used to set up the demographic model. The spDemography class has methods for setting up the spatial component of the model, adding ancestral populations, and inherits all of the methods of the msprime Demography class.
+
+
demo = sp.spDemography()
+
+
Now you can run methods on the demo object to set up the demographic model. The first method we’ll run is the stepping_stone_2d() method, which sets up a two-dimensional stepping-stone model. The migration rate, specified by rate, can be a single global rate or an array of values specifying each neighbor’s migration rate. Here, we’re using a global rate of 0.001. The global rate by default is scaled, where demes exchange the same number of migrants with their neighbors, regardless of deme size. To change this behavior, set scale=false. We’re assuming that P. distincta has a generation time of one year. Using a single value for the timesteps argument tells spaceprime that 1000 generations passes in between each raster time step in the model.
+
This step may take a few seconds (10-15 seconds on my machine) to run.
+
+
# populate the spDemography object with the deme sizes and migration rates
+demo.stepping_stone_2d(d, rate=0.001, timesteps=1000)
+
+
You may have noticed that we didn’t assign the output to a new variable. This is because the stepping_stone_2d() method modifies the demo object in place, rather than returning a new object. This is a common pattern in object-oriented programming, where methods modify the object they’re called on rather than returning a new object.
+
After initializing the spatial component of the simulation, it’s desirable to add one or more ancestral populations to the model. This is done by providing a list of ancestral population sizes and the time (in generations) at which the spatially distributed demes migrate into the ancestral population(s). The following code adds a single ancestral population of 100,000 individuals that demes merge into 23,000 generations in the past. The brackets in [100000] specify a list in python. In this case, it is a list of length one.
+
+
# add ancestral population
+demo.add_ancestral_populations([100000], 23000)
+
+
+
Inspect your model
+
Now that we have our demographic model set up, we can inspect it to make sure it looks as expected. spaceprime has a series of plot_() functions that make this easier.
+
+
plot_landscape()
+
plot_landscape() plots the deme sizes in space, which allows you to quickly inspect whether the transformation you applied to your habitat suitability map make sense. Here, we provide the demographic model object, the raster object, the index of the time slice to plot (0 for the present day in this case), and basemap=True to add an OpenStreetMap basemap, providing geographic context to the plot. If you don’t have an internet connection, set basemap=False (the default) to plot without the basemap.
+
+
sp.plot_landscape(demo, r, 0, basemap=True)
+
+
+
+
+
+
+
+
+
plot_model()
+
plot_model() plots the deme sizes on a folium interactive map, with precise deme sizes and outgoing migration rates for each deme present in a popup.
+
+
sp.plot_model(demo, r, 0)
+
+
Make this Notebook Trusted to load map: File -> Trust Notebook
+
+
+
+
+
+
+
Simulate genetic data
+
Before simulating this demography, we need to create a sample dictionary that translates the empirical sampling localities to the model’s deme indices and maps those to the number of samples to take from each deme. By default, coords_to_sample_dict() sets the number of individuals to sample from each deme to the number of empirical localities in that deme. The function also returns two other dictionaries that are not used in this example, so we’ll ignore them. In Python, it’s common practice for a function to return multiple objects as a list. If you want the objects separated, you use the formatting below. Using an underscore _ leads to the object being ignored.
+
+
+
+
+
+
+Dictionaries
+
+
+
+
Dictionaries are a data structure in Python that map keys to values. They are similar to lists, but instead of using an index to access elements, you use a key. Dictionaries are useful for storing data that has a key-value relationship, like a phone book, where the key is the name of the person and the value is their phone number. In this case, the key is the deme index and the value is the number of samples to take from that deme.
Now we get to simulate! The first task is to simulate the ancestry of the samples using the coalescent. All of the hard work is done through msprime’s sim_ancestry() function, for which spaceprime provides a convenience wrapper. This function returns a tskit TreeSequence, which “represents a sequence of correlated evolutionary trees along a genome” and is an incredibly powerful and compact data representation for population genomic analyses. The minimum number of arguments required for this function are the sample dictionary and the demographic model. If you need to overlay mutations, you need to supply the sequence length. Notice the lack of mutations in the table. We’ll set record_provenance to False to decrease the memory overhead of storing a bunch of metadata about the simulation.
We’ll take a peak at a single tree from the TreeSequence object to see what it looks like. The draw_svg() method plots trees from the TreeSequence object. Here, I selected a single tree and removed the node labels because there are tons of nodes that crowd the plot and we’re only interested in the tree structure.
Overlaying mutations after simulating ancestry isn’t necessary for calculating genetic summary statistics on a TreeSequence, but it is necessary if you would like to compare your simulations with empirical data that are represented as a table of genotypes rather than a TreeSequence. The sim_mutations() function overlays mutations on the TreeSequence object returned by sim_ancestry() and requires the mutation rate. The mutation rate is the number of mutations per base pair per generation. For this example, we’ll use a mutation rate of 1e-10 so we don’t overcrowd the tree sequence visualization. You can see from the table that the tree sequence has some mutations!
You might have noticed that we return new objects for sim_ancestry and sim_mutations. This is because we have returned to a functional programming paradigm, where functions return new objects rather than modifying objects in place. We do this because the TreeSequence objects are a different type of object than the demographic model object, and we want to keep them separate.
+
And now for the tree. The red X’s represent mutations on the tree, with their ID numbers next to them.
Use the analysis module to calculate genetic summary statistics on the TreeSequence object. For more on the analysis module, see the analysis module documentation.
+
+
Save the TreeSequence to use later or analyze on a platform like tskit with sim.dump(file/path/to/write/to.trees).
+
+
Convert the TreeSequence with mutations to a genotype matrix for use in a program like scikit-allel with sim.genotype_matrix(). For more information on this function, see the tskit documentation.
+
+
Export the TreeSequence with mutations to a VCF file using sim.write_vcf. For more information on how to use this function, see the tskit documentation.
+
+
+
+
+
+
+
+TODO
+
+
+
+
add a link to the analysis module documentation when it’s ready.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/spaceprime-for-r-users_files/figure-html/draw-mutations-output-1.svg b/spaceprime-for-r-users_files/figure-html/draw-mutations-output-1.svg
new file mode 100644
index 0000000..6bd1373
--- /dev/null
+++ b/spaceprime-for-r-users_files/figure-html/draw-mutations-output-1.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/spaceprime-for-r-users_files/figure-html/draw-tree-output-1.svg b/spaceprime-for-r-users_files/figure-html/draw-tree-output-1.svg
new file mode 100644
index 0000000..06801a1
--- /dev/null
+++ b/spaceprime-for-r-users_files/figure-html/draw-tree-output-1.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/spaceprime-for-r-users_files/figure-html/localities-output-1.png b/spaceprime-for-r-users_files/figure-html/localities-output-1.png
new file mode 100644
index 0000000..c668464
Binary files /dev/null and b/spaceprime-for-r-users_files/figure-html/localities-output-1.png differ
diff --git a/spaceprime-for-r-users_files/figure-html/plot-landscape-output-1.png b/spaceprime-for-r-users_files/figure-html/plot-landscape-output-1.png
new file mode 100644
index 0000000..94e8064
Binary files /dev/null and b/spaceprime-for-r-users_files/figure-html/plot-landscape-output-1.png differ
diff --git a/spaceprime-for-r-users_files/figure-html/timeslices1-output-1.png b/spaceprime-for-r-users_files/figure-html/timeslices1-output-1.png
new file mode 100644
index 0000000..b063c3f
Binary files /dev/null and b/spaceprime-for-r-users_files/figure-html/timeslices1-output-1.png differ
diff --git a/transformation-functions.html b/transformation-functions.html
new file mode 100644
index 0000000..d4bdcd1
--- /dev/null
+++ b/transformation-functions.html
@@ -0,0 +1,576 @@
+
+
+
+
+
+
+
+
+
+transformation-functions – spaceprime
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
All transformations are performed on a raster with 0-1 suitability values (or a raster that will be normalized to 0-1). The output of the transformation is an array of the same shape as the input raster, but with values transformed according to the specified function.
+
The “linear” transformation multiplies the input values by the maximum local deme size (max_local_size).
+
The “threshold” transformation creates an array where values below the threshold value are set to zero and values about the threshold value are set to 1.
+
The “sigmoid” transformation applies a sigmoid function to the data using Eq. 1 from Frazier and Wang 2013, Modeling landscape structure response across a gradient of land cover intensity, where an inflection_point and slope are specified. The inflection_point can be thought of like a threshold value, where original values below this value descend quicker to zero, and values about this value increase quicker to 1. The slope determines how fast values change on either side of the inflection point. A sufficiently steep slope makes this a threshold function, while a sufficiently shallow slope makes this a linear function.