-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert forward simulation notebooks. #14
Comments
Also, we want to make sure these pages don't take to long to run. |
I doubt they need conda--apt should be fine. The one with cython took quite a while to run, in order to demonstrate correctness. I have previously suggested that we delete it. |
Your call on deleting stuff. Demonstrating correctness is beyond the scope of these tutorials now I think. |
Looking at it now. It was written back when the access to tskit was via msprime. It'll take some time to convert. Also, there's an annoyance--if you convert with |
I've not bothered modernising the content in the first pass @molpopgen, if that's any help. I've basically just been getting things working, and going to make notes on the outstanding issues so we can follow up in smaller chunks afterwards. |
yeah, I'm going for "working" right now, which it doesn't! |
I'm going to shorten it considerably. Looking at it now, I really don't like the length. |
This will take a bit more time than I thought. Teaching, etc.,.. |
Hi @molpopgen - I'm just looking at our missing tutorials, and see there's a space for forward simulations. Did you get anywhere with this? |
No, totally off of my radar. I still vote to delete the Cython example and then minimally port over the other to markdown. |
That sounds ideal to me. You mentioned something about shortening though? Is that still relevant? |
Yes, IMO a tutorial should strive to be only a "screen or 2 long". To that end, something like cell 3 modified to do exactly 1 crossover / meiosis in the "pseudo-haploid" style of the C example will be sufficient. Updating to include periodic simplification would be a good thing, too, I think. |
@molpopgen - I am looking at this with a view to showing a simple forwards simulator for a workshop. I rethought the python code from scratch, and came up with this. I would be interested in how clear or otherwise you think it is (I am guessing it isn't very efficient though). The nice thing about picking parents, rather than genomes, at random is not just that it is more obvious to a biologist, but that we can relatively easily modify the import tskit
import random
# First building block: make an individual containing 2 genomes. Note that
# we flag up the nodes as "sample nodes", which helps when plotting.
def make_individual(tables, time):
"""
Creates a new individual containing 2 new genomes (nodes), and returns the ID
of the individual and a tuple of the IDs of the individual's two genomes
"""
indivID = tables.individuals.add_row()
mat_nodeID = tables.nodes.add_row(tskit.NODE_IS_SAMPLE, time, individual=indivID)
pat_nodeID = tables.nodes.add_row(tskit.NODE_IS_SAMPLE, time, individual=indivID)
return indivID, (mat_nodeID, pat_nodeID)
# Second building block: add "inheritance paths" between the
# genome the child inherits from one parent (say the mother) and
# the two genomes contained in that parent. This is where "meiosis"
# comes in. However the moment, we ignore recombination and simply pick
# one of the two genomes at random
def add_inheritance_paths(
tables, parent_genomes, child_genome, recombination_rate=None
):
"""
Given two parent genome (node) IDs from the same individual, add inheritance
paths from those genomes to a child genome (node) ID.
"""
left = 0
right = tables.sequence_length
inherit_from = random.randint(0, 1) # randomly chose 0 or 1
tables.edges.add_row(left, right, parent_genomes[inherit_from], child_genome)
# Last building block: take a collection of parents
# and generate children from them by picking two parents at random
# Here we use `random.choices` to sample with replacement, but we could
# use `random.sample` to exclude the possibility of selfing
def make_new_generation(tables, time, num_children, parents):
"""
Returns a Python dictionary of children (i.e. length `num_children`) mapping
{individual_ID: (maternal_genome_ID, paternal_genome_ID), ...}
The `parents` parameter is a dictionary of the same form, listing possible parents
for the children. If the dictionary is empty, no inheritance paths are set.
"""
children = {}
parent_IDs = list(parents.keys())
for _ in range(num_children):
child_ID, child_genomes = make_individual(tables, time)
children[child_ID] = child_genomes
if parents:
# pick IDs for two parents at random with replacement (allows selfing)
random_parent_IDs = random.choices(parent_IDs, k=2)
for child_genome, parent_ID in zip(child_genomes, random_parent_IDs):
add_inheritance_paths(tables, parents[parent_ID], child_genome)
return children
def setup_simulation(sequence_length, random_seed):
"Returns a TableCollection where we can store our growing genealogy"
random.seed(random_seed)
tables = tskit.TableCollection(sequence_length)
tables.time_units = "generations"
return tables
def simple_diploid_sim(popsize, generations, sequence_length, random_seed=123):
tables = setup_simulation(sequence_length, random_seed)
# tskit time is in units ago: we want the final generation to be at time 0
generations_ago = generations - 1 # start with the oldest generation ago
# Initialise an empty population of parents
parents = make_new_generation(tables, generations_ago, popsize, parents={})
# Run the simulation
while generations_ago > 0:
generations_ago = generations_ago - 1
parents = make_new_generation(tables, generations_ago, popsize, parents)
# Sort the tables, e.g. to put edges in the required order
tables.sort()
return tables.tree_sequence()
ts = simple_diploid_sim(5, generations=15, sequence_length=1000) Here's how you might swap in a function that allows recombination: def add_inheritance_paths(tables, parent_nodes, child_node, recombination_rate=0):
"""
Add inheritance paths from parent node IDs (i.e. parent genomes)
to a child node ID (i.e. a child genome)
"""
breakpoints = ... # e.g. pick breakpoints with probability rho per integer position
inherit_from = random.randint(0, 1)
for lft, rgt in zip([0] + breakpoints, breakpoints + [tables.sequence_length]):
tables.edges.add_row(
left=lft, right=rgt, parent=parent_nodes[inherit_from], child=child_node)
inherit_from = 1 - inherit_from # switch: inherit from the other parent genome |
@hyanwong I find it a bit hard to read for a few reasons:
|
I think I'd also go straight to the case with recombination, including assertions on the invariants that need to be upheld. |
Thanks. Really useful. I will try to standardise variable names (distinguishing between the individual IDs and the genome (node) IDs is the main problem). Type hints are a good idea. I want to turn this into a workbook, so I'll make changes there and post a link. |
ISWYM. It is nice to be able to simulate without recombination, so you get a single tree for viz purposes. But I could do that simply by setting recombination_rate to 0. |
As much as Python type hints can bother me, I do think they'd help here. :) |
I should also say that having |
I have worked this up into a (very) draft workbook at https://hyanwong.github.io/workshop-pages/lab?path=ForwardSimulation.ipynb, which has a bit of explanation between the function definitions, etc, and examples of what simplification does. Type hints are only present for return values of functions (and there are fewer functions). I have not talked about incorporating simplification into the actual simulation, however - perhaps I should do so. I have also not mentioned mutation. The idea would be to gradually port @molpopgen 's nice text into this notebook somehow, which could perhaps form the nucleus of a new tutorial. But if we want to keep the existing code, that's fine too (as I am wanting to use mine for a separate "workbook" -type file for workshops rather than the standard More comments gratefully accepted. I know it's a bit rough at the moment. Todo:
|
Thinking about this, I wonder if it is a good idea to split it into two tutorials. The first can be something like the workbook I put up, and is a "simple simulator" (but with recombination). The second can contain all the gotchas, details about mutations, repeated simplification (and invariance), recapitation, validation, etc. What do you think @molpopgen ? |
It think that the level of detail depends a bit on your target audience. For static tutorials served up on a website, I think that several short pages is okay. |
Yes, I agree. Are you happy for me to take your
I think we should describe a node as a genome rather than an "event" (e.g. the birth of a lineage). For instance, a coalescent node might not actually mark the exact time of coalescence: rather it is the neareast representative genome we have to that event. See the eARG vs gARG distinction in the ARG preprint.
I don't think we need to require diploids to be adjacent nodes any more, do we? Instead we can group them as belonging to the same individual. Indeed, once we have simplified, the internal nodes can no longer be thought of as adjacent diploids anyway (ancestral nodes are unlikely to also have their matching diploid pair in the genealogy)
I have not followed that scaling in my example, preferring e.g. to have a genome of length e.g. 50kb. Using base pairs seems easier for beginners to me, but I can also see the benefits of using 0..1, so am willing to be convinced that the Hudson scaling is better as an introduction.
I find it easier to code time as generations ago (i.e. standard tskit direction), rather than distract the reader by having to reverse the time scale on export. The time-reversal technique has the down-side that the tables are invalid all the way through the simulation, up until the point that the times are reversed. My suggestion is that when we simply count I suggest that we combine both counting-down techniques. I.e. we set the initial generation to be time |
I defined a node intentionally with respect to the birth of a genome. I think this definition is appropriate in the context.
Here, you need to think about computer architecture. Even with "individuals", nodes are likely to be kept adjacent for performance reasons. The "data oriented" paradigm from, e.g. computer games, would have "nodes" by an array within an "individuals" structure.
Either way.
Here it is a matter of taste. I'm not sure that "generations" ago generalizes to a forward simulation. For example, there could be a termination condition such that the end time is not tskit's "zero". While that doesn't matter for the data model, it is not necessarily true that the final time is zero. It is, in many cases, more natural for a forward simulation to increment time using positive integers and then do the conversion. |
How about e.g.
I think imagining it as the birth point of a genome is less event-y than imagining it as the birth point of a lineage (which is what was written before)
Yes, I think that we could say that, for efficiency, you might want that. But I don't think we need any more to define a diploid as two successive node IDs. And certainly, two successive node IDs don't have to correspond to a diploid, right? (otherwise all ancestors would have to be diploids too)
Yes, you are right here. I was suggesting that we could get around this by subtracting the "current generation time", if not zero.
I agree that it feels more natural, but that is offset by the complication of having to explain that the times need to be reversed later. The fact that by using forward generations, we would be creating ts-invalid tables during simulation was what tipped me toward using "generations_ago" instead. Maybe it's worth asking a newcomer? Thanks for the replies, by the way. |
I think there's a miscommunication here. My text refers to what forward simulations generate, which is individuals born with some ploidy. The text has that audience in mind. What tskit does with ancestors is an internal detail of tskit. |
I think it depend on the target audience. For example, is familiarity with the tskit data model a strong prereq for this tutorial? |
Ah yes, that is a very good point. I am probably thinking too much in terms of tskit definitions, rather than the (independent) simulation code. |
However, if, at any point in the simulation, we want to remove (or renumber) a node independently of the individual in which it resides, then using adjacent IDs to define an individual becomes problematic. Most of the time we leave the details of removing / reallocating IDs to tskit, but I suppose I could imagine cases where this might be required in the simulation itself? |
But again, I am talking about births. Birth happen at once, "in bulk", adding rows to the node table. That's consecutive, right? |
Right. One of the confusions is coming from the fact that I am using tskit to allocate (successive) node IDs, rather than e.g. resetting internal simulation IDs to zero each generation. That means, if we are simplifying part-way, there could be an odd number of nodes in the node table (i.e. after simplification), so that individuals start being allocated nodes such as (101,102), (103, 104). Although these groupings are consecutive, they do not start with even numbers, implying that the lower IDs do not conform to the defined (2-nodes = one individual) grouping. |
There's no requirement that things start with an even number. For a diploid, they are simply the next 2 available numbers. |
Sure, but the implication would then be that nodes exist (or have existed) in the simulation that are not grouped in pairs. Without going into the gory details of dropping single nodes from the ID list, it could potentially be a bit confusing to the reader? |
At some point, you have to assume some background in population genetics or else you will go crazy. This is domain-specific stuff you're talking about. |
Sure - anyway, I think it's useful to mention the tskit way of recording individuals (because we can store those from the simulation too), so I'll see if I can put together some sort of wording that details both cases, without going into too much detail for the beginner (wish me luck!). |
Oh, now I see the confusion! For me, an individual has nothing to do with the "individuals table" until a final export happens. Even when keeping ancient samples, it is often cleaner to manage those nodes yourself, remapping them each time you simplify, then deal with the individuals table. |
Ah yes. Personally, I think it is helpful for a forwards simulation to record individuals in the individual table as they are created (rather than generate them on export). Then you also keep all the individuals associated with ancestral non-sample genomes too, which can be helpful. For example, it means you can e.g. store the geographical location of each individual in the table, and hence gain access to the location of ancestral genomes. ( |
That works, but has drawbacks. For example, having the locations only in the tskit table means that they aren't in, say, an R-tree or some other means of efficiently looking up things in "space". |
Yes, although I'm not saying that they should only be stored in tskit. In practice, you might keep the locations for the parent and the child generations in some efficient structure too, but throw them away when you go to the next generation. Having the locations permanently stored for all ancestors is generally useful (and if you wanted to do anything time consuming with those locations, you might well extract the coordinates from the individual table and put them into a more efficient structure before analysis). Either way, I'm pretty sure that it is helpful to store individuals as-we-go, recording any metadata about them as necessary. Individuals associated with non-sample nodes can be discarded later if need be, but they can contain useful information which can be worth storing and accessing from the output tree sequence. I don't think this is easily done by constructing the individuals at the end, just before ts output. |
Here's what I have so far (a hybrid of your texts (mostly) and mine):
|
@molpopgen - I suggest that material in your |
Please just go ahead and do whatever you think is needed. I don't have any attachment to anything these docs and don't need to approve any changes. |
See extensive discussion at tskit-dev#14
See extensive discussion at tskit-dev#14
See extensive discussion at tskit-dev#14
See extensive discussion at tskit-dev#14
See extensive discussion at tskit-dev#14
See extensive discussion at tskit-dev#14
See extensive discussion at tskit-dev#14
See extensive discussion at tskit-dev#14
There's two notebooks (wfforward.ipynb and wfcython.ipynb) in the old-content/notebooks directory that need to be converted to jupyterbook format. Hopefully this can be reasonably automatic using jupytext.
The only concern is the environment requirements - what is need to run these notebooks? It would be great if we keep this a no-conda affair.
@molpopgen, would you mind taking a look please?
The text was updated successfully, but these errors were encountered: