-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewire tsdate to allow nonfixed sample nodes #1
Comments
We can pass the sample node to date as a "datable node" here: but we will need to specify a mean and variance for the distribution somehow, or modify the contents of the returned I think the best thing would be to set any nodes as "datable" if they have a non-zero variance (rather than if they are not samples). Then we simple need to figure out how to give samples a non-zero variance (we can take the mean for the prior as the time of the node in the tree sequence). It looks like the And which contains |
I have made some progress with tskit-dev/tsdate@786cf12. Here is some code to test: import tsinfer
import msprime
import tskit
import tsdate
import numpy as np
import matplotlib.pyplot as plt
Ne = 10000
samples = [
msprime.SampleSet(2),
msprime.SampleSet(1, time=100),
]
mutated_ts = msprime.sim_ancestry(
samples=samples,
population_size=Ne,
sequence_length=2e4,
recombination_rate=0, # For testing, just have a single tree
random_seed=1,
)
mutated_ts = msprime.mutate(mutated_ts, rate=1e-8, random_seed=1)
def create_sampledata_with_individual_times(ts):
"""
The tsinfer.SampleData.from_tree_sequence function doesn't allow different time
units for sites and individuals. This function adds individual times by hand
"""
# sampledata file with times-as-frequencies
sd = tsinfer.SampleData.from_tree_sequence(ts)
# Set individual times separately - warning: this mixes time units
# so that sites have TIME_UNCALIBRATED but individuals have meaningful times
individual_time = np.full(sd.num_individuals, -1)
for sample, node_id in zip(sd.samples(), ts.samples()):
if individual_time[sample.individual] >= 0:
assert individual_time[sample.individual] == ts.node(node_id).time
individual_time[sample.individual] = ts.node(node_id).time
assert np.all(individual_time >= 0)
sd = sd.copy()
sd.individuals_time[:] = individual_time
sd.finalise()
return sd
def set_times_for_historical_samples(ts):
"""
Use the times stored in the individuals metadata of an inferred tree sequence
to constrain the times.
"""
tables = ts.dump_tables()
tables.individuals.metadata_schema = tskit.MetadataSchema.permissive_json()
ts = tables.tree_sequence()
times = np.zeros(ts.num_nodes)
# set sample node times of historic samples
for node_id in ts.samples():
individual_id = ts.node(node_id).individual
if individual_id != tskit.NULL:
times[node_id] = ts.individual(individual_id).metadata.get("sample_data_time", 0)
constrained_times = tsdate.core.constrain_ages_topo(ts, times, eps=1e-1)
tables.nodes.time = constrained_times
tables.mutations.time = np.full(ts.num_mutations, tskit.UNKNOWN_TIME)
tables.sort()
return tables.tree_sequence()
sampledata = create_sampledata_with_individual_times(mutated_ts)
inferred_ts = tsinfer.infer(sampledata)
inferred_ts_w_times = set_times_for_historical_samples(inferred_ts).simplify()
print(inferred_ts_w_times.node(5))
prior = tsdate.build_prior_grid(inferred_ts_w_times, Ne=10000, allow_historical_samples=True, truncate_priors=True, node_var_override={5:1000})
dated_ts = tsdate.date(inferred_ts_w_times, priors=prior, mutation_rate=1e-8) This fails when truncating priors, however:
I can't quite figure out the logic in that function. Perhaps @awohns can talk me through it and we can see what is not working. It should be perfectly possible to truncate on the basis of a few fixed sample nodes. |
We can test the pathway without truncation using the code above via prior = tsdate.build_prior_grid(inferred_ts_w_times, Ne=10000, allow_historical_samples=True, truncate_priors=False, node_var_override={5:10})
dated_ts = tsdate.date(inferred_ts_w_times, priors=prior, mutation_rate=1e-8) With tskit-dev/tsdate@786cf12 this not complains about dangling nodes on the inside pass, which is correct, as the node corresponding to the sample-to-date will appear as if it is dangling.
The The key line is here, where we fill the inside either with inside = self.priors.clone_with_new_data( # store inside matrix values
grid_data=np.nan, fixed_data=self.lik.identity_constant
) |
Note that my changes simply create a lognormal distribution (with a user-specified variance) for the prior on an undated sample node. If a more complicated prior is needed, I guess it can be created by hand. We can show an example of this in the docs. |
Wow, with tskit-dev/tsdate@979f55c it's almost working with the outside_maximization method. The only issue now is setting the times so that they are topologically constrained: prior = tsdate.build_prior_grid(inferred_ts_w_times, Ne=10000, allow_historical_samples=True, truncate_priors=False, node_var_override={5:1000})
dated_ts = tsdate.date(inferred_ts_w_times, priors=prior, mutation_rate=1e-8, method="maximization")
|
Fixed with tskit-dev/tsdate@da58644 and tskit-dev/tsdate@02a9b67 |
The current PR tskit-dev/tsdate#214 works, but only with the outside maximisation method, which won't return posteriors. Here's what we get when trying the inside-outside: import tsinfer
import msprime
import tskit
import tsdate
import numpy as np
Ne = 10000
samples = [
msprime.SampleSet(2),
msprime.SampleSet(1, time=100),
]
mutated_ts = msprime.sim_ancestry(
samples=samples,
population_size=Ne,
sequence_length=2e4,
recombination_rate=0, # For testing, just have a single tree
random_seed=1,
)
mutated_ts = msprime.mutate(mutated_ts, rate=1e-8, random_seed=1)
def create_sampledata_with_individual_times(ts):
"""
The tsinfer.SampleData.from_tree_sequence function doesn't allow different time
units for sites and individuals. This function adds individual times by hand
"""
# sampledata file with times-as-frequencies
sd = tsinfer.SampleData.from_tree_sequence(ts)
# Set individual times separately - warning: this mixes time units
# so that sites have TIME_UNCALIBRATED but individuals have meaningful times
individual_time = np.full(sd.num_individuals, -1)
for sample, node_id in zip(sd.samples(), ts.samples()):
if individual_time[sample.individual] >= 0:
assert individual_time[sample.individual] == ts.node(node_id).time
individual_time[sample.individual] = ts.node(node_id).time
assert np.all(individual_time >= 0)
sd = sd.copy()
sd.individuals_time[:] = individual_time
sd.finalise()
return sd
def set_times_for_historical_samples(ts):
"""
Use the times stored in the individuals metadata of an inferred tree sequence
to constrain the times.
"""
tables = ts.dump_tables()
tables.individuals.metadata_schema = tskit.MetadataSchema.permissive_json()
ts = tables.tree_sequence()
times = np.zeros(ts.num_nodes)
# set sample node times of historic samples
for node_id in ts.samples():
individual_id = ts.node(node_id).individual
if individual_id != tskit.NULL:
times[node_id] = ts.individual(individual_id).metadata.get("sample_data_time", 0)
# Just need to make the ts consistent
constrained_times = tsdate.core.constrain_ages_topo(ts, times, eps=1e-1)
tables.nodes.time = constrained_times
tables.mutations.time = np.full(ts.num_mutations, tskit.UNKNOWN_TIME)
tables.sort()
return tables.tree_sequence()
sampledata = create_sampledata_with_individual_times(mutated_ts)
inferred_ts = tsinfer.infer(sampledata)
inferred_ts_w_times = set_times_for_historical_samples(inferred_ts).simplify()
prior = tsdate.build_prior_grid(inferred_ts_w_times, Ne=10000, allow_historical_samples=True, truncate_priors=False, node_var_override={5:1000})
dated_ts, posteriors = tsdate.date(inferred_ts_w_times, priors=prior, mutation_rate=1e-8, method="maximization", return_posteriors=True) # WORKS!
dated_ts, posteriors = tsdate.date(inferred_ts_w_times, priors=prior, mutation_rate=1e-8, return_posteriors=True) # FAILS
It's failing because self.norm[edge.child] is |
tskit-dev/tsdate@974038d sets the normalization constant to unity for non fixed leaf nodes. However, I'm having second thoughts about the |
It's technically working but there's a bug, I think. I reckon the following should give a relatively flat prior for node 5:
It doesn't for me. The variance logic must be wrong, I think. |
Here's some code to discuss:
|
Currently
tsdate
only allows sample nodes which have a known date. We want to rewiretsdate
so sample nodes can have an unknown date, allowing for "molecular sampling"The text was updated successfully, but these errors were encountered: