Reconsidering filtered samples from the past N consecutive days #193

szhan · 2024-07-20T06:07:41Z

An inappropriate HMM cost threshold may filter out epidemiologically relevant variants, which is undesirable (e.g. see #188). But , in practice, it is not easy to figure out, or clear, what an appropriate threshold value should be.

To see if we can work around this issue, we are trying a strategy where samples filtered out by the HMM cost threshold from the past N days are reconsidered in a post-processing step at the end of a round of daily extension.

The procedure is as follows:

Given N, fetch the samples (and their already inferred paths and mutations) from the past N consecutive days' worth of pickled files. Days with no filtered samples do not contribute to this pool of reconsidered samples.
Group the reconsidered samples by their paths and immediate reversions as done in add_matching_results.
The samples in groups of size greater than k are attached to the ARG based on their inferred paths (i.e. without needing to rerun HMM matching).
After proceeding to the following day, the oldest filtered samples, which are not attached to the ARG, are excluded in a FIFO fashion.

A good place to do this post-processing step may after calling add_matching_results while calling extend.

If this works, then we should see the first Alpha samples added in Sep., 2020, even though they were filtered out by the HMM cost filter.

The text was updated successfully, but these errors were encountered:

szhan · 2024-07-21T04:37:22Z

After grouping reconsidered samples by their paths and immediate reversions, for each group:

Insert a new node (set its time at some time arbitrarily slightly earlier than the oldest reconsidered sample).
Add new edges connecting the new node to the shared parent nodes of the grouped samples as per their shared path.
Add the shared mutations (including immediate reversions) above the new node.
Attach each of the reconsidered samples to the new node via new edges.

@jeromekelleher I think this sounds about right? I think we should keep the immediate reversions here.

jeromekelleher · 2024-07-21T12:09:07Z

I was imagining that we'd reuse the same local tree building technology as we have in the usual attachment case. I guess we'd have to change node times, but otherwise should work?

szhan · 2024-07-22T15:02:36Z

add_matching_results has been modified to take the argument min_group_size to exclude grouped matches having fewer than a specified number of samples.

szhan · 2024-07-22T15:03:35Z

We have decided to handle sampling node dates at a later issue. The focus now is to see whether this strategy adds the early Alpha samples.

szhan · 2024-07-23T08:12:25Z

I think it is working! Using min group size of 2 and looking at the past 5 days worth of filtered samples, these three Alpha samples collected in early Sep. 2020 got attached on 2020-09-25.

strain	date
ERR4659819	2020-09-20
ERR4682028	2020-09-21
ERR5217634	2020-09-23

jeromekelleher · 2024-07-23T08:28:34Z

Wahey! How much other stuff gets attached at the same time?

jeromekelleher · 2024-07-23T08:29:24Z

As in, how many other samples also get added back in besides the ones we think should be added in?

szhan · 2024-07-23T09:29:41Z

Hmm, it seems like quite a lot of samples are being added back in.

date	filtered samples added back
2020-09-24	1750
2020-09-25	2670
2020-09-26	2423
2020-09-27	1868
2020-09-28	1489
2020-09-29	1175
2020-09-30	42
2020-10-01	43

szhan · 2024-07-23T09:43:33Z

The numbers above are misleading, because they count excluded samples that may be added back in already. Still, it seems like samples (besides the ones we think should be added back in) are being added back in.

szhan · 2024-07-23T09:45:25Z

One thing clearly missing is to make sure that the same excluded samples don't get added back in multiple times.

szhan · 2024-07-23T10:18:51Z

There are five Alpha samples collected in Sep. 2020 in the Viridian dataset. The three above are added. I'm looking at the other two samples both collected on 2020-09-30. One got added in but the other didn't, and for some reason the other one is not in the excluded samples file. I think all five Alpha samples should be added by 2020-10-01, so I'm investigating this further before doing anything else.

szhan · 2024-07-23T11:31:29Z

Ah, the sample (ERR5071073) is listed in the metadata, but not in the input alignments because it got filtered out for having too many Ns in the consensus sequence. Okay, I think the four Alpha samples that should be there are added in.

szhan · 2024-07-23T12:00:13Z

Also, this function needs to be modified.

def last_date(ts):
    if ts.num_samples == 0:
        # Special case for the initial ts which contains the
        # reference but not as a sample
        u = ts.num_nodes - 1
    else:
        u = ts.samples()[-1]
    node = ts.node(u)
    assert node.time == 0
    return parse_date(node.metadata["date"])

It is getting the last date among the samples based on the last node added. In the previous pipeline where filtered samples are not reconsidered, then it is getting the latest date among the samples. But when we add samples from the past N days, then I don't think it is returning the latest date among the samples anymore.

jeromekelleher · 2024-07-23T12:12:32Z

Ah, good catch. I guess this is a motivation for getting the time of the extra added sample nodes correct. Then, we can define the last_date as

samples = ts.samples()
time_0 = samples[ts.nodes_time[samples] == 0]
node = ts.node(samples[0])  # Arbitrarily pick first

jeromekelleher · 2024-07-23T12:13:41Z

To get stuff working, what you could do is:

max(parse_date(node.metadata["date"] for node in ts.nodes())

Slow, but will work ok for prototyping.

szhan · 2024-07-23T12:22:33Z

I was thinking the former.

def last_date(ts):
    if ts.num_samples == 0:
        # Special case for the initial ts which contains the
        # reference but not as a sample
        u = ts.num_nodes - 1
    else:
        samples = ts.samples()
        u = samples[ts.nodes_time[samples] == 0][0]
    node = ts.node(u)
    assert node.time == 0
    return parse_date(node.metadata["date"])

jeromekelleher · 2024-07-23T12:26:48Z

Yeah, that works, but you'll have samples in there with time=0 but aren't from the most recent day.

szhan · 2024-07-23T12:32:39Z

Ah, okay. Gotta fix that later too.

szhan · 2024-07-23T12:57:32Z

It may be useful to promote collection date to an attribute in the Sample class rather than keeping it as an entry in the metadata, because we are accessing the collection date pretty often.

EDIT: Probably will do this later, since I don't feel like changing it and then needing to update other parts of the code.

szhan · 2024-07-23T21:28:51Z

Okay, the list of reconsidered samples are now being updated FIFO. The thing left to do is to make sure the times of the excluded samples added back in are not the time when they are added.

szhan · 2024-07-23T21:37:02Z

Should we tackle the node times of the added excluded samples in a separate issue?

szhan · 2024-07-23T21:38:06Z

Also, I think the CLI command daily-extend should take excluded_sample_dir as a separate argument from output_prefix.

jeromekelleher · 2024-07-24T08:45:41Z

I think we probably want to use a SQLite DB to store the excluded samples. This will make it easier to keep track of what's in there, and which ones have been inserted back in. I can do this if you have a working prototype based on pickle files.

jeromekelleher · 2024-07-24T08:46:37Z

Should we tackle the node times of the added excluded samples in a separate issue?

That's up to you - if the code is working well enough for evaluation purposes, I'm happy to kick the node dating can down the road.

szhan · 2024-07-24T09:01:59Z

I'll tackle the node times in a separate issue right after PR #194 gets merged. One reason I want to tackle it a bit later is because I want to get new 2020 trees for Yan's workshop ASAP and so want to run what we got now.

szhan · 2024-07-24T09:08:46Z

Also, we should probably log the number of excluded samples, reconsidered samples, and added samples per round of daily extension.

szhan · 2024-07-24T09:18:32Z

I think we probably want to use a SQLite DB to store the excluded samples. This will make it easier to keep track of what's in there, and which ones have been inserted back in. I can do this if you have a working prototype based on pickle files.

Like we store the Sample objects as BLOBs and query them by collection date like we do with sample metadata?

jeromekelleher · 2024-07-24T09:43:40Z

Pretty much.

szhan · 2024-07-24T12:02:30Z

I think these are the TODO items before closing this issue.

Fix the node times of added-backed samples.
Store pickled Sample objects in a SQLite database.
Log the number of excluded, reconsidered, and added-back samples per iteration.

szhan · 2024-07-24T16:20:47Z

Sorry, I had to fix adding more reconsidered samples. #195

szhan · 2024-07-25T08:52:20Z

The trees built using the code with PR #196 have the four Alpha samples in Sep. 2020. Also, by Dec. 31, 2020, there are 22,136 Alpha samples. So, it is working, I think.

EDIT: Note also that these trees contain 186,951, or ~70%, of the 2020 samples.

szhan · 2024-07-29T07:12:52Z

Quick update. Trees built (using max HMM cost of 5, min group size of 2, and reconsider past 5 days of excluded samples) from the samples collected up to and including Feb. 19, 2021 contain XA as a recombinant. I'll take a closer look at these trees.

szhan mentioned this issue Jul 20, 2024

Reconsider filtered samples from the past N consecutive days #194

Merged

szhan mentioned this issue Jul 24, 2024

Remove samples added back after reconsidering #196

Merged

This was referenced Jul 25, 2024

Insert added-backed samples at the right sampling times #197

Closed

Store the Sample objects of excluded samples as BLOBs in a SQLite database #198

Closed

Log the number of excluded samples, reconsidered samples, and added-back samples #199

Open

szhan mentioned this issue Jul 29, 2024

Identify sensible max HMM cost thresholds #188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsidering filtered samples from the past N consecutive days #193

Reconsidering filtered samples from the past N consecutive days #193

szhan commented Jul 20, 2024 •

edited

Loading

szhan commented Jul 21, 2024

jeromekelleher commented Jul 21, 2024

szhan commented Jul 22, 2024 •

edited

Loading

szhan commented Jul 22, 2024

szhan commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024 •

edited

Loading

jeromekelleher commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

szhan commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024 •

edited

Loading

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

jeromekelleher commented Jul 24, 2024

jeromekelleher commented Jul 24, 2024

szhan commented Jul 24, 2024

szhan commented Jul 24, 2024

szhan commented Jul 24, 2024

jeromekelleher commented Jul 24, 2024

szhan commented Jul 24, 2024 •

edited

Loading

szhan commented Jul 24, 2024

szhan commented Jul 25, 2024 •

edited

Loading

szhan commented Jul 29, 2024

Reconsidering filtered samples from the past N consecutive days #193

Reconsidering filtered samples from the past N consecutive days #193

Comments

szhan commented Jul 20, 2024 • edited Loading

szhan commented Jul 21, 2024

jeromekelleher commented Jul 21, 2024

szhan commented Jul 22, 2024 • edited Loading

szhan commented Jul 22, 2024

szhan commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024 • edited Loading

jeromekelleher commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

szhan commented Jul 23, 2024

jeromekelleher commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024 • edited Loading

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

szhan commented Jul 23, 2024

jeromekelleher commented Jul 24, 2024

jeromekelleher commented Jul 24, 2024

szhan commented Jul 24, 2024

szhan commented Jul 24, 2024

szhan commented Jul 24, 2024

jeromekelleher commented Jul 24, 2024

szhan commented Jul 24, 2024 • edited Loading

szhan commented Jul 24, 2024

szhan commented Jul 25, 2024 • edited Loading

szhan commented Jul 29, 2024

szhan commented Jul 20, 2024 •

edited

Loading

szhan commented Jul 22, 2024 •

edited

Loading

szhan commented Jul 23, 2024 •

edited

Loading

szhan commented Jul 23, 2024 •

edited

Loading

szhan commented Jul 24, 2024 •

edited

Loading

szhan commented Jul 25, 2024 •

edited

Loading