Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconsidering filtered samples from the past N consecutive days #193

Open
szhan opened this issue Jul 20, 2024 · 32 comments
Open

Reconsidering filtered samples from the past N consecutive days #193

szhan opened this issue Jul 20, 2024 · 32 comments

Comments

@szhan
Copy link
Contributor

szhan commented Jul 20, 2024

An inappropriate HMM cost threshold may filter out epidemiologically relevant variants, which is undesirable (e.g. see #188). But , in practice, it is not easy to figure out, or clear, what an appropriate threshold value should be.

To see if we can work around this issue, we are trying a strategy where samples filtered out by the HMM cost threshold from the past N days are reconsidered in a post-processing step at the end of a round of daily extension.

The procedure is as follows:

  • Given N, fetch the samples (and their already inferred paths and mutations) from the past N consecutive days' worth of pickled files. Days with no filtered samples do not contribute to this pool of reconsidered samples.
  • Group the reconsidered samples by their paths and immediate reversions as done in add_matching_results.
  • The samples in groups of size greater than k are attached to the ARG based on their inferred paths (i.e. without needing to rerun HMM matching).
  • After proceeding to the following day, the oldest filtered samples, which are not attached to the ARG, are excluded in a FIFO fashion.

A good place to do this post-processing step may after calling add_matching_results while calling extend.

If this works, then we should see the first Alpha samples added in Sep., 2020, even though they were filtered out by the HMM cost filter.

@szhan
Copy link
Contributor Author

szhan commented Jul 21, 2024

After grouping reconsidered samples by their paths and immediate reversions, for each group:

  • Insert a new node (set its time at some time arbitrarily slightly earlier than the oldest reconsidered sample).
  • Add new edges connecting the new node to the shared parent nodes of the grouped samples as per their shared path.
  • Add the shared mutations (including immediate reversions) above the new node.
  • Attach each of the reconsidered samples to the new node via new edges.

@jeromekelleher I think this sounds about right? I think we should keep the immediate reversions here.

@jeromekelleher
Copy link
Owner

I was imagining that we'd reuse the same local tree building technology as we have in the usual attachment case. I guess we'd have to change node times, but otherwise should work?

@szhan
Copy link
Contributor Author

szhan commented Jul 22, 2024

add_matching_results has been modified to take the argument min_group_size to exclude grouped matches having fewer than a specified number of samples.

@szhan
Copy link
Contributor Author

szhan commented Jul 22, 2024

We have decided to handle sampling node dates at a later issue. The focus now is to see whether this strategy adds the early Alpha samples.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

I think it is working! Using min group size of 2 and looking at the past 5 days worth of filtered samples, these three Alpha samples collected in early Sep. 2020 got attached on 2020-09-25.

strain date
ERR4659819 2020-09-20
ERR4682028 2020-09-21
ERR5217634 2020-09-23

@jeromekelleher
Copy link
Owner

Wahey! How much other stuff gets attached at the same time?

@jeromekelleher
Copy link
Owner

As in, how many other samples also get added back in besides the ones we think should be added in?

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

Hmm, it seems like quite a lot of samples are being added back in.

date filtered samples added back
2020-09-24 1750
2020-09-25 2670
2020-09-26 2423
2020-09-27 1868
2020-09-28 1489
2020-09-29 1175
2020-09-30 42
2020-10-01 43

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

The numbers above are misleading, because they count excluded samples that may be added back in already. Still, it seems like samples (besides the ones we think should be added back in) are being added back in.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

One thing clearly missing is to make sure that the same excluded samples don't get added back in multiple times.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

There are five Alpha samples collected in Sep. 2020 in the Viridian dataset. The three above are added. I'm looking at the other two samples both collected on 2020-09-30. One got added in but the other didn't, and for some reason the other one is not in the excluded samples file. I think all five Alpha samples should be added by 2020-10-01, so I'm investigating this further before doing anything else.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

Ah, the sample (ERR5071073) is listed in the metadata, but not in the input alignments because it got filtered out for having too many Ns in the consensus sequence. Okay, I think the four Alpha samples that should be there are added in.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

Also, this function needs to be modified.

def last_date(ts):
    if ts.num_samples == 0:
        # Special case for the initial ts which contains the
        # reference but not as a sample
        u = ts.num_nodes - 1
    else:
        u = ts.samples()[-1]
    node = ts.node(u)
    assert node.time == 0
    return parse_date(node.metadata["date"])

It is getting the last date among the samples based on the last node added. In the previous pipeline where filtered samples are not reconsidered, then it is getting the latest date among the samples. But when we add samples from the past N days, then I don't think it is returning the latest date among the samples anymore.

@jeromekelleher
Copy link
Owner

Ah, good catch. I guess this is a motivation for getting the time of the extra added sample nodes correct. Then, we can define the last_date as

samples = ts.samples()
time_0 = samples[ts.nodes_time[samples] == 0]
node = ts.node(samples[0])  # Arbitrarily pick first

@jeromekelleher
Copy link
Owner

To get stuff working, what you could do is:

max(parse_date(node.metadata["date"] for node in ts.nodes())

Slow, but will work ok for prototyping.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

I was thinking the former.

def last_date(ts):
    if ts.num_samples == 0:
        # Special case for the initial ts which contains the
        # reference but not as a sample
        u = ts.num_nodes - 1
    else:
        samples = ts.samples()
        u = samples[ts.nodes_time[samples] == 0][0]
    node = ts.node(u)
    assert node.time == 0
    return parse_date(node.metadata["date"])

@jeromekelleher
Copy link
Owner

Yeah, that works, but you'll have samples in there with time=0 but aren't from the most recent day.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

Ah, okay. Gotta fix that later too.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

It may be useful to promote collection date to an attribute in the Sample class rather than keeping it as an entry in the metadata, because we are accessing the collection date pretty often.

EDIT: Probably will do this later, since I don't feel like changing it and then needing to update other parts of the code.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

Okay, the list of reconsidered samples are now being updated FIFO. The thing left to do is to make sure the times of the excluded samples added back in are not the time when they are added.

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

Should we tackle the node times of the added excluded samples in a separate issue?

@szhan
Copy link
Contributor Author

szhan commented Jul 23, 2024

Also, I think the CLI command daily-extend should take excluded_sample_dir as a separate argument from output_prefix.

@jeromekelleher
Copy link
Owner

I think we probably want to use a SQLite DB to store the excluded samples. This will make it easier to keep track of what's in there, and which ones have been inserted back in. I can do this if you have a working prototype based on pickle files.

@jeromekelleher
Copy link
Owner

Should we tackle the node times of the added excluded samples in a separate issue?

That's up to you - if the code is working well enough for evaluation purposes, I'm happy to kick the node dating can down the road.

@szhan
Copy link
Contributor Author

szhan commented Jul 24, 2024

I'll tackle the node times in a separate issue right after PR #194 gets merged. One reason I want to tackle it a bit later is because I want to get new 2020 trees for Yan's workshop ASAP and so want to run what we got now.

@szhan
Copy link
Contributor Author

szhan commented Jul 24, 2024

Also, we should probably log the number of excluded samples, reconsidered samples, and added samples per round of daily extension.

@szhan
Copy link
Contributor Author

szhan commented Jul 24, 2024

I think we probably want to use a SQLite DB to store the excluded samples. This will make it easier to keep track of what's in there, and which ones have been inserted back in. I can do this if you have a working prototype based on pickle files.

Like we store the Sample objects as BLOBs and query them by collection date like we do with sample metadata?

@jeromekelleher
Copy link
Owner

Pretty much.

@szhan
Copy link
Contributor Author

szhan commented Jul 24, 2024

I think these are the TODO items before closing this issue.

  • Fix the node times of added-backed samples.
  • Store pickled Sample objects in a SQLite database.
  • Log the number of excluded, reconsidered, and added-back samples per iteration.

@szhan
Copy link
Contributor Author

szhan commented Jul 24, 2024

Sorry, I had to fix adding more reconsidered samples. #195

@szhan
Copy link
Contributor Author

szhan commented Jul 25, 2024

The trees built using the code with PR #196 have the four Alpha samples in Sep. 2020. Also, by Dec. 31, 2020, there are 22,136 Alpha samples. So, it is working, I think.

EDIT: Note also that these trees contain 186,951, or ~70%, of the 2020 samples.

@szhan
Copy link
Contributor Author

szhan commented Jul 29, 2024

Quick update. Trees built (using max HMM cost of 5, min group size of 2, and reconsider past 5 days of excluded samples) from the samples collected up to and including Feb. 19, 2021 contain XA as a recombinant. I'll take a closer look at these trees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants