-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconsidering filtered samples from the past N consecutive days #193
Comments
After grouping reconsidered samples by their paths and immediate reversions, for each group:
@jeromekelleher I think this sounds about right? I think we should keep the immediate reversions here. |
I was imagining that we'd reuse the same local tree building technology as we have in the usual attachment case. I guess we'd have to change node times, but otherwise should work? |
|
We have decided to handle sampling node dates at a later issue. The focus now is to see whether this strategy adds the early Alpha samples. |
I think it is working! Using min group size of 2 and looking at the past 5 days worth of filtered samples, these three Alpha samples collected in early Sep. 2020 got attached on 2020-09-25.
|
Wahey! How much other stuff gets attached at the same time? |
As in, how many other samples also get added back in besides the ones we think should be added in? |
Hmm, it seems like quite a lot of samples are being added back in.
|
The numbers above are misleading, because they count excluded samples that may be added back in already. Still, it seems like samples (besides the ones we think should be added back in) are being added back in. |
One thing clearly missing is to make sure that the same excluded samples don't get added back in multiple times. |
There are five Alpha samples collected in Sep. 2020 in the Viridian dataset. The three above are added. I'm looking at the other two samples both collected on 2020-09-30. One got added in but the other didn't, and for some reason the other one is not in the excluded samples file. I think all five Alpha samples should be added by 2020-10-01, so I'm investigating this further before doing anything else. |
Ah, the sample (ERR5071073) is listed in the metadata, but not in the input alignments because it got filtered out for having too many Ns in the consensus sequence. Okay, I think the four Alpha samples that should be there are added in. |
Also, this function needs to be modified. def last_date(ts):
if ts.num_samples == 0:
# Special case for the initial ts which contains the
# reference but not as a sample
u = ts.num_nodes - 1
else:
u = ts.samples()[-1]
node = ts.node(u)
assert node.time == 0
return parse_date(node.metadata["date"]) It is getting the last date among the samples based on the last node added. In the previous pipeline where filtered samples are not reconsidered, then it is getting the latest date among the samples. But when we add samples from the past N days, then I don't think it is returning the latest date among the samples anymore. |
Ah, good catch. I guess this is a motivation for getting the time of the extra added sample nodes correct. Then, we can define the
|
To get stuff working, what you could do is:
Slow, but will work ok for prototyping. |
I was thinking the former. def last_date(ts):
if ts.num_samples == 0:
# Special case for the initial ts which contains the
# reference but not as a sample
u = ts.num_nodes - 1
else:
samples = ts.samples()
u = samples[ts.nodes_time[samples] == 0][0]
node = ts.node(u)
assert node.time == 0
return parse_date(node.metadata["date"]) |
Yeah, that works, but you'll have samples in there with time=0 but aren't from the most recent day. |
Ah, okay. Gotta fix that later too. |
It may be useful to promote collection date to an attribute in the Sample class rather than keeping it as an entry in the metadata, because we are accessing the collection date pretty often. EDIT: Probably will do this later, since I don't feel like changing it and then needing to update other parts of the code. |
Okay, the list of reconsidered samples are now being updated FIFO. The thing left to do is to make sure the times of the excluded samples added back in are not the time when they are added. |
Should we tackle the node times of the added excluded samples in a separate issue? |
Also, I think the CLI command |
I think we probably want to use a SQLite DB to store the excluded samples. This will make it easier to keep track of what's in there, and which ones have been inserted back in. I can do this if you have a working prototype based on pickle files. |
That's up to you - if the code is working well enough for evaluation purposes, I'm happy to kick the node dating can down the road. |
I'll tackle the node times in a separate issue right after PR #194 gets merged. One reason I want to tackle it a bit later is because I want to get new 2020 trees for Yan's workshop ASAP and so want to run what we got now. |
Also, we should probably log the number of excluded samples, reconsidered samples, and added samples per round of daily extension. |
Like we store the Sample objects as BLOBs and query them by collection date like we do with sample metadata? |
Pretty much. |
I think these are the TODO items before closing this issue.
|
Sorry, I had to fix adding more reconsidered samples. #195 |
The trees built using the code with PR #196 have the four Alpha samples in Sep. 2020. Also, by Dec. 31, 2020, there are 22,136 Alpha samples. So, it is working, I think. EDIT: Note also that these trees contain 186,951, or ~70%, of the 2020 samples. |
Quick update. Trees built (using max HMM cost of 5, min group size of 2, and reconsider past 5 days of excluded samples) from the samples collected up to and including Feb. 19, 2021 contain XA as a recombinant. I'll take a closer look at these trees. |
An inappropriate HMM cost threshold may filter out epidemiologically relevant variants, which is undesirable (e.g. see #188). But , in practice, it is not easy to figure out, or clear, what an appropriate threshold value should be.
To see if we can work around this issue, we are trying a strategy where samples filtered out by the HMM cost threshold from the past N days are reconsidered in a post-processing step at the end of a round of daily extension.
The procedure is as follows:
add_matching_results
.A good place to do this post-processing step may after calling
add_matching_results
while callingextend
.If this works, then we should see the first Alpha samples added in Sep., 2020, even though they were filtered out by the HMM cost filter.
The text was updated successfully, but these errors were encountered: