Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify sensible max HMM cost thresholds #188

Open
szhan opened this issue Jul 2, 2024 · 8 comments
Open

Identify sensible max HMM cost thresholds #188

szhan opened this issue Jul 2, 2024 · 8 comments

Comments

@szhan
Copy link
Contributor

szhan commented Jul 2, 2024

To explore how applying a max HMM cost threshold affects ARG building, I've run sc2ts on all the 2020 samples (with full precision collection dates) using varying thresholds (with k = 3). The threshold values are None (the baseline for comparison), 20, 15, 10, and 5.

First, a base ARG was built up to Jan. 31 2020 (inclusive) without a max threshold. Then, that base ARG is extended while applying a max threshold.

High-level comparison of the ARGs

The table below shows a high-level summary of the differences between the ARGs.

threshold num_samples num_nodes num_edges num_mutations diff_samples diff_nodes diff_edges diff_mutations
None 173216 195179 195252 141778 0 0 0 0
20 173191 195130 195200 141460 25 49 52 318
15 173108 195010 195074 140561 108 169 178 1217
10 172483 194067 194118 136062 733 1112 1134 5716
5 169428 189673 189703 123736 3788 5506 5549 18042

diff_* denotes the number of samples, nodes, edges, and mutations reduced compared to the baseline (None).

There are a couple of big jumps going from threshold 15 to 10 and then from 10 to 5. The threshold may be too high at 10 already. I'm not getting much more out of looking at this.

Time-travelling Alpha sample

The sample ERR6713815 seems to be a time-travelling case of Alpha (B.1.1.7) in the Viridian dataset. It is labelled as B.1.1.7 according to the column Viridian_pangolin in the metadata, and it has a collection date of March 01, 2020. The Alpha variant was first identified in Sep. 2020 (ref).

In the ARG built using a threshold of 5, the sample's path has a single parent, but it has 29 mutations, so its HMM cost is 29. Note that none of the mutations are flagged as (immediate) reversions. The time-travelling sample should be filtered out, but the thresholding only applies to newly added samples that are classified as singletons in terms of attachment path and set of immediate reversions. Here, the time traveller was not classified as a singleton because it shares the same parent with other newly added samples which also have no immediate reversions (actually, no mutations at all).

Maybe we have to change how we group newly added samples.

@szhan szhan changed the title Identify max HMM cost threshold Identify sensible max HMM cost thresholds Jul 2, 2024
@szhan
Copy link
Contributor Author

szhan commented Jul 2, 2024

After some discussion, we decided that applying a HMM cost filter after grouping newly added samples isn't quite right, for example, it doesn't exclude the Alpha time traveller above.

For now, we will separate HMM cost filtering from match grouping. The next step is to see what samples get filtered out per day using a fairly strict threshold (max threshold of 5, which seems to be reducing the number of mutations by quite a bit) before improving the filtering strategy.

@szhan
Copy link
Contributor Author

szhan commented Jul 8, 2024

In the latest run, samples are attached to a base ARG (built up to Jan. 31, 2020) as described above using a max HMM cost threshold of 5. Samples excluded due to this threshold are stored in pickle files.

The time traveller Alpha sample (ERR6713815; pointed out above) is gone, but so are the other Alpha samples that are supposed to show up in Sep./Oct. In an ARG built up to Dec. 10, 2020, no Alpha samples have been attached. The threshold is just too strict, I think.

Below is a table showing the number of samples excluded on certain dates. It is pretty clear that an increasing number of samples are excluded over time.

date number of excluded samples on the date
2020-02-29 0
2020-03-31 5
2020-04-30 1
2020-05-31 0
2020-06-30 34
2020-07-31 28
2020-08-31 249
2020-09-30 357
2020-10-31 336
2020-11-30 312
2020-12-10 1358

Also, the pickling step is taking quite some time, it seems, because the run is taking at least one and a half day longer than without pickling the excluded samples.

@jeromekelleher
Copy link
Owner

Great, I think this is working as expected. Now we need to think of a simple way of aggregating the "kicked out" samples over several days, and see if we get a sufficiently large grouping clustering under the same node with ~ the same mutations. That's the signal of a new lineage.

@szhan
Copy link
Contributor Author

szhan commented Jul 9, 2024

There are supposed to be three Jackson et al. recombinants sampled in Dec. 2020 (see Table 1).

recombinant number of samples sampling date
Group B 2 2020/12/23–2020/12/24
CAMC-CBA018 1 2020/12/18
MILK-103C712 1 2020/12/18

I don't think they are among the recombinants detected in an ARG (up to Dec. 30, 2020) built using the max HMM cost threshold of 5.

EDIT: The Jackson et al. recombinants have Alpha as a parent. Since Alpha was not added to this ARG, the recombinants are not present.

@szhan
Copy link
Contributor Author

szhan commented Jul 25, 2024

Quick update. In the latest test run using the code from PR #196, four of the Alpha samples collected in Sep. 2020 were added to the trees. Also, we identified two of the Jackson et al. recombinants (the two Group B sequences) as recombinants in our trees, but the two singletons (which have different breakpoints) were not added (likely due to our strict HMM cost).

Because the metadata file of the Viridian dataset does not have strain names or the EPI IDs, I had to manually search for the ENA accessions of the Jackson et al. recombinants. Thankfully, the strain names are in the description text of the ENA entries, and so I was able to get the corresponding ENA accessions. They are below:

ENA accession strain name group
ERR5058070 QEUH-CCCB30 Group B
ERR5058141 QEUH-CD0F1F Group B
ERR5054123 CAMC-CBA018 singleton
ERR5054082 CAMC-CB7AB3 singleton

So, it looks like the new approach is working, even though singleton, evolutionary unsuccessful recombinants may not be added.

I'll do a bit more testing and also run sc2ts to Feb. 2021 when the XA recombinant emerges.

@jeromekelleher
Copy link
Owner

Awesome. Looks like it's working!

@szhan
Copy link
Contributor Author

szhan commented Jul 29, 2024

As mentioned in this comment, we can pick up XA in Feb. 2021. So, combined with reconsidering samples, this approach seems to be working.

@szhan
Copy link
Contributor Author

szhan commented Aug 9, 2024

Since we have been checking whether the new trees identify the Jackson et al. recombinants, it is probably useful to have their ENA run accessions (used as the unique sequence identifiers in our latest runs) and strain names here. I have extracted the ENA accessions and strain names from Table S2 of the Jackson et al. paper below.

ENA run accession strain name group sampling date
ERR5308556 ALDP-11CF93B Group A 2021-01-30
ERR5323237 ALDP-125C4D7 Group A 2021-02-06
ERR5414941 ALDP-130BB95 Group A 2021-02-21
ERR5404883 LIVE-DFCFFE Group A 2021-02-14
ERR5058070 QEUH-CCCB30 Group B 2020-12-23
ERR5058141 QEUH-CD0F1F Group B 2020-12-24
ERR5272107 MILK-1166F52 Group C 2021-01-24
ERR5406307 MILK-11C95A6 Group C 2021-01-30
ERR5232711 QEUH-109B25C Group C 2021-01-18
ERR5349458 MILK-126FE1F Group D 2021-02-07
ERR5335088 RAND-12671E1 Group D 2021-02-02
ERR5340986 RAND-128FA33 Group D 2021-02-02
ERR5054123 CAMC-CBA018 singleton 2020-12-18
ERR5054082 CAMC-CB7AB3 singleton 2020-12-18
ERR5304348 MILK-103C712 singleton 2021-01-12
ERR5238288 QEUH-1067DEF singleton 2021-01-17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants