Fix: Avoid nam id collisions that reduce accuracy #350

marcelm · 2023-09-28T14:44:38Z

The nam_id started from 0 both for forward and reverse hits, which resulted in NAM ids being reused for different NAMs. This regression was introduced by commit 91f2223 and reduced accuracy.

This PR brings accuracy back to the expected levels (that is, higher than v0.11.0 due to tuned parameters).

I will need to re-run the parameter evaluation ...

See #345

Comparison

for a couple of datasets

dataset	v0.11.0 (`93be0c2`)	this PR (`a290e7b`)	difference
drosophila-50-pe	90.1908	90.499	+0.3082
drosophila-100-pe	92.3875	92.40435	+0.0168
drosophila-150-pe	93.21175	93.2124	+0.0007
drosophila-200-pe	93.5184	93.54275	+0.0243
drosophila-300-pe	95.3617	95.37225	+0.0105
maize-50-pe	71.47185	72.9601	+1.4882
maize-100-pe	87.13235	87.2842	+0.1519
CHM13-50-pe	90.6348	91.1365	+0.5017
CHM13-100-pe	93.2195	93.2809	+0.0614
CHM13-150-pe	94.14135	94.1508	+0.0094
CHM13-200-pe	94.43375	94.43975	+0.0060
CHM13-300-pe	95.62465	95.63985	+0.0152
rye-50-pe	69.1411	70.8096	+1.6685
rye-100-pe	85.6489	85.73105	+0.0821
rye-150-pe	90.20365	90.23185	+0.0282
rye-200-pe	91.47915	91.50855	+0.0294

This fixes a regression introduced by commit 91f2223 and brings accuracy to the expected levels (higher than v0.11.0 due to tuned parameters). See #345

ksahlin · 2023-09-28T15:32:36Z

Great that you found it and fixed it so quickly!

Yes, a bit painful to rerun but exciting to see what the new optimization will arrive at. I guess the numbers reported above are close to minimum improvement we could arrive at (if you use 0.25SE + 0.75PE as objective function).

ksahlin · 2023-09-28T15:37:50Z

Approved btw. If we want to take the new accuracy numbers discussion somewhere else.

marcelm · 2023-09-29T13:04:12Z

We can take the discussion here.

Here are the new numbers for read length 50, 75, 100 and 150. For them, the same parameter settings as before maximize accuracy.

For canonical read length 250, the best parameters are now (23, 19, 4, 14) (marked 250* in the table below).

I copied the table from #345 and updated it. The single-end numbers don’t change. I haven’t had time to understand why that is the case.

Canonical read length	Current parameters	suggested parameters	SE improvement	PE improvement
50	(20, 16, -3, 2)	(18, 14, -2, -2)	+2.478	~~+1.116~~ +0.987
75	(20, 16, -3, 2)	(20, 16, -2, -2)	+0.185	~~+0.288~~ +0.225
100	(20, 16, -2, 2)	(20, 16, -1, -1)	+0.070	~~+0.137~~ +0.084
150	(20, 16, 1, 7)	(20, 16, 1, 7)	already best
250	(20, 16, 4, 13)	(22, 18, 4, 14)	+0.028	~~+0.020~~ +0.022
250*		(23, 19, 4, 14)	+0.049	+0.025
300	(22, 18, 2, 12)	(22, 18, 3, 13)	+0.033	~~+0.013~~ +0.018

For read length 300, there’s a whole bunch of better parameter settings. The best is (24, 20, 6, 13), but this increases $l$ quite a bit. I’ll just dump the list here for now:

k	s	l	u	SE improvement	PE improvement	comment
24	20	6	13	+0.055	+0.036
23	19	8	13	+0.043	+0.032
23	19	5	13	+0.055	+0.028
24	20	4	13	+0.054	+0.028
23	19	6	12	+0.042	+0.031
23	19	6	13	+0.053	+0.025
22	18	6	12	+0.038	+0.029
23	19	4	12	+0.036	+0.028
22	18	6	13	+0.046	+0.025
24	20	2	13	+0.031	+0.028
23	19	6	11	+0.027	+0.027
24	20	4	12	+0.034	+0.025
23	19	5	11	+0.024	+0.024
23	19	7	11	+0.018	+0.023
22	18	3	13	+0.033	+0.018	chosen in #345
21	17	6	13	+0.021	+0.02
23	19	2	13	+0.02	+0.02
21	17	7	13	+0.022	+0.019
23	19	3	13	+0.038	+0.014
22	18	5	13	+0.04	+0.012
22	18	4	12	+0.024	+0.015
23	19	4	11	+0.026	+0.014
22	18	4	13	+0.032	+0.011
23	19	3	12	+0.017	+0.012
22	18	2	13	+0.019	+0.008
22	18	2	12	0.0	0.0	current (baseline)

All of these numbers are relatively small, so maybe it doesn’t matter too much.

marcelm · 2023-10-02T07:46:49Z

I’ve updated my previous comment. It now reports numbers for all read length except 400 (which is still running and which you didn’t want to change anyway).

ksahlin · 2023-10-02T08:12:15Z

Okay, I don't have any idea why results would be the same for SE either. I am okay with the suggestions above, but remain skeptical at increasing l too much for the read lengths tested (for longer reads it might be ok). I think the current suggested 300 setting is okay.

For another discussion: I wanted to check if you ever tested different q? I was thinking if we should remove the whole masking and Bitcount() it since it uses popcount which seems expensive (I can show you some results from another study offline). q and Bitcount() were initially introduced to skew the sampling towards shorter seeds, which positively affected the shortest reads (=< 100ish), but with above settings, read lengths 50-100 does not even need this computation.

Let me know when you have a new commit to test.

marcelm · 2023-10-02T09:11:11Z

Okay, I don't have any idea why results would be the same for SE either. I am okay with the suggestions above, but remain skeptical at increasing l too much for the read lengths tested (for longer reads it might be ok). I think the current suggested 300 setting is okay.

Can you confirm that parameter settings for canonical read length 300 should remain (22, 18, 3, 13) as decided upon in #345? (Just asking because #345 is already merged, which makes "current suggested" ambiguous.)

For another discussion: I wanted to check if you ever tested different q?

No, can you open a new issue for that?

As discussed in #350

ksahlin · 2023-10-02T09:15:06Z

Can you confirm that parameter settings for canonical read length 300 should remain (22, 18, 3, 13) as decided upon in #345?

Yes, I confirm that (22, 18, 3, 13) is good for 300.

No, can you open a new issue for that?

Will do.

Fix: Avoid nam id collisions that reduce accuracy

a290e7b

This fixes a regression introduced by commit 91f2223 and brings accuracy to the expected levels (higher than v0.11.0 due to tuned parameters). See #345

marcelm mentioned this pull request Sep 28, 2023

Optimize parameters #345

Merged

3 tasks

marcelm added a commit that referenced this pull request Oct 2, 2023

Slightly improve accuracy for read length 250

655e58d

As discussed in #350

marcelm mentioned this pull request Oct 2, 2023

Slightly improve accuracy for read length 250 #351

Merged

marcelm merged commit 2f6e78f into main Oct 2, 2023
9 checks passed

marcelm deleted the fix-accuracy branch October 2, 2023 09:17

ksahlin mentioned this pull request Oct 2, 2023

Removing parameter q and bitcount? #352

Open

marcelm mentioned this pull request Nov 13, 2023

Baseline comparison CI job passes unexpectedly #361

Closed

marcelm mentioned this pull request Nov 23, 2023

Release 0.12.0 #349

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Avoid nam id collisions that reduce accuracy #350

Fix: Avoid nam id collisions that reduce accuracy #350

marcelm commented Sep 28, 2023

ksahlin commented Sep 28, 2023

ksahlin commented Sep 28, 2023

marcelm commented Sep 29, 2023 •

edited

Loading

marcelm commented Oct 2, 2023

ksahlin commented Oct 2, 2023 •

edited

Loading

marcelm commented Oct 2, 2023

ksahlin commented Oct 2, 2023

Fix: Avoid nam id collisions that reduce accuracy #350

Fix: Avoid nam id collisions that reduce accuracy #350

Conversation

marcelm commented Sep 28, 2023

Comparison

ksahlin commented Sep 28, 2023

ksahlin commented Sep 28, 2023

marcelm commented Sep 29, 2023 • edited Loading

marcelm commented Oct 2, 2023

ksahlin commented Oct 2, 2023 • edited Loading

marcelm commented Oct 2, 2023

ksahlin commented Oct 2, 2023

marcelm commented Sep 29, 2023 •

edited

Loading

ksahlin commented Oct 2, 2023 •

edited

Loading