Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize parameters (again) #407

Open
marcelm opened this issue Mar 12, 2024 · 4 comments
Open

Optimize parameters (again) #407

marcelm opened this issue Mar 12, 2024 · 4 comments

Comments

@marcelm
Copy link
Collaborator

marcelm commented Mar 12, 2024

Here are suggested new indexing parameters for all read lengths.

This supersedes #397.

I ran the optimization script for both v0.12.0 (commit 6fd4c5d) and multi-context seeds (commit c4a7f61).

Differences to #397:

  • "accuracy slack" is set to 0.1: The accuracy of a single dataset may drop 0.1 percentage points below the baseline without being excluded from further consideration. This is intended to avoid running into local maxima.
  • Optimization criterion is regular accuracy, not score-based accuracy.

Command used:

./search.py -c ${commit} -x --accuracy-slack 0.1 --mapping-rate-slack 1 -r ${read_length}

Suggested changes

Parameters are given as a tuple $(k, s, l, u)$.

I did not mechanically pick the settings that optimize mapping-only accuracy, but made sure that they also work well for extension alignment mode. Many parameter settings are found that are essentially equally good, so it was possible for me to find settings that work equally well for v0.12.0 and multi-context seeds, except for read lengths 100 and 150.

Readl. Before Suggestion Comment
50 (18, 14, -2, 1) (16, 12, -2, 0)
75 (20, 16, -3, 2) (20, 16, -3, -1) alternative: (21, 17, -3, 1)
100 (20, 16, -2, 2) (16, 12, 1, 3) for v0.12.
Alternative (17, 13, 1, 3) is very similar
100 (20, 16, -2, 2) (18, 14, 1, 3) for multi-context seeds
125 (20, 16, -1, 4) - not measured
150 (20, 16, 1, 7) (20, 16, 2, 5) for v0.12. Reduces ext. alignment SE accuracy
slightly; alternative (20, 16, 2, 8) would not
(but improve mapping-only PE accuracy much less)
150 (20, 16, 1, 7) (22, 18, 3, 5) for multi-context seeds. Reduces ext. alignment SE accuracy slightly;
alternative (23, 19, 2, 7) would not
(but improve mapping-only PE accuracy a bit less)
200 (22, 18, 2, 12) (24, 20, 4, 12)
300 (22, 18, 2, 12) (24, 20, 5, 13)
500 (23, 17, 2, 12) (25, 19, 7, 13)

We only have canonical read length 250. Using the interpolated parameters (24, 20, 5, 12) or (24, 20, 4, 12) gives ok results for read lengths 200 and 300.

The script was run in a mode where it optimizes mapping-only accuracy. I am currently running it to optimize extension-aligment accuracy. In theory, the results could be different. So far, for the read lengths that are finished (currently 50, 75, 100), they are not.

Details for v0.12

This shows how mapping-only and extension-alignment accuracy change for the suggested parameters.

Readlen. kslu maponly SE maponly PE extalign SE extalign PE
50 (16, 12, -2, 0) +0.7657 +1.1146 +0.9158 +0.2027
75 (20, 16, -3, -1) -0.0090 +0.1043 +0.0296 +0.0170
75 (21, 17, -3, 1) +0.0397 +0.0744 -0.0139 +0.0229
100 (16, 12, 1, 3) +0.6626 +0.4397 +0.2958 +0.1274
100 (17, 13, 1, 3) +0.6701 +0.4101 +0.2421 +0.1311
150 (20, 16, 2, 5) -0.0016 +0.0917 -0.0119 +0.0357
150 (20, 16, 2, 8) +0.1089 +0.0357 +0.0204 +0.0241
200 (24, 20, 4, 12) +0.0516 +0.0533 +0.0041 +0.0295
200 (24, 20, 5, 12) +0.0150 +0.0496
300 (24, 20, 4, 12) +0.1591 +0.0674
300 (24, 20, 5, 12) +0.1725 +0.0729
300 (24, 20, 5, 13) +0.2264 +0.0809 +0.0438 +0.0315
400 (25, 19, 7, 13) +0.2737 +0.1441 +0.0520 +0.0306
More details

Details have been shortened because GitHub’s maximum comment size was reached.

# v0.12.0

## Read length 50: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(16, 12, -2, 0)	63.0373	74.6368	+0.7657	+1.1146	96.585	96.573	+0.457	+0.443		pareto
(16, 12, -2, 1)	62.9708	74.5278	+0.6992	+1.0056	96.693	96.680	+0.564	+0.551		
(16, 12, -2, 2)	62.9749	74.5250	+0.7033	+1.0028	96.694	96.682	+0.565	+0.552		
(17, 13, -2, 0)	62.7352	74.2179	+0.4637	+0.6957	96.297	96.293	+0.169	+0.163		
(17, 13, -2, 1)	62.6388	74.0388	+0.3673	+0.5166	96.426	96.421	+0.297	+0.291		
(17, 13, -2, 2)	62.6360	74.0215	+0.3645	+0.4993	96.430	96.426	+0.302	+0.296		
(18, 14, -2, 0)	62.4612	73.7873	+0.1897	+0.2651	95.983	95.983	-0.145	-0.146		
(18, 14, -2, 1)	62.2715	73.5222	-0.0000	+0.0000	96.128	96.130	+0.000	+0.000	*****	
(18, 14, -2, 2)	62.2643	73.4951	-0.0072	-0.0270	96.137	96.139	+0.009	+0.009		

## Read length 50: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(16, 12, -2, 0)	92.8490	97.3130	+0.4439	-0.0220	66.0610	80.8919	+0.9158	+0.2027	96.585	99.455	+0.457	-0.056		pareto
(18, 14, -2, 1)	92.4051	97.3350	+0.0000	-0.0000	65.1452	80.6892	+0.0000	+0.0000	96.128	99.511	+0.000	+0.000


## Read length 75: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(20, 16, -3, -1)	71.5283	82.4506	-0.0090	+0.1043	98.817	98.823	-0.182	-0.181		pareto
(21, 17, -3, -1)	71.5770	82.4207	+0.0397	+0.0744	98.747	98.750	-0.252	-0.254		pareto
(20, 16, -3, 0)	71.5360	82.3733	-0.0013	+0.0269	98.979	98.984	-0.020	-0.020		
(20, 16, -3, 2)	71.5373	82.3463	-0.0000	+0.0000	98.999	99.004	+0.000	+0.000	*****	
(20, 16, -3, 3)	71.5373	82.3463	-0.0000	+0.0000	98.999	99.004	+0.000	+0.000		
(20, 16, -3, 1)	71.5378	82.3412	+0.0004	-0.0051	98.999	99.004	-0.000	-0.000		

## Read length 75: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(20, 16, -3, -1)	95.6523	98.4026	-0.1678	-0.0217	74.8279	86.7332	+0.0296	+0.0170	98.817	99.761	-0.182	-0.029		pareto
(21, 17, -3, -1)	95.6500	98.4203	-0.1701	-0.0040	74.7843	86.7391	-0.0139	+0.0229	98.747	99.754	-0.252	-0.036		pareto
(20, 16, -3, 2)	95.8202	98.4242	+0.0000	+0.0000	74.7982	86.7161	+0.0000	+0.0000	98.999	99.790	+0.000	+0.000	*****	

## Read length 100: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(16, 12, 1, 3)	77.3307	86.5965	+0.6626	+0.4397	99.181	99.180	-0.301	-0.301		pareto
(17, 13, 1, 3)	77.3382	86.5670	+0.6701	+0.4101	99.118	99.120	-0.364	-0.361		pareto
(18, 14, 1, 3)	77.3075	86.5272	+0.6393	+0.3704	99.042	99.040	-0.440	-0.442		
(17, 13, 0, 3)	76.9034	86.3236	+0.2352	+0.1668	99.387	99.387	-0.095	-0.094		
(18, 14, 0, 3)	76.9109	86.3185	+0.2428	+0.1617	99.344	99.342	-0.138	-0.140		
(16, 12, 0, 3)	76.8668	86.3232	+0.1986	+0.1664	99.446	99.446	-0.036	-0.035		
(18, 14, 0, 2)	76.8087	86.3316	+0.1405	+0.1747	99.250	99.250	-0.232	-0.232		
(17, 13, 0, 2)	76.7686	86.3275	+0.1004	+0.1707	99.292	99.293	-0.190	-0.188		
(19, 15, 0, 3)	76.8939	86.2880	+0.2257	+0.1311	99.283	99.285	-0.199	-0.196		
(19, 15, 0, 2)	76.7953	86.2902	+0.1272	+0.1333	99.202	99.201	-0.280	-0.280		
(16, 12, 0, 2)	76.7280	86.3065	+0.0598	+0.1496	99.347	99.347	-0.135	-0.134		
(20, 16, -2, 2)	76.6682	86.1568	+0.0000	+0.0000	99.482	99.481	+0.000	+0.000	*****	
(20, 16, -2, 3)	76.6916	86.1485	+0.0234	-0.0083	99.491	99.490	+0.009	+0.008		
(21, 17, -2, 1)	76.5986	86.1625	-0.0696	+0.0057	99.393	99.392	-0.089	-0.089		

## Read length 100: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(16, 12, 1, 3)	96.8270	98.8985	+0.0900	+0.0661	80.2273	89.7221	+0.2958	+0.1274	99.181	99.770	-0.301	-0.036		pareto
(17, 13, 1, 3)	96.8201	98.9200	+0.0831	+0.0876	80.1736	89.7258	+0.2421	+0.1311	99.118	99.769	-0.364	-0.036		pareto
(20, 16, -2, 2)	96.7371	98.8324	+0.0000	+0.0000	79.9315	89.5947	+0.0000	+0.0000	99.482	99.805	+0.000	+0.000	*****	

## Read length 150: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(21, 17, 2, 6)	83.8433	90.4057	+0.0664	+0.0783	99.662	99.662	-0.065	-0.064		pareto
(20, 16, 2, 5)	83.7752	90.4192	-0.0016	+0.0917	99.666	99.668	-0.061	-0.059		pareto
(20, 16, 2, 6)	83.8460	90.3985	+0.0692	+0.0711	99.685	99.686	-0.042	-0.041		pareto
(21, 17, 2, 5)	83.7807	90.4081	+0.0038	+0.0806	99.645	99.643	-0.082	-0.084		pareto
(20, 16, 3, 6)	83.8200	90.3907	+0.0432	+0.0633	99.629	99.627	-0.098	-0.100		
(21, 17, 2, 7)	83.8663	90.3741	+0.0895	+0.0466	99.672	99.672	-0.055	-0.055		pareto
(20, 16, 2, 8)	83.8858	90.3632	+0.1089	+0.0357	99.699	99.698	-0.028	-0.029		pareto
(20, 16, 2, 7)	83.8592	90.3694	+0.0824	+0.0420	99.695	99.695	-0.032	-0.032		
(20, 16, 3, 5)	83.7589	90.3940	-0.0179	+0.0666	99.584	99.582	-0.143	-0.145		
(19, 15, 3, 7)	83.7862	90.3850	+0.0093	+0.0575	99.709	99.708	-0.018	-0.019		
(22, 18, 2, 7)	83.8621	90.3577	+0.0852	+0.0302	99.644	99.641	-0.083	-0.086		
(21, 17, 2, 8)	83.8798	90.3507	+0.1030	+0.0232	99.682	99.680	-0.045	-0.046		
(20, 16, 3, 7)	83.8378	90.3611	+0.0609	+0.0337	99.648	99.647	-0.079	-0.080		
(19, 15, 4, 7)	83.7849	90.3693	+0.0081	+0.0418	99.649	99.648	-0.078	-0.079		
(19, 15, 4, 8)	83.8282	90.3539	+0.0513	+0.0264	99.671	99.670	-0.056	-0.056		
(19, 15, 3, 8)	83.8379	90.3502	+0.0610	+0.0228	99.717	99.716	-0.010	-0.011		
(20, 16, 3, 8)	83.8600	90.3425	+0.0832	+0.0150	99.661	99.659	-0.067	-0.068		
(19, 15, 4, 6)	83.7073	90.3794	-0.0696	+0.0520	99.611	99.610	-0.116	-0.117		
(21, 17, 1, 7)	83.7966	90.3424	+0.0197	+0.0149	99.712	99.711	-0.015	-0.015		
(18, 14, 4, 8)	83.7598	90.3319	-0.0170	+0.0044	99.693	99.690	-0.034	-0.036		
(20, 16, 1, 7)	83.7768	90.3275	+0.0000	+0.0000	99.727	99.727	+0.000	+0.000	*****	
(20, 16, 1, 8)	83.8208	90.3148	+0.0440	-0.0127	99.730	99.729	+0.003	+0.002		
(21, 17, 1, 8)	83.8313	90.3107	+0.0545	-0.0167	99.715	99.714	-0.012	-0.013		
(18, 14, 3, 8)	83.7611	90.3143	-0.0157	-0.0131	99.728	99.727	+0.001	+0.000		
(22, 18, 1, 6)	83.7280	90.3213	-0.0488	-0.0061	99.681	99.680	-0.046	-0.046		
(22, 18, 1, 7)	83.7775	90.3080	+0.0007	-0.0194	99.686	99.684	-0.041	-0.043		
(19, 15, 2, 8)	83.7611	90.3010	-0.0158	-0.0264	99.741	99.740	+0.014	+0.013		
(18, 14, 5, 8)	83.6952	90.2884	-0.0816	-0.0391	99.633	99.630	-0.094	-0.096		

## Read length 150: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(21, 17, 2, 5)	97.8650	99.2698	-0.0823	+0.0274	86.1799	92.3607	-0.0529	+0.0468	99.645	99.780	-0.082	-0.001		pareto
(20, 16, 2, 5)	97.8731	99.2638	-0.0742	+0.0214	86.2209	92.3496	-0.0119	+0.0357	99.666	99.780	-0.061	-0.001		pareto
(21, 17, 2, 7)	97.9767	99.2781	+0.0294	+0.0357	86.2365	92.3424	+0.0037	+0.0285	99.672	99.781	-0.055	-0.000		pareto
(20, 16, 2, 8)	98.0105	99.2648	+0.0632	+0.0225	86.2533	92.3380	+0.0204	+0.0241	99.699	99.781	-0.028	-0.000		pareto
(21, 17, 2, 6)	97.9248	99.2729	-0.0225	+0.0305	86.2055	92.3472	-0.0273	+0.0333	99.662	99.781	-0.065	-0.000		
(20, 16, 2, 6)	97.9302	99.2642	-0.0171	+0.0218	86.2404	92.3337	+0.0076	+0.0198	99.685	99.781	-0.042	-0.000		
(20, 16, 1, 7)	97.9473	99.2424	-0.0000	-0.0000	86.2328	92.3139	+0.0000	+0.0000	99.727	99.781	+0.000	+0.000	*****	

## Read length 200: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(24, 20, 4, 11)	87.5395	91.8639	+0.0401	+0.0623	99.720	99.717	-0.025	-0.025		pareto
(24, 20, 3, 12)	87.5548	91.8545	+0.0554	+0.0528	99.732	99.730	-0.013	-0.013		pareto
(24, 20, 4, 12)	87.5510	91.8549	+0.0516	+0.0533	99.721	99.718	-0.024	-0.024		pareto
(24, 20, 4, 10)	87.4988	91.8656	-0.0006	+0.0640	99.719	99.716	-0.026	-0.027		pareto
(24, 20, 3, 10)	87.5058	91.8626	+0.0065	+0.0610	99.730	99.729	-0.014	-0.014		
(24, 20, 3, 11)	87.5212	91.8571	+0.0218	+0.0555	99.732	99.730	-0.013	-0.013		
(24, 20, 3, 13)	87.5738	91.8424	+0.0744	+0.0408	99.733	99.731	-0.012	-0.012		pareto
(23, 19, 3, 12)	87.5488	91.8472	+0.0494	+0.0455	99.737	99.735	-0.008	-0.008		
(23, 19, 3, 10)	87.4930	91.8580	-0.0064	+0.0563	99.736	99.734	-0.009	-0.009		
...
(22, 18, 2, 12)	87.4994	91.8016	+0.0000	+0.0000	99.745	99.743	+0.000	+0.000	*****	


## Read length 200: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(24, 20, 4, 12)	98.4782	99.3803	+0.0502	+0.0442	89.4550	93.1223	+0.0041	+0.0295	99.721	99.750	-0.024	+0.000		pareto
(24, 20, 3, 12)	98.4726	99.3639	+0.0446	+0.0279	89.4582	93.1198	+0.0074	+0.0270	99.732	99.750	-0.013	+0.000		pareto
(24, 20, 4, 11)	98.4497	99.3765	+0.0216	+0.0404	89.4347	93.1223	-0.0161	+0.0295	99.720	99.750	-0.025	+0.000		
(24, 20, 3, 13)	98.4901	99.3666	+0.0620	+0.0306	89.4578	93.1104	+0.0070	+0.0176	99.733	99.750	-0.012	+0.000		
(24, 20, 4, 10)	98.4217	99.3768	-0.0063	+0.0407	89.4191	93.1175	-0.0318	+0.0246	99.719	99.750	-0.026	+0.000		
(22, 18, 2, 12)	98.4280	99.3361	+0.0000	-0.0000	89.4508	93.0928	+0.0000	+0.0000	99.745	99.750	+0.000	+0.000	*****	

## Read length 300: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(24, 20, 6, 13)	91.0242	94.7012	+0.2182	+0.0960	99.692	99.690	-0.001	-0.001		pareto
(24, 20, 7, 13)	91.0230	94.6993	+0.2171	+0.0940	99.691	99.690	-0.001	-0.001		
(24, 20, 8, 13)	90.9953	94.7001	+0.1893	+0.0948	99.690	99.689	-0.002	-0.002		
(24, 20, 5, 13)	91.0323	94.6862	+0.2264	+0.0809	99.692	99.691	-0.000	-0.000		pareto
(23, 19, 6, 13)	91.0143	94.6817	+0.2084	+0.0764	99.692	99.690	-0.000	-0.000		
(24, 20, 6, 12)	90.9737	94.6912	+0.1678	+0.0859	99.691	99.690	-0.001	-0.001		
...
(22, 18, 2, 12)	90.8059	94.6053	+0.0000	+0.0000	99.692	99.691	+0.000	+0.000	*****	


## Read length 300: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(24, 20, 5, 13)	98.7497	99.4555	+0.0807	+0.0234	92.4811	95.6001	+0.0438	+0.0315	99.692	99.691	-0.000	+0.000		pareto
(24, 20, 6, 13)	98.7494	99.4608	+0.0804	+0.0287	92.4793	95.6004	+0.0419	+0.0319	99.692	99.691	-0.001	-0.000		pareto
(22, 18, 2, 12)	98.6690	99.4322	+0.0000	+0.0000	92.4373	95.5686	-0.0000	+0.0000	99.692	99.691	+0.000	+0.000	*****	

## Read length 500: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(25, 19, 8, 13)	93.5670	95.5009	+0.2708	+0.1493	99.578	99.574	-0.000	-0.000		pareto
(25, 19, 7, 13)	93.5699	95.4957	+0.2737	+0.1441	99.578	99.574	-0.000	-0.000		pareto
(25, 19, 6, 13)	93.5695	95.4906	+0.2733	+0.1390	99.578	99.574	-0.000	+0.000		
(25, 19, 7, 12)	93.5153	95.4898	+0.2191	+0.1382	99.578	99.574	-0.000	-0.000		
...
(23, 17, 2, 12)	93.2962	95.3516	+0.0000	+0.0000	99.578	99.574	+0.000	+0.000	*****	


## Read length 500: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(25, 19, 8, 13)	98.9691	99.3465	+0.0594	+0.0285	94.7018	96.0529	+0.0562	+0.0347	99.578	99.574	-0.000	+0.000		pareto
(25, 19, 7, 13)	98.9757	99.3490	+0.0660	+0.0310	94.6976	96.0488	+0.0520	+0.0306	99.578	99.574	-0.000	-0.000		
(23, 17, 2, 12)	98.9097	99.3180	+0.0000	+0.0000	94.6456	96.0182	+0.0000	+0.0000	99.578	99.574	+0.000	+0.000	*****	

# Multi-context seeds (c4a7f61)


## Read length 50: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(16, 12, -2, 0)	63.2349	74.9216	+0.3466	+0.6877	97.123	97.112	+0.146	+0.136		pareto
(16, 12, -2, 1)	63.2007	74.8490	+0.3124	+0.6150	97.252	97.244	+0.275	+0.268		
(16, 12, -2, 2)	63.1981	74.8449	+0.3097	+0.6110	97.254	97.246	+0.277	+0.270		
(17, 13, -2, 0)	63.1308	74.6860	+0.2425	+0.4520	96.977	96.969	-0.000	-0.007		
(17, 13, -2, 1)	63.0462	74.5605	+0.1578	+0.3265	97.151	97.143	+0.174	+0.168		
(17, 13, -2, 2)	63.0538	74.5557	+0.1654	+0.3218	97.158	97.150	+0.181	+0.174		
(18, 14, -2, 0)	62.9883	74.4344	+0.0999	+0.2004	96.756	96.754	-0.220	-0.222		
(18, 14, -2, 1)	62.8884	74.2339	+0.0000	+0.0000	96.977	96.976	+0.000	+0.000	*****	
(18, 14, -2, 2)	62.8892	74.2215	+0.0008	-0.0124	96.992	96.992	+0.015	+0.016		

## Read length 50: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(16, 12, -2, 0)	93.1696	97.3440	+0.0728	-0.0574	66.3488	80.9302	+0.5248	+0.1651	97.123	99.487	+0.146	-0.077		pareto
(18, 14, -2, 1)	93.0968	97.4014	+0.0000	+0.0000	65.8240	80.7651	+0.0000	+0.0000	96.977	99.565	+0.000	+0.000	*****	

## Read length 75: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(21, 17, -3, -1)	71.7299	82.5870	+0.0342	+0.0823	98.887	98.888	-0.258	-0.260		pareto
(20, 16, -3, 0)	71.6998	82.5136	+0.0042	+0.0089	99.119	99.123	-0.025	-0.025		
(20, 16, -3, 1)	71.6971	82.5044	+0.0015	-0.0003	99.144	99.148	-0.000	-0.000		
(20, 16, -3, 2)	71.6957	82.5047	+0.0000	-0.0000	99.145	99.149	+0.000	+0.000	*****	
(20, 16, -3, 3)	71.6957	82.5047	+0.0000	-0.0000	99.145	99.149	+0.000	+0.000		

## Read length 75: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(21, 17, -3, -1)	95.7962	98.4268	-0.1689	-0.0038	74.9228	86.7618	-0.0016	+0.0307	98.887	99.756	-0.258	-0.036		pareto
(20, 16, -3, 2)	95.9650	98.4305	+0.0000	+0.0000	74.9245	86.7311	+0.0000	+0.0000	99.145	99.792	+0.000	-0.000	*****	pareto

## Read length 100: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(18, 14, 1, 3)	77.4674	86.7038	+0.6795	+0.4186	99.235	99.238	-0.367	-0.363		pareto
(17, 13, 1, 3)	77.4442	86.6853	+0.6563	+0.4001	99.266	99.267	-0.335	-0.335		
(19, 15, 1, 3)	77.4468	86.6761	+0.6589	+0.3909	99.221	99.220	-0.380	-0.382		
(16, 12, 1, 3)	77.3820	86.6520	+0.5941	+0.3668	99.289	99.288	-0.312	-0.314		
(20, 16, 0, 3)	77.4297	86.6071	+0.6418	+0.3219	99.199	99.195	-0.402	-0.407		
(20, 16, 0, 2)	77.4049	86.6042	+0.6170	+0.3190	99.164	99.159	-0.437	-0.442		
(19, 15, 0, 3)	77.0658	86.4753	+0.2779	+0.1900	99.471	99.472	-0.131	-0.129		
(21, 17, -1, 1)	77.0216	86.4653	+0.2337	+0.1801	99.312	99.308	-0.289	-0.293		
(20, 16, -1, 3)	77.1043	86.4446	+0.3164	+0.1594	99.451	99.452	-0.151	-0.149		
(20, 16, -1, 1)	76.9934	86.4722	+0.2056	+0.1870	99.345	99.349	-0.256	-0.253		
(20, 16, -1, 2)	77.0752	86.4512	+0.2874	+0.1660	99.431	99.434	-0.171	-0.168		
(18, 14, 0, 3)	77.0269	86.4572	+0.2390	+0.1720	99.484	99.484	-0.118	-0.118		
(19, 15, 0, 2)	76.9685	86.4657	+0.1807	+0.1805	99.389	99.387	-0.213	-0.215		
(18, 14, 0, 2)	76.9113	86.4563	+0.1234	+0.1711	99.393	99.394	-0.208	-0.208		
(17, 13, 0, 3)	76.9784	86.4225	+0.1905	+0.1373	99.495	99.497	-0.107	-0.105		
(17, 13, 0, 2)	76.8298	86.4074	+0.0419	+0.1222	99.402	99.404	-0.199	-0.198		
(16, 12, 0, 3)	76.9166	86.3725	+0.1287	+0.0873	99.525	99.525	-0.076	-0.076		
(22, 18, -2, 3)	76.8749	86.3023	+0.0871	+0.0171	99.561	99.563	-0.040	-0.039		
(22, 18, -2, 1)	76.7769	86.3194	-0.0110	+0.0342	99.512	99.512	-0.089	-0.090		
(22, 18, -2, 2)	76.8542	86.2984	+0.0663	+0.0132	99.553	99.554	-0.048	-0.048		
(21, 17, -2, 2)	76.8275	86.3046	+0.0397	+0.0194	99.580	99.579	-0.022	-0.023		
(21, 17, -2, 3)	76.8509	86.2957	+0.0631	+0.0105	99.588	99.587	-0.013	-0.014		
(21, 17, -2, 1)	76.7558	86.3089	-0.0320	+0.0237	99.536	99.534	-0.065	-0.068		
(20, 16, -2, 3)	76.8415	86.2799	+0.0536	-0.0053	99.611	99.611	+0.010	+0.009		
(20, 16, -2, 2)	76.7879	86.2852	+0.0000	+0.0000	99.601	99.602	+0.000	+0.000	*****	

## Read length 100: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(18, 14, 1, 3)	96.9435	98.9435	+0.0754	+0.1069	80.2780	89.7660	+0.2204	+0.1587	99.235	99.773	-0.367	-0.033		pareto
(20, 16, -2, 2)	96.8680	98.8366	+0.0000	+0.0000	80.0575	89.6073	+0.0000	+0.0000	99.601	99.806	+0.000	+0.000	*****	

## Read length 150: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(23, 19, 3, 6)	83.9188	90.4801	+0.1618	+0.1330	99.690	99.689	-0.072	-0.072		pareto
(23, 19, 3, 5)	83.8571	90.4898	+0.1000	+0.1427	99.653	99.653	-0.109	-0.109		pareto
(22, 18, 3, 5)	83.8277	90.4916	+0.0707	+0.1445	99.662	99.664	-0.099	-0.098		pareto
(22, 18, 3, 6)	83.8811	90.4733	+0.1240	+0.1263	99.700	99.700	-0.062	-0.061		
(23, 19, 2, 5)	83.8464	90.4809	+0.0894	+0.1338	99.710	99.709	-0.052	-0.053		
(23, 19, 3, 7)	83.9240	90.4585	+0.1670	+0.1115	99.710	99.709	-0.052	-0.053		pareto
(21, 17, 3, 5)	83.8062	90.4857	+0.0491	+0.1387	99.669	99.669	-0.093	-0.092		
(23, 19, 2, 6)	83.8871	90.4654	+0.1301	+0.1184	99.725	99.725	-0.037	-0.037		
(23, 19, 2, 7)	83.9296	90.4526	+0.1725	+0.1055	99.734	99.734	-0.028	-0.028		pareto
(22, 18, 2, 5)	83.8371	90.4757	+0.0800	+0.1286	99.714	99.712	-0.048	-0.049		
(21, 17, 2, 6)	83.8632	90.4686	+0.1062	+0.1215	99.734	99.734	-0.027	-0.028		
(22, 18, 2, 6)	83.8924	90.4575	+0.1354	+0.1104	99.731	99.730	-0.031	-0.032		
(21, 17, 3, 6)	83.8703	90.4606	+0.1133	+0.1135	99.711	99.709	-0.051	-0.052		
(21, 17, 3, 7)	83.8881	90.4484	+0.1311	+0.1014	99.726	99.725	-0.036	-0.037		
(22, 18, 2, 7)	83.9069	90.4414	+0.1498	+0.0943	99.740	99.740	-0.022	-0.022		
(20, 16, 3, 6)	83.8398	90.4549	+0.0828	+0.1079	99.717	99.716	-0.045	-0.046		
(22, 18, 3, 7)	83.9017	90.4377	+0.1446	+0.0907	99.720	99.720	-0.042	-0.042		
(23, 19, 2, 8)	83.9332	90.4283	+0.1761	+0.0813	99.740	99.739	-0.022	-0.022		pareto
(23, 19, 3, 8)	83.9324	90.4264	+0.1754	+0.0793	99.721	99.720	-0.041	-0.042		
...
(20, 16, 1, 7)	83.7570	90.3470	+0.0000	+0.0000	99.762	99.762	-0.000	+0.000	*****	


## Read length 150: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(22, 18, 3, 5)	97.9170	99.3012	-0.0674	+0.0578	86.2434	92.3854	-0.0286	+0.0711	99.662	99.779	-0.099	-0.002		pareto
(23, 19, 3, 5)	97.9233	99.3057	-0.0610	+0.0623	86.2171	92.3813	-0.0549	+0.0670	99.653	99.779	-0.109	-0.002		
(23, 19, 3, 6)	97.9852	99.3068	+0.0008	+0.0634	86.2630	92.3627	-0.0091	+0.0484	99.690	99.780	-0.072	-0.001		pareto
(23, 19, 3, 7)	98.0339	99.3086	+0.0496	+0.0652	86.2616	92.3630	-0.0105	+0.0487	99.710	99.781	-0.052	-0.000		pareto
(23, 19, 2, 7)	98.0662	99.2976	+0.0819	+0.0542	86.2846	92.3562	+0.0126	+0.0419	99.734	99.781	-0.028	-0.000		pareto
(23, 19, 2, 8)	98.1032	99.2958	+0.1189	+0.0524	86.3059	92.3469	+0.0339	+0.0326	99.740	99.781	-0.022	-0.000		pareto
(20, 16, 1, 7)	97.9843	99.2434	+0.0000	+0.0000	86.2720	92.3143	+0.0000	+0.0000	99.762	99.781	-0.000	+0.000	*****	

## Read length 200: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(24, 20, 4, 12)	87.4547	91.8541	+0.0733	+0.0944	99.747	99.745	-0.003	-0.003		pareto
(24, 20, 5, 10)	87.4016	91.8602	+0.0203	+0.1006	99.741	99.739	-0.009	-0.009		pareto
(24, 20, 5, 11)	87.4243	91.8544	+0.0430	+0.0947	99.743	99.741	-0.007	-0.007		pareto
(24, 20, 4, 11)	87.4371	91.8494	+0.0557	+0.0897	99.746	99.744	-0.004	-0.004		
(24, 20, 4, 10)	87.4239	91.8470	+0.0426	+0.0874	99.745	99.744	-0.005	-0.005		
(24, 20, 5, 12)	87.4302	91.8424	+0.0489	+0.0827	99.744	99.742	-0.006	-0.006		
(24, 20, 4, 13)	87.4852	91.8282	+0.1038	+0.0685	99.748	99.746	-0.002	-0.002		pareto
(24, 20, 5, 13)	87.4570	91.8312	+0.0757	+0.0716	99.745	99.743	-0.005	-0.005		pareto
(24, 20, 3, 13)	87.4898	91.8222	+0.1084	+0.0626	99.749	99.747	-0.001	-0.001		pareto
(24, 20, 3, 10)	87.4065	91.8406	+0.0252	+0.0809	99.747	99.746	-0.003	-0.003		
...
(22, 18, 2, 12)	87.3813	91.7597	+0.0000	+0.0000	99.750	99.748	+0.000	+0.000	*****	


## Read length 200: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(24, 20, 4, 12)	98.5003	99.3799	+0.0720	+0.0447	89.4840	93.1277	+0.0491	+0.0333	99.747	99.750	-0.003	+0.000		pareto
(24, 20, 5, 13)	98.5112	99.3841	+0.0828	+0.0489	89.4901	93.1235	+0.0551	+0.0291	99.745	99.750	-0.005	+0.000		pareto
(24, 20, 5, 11)	98.4640	99.3801	+0.0356	+0.0449	89.4556	93.1223	+0.0206	+0.0279	99.743	99.750	-0.007	+0.000		
(24, 20, 4, 13)	98.5208	99.3765	+0.0924	+0.0413	89.4810	93.1150	+0.0461	+0.0207	99.748	99.750	-0.002	+0.000		
(24, 20, 5, 10)	98.4288	99.3803	+0.0004	+0.0451	89.4413	93.1245	+0.0064	+0.0302	99.741	99.750	-0.009	+0.000		
(24, 20, 3, 13)	98.5043	99.3654	+0.0759	+0.0302	89.4898	93.1114	+0.0548	+0.0170	99.749	99.750	-0.001	+0.000		
(22, 18, 2, 12)	98.4284	99.3352	+0.0000	+0.0000	89.4350	93.0943	+0.0000	+0.0000	99.750	99.750	+0.000	+0.000	*****	

## Read length 300: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(24, 20, 6, 13)	90.7976	94.5880	+0.1863	+0.0741	99.692	99.691	-0.000	-0.000		pareto
(24, 20, 5, 13)	90.8041	94.5827	+0.1927	+0.0688	99.692	99.691	+0.000	+0.000		pareto
(24, 20, 7, 13)	90.7785	94.5875	+0.1671	+0.0735	99.692	99.691	-0.000	-0.000		
(24, 20, 6, 12)	90.7392	94.5910	+0.1279	+0.0770	99.692	99.690	-0.000	-0.000		pareto
(24, 20, 4, 13)	90.7948	94.5756	+0.1835	+0.0616	99.692	99.691	+0.000	+0.000		
(24, 20, 4, 12)	90.7514	94.5846	+0.1400	+0.0707	99.692	99.691	-0.000	-0.000		
(24, 20, 5, 12)	90.7636	94.5773	+0.1523	+0.0633	99.692	99.691	+0.000	-0.000		
(24, 20, 3, 13)	90.7737	94.5729	+0.1624	+0.0590	99.692	99.691	+0.000	+0.000		
...
(22, 18, 2, 12)	90.6113	94.5139	+0.0000	+0.0000	99.692	99.691	+0.000	+0.000	*****	


## Read length 300: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(24, 20, 6, 13)	98.7377	99.4604	+0.0772	+0.0287	92.4746	95.6028	+0.0431	+0.0364	99.692	99.691	-0.000	-0.000		pareto
(24, 20, 5, 13)	98.7374	99.4547	+0.0769	+0.0229	92.4670	95.5950	+0.0355	+0.0286	99.692	99.691	+0.000	+0.000		
(24, 20, 6, 12)	98.7202	99.4553	+0.0596	+0.0235	92.4535	95.5976	+0.0220	+0.0312	99.692	99.691	-0.000	-0.000		
(22, 18, 2, 12)	98.6606	99.4318	+0.0000	+0.0000	92.4316	95.5664	+0.0000	+0.0000	99.692	99.691	+0.000	+0.000	*****	

## Read length 500: Weighted SE/PE results - mapping-only

parameters	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(25, 19, 7, 13)	93.2384	95.3643	+0.1550	+0.1071	99.578	99.574	-0.000	-0.000		pareto
(25, 19, 6, 13)	93.2549	95.3558	+0.1714	+0.0986	99.578	99.574	-0.000	+0.000		pareto
(25, 19, 5, 13)	93.2581	95.3468	+0.1747	+0.0896	99.578	99.574	+0.000	+0.000		pareto
(25, 19, 7, 12)	93.1965	95.3606	+0.1131	+0.1034	99.578	99.574	-0.000	-0.000		
(25, 19, 6, 12)	93.2142	95.3496	+0.1307	+0.0924	99.578	99.574	-0.000	+0.000		
(25, 19, 4, 13)	93.2463	95.3391	+0.1628	+0.0819	99.578	99.574	+0.000	+0.000		
...
(23, 17, 2, 12)	93.0834	95.2572	+0.0000	+0.0000	99.578	99.574	+0.000	+0.000	*****	

## Read length 500: Weighted SE/PE results - with extension alignment

parameters	sacc_se	sacc_pe	diff_se	diff_pe	acc_se	acc_pe	diff_se	diff_pe	mprt_se	mprt_pe	diff_se	diff_pe
(25, 19, 7, 13)	98.9583	99.3693	+0.0592	+0.0267	94.6867	96.0819	+0.0416	+0.0251	99.578	99.574	-0.000	+0.000		pareto
(25, 19, 5, 13)	98.9473	99.3601	+0.0482	+0.0175	94.6765	96.0713	+0.0313	+0.0145	99.578	99.574	+0.000	+0.000		
(25, 19, 6, 13)	98.9524	99.3623	+0.0533	+0.0197	94.6751	96.0669	+0.0299	+0.0101	99.578	99.574	-0.000	+0.000		
(23, 17, 2, 12)	98.8991	99.3426	+0.0000	+0.0000	94.6451	96.0568	+0.0000	+0.0000	99.578	99.574	+0.000	+0.000	*****	
@ksahlin
Copy link
Owner

ksahlin commented Mar 15, 2024

Great, I think we could go with your suggested parameter changes in this issue for a benchmark between current hashing and multi-context hashing.

It is interesting that many of the read lengths have the same parameter combination; I am not sure if this is a sign of something bad (e.g., overfitting the design to data, underutilization of partial hits, or underevaluation). Regardless, I think it serves its purpose for now. We are thinking about asymmetrical seeds, which are more important now and may alter things slightly.

(Note: we should probably log how many times we successfully used a 'partial hit', and not the full hit, in the new hashing scheme in further evaluations. Here, 'successfully' is a bit vague and could have several meanings, such as simply finding a partial hit and that they were used in making a higher scoring NAM/pair of NAMs)

marcelm added a commit that referenced this issue Mar 20, 2024
marcelm added a commit that referenced this issue Mar 20, 2024
marcelm added a commit that referenced this issue Mar 20, 2024
@marcelm
Copy link
Collaborator Author

marcelm commented Mar 20, 2024

I have added two branches to the repository, each with a single new commit that switches to the optimized parameters:

  • v0.12.0-optimized-parameters is on top of v0.12.0.
  • mcs-optimized-parameters is on top of Ivan’s multi-context-seeds branch

For completeness, I picked (20, 16, 1, 4) for canonical read length 125 for both branches, but this should not be relevant as the test datasets don’t include that read length.

I also noticed that v0.12.0 still has canonical read length 300, so I left it that way and did not use the interpolated parameters as I had originally suggested.

It would be possible to apply these changes on top of v0.13.0, but since I benchmarked v0.12.0 and there have been very few changes since then that affect accuracy, I suggest we stick to v0.12.0.

@ksahlin
Copy link
Owner

ksahlin commented Mar 20, 2024

I have started a benchmark of the two commits.

For completeness, I picked (20, 16, 1, 4) for canonical read length 125 for both branches, but this should not be relevant as the test datasets don’t include that read length.

The evaluation does include read length 125 as well as read lengths ["50", "75", "91", "100", "111", "125", "136", "150", "176", "200", "250", "300", "500"] to test 'worst case' for some of the parameter ranges.

It would be possible to apply these changes on top of v0.13.0, but since I benchmarked v0.12.0 and there have been very few changes since then that affect accuracy, I suggest we stick to v0.12.0.

It's great to compare these two commits as a checkpoint to see where we are. However, I am afraid this might not be the last benchmark I do between the two seeding variants. The larger goal before an eventual merge of mcs would be to get rid of the redundant NAMs causing redundant extension calls (particularly visible in the mcs branch). Ivan is now exploring the asymmetrical version of mcs, checking whether my comment is true #405 (comment). If my guess would be correct, it would be nice to benchmark two asymmetrical versions against each other.

@ksahlin
Copy link
Owner

ksahlin commented Mar 21, 2024

Evaluation is ready (see attached plots). All results are for PE alignment, symmetric seeds. Main points:

Accuracy

  • Extension based accuracy is near identical between the two seeds.
  • Mapping-only based accuracy is slightly better for mcs for short reads (see particularly drosophila and CHM13), and slightly worse for longer reads. Notable here is the dip at read lengths 111 for our current seeds. Another notable issue is that msc are strictly worse for longer seeds. I do not expect (/accept:) this.

Percent mapped

  • mcs beats current seeds in almost all cases and with quite a big margin, which is nice to see.

Runtime

  • is seems mcs are more often faster than not for short reads - nice! Possible because of less rescue extension.
  • mcs are consistently quite substantially slower than current seeds for the longest reads. Ivan and I believe that this is because more mapping sites are tried with extension due to more matches (coming from partial matches). If 'chaining'/scoring of NAMs is implemented well, I do not see a reason for accepting this. Using asymmetric seeds would lead to better NAM merging, hence scoring, and would take care of this (according to @marcelm's analysis).

Overall:

  1. mcs offer some clear advantages in mapping (in mapped percentage, accuracy, and time) for short reads, but is currently slightly stifled by NAM scoring/chaining, leading to lower accuracy and slower runtime on longer reads. It will be interesting seeing if this can this be solved with asymmetric seeds. If this last issue is ironed out, I think we have a strong case for using mcs as new strategy.
  2. Evaluation does not include SE alignment - but all evidence points to msc being even better (relatively) on SE data.

@Itolstoganov

accuracy_plot_cut_at_80.pdf
percentage_aligned_plot.pdf
time_plot.pdf

marcelm added a commit that referenced this issue May 22, 2024
marcelm added a commit that referenced this issue May 22, 2024
marcelm added a commit that referenced this issue May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants