-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize parameters (again) #407
Comments
Great, I think we could go with your suggested parameter changes in this issue for a benchmark between current hashing and multi-context hashing. It is interesting that many of the read lengths have the same parameter combination; I am not sure if this is a sign of something bad (e.g., overfitting the design to data, underutilization of partial hits, or underevaluation). Regardless, I think it serves its purpose for now. We are thinking about asymmetrical seeds, which are more important now and may alter things slightly. (Note: we should probably log how many times we successfully used a 'partial hit', and not the full hit, in the new hashing scheme in further evaluations. Here, 'successfully' is a bit vague and could have several meanings, such as simply finding a partial hit and that they were used in making a higher scoring NAM/pair of NAMs) |
I have added two branches to the repository, each with a single new commit that switches to the optimized parameters:
For completeness, I picked (20, 16, 1, 4) for canonical read length 125 for both branches, but this should not be relevant as the test datasets don’t include that read length. I also noticed that v0.12.0 still has canonical read length 300, so I left it that way and did not use the interpolated parameters as I had originally suggested. It would be possible to apply these changes on top of v0.13.0, but since I benchmarked v0.12.0 and there have been very few changes since then that affect accuracy, I suggest we stick to v0.12.0. |
I have started a benchmark of the two commits.
The evaluation does include read length 125 as well as read lengths
It's great to compare these two commits as a checkpoint to see where we are. However, I am afraid this might not be the last benchmark I do between the two seeding variants. The larger goal before an eventual merge of mcs would be to get rid of the redundant NAMs causing redundant extension calls (particularly visible in the mcs branch). Ivan is now exploring the asymmetrical version of mcs, checking whether my comment is true #405 (comment). If my guess would be correct, it would be nice to benchmark two asymmetrical versions against each other. |
Evaluation is ready (see attached plots). All results are for PE alignment, symmetric seeds. Main points: Accuracy
Percent mapped
Runtime
Overall:
accuracy_plot_cut_at_80.pdf |
Here are suggested new indexing parameters for all read lengths.
This supersedes #397.
I ran the optimization script for both v0.12.0 (commit 6fd4c5d) and multi-context seeds (commit c4a7f61).
Differences to #397:
Command used:
Suggested changes
Parameters are given as a tuple$(k, s, l, u)$ .
I did not mechanically pick the settings that optimize mapping-only accuracy, but made sure that they also work well for extension alignment mode. Many parameter settings are found that are essentially equally good, so it was possible for me to find settings that work equally well for v0.12.0 and multi-context seeds, except for read lengths 100 and 150.
Alternative (17, 13, 1, 3) is very similar
slightly; alternative (20, 16, 2, 8) would not
(but improve mapping-only PE accuracy much less)
alternative (23, 19, 2, 7) would not
(but improve mapping-only PE accuracy a bit less)
We only have canonical read length 250. Using the interpolated parameters (24, 20, 5, 12) or (24, 20, 4, 12) gives ok results for read lengths 200 and 300.
The script was run in a mode where it optimizes mapping-only accuracy. I am currently running it to optimize extension-aligment accuracy. In theory, the results could be different. So far, for the read lengths that are finished (currently 50, 75, 100), they are not.
Details for v0.12
This shows how mapping-only and extension-alignment accuracy change for the suggested parameters.
More details
Details have been shortened because GitHub’s maximum comment size was reached.
The text was updated successfully, but these errors were encountered: