[FIX] Include relaxed FPR into the DP algorithm. #224

smehringer · 2024-09-07T10:00:51Z

This PR makes the DP aware of the relaxed false positive rate by down-scaling the size of merged bins.

seqan-actions · 2024-09-07T10:02:59Z

Documentation preview available at https://docs.seqan.de/preview/seqan/hibf/224

codecov · 2024-09-07T12:06:16Z

Codecov Report

Attention: Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Project coverage is 99.63%. Comparing base (f83ba1a) to head (f444aa3).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
src/layout/compute_layout.cpp	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
- Coverage   99.68%   99.63%   -0.06%     
==========================================
  Files          49       50       +1     
  Lines        1914     1911       -3     
  Branches        5        5              
==========================================
- Hits         1908     1904       -4     
- Misses          6        7       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

smehringer · 2024-09-07T12:18:46Z

test/unit/hibf/layout/hierarchical_binning_test.cpp

-    std::vector<seqan::hibf::layout::layout::max_bin> expected_max_bins{{{1}, 22}, {{2}, 22}};
+    std::vector<seqan::hibf::layout::layout::max_bin> expected_max_bins{{{2}, 22}, {{3}, 0}};

    std::vector<seqan::hibf::layout::layout::user_bin> expected_user_bins{{{}, 0, 1, 7},
-                                                                          {{1}, 0, 22, 4},
-                                                                          {{1}, 22, 21, 5},
-                                                                          {{1}, 43, 21, 6},
-                                                                          {{2}, 0, 22, 0},
-                                                                          {{2}, 22, 21, 2},
-                                                                          {{2}, 43, 21, 3},
-                                                                          {{}, 3, 1, 1}};
+                                                                          {{}, 1, 1, 6},
+                                                                          {{2}, 0, 22, 3},
+                                                                          {{2}, 22, 21, 4},
+                                                                          {{2}, 43, 21, 5},
+                                                                          {{3}, 0, 42, 1},
+                                                                          {{3}, 42, 11, 0},
+                                                                          {{3}, 53, 11, 2}};


Input: (no union estimation enabled)

user bin ID | 0 1 2 3 4 5 6 7 size | 500, 1000, 500, 500, 500, 500, 500, 500

Before:

Top Level | 7 | 4,5,6 | 1 | 0,2,3 | / \ 1st Level | 22x4 | 21x5 | 21x6 | | 22x0 | 21x2 | 21x3 |

After

Top Level | 7 | 6 | 3,4,5 | 0,1,2 | / \ 1st Level | 22x3 | 21x4 | 21x5 | | 11x0 | 42x1 | 11x2 |

Afterwards it's better to merge the large UB-1. Since merging 0,1,2 (500+500+1000) corrected by the relaxed false positive rate (0.3 vs. 0.05) is smaller than having UB-1 with a size of 1000 at the top level as a single bin. This result is intuitive for the change.

Otherwise, returns double, and may cause integer-to-double promotion

eseiler

I pushed a review, check it out

smehringer · 2024-09-10T09:08:53Z

include/hibf/layout/hierarchical_binning.hpp

-        double const numerator =
-            std::log1p(-std::exp(std::log(config_.maximum_fpr) / config_.number_of_hash_functions));
-        double const denominator =
-            std::log1p(-std::exp(std::log(config_.relaxed_fpr) / config_.number_of_hash_functions));
-        relaxed_fpr_correction_factor = numerator / denominator;
-        assert(relaxed_fpr_correction_factor <= 1.0);


I would at least test if the variable is non-zero, no?
I don't really like that the members must be st outside but are used only here..

smehringer · 2024-09-10T10:53:07Z

LGTM Follow Up:

1: `void data_store::validate()`

We need to test whether data store is set up correctly before we start executing hierarchical binning

2: Refactor `data_store`

If config is stored in data store, the data store can get its won constructor that initialises the fpr correction factors by initialising lambdas. This would be more error proof.
The data store is not very intuitive because it contains members that..
- are given external and are always const
- are given externally but change in recursive calls (e.g. positions)
- are local but const
- are local but will change in iterations of the DP (e.g. union_estimation)
  This should be restructured.

seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Sep 7, 2024

smehringer commented Sep 7, 2024

View reviewed changes

smehringer requested a review from eseiler September 7, 2024 12:19

smehringer force-pushed the relaxed_fpr branch from bdb53fe to 616c282 Compare September 7, 2024 12:49

seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Sep 7, 2024

[FIX] Include relaxed FPR into the DP algorithm.

0f4dba0

eseiler force-pushed the relaxed_fpr branch from 616c282 to f83cab1 Compare September 9, 2024 16:15

seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Sep 9, 2024

eseiler force-pushed the relaxed_fpr branch from 4148a17 to 02f6e95 Compare September 9, 2024 16:18

seqan-actions added the lint [INTERNAL] used for linting label Sep 9, 2024

eseiler added 3 commits September 9, 2024 18:24

[REVIEW] Put relaxed_fpr_correction into data_store

66243f6

[REVIEW] Remove maximum_bin_tracker::choose_max_bin

200e07c

[REVIEW] Fix return value of get_weight

f444aa3

Otherwise, returns double, and may cause integer-to-double promotion

eseiler force-pushed the relaxed_fpr branch from 5016771 to f444aa3 Compare September 9, 2024 16:24

seqan-actions removed the lint [INTERNAL] used for linting label Sep 9, 2024

eseiler reviewed Sep 9, 2024

View reviewed changes

smehringer commented Sep 10, 2024

View reviewed changes

eseiler approved these changes Sep 10, 2024

View reviewed changes

eseiler merged commit 14026d0 into seqan:main Sep 10, 2024
37 checks passed

eseiler mentioned this pull request Sep 10, 2024

data_store follow-ups #227

Open

smehringer deleted the relaxed_fpr branch September 11, 2024 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] Include relaxed FPR into the DP algorithm. #224

[FIX] Include relaxed FPR into the DP algorithm. #224

smehringer commented Sep 7, 2024 •

edited

Loading

seqan-actions commented Sep 7, 2024

codecov bot commented Sep 7, 2024 •

edited

Loading

smehringer Sep 7, 2024

eseiler left a comment

smehringer Sep 10, 2024

smehringer commented Sep 10, 2024 •

edited by eseiler

Loading

[FIX] Include relaxed FPR into the DP algorithm. #224

[FIX] Include relaxed FPR into the DP algorithm. #224

Conversation

smehringer commented Sep 7, 2024 • edited Loading

seqan-actions commented Sep 7, 2024

codecov bot commented Sep 7, 2024 • edited Loading

Codecov Report

smehringer Sep 7, 2024

Choose a reason for hiding this comment

eseiler left a comment

Choose a reason for hiding this comment

smehringer Sep 10, 2024

Choose a reason for hiding this comment

smehringer commented Sep 10, 2024 • edited by eseiler Loading

1: void data_store::validate()

2: Refactor data_store

smehringer commented Sep 7, 2024 •

edited

Loading

codecov bot commented Sep 7, 2024 •

edited

Loading

smehringer commented Sep 10, 2024 •

edited by eseiler

Loading

1: `void data_store::validate()`

2: Refactor `data_store`