Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Include relaxed FPR into the DP algorithm. #224

Merged
merged 4 commits into from
Sep 10, 2024

Conversation

smehringer
Copy link
Member

@smehringer smehringer commented Sep 7, 2024

This PR makes the DP aware of the relaxed false positive rate by down-scaling the size of merged bins.

@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Sep 7, 2024
@seqan-actions
Copy link
Member

Documentation preview available at https://docs.seqan.de/preview/seqan/hibf/224

@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Sep 7, 2024
Copy link

codecov bot commented Sep 7, 2024

Codecov Report

Attention: Patch coverage is 95.45455% with 1 line in your changes missing coverage. Please review.

Project coverage is 99.63%. Comparing base (f83ba1a) to head (f444aa3).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/layout/compute_layout.cpp 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #224      +/-   ##
==========================================
- Coverage   99.68%   99.63%   -0.06%     
==========================================
  Files          49       50       +1     
  Lines        1914     1911       -3     
  Branches        5        5              
==========================================
- Hits         1908     1904       -4     
- Misses          6        7       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines -33 to +46
std::vector<seqan::hibf::layout::layout::max_bin> expected_max_bins{{{1}, 22}, {{2}, 22}};
std::vector<seqan::hibf::layout::layout::max_bin> expected_max_bins{{{2}, 22}, {{3}, 0}};

std::vector<seqan::hibf::layout::layout::user_bin> expected_user_bins{{{}, 0, 1, 7},
{{1}, 0, 22, 4},
{{1}, 22, 21, 5},
{{1}, 43, 21, 6},
{{2}, 0, 22, 0},
{{2}, 22, 21, 2},
{{2}, 43, 21, 3},
{{}, 3, 1, 1}};
{{}, 1, 1, 6},
{{2}, 0, 22, 3},
{{2}, 22, 21, 4},
{{2}, 43, 21, 5},
{{3}, 0, 42, 1},
{{3}, 42, 11, 0},
{{3}, 53, 11, 2}};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Input: (no union estimation enabled)

user bin ID | 0    1     2    3    4    5    6    7
size        | 500, 1000, 500, 500, 500, 500, 500, 500

Before:

Top Level               | 7 | 4,5,6 | 1 | 0,2,3 |    
                             /                \
1st Level   | 22x4 | 21x5 | 21x6 |           | 22x0 | 21x2 | 21x3 | 

After

Top Level                | 7 | 6 | 3,4,5 | 0,1,2 |
                                  /              \
1st Level     | 22x3 | 21x4 | 21x5 |          | 11x0 | 42x1 | 11x2 |

Afterwards it's better to merge the large UB-1. Since merging 0,1,2 (500+500+1000) corrected by the relaxed false positive rate (0.3 vs. 0.05) is smaller than having UB-1 with a size of 1000 at the top level as a single bin. This result is intuitive for the change.

@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Sep 7, 2024
@seqan-actions seqan-actions added lint [INTERNAL] used for linting and removed lint [INTERNAL] used for linting labels Sep 9, 2024
@seqan-actions seqan-actions added the lint [INTERNAL] used for linting label Sep 9, 2024
@seqan-actions seqan-actions removed the lint [INTERNAL] used for linting label Sep 9, 2024
Copy link
Member

@eseiler eseiler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a review, check it out

Comment on lines 88 to 93
double const numerator =
std::log1p(-std::exp(std::log(config_.maximum_fpr) / config_.number_of_hash_functions));
double const denominator =
std::log1p(-std::exp(std::log(config_.relaxed_fpr) / config_.number_of_hash_functions));
relaxed_fpr_correction_factor = numerator / denominator;
assert(relaxed_fpr_correction_factor <= 1.0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would at least test if the variable is non-zero, no?
I don't really like that the members must be st outside but are used only here..

@smehringer
Copy link
Member Author

smehringer commented Sep 10, 2024

LGTM Follow Up:

1: void data_store::validate()

  • We need to test whether data store is set up correctly before we start executing hierarchical binning

2: Refactor data_store

  • If config is stored in data store, the data store can get its won constructor that initialises the fpr correction factors by initialising lambdas. This would be more error proof.
  • The data store is not very intuitive because it contains members that..
    • are given external and are always const
    • are given externally but change in recursive calls (e.g. positions)
    • are local but const
    • are local but will change in iterations of the DP (e.g. union_estimation)
      This should be restructured.

@eseiler eseiler merged commit 14026d0 into seqan:main Sep 10, 2024
37 checks passed
@eseiler eseiler mentioned this pull request Sep 10, 2024
@smehringer smehringer deleted the relaxed_fpr branch September 11, 2024 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants