Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement 3: implement string cache #196

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

msm-code
Copy link
Contributor

This replaces #195. Both PRs are not compatible with each other (though general idea from #195 can be implemented later after cache).

The idea is very simple - "just" use a string cache, and when the same string is requested twice, return a cached results.

There are a few implementation problems that make it more complicated than necessary:

  • Most strings only occur once, so there's no point in caching them. I didn't measure the memory usage difference, but I'd like to avoid OOMs (especially for people who don't have small enough datasets). So we need a way to count strings in a query.
  • This means that the oracle should get std::vector<PrimitiveQuery> as a parameter instead of just PrimitiveQuery. That makes a code just a bit more ugly.
  • Finally std::vector<PrimitiveQuery> must be useable in a STL container, so we must give PrimitiveQuery a < operator.

Nevertheless, the results look good:
https://github.com/msm-code/ursa-bench/blob/master/results/3_cache_hdd_all.txt
https://github.com/msm-code/ursa-bench/blob/master/results/hdd_all.html
http://65.21.130.153:8000/hdd_all.html

❯ python3 benchcompare.py results/2_earlyexit_hdd_all.txt results/3_cache_hdd_all.txt | head
apt_apt6_malware.yar: read: 8184 vs 4092 (-49.999994%)
apt_apt19.yar: read: 1427 vs 1243 (-12.894175%)
apt_apt28.yar: read: 1598 vs 1393 (-12.828528%)
apt_apt29_grizzly_steppe.yar: read: 3505 vs 3354 (-4.308130%)
apt_apt30_backspace.yar: read: 19599 vs 15891 (-18.919332%)
apt_aus_parl_compromise.yar: read: 6254 vs 3992 (-36.168846%)
apt_buckeye.yar: read: 2431 vs 1864 (-23.323725%)
apt_cheshirecat.yar: read: 1077 vs 703 (-34.726059%)
apt_cn_netfilter.yar: read: 4273 vs 3819 (-10.624851%)
apt_cn_pp_zerot.yar: read: 2906 vs 1862 (-35.925659%)

where it helps, it speeds things up 10%-50%. On a whole corpus this gives us 20% speedup which is not gamebreaking, but not terrible too.

@msm-code msm-code marked this pull request as draft December 15, 2022 18:35
@msm-code
Copy link
Contributor Author

Nevermind, scratch that too. I think we are better off with per-ngram cache. I thought this would consume crazy amounts of RAM, but it actually looks reasonable (at least on my dataset sizes). And the performance improvements are significant (at least just counting total number of reads on the test corpus):

  • version 2 (with early exit): 1839k reads
  • version 3 (with this PR): 1463k reads
  • this PR: 899k reads

And we can even combine that with the previous PR.

I think the ideal solution will be:

  • per-ngram cache
  • limit for cache size (for example, no more than 100mb)
  • and make this limit configurable (in the database config)

And when all this is done, this may actually be good enough for the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant