Performance improvement 3: implement string cache #196

msm-code · 2022-12-15T18:06:55Z

This replaces #195. Both PRs are not compatible with each other (though general idea from #195 can be implemented later after cache).

The idea is very simple - "just" use a string cache, and when the same string is requested twice, return a cached results.

There are a few implementation problems that make it more complicated than necessary:

Most strings only occur once, so there's no point in caching them. I didn't measure the memory usage difference, but I'd like to avoid OOMs (especially for people who don't have small enough datasets). So we need a way to count strings in a query.
This means that the oracle should get std::vector<PrimitiveQuery> as a parameter instead of just PrimitiveQuery. That makes a code just a bit more ugly.
Finally std::vector<PrimitiveQuery> must be useable in a STL container, so we must give PrimitiveQuery a < operator.

Nevertheless, the results look good:
https://github.com/msm-code/ursa-bench/blob/master/results/3_cache_hdd_all.txt
https://github.com/msm-code/ursa-bench/blob/master/results/hdd_all.html
http://65.21.130.153:8000/hdd_all.html

❯ python3 benchcompare.py results/2_earlyexit_hdd_all.txt results/3_cache_hdd_all.txt | head
apt_apt6_malware.yar: read: 8184 vs 4092 (-49.999994%)
apt_apt19.yar: read: 1427 vs 1243 (-12.894175%)
apt_apt28.yar: read: 1598 vs 1393 (-12.828528%)
apt_apt29_grizzly_steppe.yar: read: 3505 vs 3354 (-4.308130%)
apt_apt30_backspace.yar: read: 19599 vs 15891 (-18.919332%)
apt_aus_parl_compromise.yar: read: 6254 vs 3992 (-36.168846%)
apt_buckeye.yar: read: 2431 vs 1864 (-23.323725%)
apt_cheshirecat.yar: read: 1077 vs 703 (-34.726059%)
apt_cn_netfilter.yar: read: 4273 vs 3819 (-10.624851%)
apt_cn_pp_zerot.yar: read: 2906 vs 1862 (-35.925659%)

where it helps, it speeds things up 10%-50%. On a whole corpus this gives us 20% speedup which is not gamebreaking, but not terrible too.

msm-code · 2022-12-15T18:41:25Z

Nevermind, scratch that too. I think we are better off with per-ngram cache. I thought this would consume crazy amounts of RAM, but it actually looks reasonable (at least on my dataset sizes). And the performance improvements are significant (at least just counting total number of reads on the test corpus):

version 2 (with early exit): 1839k reads
version 3 (with this PR): 1463k reads
this PR: 899k reads

And we can even combine that with the previous PR.

I think the ideal solution will be:

per-ngram cache
limit for cache size (for example, no more than 100mb)
and make this limit configurable (in the database config)

And when all this is done, this may actually be good enough for the next release.

Performance improvement 3: implement string cache

70e7bc3

msm-code requested a review from nazywam December 15, 2022 18:07

msm-code marked this pull request as draft December 15, 2022 18:35

msm-code mentioned this pull request Dec 17, 2022

Performance improvement 3: implement ngram cache #197

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement 3: implement string cache #196

Performance improvement 3: implement string cache #196

msm-code commented Dec 15, 2022

msm-code commented Dec 15, 2022

Performance improvement 3: implement string cache #196

Are you sure you want to change the base?

Performance improvement 3: implement string cache #196

Conversation

msm-code commented Dec 15, 2022

msm-code commented Dec 15, 2022