-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cap eviction effort (CPU under stress) in HyperClockCache #12141
Conversation
Summary: HyperClockCache is intended to mitigate performance problems under stress conditions. In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem. However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system. To (IMHO) effectively mitigate this problem, I'm introducing a new default behavior and tuning parameter for HCC, eviction_effort_cap. See the comments on the new config parameter in the public API. Test Plan: unit test included ## Performance test We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7 ``` ``` Before, LRU (alternate baseline not exhibiting bad behavior): Rough parallel ops/sec = 2290997 1088060 maxresident Before, AutoHCC (bad behavior): Rough parallel ops/sec = 141011 <- Yes, more than 10x slower 1083932 maxresident ``` Now let us sample a range of values in the solution space: ``` After, AutoHCC, eviction_effort_cap = 1: Rough parallel ops/sec = 3212586 2402216 maxresident After, AutoHCC, eviction_effort_cap = 10: Rough parallel ops/sec = 2371639 1248884 maxresident After, AutoHCC, eviction_effort_cap = 30: Rough parallel ops/sec = 1981092 1131596 maxresident After, AutoHCC, eviction_effort_cap = 100: Rough parallel ops/sec = 1446188 1090976 maxresident After, AutoHCC, eviction_effort_cap = 1000: Rough parallel ops/sec = 549568 1084064 maxresident ``` I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default. ``` Change to -pinned_ratio=0.85 Before, LRU: Rough parallel ops/sec = 2108373 1078232 maxresident Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2164910 1077312 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2145542 1077216 maxresident ``` The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress. ``` Change to -pinned_ratio=0.25 (low stress) Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2221149 1076540 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2224521 1076664 maxresident ``` No measurable difference under normal circumstances. Some tests repeated with FixedHCC, with similar results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Would it be useful to emit a warning or have some sort of stat if this happens too often? IIRC, there's a periodic dump of cache related stats into the info log.
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger has updated the pull request. You must reimport the pull request before landing. |
@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pdillinger merged this pull request in 88bc91f. |
Summary: HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem.
However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system.
To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC,
eviction_effort_cap
. See the comments on the new config parameter in the public API.Test Plan: unit test included
Performance test
We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with
make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench
. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior:Now let us sample a range of values in the solution space:
I looks like
cap=30
is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default.The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress.
No measurable difference under normal circumstances.
Some tests repeated with FixedHCC, with similar results.