Cap eviction effort (CPU under stress) in HyperClockCache #12141

pdillinger · 2023-12-13T04:18:01Z

Summary: HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem.

However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system.

To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, eviction_effort_cap. See the comments on the new config parameter in the public API.

Test Plan: unit test included

Performance test

We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior:

./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7

Before, LRU (alternate baseline not exhibiting bad behavior):
Rough parallel ops/sec = 2290997
1088060 maxresident

Before, AutoHCC (bad behavior):
Rough parallel ops/sec = 141011 <- Yes, more than 10x slower
1083932 maxresident

Now let us sample a range of values in the solution space:

After, AutoHCC, eviction_effort_cap = 1:
Rough parallel ops/sec = 3212586
2402216 maxresident

After, AutoHCC, eviction_effort_cap = 10:
Rough parallel ops/sec = 2371639
1248884 maxresident

After, AutoHCC, eviction_effort_cap = 30:
Rough parallel ops/sec = 1981092
1131596 maxresident

After, AutoHCC, eviction_effort_cap = 100:
Rough parallel ops/sec = 1446188
1090976 maxresident

After, AutoHCC, eviction_effort_cap = 1000:
Rough parallel ops/sec = 549568
1084064 maxresident

I looks like cap=30 is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default.

Change to -pinned_ratio=0.85
Before, LRU:
Rough parallel ops/sec = 2108373
1078232 maxresident

Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2164910
1077312 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2145542
1077216 maxresident

The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress.

Change to -pinned_ratio=0.25 (low stress)
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2221149
1076540 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2224521
1076664 maxresident

No measurable difference under normal circumstances.

Some tests repeated with FixedHCC, with similar results.

Summary: HyperClockCache is intended to mitigate performance problems under stress conditions. In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem. However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system. To (IMHO) effectively mitigate this problem, I'm introducing a new default behavior and tuning parameter for HCC, eviction_effort_cap. See the comments on the new config parameter in the public API. Test Plan: unit test included ## Performance test We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7 ``` ``` Before, LRU (alternate baseline not exhibiting bad behavior): Rough parallel ops/sec = 2290997 1088060 maxresident Before, AutoHCC (bad behavior): Rough parallel ops/sec = 141011 <- Yes, more than 10x slower 1083932 maxresident ``` Now let us sample a range of values in the solution space: ``` After, AutoHCC, eviction_effort_cap = 1: Rough parallel ops/sec = 3212586 2402216 maxresident After, AutoHCC, eviction_effort_cap = 10: Rough parallel ops/sec = 2371639 1248884 maxresident After, AutoHCC, eviction_effort_cap = 30: Rough parallel ops/sec = 1981092 1131596 maxresident After, AutoHCC, eviction_effort_cap = 100: Rough parallel ops/sec = 1446188 1090976 maxresident After, AutoHCC, eviction_effort_cap = 1000: Rough parallel ops/sec = 549568 1084064 maxresident ``` I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default. ``` Change to -pinned_ratio=0.85 Before, LRU: Rough parallel ops/sec = 2108373 1078232 maxresident Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2164910 1077312 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2145542 1077216 maxresident ``` The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress. ``` Change to -pinned_ratio=0.25 (low stress) Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2221149 1076540 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2224521 1076664 maxresident ``` No measurable difference under normal circumstances. Some tests repeated with FixedHCC, with similar results.

anand1976

LGTM! Would it be useful to emit a warning or have some sort of stat if this happens too often? IIRC, there's a periodic dump of cache related stats into the info log.

facebook-github-bot · 2023-12-14T19:32:20Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-12-15T01:17:52Z

@pdillinger has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-12-15T01:41:27Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-12-15T06:18:17Z

@pdillinger merged this pull request in 88bc91f.

pdillinger requested a review from anand1976 December 13, 2023 04:18

facebook-github-bot added the CLA Signed label Dec 13, 2023

anand1976 approved these changes Dec 14, 2023

View reviewed changes

pdillinger added 2 commits December 14, 2023 11:14

Fix strict_capacity_limit, add release note

3c101ee

Merge remote-tracking branch 'origin/main' into hcc_eviction_effort_cap

932a346

pdillinger added 3 commits December 14, 2023 14:16

Add a (debug) counter for effort exceeded

4893798

Merge remote-tracking branch 'origin/main' into hcc_eviction_effort_cap

736aab0

Fix performance with strict_capacity_limit

8ed282b

facebook-github-bot closed this in 88bc91f Dec 15, 2023

facebook-github-bot added the Merged label Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cap eviction effort (CPU under stress) in HyperClockCache #12141

Cap eviction effort (CPU under stress) in HyperClockCache #12141

pdillinger commented Dec 13, 2023

anand1976 left a comment

facebook-github-bot commented Dec 14, 2023

facebook-github-bot commented Dec 15, 2023

facebook-github-bot commented Dec 15, 2023

facebook-github-bot commented Dec 15, 2023

Cap eviction effort (CPU under stress) in HyperClockCache #12141

Cap eviction effort (CPU under stress) in HyperClockCache #12141

Conversation

pdillinger commented Dec 13, 2023

Performance test

anand1976 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Dec 14, 2023

facebook-github-bot commented Dec 15, 2023

facebook-github-bot commented Dec 15, 2023

facebook-github-bot commented Dec 15, 2023