Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with S3-FIFO eviction policy #29

Open
aktau opened this issue Aug 28, 2023 · 87 comments
Open

Experiment with S3-FIFO eviction policy #29

aktau opened this issue Aug 28, 2023 · 87 comments

Comments

@aktau
Copy link

aktau commented Aug 28, 2023

Just saw this pass by on Hacker News: https://blog.jasony.me/system/cache/2023/08/01/s3fifo. Seems interesting. I wonder if it outperforms (not just in hit ratio, but also CPU time to calculate whether to evict) TinyLFU.

@Yiling-J
Copy link
Owner

@aktau Thank you for sharing the information! I think we can take a look this discussion by ben, author of Caffeine/TinyLFU: 1a1a11a/libCacheSim#20

@ben-manes
Copy link

I got a new laptop last week (m3 max 14-core) and ran Caffeine's benchmark using 16 threads. At close to 1B reads/s (40% of an unbounded baseline), I have to disagree with those authors that this represents a scalability bottleneck. There is also an independent hit rate analysis by a library author who is trying to adopt their algorithm with their guidance. That shows a large fluctuation where the hit rate can decrease at larger sizes. It does not appear yet to be a reliably general purpose algorithm.

Screenshot 2023-11-24 at 1 21 07 PM

@maypok86
Copy link

maypok86 commented Dec 5, 2023

Let me give my opinion as well, since I've already gathered a lot of attempts at adopting it.

  1. First, I tried to implement cache on hash table and lock free queues, as the authors wrote. I was annoyed, s3-fifo was indeed many times faster than lru cache implemented on hash table with global mutex, but still lost to ristretto and theine, which use bp-wrapper techniques. The profiler also showed that it was the eviction policy that was spending the bulk of the time. And I was only able to beat ristretto and theine after rewriting the implementation using bp-wrapper.

  2. Actually, the big hit ratio jumps are caused by differences in the implementation of ghost queue. In the paper the authors suggested to keep storing items in the ghost queue (this is what redpanda does now and I decided to try to do it this way too). It seems that large big hit ratio jumps are caused by the fact that the size of the ghost queue is limited by the size of the main queue according to the algorithm. And with a larger cache capacity the capacity of the small queue increases => we store items longer => more items get into the main queue => the ghost queue also stores more items. In the end I came to the conclusion that storing small queue capacity + main queue capacity + ghost queue capacity = 0.1 capacity + 0.9 capacity + 0.9 capacity = 1.9 capacity of items is too cheater. So I decided to try storing hashes of items in ghost queue and evicting them in fifo order. And got frustrated 😔. Limiting the ghost queue to the size of the main queue works very badly on small cache sizes. The best I got tonight was limiting the ghost queue to the size of the entire cache with these results: r.txt. S3-FIFO (with minor modifications) actually outperformed or was about equal to W-TinyLFU on all traces except S3 and DS1 without much fluctuation.

In summary, I completely disagree about the scalability advantage of S3-FIFO over the other policies. In terms of hit ratio S3-FIFO can indeed often outperform W-TinyLFU, but feels bad on lfu-friendly traces. I think S3-FIFO has a right to live, especially if someone can make something more adaptive out of it, but the thought of dropping everything and rewriting everything on W-TinyLFU is getting more and more tempting 🙂.

@maypok86
Copy link

maypok86 commented Dec 5, 2023

Though to be honest, what was causing the hit ratio to go down in DS1 I still don't understand

@ben-manes
Copy link

ben-manes commented Dec 5, 2023

Thanks, @maypok86! It's always very interesting to hear an experience report.

I can't speak for any implementation other than my own, so it would be helpful if you could also compare hit rates against Caffeine in its simulator. Could you please give that a try? I expect that static W-TinyLFU will lose on recency-biased traces, as this was a known deficiency (shared with the S3-FIFO author in 2016 here). The adaptive scheme has worked very well in all of the workloads that I have tested, where the policy is competitive with the leading algorithm. I don't expect it to always win, but for a general-purpose cache, I wanted it to robustly be near the top in any user workload. There are likely areas deserving improvement, but without data, I have to wait until there is something to analyze and discuss.

If comparing just on hit rates, the best alternative policy that I have seen is LIRS2 (code, paper, video). However, LIRS (v1) is very complicated to implement correctly and debug, which is why I invested in improving TinyLFU, as it seemed to be a more promising approach that I could reasonably maintain. In their v2, it is competitive except for my adaptive stress test. Of course, in practice, one has to consider all of the other characteristics, like metadata overhead and concurrency needs. There are a lot of design tradeoffs to consider when deciding on a policy, as the best choice depends on the end goals.

I would have expected S3FIFO to support lock-free reads and blocking writes based on the description. Since the worst-case eviction time is O(n), that time under the lock could become a concern. I'd probably just evict the next item after a threshold scan limit because I would be concerned about an unknown user workload hitting the worst-case behavior (akin to GC thrashing causing long pauses). The BP-Wrapper approach amortizes those costs as it samples the read requests, and the read throughput exceeds practical needs, so the time it steals helps keep the critical section bounded and more predictable. It is more complex than using a Clock-based policy, but as an engineer, this predictability in tail latencies is worth the effort when providing a library. My very limited (and likely outdated) understanding is that Go does not have a very good suite of concurrent data structures and primitives, at least compared to Java's, so it can be much more difficult to implement efficient multi-threaded code. I don't recall the author implementing a thread-safe version, so there might be oversights, and I suppose those claims should be taken skeptically.

@Yiling-J
Copy link
Owner

Yiling-J commented Dec 6, 2023

@maypok86 one thing i'm interested: did you compare the performance of xsync map and sharded map(throughput and gc)? from xsync author there is some overhead(puzpuzpuz/xsync#102)? If xsync map always outperforms, I think i can also switch to that. Current Theine implementation is based on sharded map and I do some optimizations on it. Also as @ben-manes suggested, you can test hit rates using Caffeine's simulator, Theine's window queue and adaptive algorithm is a little different from Caffeine's

@maypok86
Copy link

maypok86 commented Dec 6, 2023

@Yiling-J I wanted to come with a separate issue about this (and most likely about theine benchmarks in general), but that would be a very long discussion, and it's much more important for me to finish work on otter right now. So I'll come with that, but much later 🙂 or else I was able to get my ideas to the author of memorycache, but it just took a huge amount of time.

@ben-manes LIRS2 looks very smart, I'll try to look in its direction, but I'm not sure it would be easy to maintain such a thing. I also tried running the caffeine simulator and the verdict is about the same: S3-FIFO was +- 2% on all traces used in otter except S3 and DS1 and otter showed about the same hit ratio. On DS1 (6000000 capacity) for example the loss was quite serious

╔══════════════════╤══════════╤════════════╤════════════╤════════════╤════════════╤════════════╤═══════════╗
║ Policy           │ Hit Rate │ Hits       │ Misses     │ Requests   │ Evictions  │ Steps      │ Time      ║
╠══════════════════╪══════════╪════════════╪════════════╪════════════╪════════════╪════════════╪═══════════╣
║ opt.Clairvoyant  │ 61.82 %  │ 27,019,597 │ 16,685,382 │ 43,704,979 │ 10,685,382 │            │ 1.080 min ║
╟──────────────────┼──────────┼────────────┼────────────┼────────────┼────────────┼────────────┼───────────╢
║ product.Caffeine │ 58.51 %  │ 25,570,957 │ 18,134,022 │ 43,704,979 │ 12,134,022 │            │ 39.94 s   ║
╟──────────────────┼──────────┼────────────┼────────────┼────────────┼────────────┼────────────┼───────────╢
║ two-queue.Qdlp   │ 46.65 %  │ 20,390,489 │ 23,314,490 │ 43,704,979 │ 17,314,490 │ 43,704,979 │ 35.36 s   ║
╚══════════════════╧══════════╧════════════╧════════════╧════════════╧════════════╧════════════╧═══════════╝

By the way, I was quite surprised by the difference in hit ratio between theine and caffeine on P3 (25000 capacity) 8% vs 14%.

Yes, perhaps the main problems with S3-FIFO are the inability to adapt and the possibility of an element being pushed out for O(n) and any attempts to fix this can lead to a worse hit ratio. And in the short term, I don't know what to do about it.

Go really still doesn't have efficient concurrent data structures, there are only implementations scattered across different libraries based on research papers or ports from other languages, but unfortunately most of them are barely supported. This is largely what motivated me to build an efficient cache library under such limited circumstances.

@ben-manes
Copy link

It looks like the ARC traces are no longer public due to an overzealous Dropbox security restriction, and unfortunately I did not use P3 regularly or recently enough to have it locally. I emailed the author to ask if they can be made available again, but if you have all of the downloads I might need you to publish them.

@maypok86
Copy link

maypok86 commented Dec 6, 2023

@ben-manes I just took the traces from ristretto benchmarks https://github.com/dgraph-io/benchmarks/tree/master/cachebench/ristretto/trace (better to download individually as the repository is very large).

@1a1a11a
Copy link

1a1a11a commented Dec 7, 2023

Hi all, I missed the fun party! This is the author of the S3-FIFO algorithm.
@maypok86 I just created an issue (maypok86/otter#17), I believe there is a potential bug causing more objects to be evicted. I think this should fix the fluctuation problem.
However, It is true that is WTinyLFU is better than S3-FIFO on the DS1 and S3 traces.
I believe this partially comes from not being adaptive, and this is one direction we are working on. In general, S3-FIFO is more suitable for web workloads (e.g., key-value cache, object cache), which follow skewed popularity with popularity decay.

On block workloads, especially the old ones (like the ARC ones, which are more than twenty years old), S3-FIFO may not outperform W-TinyLFU. Similar results can also be found in our paper https://dl.acm.org/doi/10.1145/3600006.3613147

With that being said, I believe if

  • you need a very general cache replacement that can handle both block and web workloads
  • and don't mind the complexity
  • don't mind scalability
  • ok with extra metadata overhead
    W-TinyLFU is still better than S3-FIFO, but if your workloads are mostly from web applications, then S3-FIFO should be
    a better choice.

@1a1a11a
Copy link

1a1a11a commented Dec 7, 2023

Hi @ben-manes, can you give more description about the scalability benchmark you showed in the figure? I am interested to learn how you improve LRU's scalability. We did have a prototype in which we observed much better scalability than the TinyLFU in the cachelib library.

@maypok86
Copy link

maypok86 commented Dec 7, 2023

@1a1a11a I'm certainly not Ben, but it seems that bp-wrapper and a set of wait-free lossy buffers can give even greater scalability than lock-free queues. Articles on caffeine architecture: first and second

@ben-manes
Copy link

ben-manes commented Dec 7, 2023

Cachelib mimics memcached's scheme of an exclusive lock (optionally try-lock) and a per-item time interval between policy updates (60s). That was good enough for there internal needs.

Caffeine uses dynamic striped, lock-free, lossy mpsc ring buffers to sample the request events. This differs from the author's BP-Wrapper implementation where they used an isolated system with a dedicated cache (Postgres), so they leveraged thread-local buffers as the memory usage and thread counts were known up front. That didn't make sense for a general purpose library embedded into an application which might have hundreds of instances, and each with different scalability needs. It is their design but with a different implementation approach.

The dynamic striping (number of read buffers) increases as contention is detected, mirroring how a scalable counter is implemented. The ring buffers capture the events cheaply and if full then it is dropped. Since a cache is probabilistic and trying to detect a usage pattern for hot-cold entries, this lossy behavior does not meaningfully impact the hit rates. Whether one uses a lossy buffer, a time threshold, a fifo clock, etc. the insight is the same that a perfect history of the access order is not required to make good predictions.

The writes are also buffered using a bounded mpsc queue, which grows up to a threshold. This allows the cache to absorb most independent writes without threads contending on each other and batch the work to be replayed under the lock. These writes are a cache miss which indicated that the application had to perform an expensive load (10ms+). While in applications the write rate won't exceed the eviction rate, it can happen in a synthetic stress test. In those cases back-pressure is applied to the buffering, which assists in batching the work to avoid context switches or lock contention. Generally writes are limited by the hash table's write throughput, as shown in those benchmarks. The hash table offers lock-free reads, uses dynamically striped locks for writes, offers computations under the fine-grained lock, and protects against hash flooding by switching the buckets from linked-list to red-black trees when needed.

The buffering of events allows us to schedule the draining as appropriate, either after a write or when a read buffer is full. That uses a simple state machine and a try-lock to coordinate, so the eviction policy does not suffer from lock contention. As BP-Wrapper describes itself, it "(almost) eliminates lock contention for any replacement algorithm without requiring any changes to the algorithm". This allows for more exploration of algorithms and data structures because the policies do not need to be thread-safe.

The benchmark that I use has a pre-populated cache with a zipf distribution for the access requests. That creates hot spots as some entries are used much more often than others, just like a real cache would experience. A common mistake is a uniform policy which evenly spreads the contention, which would imply random replacement is the ideal policy. A skewed distribution means that locks suffer higher contention so it exacerbates that as a problems, while also benefiting contention-free implementations who can better utilize the cpu cache. In the real world, 50-70M reads/s is a good bar to reach to satisfy most users and the focus turns towards other areas like hit rate and features.

@ben-manes
Copy link

ben-manes commented Dec 9, 2023

you need a very general cache replacement that can handle both block and web workloads

Yes

and don't mind the complexity

Yes. There are more pieces, each simple, which offer greater predictability. It is a tradeoff of a little more upfront work for fewer runtime surprises.

don't mind scalability

No. If we assume zero overhead (an unbounded map) then the measured per operation overhead is,
1.06 nanoseconds − 0.435 nanoseconds ≈ 0.625 nanoseconds
A Clock-based policy would fall in the middle, e.g. 0.7 ns. At these numbers the cache would not be the scalability bottleneck and it would be bounded by something else in the system.

ok with extra metadata overhead

This seems debatable and would require being more explicit about the calculations.

If we assume a traditional ghost history based on a doubly-linked list, closed hashing, and 64-bit machine then
List node (2 x 8 bytes) + Hash Entry (2 x 8 bytes) + key hash (4 bytes) = 36 bytes
This does not account for wasted space due to the load factor, object headers, etc that might be required.

If optimized for space, we can say the lower limit must be at least the 32-bit key hash (4 bytes). As in, we increase the time complexity for find / add / delete through an array of hashes. This seems to be an unlikely implementation choice and it will be higher in practice.

Caffeine uses a CountMinSketch sized at 8 bytes x ceilingPowerOfTwo(maximumSize) to hold 4-bit counters. This means that each entry adds 8-16 bytes. The paper discusses the doorkeeper optimization (adding a bloom filter) to reduce the size of the CountMinSketch. In my experiments this allowed the total memory overhead to be reduced by 50% without meaningfully impacting the hit rate. Thus, we are looking at a range between 4-16 bytes per entry for maintaining the history metadata.

The extra metadata overhead of maintaining the history seems to be, at best, equivalent. In practice it appears more likely that the frequency sketch will be a lower cost than a ghost list. The size of the ghost list is the unknown factor, e.g. @maypok86 uses 0.9x. If it was instead significantly smaller than the total capacity, e.g. 0.2x, then it could be favorable.

The resident per entry metadata overhead is less clear. W-TinyLFU describes only an algorithmic structure and the paper discusses the implementation choices made by Caffeine for the purpose of an evaluation. Ristretto decided to use sampled LFU with TinyLFU (doorkeeper + count-min) to reduce the metadata overhead, based on their analysis that their use-case was frequency-biased. It appears to me that S3-FIFO and W-TinyLFU would be equivalent for resident entries in their common implementations.

@maypok86
Copy link

maypok86 commented Dec 9, 2023

Ugh, I urgently need a reaction with popcorn on github. I'm probably not competent enough to argue about S3-FIFO and W-TinyLFU. Let me just say that if you only need find and add operations in a ghost queue, you can manage with about 18 additional bytes if the hash takes 8 bytes (if it takes 4 bytes, you need half as many bytes), but if you need to remove random items from it when changing the algorithm, things get bad

@ben-manes
Copy link

It's not so much an argument for or against, its that there are a lot of design choices where those details matter. I find it frustrating when sweeping generalizations are made that conveniently leave out important caveats or alternatives that a reader should be aware of because it may weaken the author's thesis. As an engineer, I appreciate having a richer picture because I need to balance tradeoffs, support the usage long-term, and cope with future requirements. This in conflict with an academic audience where goals are more short-term, e.g. peer review and that typically the work is discarded after publication. When the discussions are oriented towards an engineering audience then going into more breadth and depth over the facts matters so that the tradeoffs are understood. Usually that should ends up with a mixture of ideas, like how you applied insights from both BP-Wrapper and S3-FIFO, to discover the right design fit for the target environment.

@1a1a11a
Copy link

1a1a11a commented Dec 12, 2023

Hi @ben-manes, Thanks for the super detailed explanations! A few comments and questions.

Comments

I like the TinyLFU design, and it shows the second best results in our evaluation (with a window size of 10%). I have no doubt calling it the state-of-the-art.

image

The figure above shows the evaluation of 6594 traces from 14 datasets (CDN, kv, and block).
Whether S3-FIFO is truly better than TinyLFU is hard to say; we have traces where S3-FIFO is better (especially for web workloads), and we also have traces where S3-FIFO is worse than TinyLFU (especially on block workloads).
These empirical results depend on workloads, and since we do not have a good model of workloads, I will not argue that S3-FIFO is strictly better than TinyLFU on the miss ratio.

metadata size

Regarding the metadata size, as a FIFO-based algorithm, if objects in the workload have a uniform size, we can implement the FIFO queues using ring buffers, and AFAIK, several production cache implementations (e.g., TigerBeetle, VMware) do use ring buffers.

For the ghost entries, we implemented them as part of a bucket-based hash table to take advantage of the over-provisioned space. They are cleared out during hash collisions. This means we used no more than 8 bytes (4-byte timestamp, and 4-byte fingerprint) per ghost entry. Since there are a similar number of ghost entries as the entries in the cache, we can estimate that the ghost entries add no more than 8 bytes per cached object. Our sensitivity analysis shows that we can reduce the number of ghost entries to 40% of the cached entries with almost no impact.
In summary, the ghost entry adds fewer than 4 bytes per object.

Scalability

I appreciate the efforts put into making Caffeine very scalable. However, the FIFO queues in S3-FIFO do not need any fancy technique to be scalable. All we need is atomics. I implemented it in C/C++, and everything is straightforward. However, I recently realized that it is non-trivial to have scalable and efficient implementations in other languages, e.g., Java and Rust. But the scalability in S3-FIFO is fundamental in the design.

Questions

1.06 nanoseconds − 0.435 nanoseconds ≈ 0.625 nanoseconds
Can you explain what the numbers are?

I still doubt how Caffeine can achieve 1000 MQPS on 16 threads. The numbers indicate that a hash look-up and recording the request on a ring buffer takes 16 ns. But CPU L1 access is ~1 ns, and DRAM access takes 100 ns. I simply don't understand how this is possible. This indicates that most data accesses (including the hash table) happen in the L1 cache.

A friend of mine implemented (not optimized) FIFO in C and benchmarked its scalability on a server with 72 cores (using a pre-populated Zipf request stream). However, she was only able to get ~100 MQPS.

Happy to learn more from you! :)

@ben-manes
Copy link

ben-manes commented Dec 17, 2023

Great comments, @1a1a11a. I am very happy to see more caches moving beyond LRU, so a variety of simple, easy, or more comprehensive solutions is wonderful. There is a lot of low-hanging fruit beyond the classics.

One of the differences to keep in mind is that the TinyLFU papers are closer to design patterns than algorithms. They provide a structure and techniques but only suggest implementation approaches. That means that implementations can differ; for example, cachelib uses 32-bit counters in their CountMinSketch, whereas Caffeine's is 4-bit. The variety of choices can impact metadata overhead, hit rate, throughput, and concurrency support.

Metadata

I don't believe that the ghost entry's cost of the hash table and FIFO queue can be dismissed to say it only takes 8 bytes (timestamp + fingerprint). A hash table typically has a load factor of around 75%, and may be higher for a cache since the maximum capacity is known. That leaves less space for the over-provisioned table size but also adds at least a pointer for the chained entry and maybe more (e.g. hash field if cached, uses a doubly linked list for faster removals, etc). The cost of the FIFO should also be accounted for as part of the ghost's overhead. A reasonable minimum estimate may be that the FIFO holds a 4-byte fingerprint, the hash table adds 8 bytes for the chain pointer, the table over-provisioning is dismissed for simplicity, the entry costs 8 bytes for the key and timestamp, and ghost region ranges from 0.4x to 0.9x of total capacity. That is then an incremental cost of 8 to 18 bytes of metadata overhead for each additional cache entry. That's great and not much higher than the 4-16 bytes per entry for TinyLFU.

Scalability

This indicates that most data accesses (including the hash table) happen in the L1 cache.

Yes, that is the ideal because this micro-benchmark tries to isolate the costs to only measuring the cache in a concurrent workload. It would show bottlenecks such as due to hardware cache coherence, data dependencies, poor branch prediction, saturation of the logic units (there are multiple ALUs for every FP unit), (lack of) SIMD, waits on the store buffer, blocking and context switches, system calls, inefficient algorithms, etc.

The numbers indicate that a hash look-up and recording the request on a ring buffer takes 16 ns. But CPU L1 access is ~1 ns, and DRAM access takes 100 ns.

A modern CPU core is superscalar (6 wide) and fuses multiple operations into a bundle (8 micro-ops). If we avoid data dependencies and use simple integer instructions, then we'll see higher CPU utilization as it is much more work than a single instruction per cycle. That could mean at best a CPU core can perform is 48 instructions per cycle, which at 4GHz means 192 instructions per nanosecond; or 2688 instructions per nanosecond across all 14 cores.

The benchmark's overhead is thread-local work for an index increment, array lookup, loop, and operation counter. A read performs a hash table lookup, a few liveliness validation checks, hashing, and maybe an addition to a ring buffer. The hashes are very inexpensive bitwise and multiplication operations. Since the ring buffers for recording a read are lossy and drained by a single thread, they fill up very quickly and are usually skipped. In those cases there is no blocking on a lock, no CAS, no volatile writes, etc. so that at maximum load it stays close to the hash table's performance. If we introduce a cost like data dependencies (such as by using a random number generator to select the next key), we'd see this fall sharply as we no longer allow the CPU/compiler to use the hardware's full potential.

I refer to BP-Wrapper as request sampling because it sheds load (policy maintenance) by not recording the read in the ring buffer. The popular items will reenter the buffer at a higher frequency so when the policy drains the read buffer it still gets to observe the heavy hitters and the policy can make a reasonable prediction without slowing down the application.

A friend of mine implemented (not optimized) FIFO in C and benchmarked its scalability on a server with 72 cores (using a pre-populated Zipf request stream). However, she was only able to get ~100 MQPS.

There are many easy mistakes that could severely diminish throughput.

  1. Confirm that the key distribution is pre-generated. If not, you are paying not only the cost of randomizing but also creating data dependencies which severely limits the instruction parallelism. The random seed is often modified under a lock. The benchmark's own cost should be scrutinized.
  2. Is all of the mutable state (effectively) thread local? If you are reading and updating shared state, like incrementing statistic counters, then that introduces coherence costs and data dependencies.
  3. Is shared mutable state padded to avoid false sharing? Any write barrier (CAS, volatile update) should be inspected to avoid contention on the cache line. The field may be isolated to the thread, but updating a shared cache line can have a severe penalty.
  4. Does the workload fit within the CPU cache? A large distribution would include misses to main memory, thereby creating CPU stalls, and may increase the number of collisions observed during the hash table's lookup.
  5. Is the workload only reads with a 100% hit rate?
  6. Is the hash function and other math performed efficiently? An integer modulus or division (20-80 cycles) is much more expensive than add/sub/bitwise (1 cycle) or multiplication (6 cycles), and FP is even more (30-150). Often power-of-two sizing is used to replace a modulus with a mask, e.g. hash % table.length becomes hash & (table.length - 1). A 32-bit hash may be cheaper than 64-bit by packing more work per hardware logic unit. An expensive algorithm like SipHash might be replaced by a very simplistic hash by using red-black tree bins to mitigate collisions.
  7. What is the cost compared to an unbounded hash table? An immutable map vs a concurrent one? Are reads lock-free (e.g. do they require a read-lock)? Also compare against a modern hash table, like Google's Abseil and Facebook's Folly. Even on decade-old hardware, over 10M reads/s per core was observed for concurrent hash tables (6 core benchmark, 2 core and up but note low-perf cores used at high CPU counts). A modern 2022 benchmark observed 100M/s on 2 cores.
  8. Does performance scale linearly or super-linearly as threads/cores are added? It should until saturation.
  9. Are there any hard to predict branches, system calls (e.g. clock read), locks, memory fences, problematic data dependencies (e.g. disallow loop unrolling), etc. on the critical path? Just reading the system time can be expensive.
  10. Has the code been profiled? Try Linux perf and Intel's VTune.

@maypok86
Copy link

I tried to update the otter benchmarks today and tried to replicate the caffeine benchmarks and there seems to be a reason for these numbers in the caffeine benchmark results.

Caffeine uses 2 << 14 = 32768 capacity (cache pre-filled) and scrambled zipfian from yahoo.

And on such a benchmark I got fantastic speed from otter.

goos: darwin
goarch: arm64
pkg: github.com/maypok86/benchmarks/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               11247033               102.2 ns/op         9780838 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            24089149                44.99 ns/op       22225554 ops/s              27 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             18038624                61.85 ns/op       16167099 ops/s             108 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            21097772                53.16 ns/op       18811359 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                240418147                5.085 ns/op     196646932 ops/s               0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/benchmarks/perf     6.965s

But if I increase cache capacity to 2 << 20 = 2097152 (used in ristretto benchmarks), the results are already decently worse and I suspect it will be the same with caffeine.

goos: darwin
goarch: arm64
pkg: github.com/maypok86/benchmarks/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8                8520392               127.9 ns/op         7819737 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            19632382                56.17 ns/op       17804392 ops/s              27 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             13077836                99.85 ns/op       10015271 ops/s              86 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            15576997                75.88 ns/op       13178764 ops/s              41 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                67229958                16.86 ns/op       59302586 ops/s               0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/benchmarks/perf     21.023s

I also don't understand why benchmarks are configured so that elements are not evicted at all. It doesn't seem to be very representative of real world loads.

@maypok86
Copy link

Of course, I can attach such pictures in the readme, but I have big questions about the honesty of such benchmarks.

bench

@ben-manes
Copy link

ben-manes commented Dec 17, 2023

I also don't understand why benchmarks are configured so that elements are not evicted at all. It doesn't seem to be very representative of real world loads.

This is because it is easy to write an incorrect benchmark or to misunderstand its results. Therefore that benchmark is very narrow to answer a very specific question. In this case that is if the cache is a fundamental scalability bottleneck that may harm the application's performance. A thorough analysis should ask multiple, narrow, verifiable questions that each have their own micro-benchmark for understanding the system. In no way is that single benchmark enough, but it does answer a common question and counters the belief against LRU. That question (if the cache hit penalty is a bottleneck) and its answer does not mean that being infinitely faster is important; there is a point of diminishing returns and the answer is "yes or no" not "by how much".

Caffeine does include an eviction benchmark to measure that worst case throughput. Otherwise the questions became much murkier and easy to misinterpret, so they were written as needed and not included in the repository. The concern is that if included that might imply that they are valid and trustworthy comparisons rather than exploratory throwaways during development. You should write more benchmarks for analyzing your cache, but you should also be conservative publicly to avoid others misrinterpreting the results.

An example of trying to follow your logic and making an easy mistake was in RedHat's benchmark. It is much faster to perform a cache miss than a cache hit because there is no additional work for maintaining the eviction policy. That penalized caches with higher hit rates as an increased throughput was observed when there are more misses. They didn't realize this and misunderstood their measurements, so during a rewrite their newer implementation suffered very high contention on a cache hit and the lower hit rate made it appear to be an improvement. Your question is good and worth exploring, but be very careful due to how trivially easy it is to make an invalid comparison or an honest misunderstanding of the results.

A few related talks that I found useful,

  1. Performance Anxiety
  2. Benchmarking for Good (or skip to a real mistake)
  3. (The Art of) (Java) Benchmarking
  4. A Crash Course in Modern Hardware

@maypok86
Copy link

Interesting idea about the benchmark separation.

Let me share my yesterday's adventures (and conclusion). I think it might be useful to someone.

I ran the benchmarks in the same configuration as caffeine on two versions of otter:

  1. Used spinlock to synchronize the key-value pair (to get the value atomically too) and the results came out as follows:
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               11870023               100.4 ns/op         9960592 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            28630963                43.99 ns/op       22732079 ops/s              26 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             17515051                60.59 ns/op       16503227 ops/s             106 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            21355162                53.30 ns/op       18761140 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                86992596                14.91 ns/op       67050184 ops/s               0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/perf  7.752s
  1. used atomic.Pointer on the value, in fact that's what caffeine does.
type Node[K comparable, V any] struct {
	key        K
	value      atomic.Pointer[V]
	...
}

And the results were much better (you've already seen them):

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8                9978348               115.1 ns/op         8690168 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            25813344                48.54 ns/op       20603083 ops/s              27 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             18204561                64.09 ns/op       15602573 ops/s             113 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            19909934                55.59 ns/op       17987431 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                210819736                5.589 ns/op     178916881 ops/s               1 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/perf  7.292s

So it looks like no one in go can come close to the throughput of caffeine without additional memory overhead and pressure on gc. Otter of course already uses a set of tricks to reduce gc pressure and use extra memory, but such changes actually crosses them all out.

@ben-manes
Copy link

We are running on different machines, so you would have to see what Caffeine's numbers were on yours. My latest numbers came from a brand new M3 Max macbook, where I bumped the benchmark from 8 to 16 threads. The three workload types would also be nice to see, and a comparison to the unbounded hash map as a control group. Java's ConcurrentHashMap is very fast so that might be what gives caffeine an advantage if it wins on your hardware.

@ben-manes
Copy link

I’m unsure why you’d spin lock on read, but if important then you might find a stamped lock useful. Basically use a version counter so reads verify their stamp to ensure a consistent read, instead of blocking each other by acquiring exclusive access.

@ben-manes
Copy link

Oh, and your spinlock is TAS whereas using TTAS is usually preferred.

@maypok86
Copy link

The problem isn't even so much the numbers compared to caffeine, but the fact that spinlock eats up a lot of time on such benchmarks (CLHT is the hash table that otter uses).

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               10268367               102.8 ns/op         9725720 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            23940507                45.51 ns/op       21972560 ops/s              27 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             19561802                61.60 ns/op       16234376 ops/s             111 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            20381444                53.41 ns/op       18721946 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_CLHT_reads=75%,writes=25%-8                 280790150                4.242 ns/op     235737370 ops/s               1 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                194110579                5.995 ns/op     166793393 ops/s               1 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/perf  9.498s

With the spinlock, the otter is about three times slower

@ben-manes
Copy link

ben-manes commented Dec 18, 2023

spinlock eats up a lot of time on such benchmarks

I think if you use a stamped spinlock then the optimistic read will usually succeed and the spinlock overhead will mostly disappear. But I’d also try TTAS first.

@maypok86
Copy link

@ben-manes Why am I using spinlock?

To update asynchronously the received Entry from the hash table when calling Set, since there is no synchronized or anything like that in golang. And I was even able to save 4 bytes for this. Also, since it makes it necessary to atomically do getValue when calling Get on the cache instance I use the same spinlock.

There is a problem with StambledLock - it takes much more memory, judging by the java implementation.

@maypok86
Copy link

And TTAS spinlock doesn't seem to save the situation much either

func (sl *SpinLock) Lock() {
	acquired := false
	for !acquired {
		if atomic.LoadUint32((*uint32)(sl)) == 0 {
			acquired = atomic.CompareAndSwapUint32((*uint32)(sl), 0, 1)
		}
	}
}
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               10992760               104.5 ns/op         9566058 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            26589556                46.96 ns/op       21294953 ops/s              28 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             17358097                62.27 ns/op       16059811 ops/s             107 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            20358709                54.12 ns/op       18478836 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_CLHT_reads=75%,writes=25%-8                 267684039                4.182 ns/op     239110129 ops/s               1 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                76844876                14.15 ns/op       70652962 ops/s               0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/perf  7.847s

@ben-manes
Copy link

ben-manes commented Jan 6, 2024

Supposedly a black hole assignment is a trick to hint to the compiler to insert a compiler barrier. It is to ensure that seq is considered live and used to preventing the compiler from optimizing it away, or in this case placing after the value assignment. However that seems to be a poorly explained trick and I am unsure if it works, but maybe try that next.

func compilerBarrier() {
  var dummy int
  _ = dummy
}

func (n *Node[K, V]) Value() V {
  for {
    seq := n.lock.Load()
    compilerBarrier()

    if seq&1 != 0 {
      runtime.Gosched()
      continue
    }

    value := n.value // Safe read after memory barrier

    newSeq := n.lock.Load()
    if seq == newSeq {
      return value
    }
  }
}

@maypok86
Copy link

maypok86 commented Jan 6, 2024

I've tried something similar, but it doesn't work. The compiler is too smart in this case. I'll have to take the issue to the developers, because I don't really understand why it decides to put a copy value after newSeq.

b15: ← b7-
  v46 (+38) = InlMark <void> [6] v17 
  v60 (+35) = MOVQload <int> [8] v6 v17
  v56 (35) = MOVQload <int> [16] v6 v17
  v54 (35) = MOVQload <int> [24] v6 v17
  v49 (+116) = MOVLatomicload <uint32,mem> v6 v17
  v51 (116) = Select0 <uint32> v49 (~R0[uint32], newSeq[uint32])
  v50 (116) = Select1 <mem> v49
  v14 (+40) = CMPL <flags> v51 v18
EQ v14 → b17 b16 (unlikely) (40)

gossa

@ben-manes
Copy link

ben-manes commented Jan 6, 2024

What if you make a method scoped mutex variable and lock around the value read? It would be uncontended as dead code, but might not be eliminated or at least force the barriers. If Go follows the roach motel model it would also be safe, I think?

@maypok86
Copy link

maypok86 commented Jan 6, 2024

Ahahahaha, that kind of thing really works.

func (n *Node[K, V]) Value() V {
	var mutex sync.Mutex
	for {
		seq := n.lock.Load()
		if seq&1 != 0 {
			runtime.Gosched()
			continue
		}

		mutex.Lock()
		value := n.value
		mutex.Unlock()

		newSeq := n.lock.Load()
		if seq == newSeq {
			return value
		}
	}
}

@ben-manes
Copy link

ben-manes commented Jan 6, 2024

wonderful! you may be able to use an atomic.Uint32 in a similar, lighter fashion to create a store barrier. It sounds like they'll reorder load barriers but not stores, where the lock/unlock are writing to the state field. We assumed they would honor an acquiring load, e.g. [LoadLoad][LoadStore] barrier on the atomic.LoadInt32, and not allow the compiler to reordering around that. Instead it seems that you have to rely on the releasing store, [StoreStore][LoadStore], for the compiler act how one should expect. It would be nice if they wrote those semantics down or provided the ability to emit the proper fences. I believe this method scoped variable would be stack allocated, so it would incur no GC penalty, and therefore should be extremely cheap and could be optimized away to just the emit the fences.

@maypok86
Copy link

maypok86 commented Jan 6, 2024

@ben-manes Yes, it's very similar to that and everything will indeed happen on the stack. I played around a bit and found out that even this ridiculous variant works.

func (n *Node[K, V]) Value() V {
	var lol atomic.Uint32
	for {
		seq := n.lock.Load()
		if seq&1 != 0 {
			runtime.Gosched()
			continue
		}

		value := n.value
		lol.Store(1)

		newSeq := n.lock.Load()
		if seq == newSeq {
			return value
		}
	}
}

@ben-manes
Copy link

Yep, that's what I was suggesting and it makes sense. They are enforcing compiler barriers for release semantics but oddly not for acquire semantics. That's really convoluted, but 🤷‍♂️

@maypok86
Copy link

maypok86 commented Jan 6, 2024

Very interesting, on laptops with linux, windows and intel and amd (x64) processors (all combinations) everything works fine without additional atomics but it doesn't work on macs on arm. I wish I could get a mac on intel somewhere....

@1a1a11a
Copy link

1a1a11a commented Jan 6, 2024

I believe ARM uses weak memory model https://preshing.com/20120930/weak-vs-strong-memory-models/

@maypok86
Copy link

maypok86 commented Jan 6, 2024

In general, it doesn't look like it, since gossa shows a rearrangement of instructions. And when adding Store, judging by the instructions, the barrier is triggered. Perhaps on some processors/architectures race condition does not work in time. But it is not certain.

@1a1a11a
Copy link

1a1a11a commented Jan 7, 2024

True, if the instruction has already been rearranged before running, then it is probably not relevant. I am not an expert on this, so I will just watch and learn.

@jbduncan
Copy link

jbduncan commented Jan 7, 2024

My very limited (and likely outdated) understanding is that Go does not have a very good suite of concurrent data structures and primitives, at least compared to Java's...

@ben-manes That is true, thread-safe data structures are rife with locks in Go. But in my experience, this is less of an issue than in Java because of goroutines (lightweight threads) and channels, which encourage message passing over memory sharing.

@maypok86
Copy link

maypok86 commented Jan 7, 2024

Oh man, in Value method lol and mutex are allocated on the heap.

@maypok86
Copy link

maypok86 commented Jan 7, 2024

@jbduncan it's not true, for a large number of applications in Go, Java and other languages the usual map with mutex will suffice, but if you need something more, you suddenly start to have problems. Because a map with mutex will not save you from blocking hundreds of goroutines, and sync.Map is actually useless. Also the channels cannot be called efficient in any way. In conditional Java you can take ConcurrentHashMap at once and live peacefully, but in go people start to get out of it (most often just using map sharding). The canonical go way unfortunately doesn't always work either, the simplest examples are fasthttp and VictoriaMetrics, which break the rules but achieve excellent efficiency and are very popular. For example, my company switched to VictoriaMetrics simply because it was able to handle loads that competitors couldn't and nobody cares that it has a huge amount of non-canonical code and specific optimizations. Also, when you have a well-performing, more efficient solution, everyone I know immediately asks, "Why would I use something else?"

@jbduncan
Copy link

jbduncan commented Jan 7, 2024

@maypok86 Oh, good point about sync.Map and mutexes blocking goroutines. I've only ever had to consider problems like net/http servers where things are abstracted away or small ETL jobs where goroutines and channels could be viable, so thanks for bringing me back down to earth.

Though I'm not sure what you mean by "channels cannot be called efficient in any way", so at the risk of derailing things, I was wondering if you had an example?

@maypok86
Copy link

maypok86 commented Jan 7, 2024

@jbduncan In fact it is very hard to feel the problems with channels, unlike map with mutex and they really do their job in 99% of cases, but they use mutex internally because of what on large machines and heavy loads on data transfer through goroutines can become a bottleneck of the program.

Here is an example of comparison with a more tricky queue architecture (сoncurrent producers and consumers (1:1), queue/channel size 1,000, some work done by both producers and consumers):

QueueProdConsWork100                             252ns ± 0%
QueueProdConsWork100-2                           206ns ± 5%
QueueProdConsWork100-4                           136ns ±12%
QueueProdConsWork100-8                           110ns ± 6%
QueueProdConsWork100-16                          108ns ± 2%
QueueProdConsWork100-32                          102ns ± 2%
QueueProdConsWork100-64                          101ns ± 0%
ChanProdConsWork100                              283ns ± 0%
ChanProdConsWork100-2                            406ns ±21%
ChanProdConsWork100-4                            549ns ± 7%
ChanProdConsWork100-8                            754ns ± 7%
ChanProdConsWork100-16                           828ns ± 7%
ChanProdConsWork100-32                           810ns ± 8%
ChanProdConsWork100-64                           832ns ± 4%

@ben-manes
Copy link

Oh man, in Value method lol and mutex are allocated on the heap.

In sync.pool it uses runtime_LoadAcquintptr for acquiring semantics which the compiler probably honors. Would a pool.Get() on an empty, never populated, shared pool give you a compiler barrier without causing unnecessary heap or coherence overhead? if that works, how much overhead is this no-op hack?

@maypok86
Copy link

maypok86 commented Jan 7, 2024

In sync.pool it uses runtime_LoadAcquintptr for acquiring semantics which the compiler probably honors. Would a pool.Get() on an empty, never populated, shared pool give you a compiler barrier without causing unnecessary heap or coherence overhead? if that works, how much overhead is this no-op hack?

It is huge because the empty sync.Pool uses the getSlow method (finds nothing and tries to find something in other pools). What's interesting, the approach from counters to determine the OS thread id is faster, but actually equals spinlocks.

Benchmark results (reads slower than updates)

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-8               10877155               111.5 ns/op         8969638 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-8            40693945                30.85 ns/op       32418153 ops/s              16 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=100%,writes=0%-8             27307274                45.90 ns/op       21788883 ops/s              96 B/op          2 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=100%,writes=0%-8            36409423                33.47 ns/op       29873592 ops/s              47 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-8                43991157                23.78 ns/op       42049901 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8                8651826               124.7 ns/op         8017364 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            24971624                51.71 ns/op       19336805 ops/s              28 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             17303230                71.20 ns/op       14044841 ops/s             108 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            19197260                58.07 ns/op       17219988 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                58528850                18.31 ns/op       54613259 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=0%,writes=100%-8               20914059                54.85 ns/op       18231676 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=0%,writes=100%-8             8795355               136.0 ns/op         7354091 ops/s             120 B/op          3 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=0%,writes=100%-8             12765872                81.69 ns/op       12241376 ops/s             165 B/op          0 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=0%,writes=100%-8            21821074                53.93 ns/op       18540956 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=0%,writes=100%-8                71402260                16.70 ns/op       59866267 ops/s               0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/perf  20.408s

@ben-manes
Copy link

ben-manes commented Jan 7, 2024

I guess then what is the impact of having a global atomic field that is blindly written to, e.g. always set a boolean false? Since it’s not loaded or in lockstep, will it cause a lot of coherence traffic as cpus trade mesi ownership of the cache line? Or will it just create a store barrier and some light traffic?

I suspect a little heap allocation will be the best of the bad choices. The gc will keep getting better and eventually you may get a proper compiler fence to replace the allocation.

@maypok86
Copy link

maypok86 commented Jan 8, 2024

Unfortunately, global atomics is making things much worse.

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-8               11039739                99.10 ns/op       10091010 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-8            43249087                24.22 ns/op       41296106 ops/s              16 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=100%,writes=0%-8             29019181                38.42 ns/op       26025690 ops/s              96 B/op          2 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=100%,writes=0%-8            34859906                29.98 ns/op       33353715 ops/s              47 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-8                45606712                24.18 ns/op       41349324 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               10703342               102.7 ns/op         9736724 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            25954760                43.42 ns/op       23030717 ops/s              27 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             18474672                61.49 ns/op       16263119 ops/s             112 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            22329164                52.62 ns/op       19004293 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                60689072                19.50 ns/op       51290554 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=0%,writes=100%-8               28396710                44.02 ns/op       22715338 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=0%,writes=100%-8            10060098               123.9 ns/op         8068016 ops/s             121 B/op          3 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=0%,writes=100%-8             14165078                75.17 ns/op       13303308 ops/s             173 B/op          0 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=0%,writes=100%-8            25403186                49.72 ns/op       20114664 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=0%,writes=100%-8                75778336                17.89 ns/op       55907056 ops/s               0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/perf  20.938s

Allocation for each Get request confuses me because with high probability GC will spend much more time (and in fact a lot of memory will be spent on it too) to clean up these allocations than we will gain.

And there are several ways of solution here

  1. use atomic.Value and put up with memory loss and load on GC, especially since unlike theine and many other cache libraries otter does not store keys twice (the ristretto approach with replacing keys with hash is too dirty in my opinion). Because of this, for example, when using strings as keys, our implementations can actually compare in terms of GC load and memory consumption (although clearing values from the heap is certainly a more unpleasant operation).
goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-8               10890712               105.2 ns/op         9506177 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-8            42186675                25.67 ns/op       38957096 ops/s              16 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=100%,writes=0%-8             32435317                40.15 ns/op       24909024 ops/s              96 B/op          2 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=100%,writes=0%-8            38724302                30.79 ns/op       32474269 ops/s              47 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-8                266765196                4.576 ns/op     218544656 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               10897141               110.2 ns/op         9071319 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            27160022                45.09 ns/op       22177331 ops/s              28 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             17890664                63.31 ns/op       15794165 ops/s             106 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            23185473                51.38 ns/op       19464213 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                185191293                6.631 ns/op     150817471 ops/s               1 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=0%,writes=100%-8               28361977                42.62 ns/op       23461253 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=0%,writes=100%-8            12942342                95.99 ns/op       10417285 ops/s             120 B/op          3 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=0%,writes=100%-8             13148985                76.05 ns/op       13148494 ops/s             171 B/op          0 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=0%,writes=100%-8            24734936                48.88 ns/op       20459221 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=0%,writes=100%-8                60494036                24.93 ns/op       40112645 ops/s              16 B/op          1 allocs/op
PASS
ok      github.com/maypok86/otter/perf  21.076s
  1. use RCU approach and put up with its disadvantages. Here we will have to edit quite a lot in the current implementation, but we can agree with it, although the strongest slowdown of updates is alarming.

  2. Leave everything as it is and live with slower reads.

@maypok86
Copy link

maypok86 commented Jan 8, 2024

By the way, performance with seqlock and atomic barrier on every Get request.

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-8               11564922               100.3 ns/op         9973789 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-8            40285806                26.70 ns/op       37454930 ops/s              16 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=100%,writes=0%-8             30965144                40.81 ns/op       24502239 ops/s              96 B/op          2 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=100%,writes=0%-8            40173415                31.08 ns/op       32171586 ops/s              47 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-8                192483022                6.916 ns/op     144595413 ops/s               4 B/op          1 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               10873077               106.4 ns/op         9395601 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            29815732                43.06 ns/op       23222686 ops/s              27 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             17980041                62.49 ns/op       16002663 ops/s             103 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            22022995                53.27 ns/op       18773787 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                136333018                9.592 ns/op     104255772 ops/s               3 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=0%,writes=100%-8               31642126                39.82 ns/op       25115454 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=0%,writes=100%-8             9879114               120.2 ns/op         8317107 ops/s             121 B/op          3 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=0%,writes=100%-8             13663359                86.14 ns/op       11609071 ops/s             238 B/op          0 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=0%,writes=100%-8            24398157                49.66 ns/op       20135512 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=0%,writes=100%-8                58525999                21.32 ns/op       46915199 ops/s               0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/perf  22.698s

And with spinlock they are like this.

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/perf
BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-8               11513593               102.8 ns/op         9725518 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-8            44476977                28.48 ns/op       35114128 ops/s              16 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=100%,writes=0%-8             29389074                41.46 ns/op       24118353 ops/s              96 B/op          2 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=100%,writes=0%-8            40236950                30.71 ns/op       32565429 ops/s              47 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-8                87322472                12.33 ns/op       81082476 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               10701369               109.8 ns/op         9109595 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            23923245                45.17 ns/op       22139916 ops/s              27 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             18169131                60.89 ns/op       16421834 ops/s             103 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            22694677                53.62 ns/op       18649904 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                87237304                13.64 ns/op       73295198 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Theine_reads=0%,writes=100%-8               35994420                47.10 ns/op       21231164 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=0%,writes=100%-8             9810844               129.5 ns/op         7724999 ops/s             121 B/op          3 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=0%,writes=100%-8             13800042                78.14 ns/op       12797693 ops/s             185 B/op          0 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=0%,writes=100%-8            24117067                48.82 ns/op       20484021 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=0%,writes=100%-8                67451888                18.75 ns/op       53345341 ops/s               0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/perf  20.128s

@ben-manes
Copy link

Yep, as expected with the global. Given low write rates, @1a1a11a’s rcu approach is probably your best option design-wise. The spinlock or atomic.Value is the most pragmatic as low effort, likely good enough, and easiest to refactor if you get the missing functionality in a later go release.

@maypok86
Copy link

maypok86 commented Jan 8, 2024

In fact, I doubt they will add anything to go that can fix seqlocks.

So I tried rewriting the code to rcu, and by preliminary measurements I'm quite happy with the performance. (On 100% writes it looks like it's all bottlenecked by the queue).

reads=100%,writes=0%

BenchmarkCache/Zipf_Theine_reads=100%,writes=0%-8               10947901               100.4 ns/op         9958631 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=100%,writes=0%-8            45642127                25.84 ns/op       38702390 ops/s              16 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=100%,writes=0%-8             32109850                40.04 ns/op       24973950 ops/s              96 B/op          2 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=100%,writes=0%-8            39218516                30.52 ns/op       32761562 ops/s              47 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=100%,writes=0%-8                253637361                4.377 ns/op     228480222 ops/s               0 B/op          0 allocs/op

reads=75%,writes=25%

BenchmarkCache/Zipf_Theine_reads=75%,writes=25%-8               10209930               104.4 ns/op         9582855 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=75%,writes=25%-8            27865933                43.44 ns/op       23020143 ops/s              27 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=75%,writes=25%-8             19538709                62.58 ns/op       15979712 ops/s             110 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=75%,writes=25%-8            19207093                54.99 ns/op       18184478 ops/s              43 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=75%,writes=25%-8                176039958                6.490 ns/op     154081867 ops/s               2 B/op          0 allocs/op

reads=50%,writes=50%

BenchmarkCache/Zipf_Theine_reads=50%,writes=50%-8               11559314                97.44 ns/op       10262300 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=50%,writes=50%-8            22018971                48.94 ns/op       20435061 ops/s              34 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=50%,writes=50%-8             16228872                62.82 ns/op       15919191 ops/s             132 B/op          1 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=50%,writes=50%-8            21352249                54.48 ns/op       18355882 ops/s              39 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=50%,writes=50%-8                107763754               10.72 ns/op       93268480 ops/s               5 B/op          0 allocs/op

reads=25%,writes=75%

BenchmarkCache/Zipf_Theine_reads=25%,writes=75%-8               12707719                92.89 ns/op       10765447 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=25%,writes=75%-8            20564748                57.58 ns/op       17367336 ops/s              55 B/op          1 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=25%,writes=75%-8             16359062                65.76 ns/op       15206757 ops/s             165 B/op          0 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=25%,writes=75%-8            23839993                45.78 ns/op       21843813 ops/s              23 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=25%,writes=75%-8                52671734                22.94 ns/op       43586449 ops/s              11 B/op          0 allocs/op

reads=0%,writes=100%

BenchmarkCache/Zipf_Theine_reads=0%,writes=100%-8               26785988                42.13 ns/op       23735382 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Ristretto_reads=0%,writes=100%-8             9223939               126.2 ns/op         7920840 ops/s             122 B/op          3 allocs/op
BenchmarkCache/Zipf_Bigcache_reads=0%,writes=100%-8             13591893                81.53 ns/op       12264735 ops/s             252 B/op          0 allocs/op
BenchmarkCache/Zipf_Fastcache_reads=0%,writes=100%-8            24455442                49.59 ns/op       20163620 ops/s               0 B/op          0 allocs/op
BenchmarkCache/Zipf_Otter_reads=0%,writes=100%-8                 9646405               116.5 ns/op         8587281 ops/s              64 B/op          1 allocs/op

It seems that even for write-heavy workloads, a quick-written rcu holds up quite well.

@maypok86
Copy link

Anyway, if anyone is interested, seqlocks in go can be fixed with dark magic.

Namely by importing this function from the go runtime.

//go:linkname runtimeatomicwb runtime.atomicwb
//go:noescape
func runtimeatomicwb(ptr *unsafe.Pointer, new unsafe.Pointer)

But this is too forbidden technique🙂.

@Yiling-J
Copy link
Owner

Yiling-J commented Aug 6, 2024

@maypok86
I'm considering improving Theine a bit, and the first thing I plan to do is change the read buffer. I take a look Otter and the striped ring buffer seems very high performance. However, I also have another interesting idea.

The original Ristretto version uses a sync pool with a buffer, making each buffer local to P so no lock is needed. However, sync pools are easily GCed, causing lose of buffers. What if I disable sync pool GC, preallocate all buffers, and make sync pool's New function to only return nil?

An example of overriding sync pool: https://github.com/bytedance/gopkg/blob/main/lang/syncx/pool.go

I actually write some simple demo code when reading Otter code:

type SyncPoolBuffer[K comparable, V any] struct {
	pool Pool
	pm   []*PolicyBuffers[K, V]
}

func NewSP[K comparable, V any](nodeManager *node.Manager[K, V], size int) *SyncPoolBuffer[K, V] {

	b := &SyncPoolBuffer[K, V]{
		pool: Pool{
			New: func() any {
				return nil
			},
		},
	}
	for i := 0; i < size/4; i++ {
		buffer := &PolicyBuffers[K, V]{make([]node.Node[K, V], 0, capacity)}
		b.pm = append(b.pm, buffer)
		go b.pool.Put(buffer)
	}

	return b
}

func (b *SyncPoolBuffer[K, V]) Add(n node.Node[K, V]) *PolicyBuffers[K, V] {
	raw := b.pool.Get()
	if raw == nil {
		return nil
	}
	pb := raw.(*PolicyBuffers[K, V])
	if lb := len(pb.Returned); lb >= capacity {
		return pb
	} else {
		pb.Returned = append(pb.Returned, n)
		b.pool.Put(pb)
		return nil
	}
}

// Free returns the processed buffer back and also clears it.
func (b *SyncPoolBuffer[K, V]) Free(pb *PolicyBuffers[K, V]) {
	pb.Returned = pb.Returned[:0]
	b.pool.Put(pb)
}

But the problem with this approach is also obvious: there is no contention anymore. Here I add a simple counter in afterGet:

	if SyncPoolBufferMode {
		pb = c.syncPoolBuffer.Add(got)
	} else {
		pb = c.stripedBuffer[idx].Add(got)
	}
	if pb != nil {
		c.evictionMutex.Lock()
		c.policy.Read(pb.Returned)
		logged += 1
		c.evictionMutex.Unlock()

		if SyncPoolBufferMode {
			c.syncPoolBuffer.Free(pb)
		} else {
			c.stripedBuffer[idx].Free()
		}
	}

and here is the throughput benchmark result:

Striped RingBuffer:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkCache/zipf_otter_reads=100%,writes=0%-8         	100000000	        12.25 ns/op	  81640149 ops/s
==== logged 20817

SyncPool Buffer:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkCache/zipf_otter_reads=100%,writes=0%-8         	100000000	        36.38 ns/op	  27486156 ops/s
==== logged 5882350

What do you think about this idea?

@maypok86
Copy link

maypok86 commented Aug 6, 2024

I plan to do is change the read buffer.

Yeah, this is probably theine's main problem when working with a high load.

for i := 0; i < size/4; i++ {
	buffer := &PolicyBuffers[K, V]{make([]node.Node[K, V], 0, capacity)}
	b.pm = append(b.pm, buffer)
	go b.pool.Put(buffer)
}

It looks interesting and funny. It should even work, although it relies very heavily on the behavior of the scheduler. I would be careful here that the scheduler may ignore one of the P's and then the cache will always work poorly.

It seems that you implemented read buffers from ristretto, but removing the gc option. I wouldn't say that gc is a problem of ristretto read buffers. Moreover, I suspect that this has almost no effect on throughput benchmarks. It may affect the hit ratio, but it needs to be checked. I haven't been able to replicate most of the ristretto charts, although the charts of my hit ratio simulator are almost exactly the same as yours. But at least I found the reason for a very small hit ratio in ristretto.

In fact, I thought about the options for implementing read buffers for an incredibly long time and decided to do it in this form based on the following thoughts:

  1. Under high load, sync.Pool won't help much, since all cores will be waiting for the read buffer to be processed by the eviction policy.
  2. With contention-based losses, the cores will not wait and will show excellent results on throughput benchmarks. This may affect the hit ratio, but I don't think it's very important for two reasons. Firstly, it has almost no effect on the vast majority of users, since they are not able to generate such a load. Secondly, if the application is able to handle such a load, then the cache is still only trying to identify hot entries, and if you remove a small number of accesses, it will hardly change the distribution of hot-cold entries. Especially with millions of requests per second. That's how much it takes to generate meaningful contention. I even tried to compare the hit ratio losses of theine and otter and the difference there is about 2% at 100% cpu usage (quite expected), which is extremely unlikely in reality. I was more surprised by the presence of hit ratio losses due to the use of bp-wrapper.

Maybe I'm wrong somewhere, but I reasoned something like this :)

It's funny that I recently tried to make the read buffers dynamic and even something worked out, but I won't risk pushing and merging it :).

@Yiling-J
Copy link
Owner

Yiling-J commented Aug 7, 2024

@maypok86 I revisited your Ristretto issue, and finally, part of the mystery has been solved. Totally agree losses is intended on read buffer. I think I'll just borrow your striped buffer because it naturally fits the current Theine, which handles read policy updates synchronously. The modified sync pool is extremely 'unsafe' and seems too magical, and also need to figure out how to make it lossy.

@Yiling-J
Copy link
Owner

Yiling-J commented Aug 9, 2024

@maypok86 Here is the improved result after switching to Otter striped buffer and several other optimizations:

BenchmarkCache/zipf_otter_reads=100%,writes=0%-8         	101792066	        11.97 ns/op	  83515868 ops/s
BenchmarkCache/zipf_theine_reads=100%,writes=0%-8        	40965662	        30.05 ns/op	  33275970 ops/s
BenchmarkCache/zipf_ristretto_reads=100%,writes=0%-8     	29334450	        39.74 ns/op	  25166567 ops/s

BenchmarkCache/zipf_otter_reads=75%,writes=25%-8         	54122091	        18.56 ns/op	  53871082 ops/s
BenchmarkCache/zipf_theine_reads=75%,writes=25%-8        	29716508	        41.82 ns/op	  23910140 ops/s
BenchmarkCache/zipf_ristretto_reads=75%,writes=25%-8     	16239280	        86.11 ns/op	  11613020 ops/s

And this is the PR (#42), maybe you can also help review it because I borrowed your code directly and only made a small modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants