Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please update the link in the Readme #33

Closed
mskorina opened this issue Jan 10, 2024 · 30 comments · Fixed by #35
Closed

Please update the link in the Readme #33

mskorina opened this issue Jan 10, 2024 · 30 comments · Fixed by #35
Labels
bug Something isn't working

Comments

@mskorina
Copy link

Hi @maypok86

Please update the link "BP-Wrapper: A Framework Making Any Replacement Algorithms (Almost) Lock Contention Free(http://web.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-09-1.pdf)".

I found another link from your reply Yiling-J/theine-go#29 (comment).

https://www.researchgate.net/publication/220966845_BP-Wrapper_A_System_Framework_Making_Any_Replacement_Algorithms_Almost_Lock_Contention_Free

And perhaps you can add the following links about s3fifo:
https://blog.jasony.me/system/cache/2023/08/01/s3fifo
https://blog.jasony.me/system/cache/2023/12/28/fifo

Thanks for the development of this project, I believe there will be a stable version for production use soon ;)

@mskorina mskorina added the bug Something isn't working label Jan 10, 2024
@maypok86
Copy link
Owner

Hi @mskorina, thanks for the issue!

Yes, I've already seen that the site the bp-wrapper link leads to is just down most of the time. I just didn't know whether to change it to research gate or https://dgraph.io/blog/refs/bp_wrapper.pdf, but I think research gate looks more formal.

There is only one problem with the additional links. It looks like it's time to create a Wiki section, because I also wanted to add hit ratio simulation results on newer traces from famous companies (META, Twitter, Alibaba, etc) instead of traces that are more than 20 years old but everyone keeps testing caches on them and there's just no room for it in the README.

Speaking of the suggested links, I definitely wouldn't want to add an article about lock-free queues.

  1. otter doesn't use anything from there.
  2. I disagree with the author about the advantage of lock-free queues, as it is extremely inconvenient both on the engineering side and the fact that these queues will generate a large number of atomic operations due to which they can be defeated by regular spinlock queues. This is in my discussion with the author on hacker news, Ben also described it in his messages in the theine s3-fifo thread.
  3. The article actually describes Michael Scott's queue, but doesn't say that it actually takes a lot more atomic operations to maintain cache on these queues, and the queue is described in a simplified way.

It's possible, of course, that I'm wrong about a cache based on lock-free queues, but I've tried it and I haven't been able to squeeze something worthy of fighting bp-wrapper out of it. So I'd be very interested to see the solution of someone who succeeds.

The s3-fifo article seems to be a simplified updated version of the original paper and what exactly is better to refer to is unclear.

I too hope otter will live to see its first production-ready release. 😅 (It's already quite stable, but I'm not happy with many api contracts and want to tidy them up so that I don't have to make a lot of contract-breaking changes in a library used by a lot of people).

@ben-manes
Copy link

ben-manes commented Jan 11, 2024

advantage of lock-free queues

Clock policies are almost always written as lock-free reads and a mutex around writes. Here is an example of that (written at the time as a stopgap since people were manually building the broken alpha code). A lock-free queue doesn't make much sense since eviction is still a sequential loop around the queue, so it seems like a lot of unnecessary complexity. That was my learnings when trying to write a lock-free LRU since that always reordered the node to the end of the queue, which made it slower than a lock-based one. Plus more advanced Clock policies like ClockPro (better handles scans) assume a single eviction thread, I've never really seen concurrent eviction advocated for as a feature.

In that case then BP-Wrapper is only helpful for other access policies, like time-to-idle (expireAfterAccess). In that case of expiration since events can be lost anyway, the timer wheel fixes it up when it reorders items. Then you wouldn't need BP-Wrapper for reads but could still use it to mask write latencies when not under synthetic write load.

Since you've done the work I wouldn't advocate against keeping BP-Wrapper since that lets you play with more design parameters. The workloads where the S3-FIFO authors found their algorithm was superior turned out to be due to their TinyLFU implementation (others may have made similar oversights). Caffeine aced all of their test cases while theirs can fail very badly (Otter improved that, so can't say how it fairs). I don't mind old traces as valid patterns that might emerge for a general purpose cache, though I also have no problems optimizing instead to a specific set of workload types and traces for a specialized one (hence Sieve clearly stating its target was perfect while I took issue with S3-FIFO pretending otherwise).

I'm not happy with many api contracts

I tried to keep the top-level api very simple and clear, then stash feature-specific methods under Cache.policy() as extension points. This way I could offer apis like put(key, value, ttl duration) only if the cache was built with it enabled and code generate the optimized cache entry to minimize metadata overhead. I like Josh Bloch's Bumper-Sticker API Design (video).

@maypok86
Copy link
Owner

@ben-manes (@rueian from the other thread)

Okay, as promised in the other thread there are going to be a lot of words now.

I've tried to divide it into topics, but it can still be mixed. Also, not all of this needs comments (it's just a list of observations and pains with some questions), but even small comments can be helpful.

Memory consumption

Right now otter has a few memory consumption issues:

  1. Static node configuration that consumes more memory than necessary, but I hope to fix that in this issue.
  2. Static buffers that consume a large amount of memory relative to small caches. As I understand it, caffeine uses read buffers that can grow based on contention and MpscGrowableQueue for this. Read buffers would be cool to port them to Go, but I'm not sure it wouldn't have the same problems as seqlocks. Same with MpscGrowableQueue, but here I have an additional question. Why couldn't you use the same approach as with read buffers and just increase the number of write buffers based on contention?

Why do I even need all this stuff? Besides just reducing memory consumption, I really dislike the ristretto approach with the loss of insertion events. And ideally I'd like to allow users to cut otter into shards if they need more write throughput (there's also the goroutines issue, but that's beside the point).

Hit ratio

I was wondering how often applications are written in Go that might have unfriendly workloads to S3-FIFO (I only knew about analytics, page caches in the DB, and search). And the following came to light:

  1. Analytics is extremely rare, but still occurs
  2. Database engines are already much more common. There are many examples, the same vitess, cockroach, dgraph, badger, bolt, etc.
  3. I had a revelation with search. Almost no one writes search engines in Go, but there are a large number of users with about the same workload. Usually these are services that act as a gateway and proxy queries (based on responses from search engines like elastic, manticore, sphinx) to more narrowly focused services. It is also extremely important for them to have minimal latency (that's why redis, memcached, etc. are usually not used), maximum possible hit ratio and resistance to attacks.

So a few questions.

  1. Is there any way to know on which traces it is worth checking hit ratio? Otherwise, I will spend a lot of time to find out on my own. (I know scan, loop and zig-zag from the lirs collection, but there are probably some that should be checked that I don't know). I wanted to check the newer ones just out of interest and as it turns out, W-TinyLFU (in theine version) still shows not the best but acceptable results on them.
AlibabaBlock
Cache 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000
otter 18.81 30.71 40.23 48.98 56.19 61.69 66.01 69.47
theine 18.58 29.70 38.70 46.17 52.73 57.50 62.59 66.48
ristretto 0.78 1.14 1.40 1.67 1.85 2.11 2.36 2.58
lru 13.19 24.54 33.89 42.44 49.72 55.37 60.15 64.16
arc 16.16 27.91 36.42 44.47 51.00 56.27 60.79 64.57
  1. You once wrote that doorkeeper allows you to reduce the count min-sketch size without big hit ratio loss. I tried to compare the normal version of theine and the version with doorkeeper enabled. The results don't look good, but maybe theine is doing something wrong.
S3
Cache 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000
theine 9.87 18.97 27.80 36.63 44.47 53.53 60.41 67.02
theineDoorkeeper 10.53 17.84 23.38 30.50 35.96 41.12 46.44 50.89
P3
Cache 25,000 50,000 100,000 200,000 300,000 400,000 500,000 600,000
theine 11.56 19.01 37.76 54.21 66.43 76.07 79.04 79.95
theineDoorkeeper 11.30 22.96 33.88 45.87 51.45 53.57 53.77 53.77
  1. I have yet to see a cache that at least replicates caffeine's hit ratio and its stability. Surely there is no mega secret? :) (jitter doesn't look like that).

Core

  1. Not a proactive expiration policy, but I hope to fix that in this issue as well.
  2. I'm a bit disappointed with prefetching from bp-wrapper. It feels like when using lossy buffers it's just not needed. So want to try to give up on it.
  3. I've been confused for a very long time about whether it's worth trying to remove goroutines for write buffer processing and expiration policy. In theory, it might help to reduce the write time a bit (now I have to put this goroutine to sleep and wake it up) and reduce the memory consumption a bit (very weak, but if we still allow cutting the otter, it looks more useful).

API

You're actually right about policy, but I decided to leave it for now, especially since I'm terribly annoyed with the way people are trying to use otter now.

Everyone I've found without exception uses WithVariableTTL and always passes the same ttl value. I like the approach of passing an object/function to compute the variable ttl (expireAfter), but there's a huge problem with the fact that nobody does this at all in Go right now and I end up having to go against the system :(. This would also make it terribly difficult to add in the various wrappers like gocache. So what to do about it, I don't understand at all.

@ben-manes
Copy link

ben-manes commented Feb 16, 2024

Why couldn't you use the same approach as with read buffers and just increase the number of write buffers based on contention?

I haven't observed the write buffer being a bottleneck so there has not been a reason to stripe it. If doing so for space reasons, then MpscGrowableArrayQueue will start small and grow by doubling up to a capacity limit. Since writes are less common and that work is drained asap, this queue has low footprint.

Why do I even need all this stuff?

Most Clock-based caches simply have all writes serialize on a single lock for both the hash table and policy updates. Guava embedded LRU into its hash table's segments, letting the read buffer drain by try-lock'ing the segment. The write buffer is not necessary, it's just an optimization to allow for faster independent writers. You could do away with it and performance should be fine, and if not then users to stripe the cache.

I really dislike the ristretto approach with the loss of insertion events.

100% agree. It's not what I did or would advise.

Is there any way to know on which traces it is worth checking hit ratio

I don't know either as most users don't collect traces, monitor, and typically the size & expiration are guesstimates. I don't discriminate against a workload and try to be good across all because even if I advised against a use-case then someone will drop Caffeine in regardless. Instead try to avoid hard failures like the 0% hit rates for S3-FIFO on loop or TinyLFU on corda. Once it is reasonably okay then you can try to do better if that is fun and interesting, else it is probably okay as is.

You once wrote that doorkeeper allows you to reduce the count min-sketch size without big hit ratio loss. I tried to compare the normal version of theine and the version with doorkeeper enabled.

That was my recollection when enabling it in the simulator and doing a spot check, but that idea predated my involvement. It made the policy more LFU / MRU biased, which further hurt TinyLFU+LRU on the broader set of traces that I was using, like OLTP (ARC, UMass). That hurt the static W-TinyLFU's 1% window, so I focused on further fixing the policy for LRU-biased traces as Corda & Scarab both observed poor performance in their production usages. After adaptivity, I did a brief review and did not see any major faults but have never explored it in depth due to lack of complaints on the memory overhead.

When I try it they work fine when halving the CountMin4's size.

S3
Policy 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000
HillClimberWindowTinyLfu 12.14 23.40 34.07 43.25 51.55 59.47 65.62 70.75
HillClimberWindowTinyLfu (doorkeeper 1x) 12.99 23.93 33.96 43.64 52.04 59.66 65.80 70.89
HillClimberWindowTinyLfu (doorkeeper 0.5x) 12.65 23.44 33.62 43.05 51.36 59.27 65.38 70.55
P3
Policy 25,000 50,000 100,000 200,000 300,000 400,000 500,000 600,000
HillClimberWindowTinyLfu 14.93 25.57 38.41 53.06 67.07 75.95 79.08 79.90
HillClimberWindowTinyLfu (doorkeeper 1x) 15.85 26.18 37.96 53.85 66.60 75.80 79.07 79.91
HillClimberWindowTinyLfu (doorkeeper 0.5x) 15.48 25.58 38.53 54.30 67.13 75.98 78.98 79.91

I have yet to see a cache that at least replicates caffeine's hit ratio and its stability. Surely there is no mega secret? :)

I can't really speak towards anyone else's code. It is all open source, documented, and I share what I think is helpful. I probably over optimized everything as it was a fun excuse to learn and push myself. Perhaps @bitfaster's cache (C#) has implemented the algorithms more closely? I think others wanted to solve it their own way as a fun challenge and my mistakes were mostly ironed out.

I'm a bit disappointed with prefetching from bp-wrapper. It feels like when using lossy buffers it's just not needed.

Maybe it was an artifact of older hardware having tiny caches, slow cache coherence, and a constrained memory bus? It was probably the professor's experience from late 90s hardware (like Sparc) as a general best practice. They weren't using lossy buffers or viewed it as request sampling, so they probably didn't fully understand why their ideas worked so well and which had the most impact.

I've been confused for a very long time about whether it's worth trying to remove goroutines for write buffer processing and expiration policy.

A lot of users set Caffeine's executor to a caller runs policy, once even for a performance benefit. The maintenance cost is tiny so my prior caches did it on the caller without any trouble. Java 8 introduced Map computations which block on user code by locking the hash bin for that entry and its neighbors. That could mean a long running computation, like a cache load on a different key, could delay eviction. Similarly the eviction listener runs foreign code under a computation as some cases need to be in sync with the mapping. If those don't apply then you probably are not getting a benefit.

Everyone I've found without exception uses WithVariableTTL and always passes the same ttl value

That's why I got away without that feature for a long time. Even then, most misuse the api and I need to add a fluent static factory to assist them (ben-manes/caffeine#1499).

I'm more concerned by the ample usage of explicit writes in Go caches (Set(k, v)) instead of loading through the cache to avoid stampedes. The callbacks are to be compatible with that. Do you think Go developers would change that bad habit?

@rueian
Copy link

rueian commented Feb 16, 2024

I'm more concerned by the ample usage of explicit writes in Go caches (Set(k, v)) instead of loading through the cache to avoid stampedes. The callbacks are to be compatible with that. Do you think Go developers would change that bad habit?

I believe most go developers are trying to avoid stampedes. If the cache doesn’t provide abilities to avoid that, then developers will add singleflights before using the explicit Set(). But it will be best if the singleflight ability is built in the cache so that there will be no additional hashmap and locks.

This is actually one of the ongoing work I am doing now to integrate otter into https://github.com/redis/rueidis. I am forking otter and trying to add a new loading state to a node. A node created in the loading state will bypass eviction policy and remain in the hashmap until its value, cost, and ttl are updated.

@maypok86
Copy link
Owner

If doing so for space reasons

Yes, this is purely to reduce memory consumption, although I don't think the loss is very significant either.

Here are the results of memory consumption benchmark depending on capacity. Starting with capacity >= 20000 the size of overhead per entry starts to affect more than the size of buffers.

256

memory_256

1024

memory_1024

10000

memory_10000

25000

memory_25000

100000

memory_100000

I even considered doing the same as theine does, but I'm not too sure about it.

Instead try to avoid hard failures like the 0% hit rates for S3-FIFO on loop or TinyLFU on corda.

Actually, I thought you knew some other traces on which S3-FIFO feels bad. But okay, I'll try to test it on more traces later.

I'm more concerned by the ample usage of explicit writes in Go caches (Set(k, v)) instead of loading through the cache to avoid stampedes.

Usually it goes something like this: Set(k, v) is used -> problems appear because of stampedes -> singleflight is added. I don't think this will be fixed in any language, because usually everyone fixes problems only after directly encountering them, and it's not clear how to change the mentality. And often these problems are not even noticed, I discovered problems with ristretto at all only after trying to understand why microservice has such a high latency and adding a hit ratio dashboard to grafana.

I was more surprised by the strange use of the variable ttl expiration policy because it's just less convenient, but maybe I don't understand user logic :(.

@maypok86
Copy link
Owner

maypok86 commented Feb 17, 2024

I am forking otter and trying to add a new loading state to a node.

Sounds really cool, especially since my strength is already lacking (I've also managed to get sick😢).

A node created in the loading state will bypass eviction policy and remain in the hashmap until its value, cost, and ttl are updated.

It seems too complicated, because it is not clear what to do with an entry that has not yet had a value loaded, but we are trying to add it to the eviction policy. It is much easier to add Compute function to the hash table and instead of V store Loaded[V], which will additionally store the load status (uint32/sync.WaitGroup) and error, after the end of loading all waiters wake up and return the loaded value. Or simplify everything and just use sharded singleflight.

@ben-manes
Copy link

ben-manes commented Feb 17, 2024

Yes, this is purely to reduce memory consumption

What about using a growable ring buffer but guard it by a lock? You could use a lock-free ring buffer that when full you lock to double its capacity if possible. That growth is rare and there's a limit, so being fully lock-free doesn't offer much of a benefit. Likely locking all read/writes to the buffer would also be fine. I use a range of 4 ... 128 * pow2(# cpus).

Actually, I thought you knew some other traces on which S3-FIFO feels bad.

I used a standard set for a quick check, saw it did well in some and poorly in others, and when good it merely matched others rather than leapt past a barrier. Caffeine stayed competitive in their best workloads, so there was no reason to analyze deeper. And honestly, I wasn't going to invest a lot of time for authors who invented a fake quote and other misrepresentations to exaggerate their work. That dirtying it up masks anything good they did do, which doesn't need to be novel to be respected and evangelized. If I can't trust them then I'll quickly check, see if anything is interesting enough to dig in deeper, and if not then ignore it as passing noise.

usually everyone fixes problems only after directly encountering them, and it's not clear how to change the mentality.

I think the best you can do is write APIs that nudge a user towards better practices in the api and docs. Most of the time users just pattern match on the docs, know what they want and skim to copy what looks like the fit. Making it easier to do the right thing goes surprisingly far because they don't really care and want to move onto solving their actual problem, not learning your library.

A node created in the loading state will bypass eviction policy and remain in the hashmap until its value, cost, and ttl are updated.

Sounds like Caffeine's AsyncCache. It inserts a future as the mapping as a pinned entry, and when complete the callback updates the mapping's metadata or removes it if a failure. A pinned entry is a generic feature where the weight of zero is skipped by size eviction and a max expiration is unreachable (300 yrs). This is a decorator with slightly lower level access to coerce the cache statistics.

It is much easier to add Compute function to the hash table and instead of V store Loaded[V]

A hash table compute is really handy! In Java's it doesn't store a Loaded[V] and instead locks the hashbin during the operation. That's a fine-grained bin so it is usually okay, allows more flexible computations, and ensures linearizability. The AsyncCache covers those limitations well enough that users switch to that when needed.

Guava did as you suggested, where it has a future value during the load and afterwards swaps the entry type to store the plain value. This is very nice if only a loader is needed, but doesn't support richer computations as nicely and is harder to make linearizable.

@maypok86
Copy link
Owner

maypok86 commented Feb 17, 2024

What about using a growable ring buffer but guard it by a lock?

Hmm, it's worth a try.


About Loaded[V], I think I meant a little differently.

Ideologically, I wanted to be able to do something like this.

type Loaded[V any] struct {
	value V
	// I'd like to get rid of storing the error here...
	err error
	loadStatus atomic.Uint32
}

func (l *Loaded[V any]) Load() (V, error) {
    for l.loadStatus.Load() == 0 {
        runtime.Gosched()
    }
	
    return l.value, l.err
}

func (l *Loaded) Exec(f func() (V, error)) {
	l.value, l.err = f()
	l.Store(1)
}

// want to occasionally give away an incomplete load so that the user can request it when needed.
func (c *AwesomeCache[K, V]) Get(key K) (*Loaded[V], bool) {
    // values in nodes - *Loaded[V]
	
    // It's better to use ComputeIfAbsent here, but we need to think about the api.
    n := node.New(key) // value is default
    v := c.hashmap.SetIfAbsent(n)
    if v != nil {
        // another node found
        return v.Value(), true
    }
	
    l := n.Value()
    // only one run is guaranteed
    с.workerPool.Run(func () {
        // the function should be called differently, but it's fine for the example.
        l.Exec(c.loader(key))
        if l.err != nil {
            c.hashmap.DeleteNode(n)
            return
        }
        // we need to think about setting cost and ttl, but it seems solvable.
        с.writeBuffer.Insert(node.NewAddTask(n))
    })
	
    // another work
    ...
	
    return l, true
}

// or if the user just needs the value, we do something like this
func (c *AwesomeCache[K, V]) Load(key K) (V, error) {
    l, ok := c.Get(key)
    if !ok {
        var v V
        // most likely, this shouldn't happen at all
        return v, ErrNotFound
    }
    return l.Load()
}

While I was writing all this up here, I realized that Loaded is a bad name, given that it's promise or call (in singleflight implementation) or future in Java (judging from Ben's posts). But I really don't like the increased memory consumption. Perhaps a sharded singleflight would be even better.

it has a future value during the load and afterwards swaps the entry type to store the plain value.

In Go, we're probably already getting bogged down with node casts. So I didn't even consider it, maybe only if to optimize memory consumption. But how to make it linearizable is really not obvious at first glance.

@ben-manes
Copy link

The promise solution has a few nice characteristics of async programming where the in-flight promise can chain dependent actions, be canceled, local timeouts, and the mapping can be discarded early. It also allows for nicer handling of bulk loads by inserting any absent mapping and then fill them in with the results. That could be an explicit getAll(keys) or implicit coalesce of individual get(key) operations. However it has that wrapper overhead and I'm not sure if it can support linearizability because the hash table and entry are governed by independent locks. It makes implementing fancy compute methods harder.

The hashbin locking approach is attractive for linearizability and fancy computes, which is very useful for internal policy handling and by users. Unfortunately a long-running compute, including by a colliding neighbor, can cause problems. In this mode, Caffeine doesn't support bulk loads that protect against cache stampedes which is an unfortunate limitation, since we can't lock multiple hashbins at once. I think later on it could by using a fanout of virtual threads (goroutines) to create a promise-style load by letting those block in a compute until being sent the loaded value (VT are still beta quality in Java). When users hit the limitations here then they switch to the async cache.

Anyway, I think the hashbin-style is more powerful and allows for the promise-style to be layered on afterwards to solve its limitations. It felt wrong and frustrating at first, but turned out to be a really great foundation that I appreciated.

@rueian
Copy link

rueian commented Feb 18, 2024

It seems too complicated, because it is not clear what to do with an entry that has not yet had a value loaded, but we are trying to add it to the eviction policy. It is much easier to add Compute function to the hash table

If we allow an entry that remains in the hashmap and ignores all eviction policies until an update to its cost and ttl, that will be already sufficient for users to implement singleflights by themselves. The idea of @ben-manes's AsyncCache, setting the initial cost to zero and the ttl to forever, is really clever.

I believe adding a Compute function to the hash table is actually more complex than the above "placeholder" mechanism in terms of API design decision and has some limitations:

  • Choosing a function signature for the Compute function to suit everyone's needs is impossible. A simple func() (V, error) is likely not enough. One example is that users usually want a scoped context.Context to be passed in to be able to cancel the operation.
  • A func() (V, error) is okay if it is a function parameter of Get, but it will easily escape to the heap causing an additional allocation for every Get.
  • As @ben-manes mentioned, neither of the above can easily work for bulk loads. An ideal bulk load works by first inserting placeholders into the hashmap, and then loading missing values at once, and finally swapping those placeholders with loaded values. For example, rueidis' DoMultiCache does this to its current LRU cache implementation, but I am trying to replace that implementation with otter.

@maypok86
Copy link
Owner

I think I'm clearly missing something, but I don't know what it is :)

If we allow an entry that remains in the hashmap and ignores all eviction policies until an update to its cost and ttl, that will be already sufficient for users to implement singleflights by themselves.

Why? Loader is only guaranteed to execute once and everyone gets the same result, especially since it's not clear to me how my proposal is fundamentally different from pinned entry which is actually ignored by the policies. Also in both cases you need to update the cost/weight in the eviction policy, which will also require synchronization (and in my opinion in the case of pinned entry it is even more complicated).

The hashbin locking approach is attractive for linearizability and fancy computes

I understand correctly that caffeine just locks the bucket in the hash table until the loader is executed, right? If so, this is a very controversial action. And if not, I don't understand how it works. :(

Choosing a function signature for the Compute function to suit everyone's needs is impossible.

Yes, but it doesn't need to. The hash table just doesn't need to know about error and context.Context, and if you do need them, you can use closures. And I didn't say that Compute would be added to the public api.

A func() (V, error) is ok if it is a function parameter of Get, but it will easily escape to the heap causing an additional allocation for every Get.

Escape analysis in Go is a little smarter, in our case the function will be allocated on the heap only once, and in many user scenarios (if we add Compute to the public api) the same thing will happen.

neither the above can easily work with bulk loads.

Yes, this is really a problem.

@rueian
Copy link

rueian commented Feb 18, 2024

Loader is only guaranteed to execute once and everyone gets the same result, especially since it's not clear to me how my proposal is fundamentally different from pinned entry which is actually ignored by the policies.

Your proposal is very close to the pinned entry approach, but I think leaving the workerPool part to users will be a better choice. That will enable users to implement their own singleflight and bulk loads.

Also in both cases you need to update the cost/weight in the eviction policy, which will also require synchronization (and in my opinion in the case of pinned entry it is even more complicated).

I may also missed something, but it looks like the pinned entry approach can be achieved by this line change:

diff --git a/internal/hashtable/map.go b/internal/hashtable/map.go
index 839301f..fae5c92 100644
--- a/internal/hashtable/map.go
+++ b/internal/hashtable/map.go
@@ -319,7 +319,7 @@ func (m *Map[K, V]) Delete(key K) *node.Node[K, V] {
 // Returns the evicted node or nil if the node wasn't evicted.
 func (m *Map[K, V]) DeleteNode(n *node.Node[K, V]) *node.Node[K, V] {
        return m.delete(n.Key(), func(current *node.Node[K, V]) bool {
-               return n == current
+               return n == current && current.Cost() > 0
        })
 }

With that, I will be able to implement bulk loads and singleflight as well as context cancellation like this:

package main

import (
	"context"
	"github.com/maypok86/otter"
)

type Result struct {
	data string
	err  error
}

type keyidx struct {
	key string
	idx int
}

type flight struct {
	data string
	err  error
	done chan struct{}
}

func (s *flight) Wait(ctx context.Context) Result {
	select {
	case <-ctx.Done():
		return Result{err: ctx.Err()}
	case <-s.done:
		return Result{data: s.data, err: s.err}
	}
}

func (s *flight) Cost() uint32 {
	select {
	case <-s.done:
		return uint32(len(s.data))
	default:
		return 0
	}
}

var cache otter.CacheWithVariableTTL[string, *flight]

func BulkLoad(ctx context.Context, keys ...string) []Result {
	flights := make([]*flight, len(keys))
	pending := make([]keyidx, 0)
	for i, k := range keys {
		f, ok := cache.Get(k)
		for !ok {
			if cache.SetIfAbsent(k, &flight{done: make(chan struct{})}, -1) {
				pending = append(pending, keyidx{key: k, idx: i})
			}
			f, ok = cache.Get(k)
		}
		flights[i] = f
	}
	result, err := bulkFetchExternal(ctx, pending)
	for i, p := range pending {
		if err != nil {
			flights[p.idx].err = err
			close(flights[p.idx].done)
			cache.Delete(p.key)
		} else {
			flights[p.idx].data = result[i]
			close(flights[p.idx].done)
			cache.Set(p.key, flights[p.idx], someTTL)
		}
	}
	results := make([]Result, len(flights))
	for i, f := range flights {
		results[i] = f.Wait(ctx)
	}
	return results
}

func bulkFetchExternal(ctx context.Context, keys []keyidx) ([]string, error) {
	// TODO
}

func main() {
	var err error
	cache, err = otter.MustBuilder[string, *flight](10_000).
		Cost(func(key string, value *flight) uint32 { return value.Cost() }).
		WithVariableTTL().
		Build()
	if err != nil {
		panic(err)
	}
	BulkLoad(context.Background(), "k1", "k2", "k3")
}

@maypok86
Copy link
Owner

maypok86 commented Feb 18, 2024

  1. Don't specify ttl as negative, it will just break everything (I wish we could add validation for this...). It is better to specify it just unreachable.
  2. The code seems to be based on workarounds.

I thought at first that it was not linearizable, but in this particular example it seems fine. But if you allow the user to call functions at will, I'm not sure it's going to work.

it looks like the pinned entry approach can be achieved by this line change

It seems then the entry can be removed from the eviction policy, but the pinned entry should stay in it.

@ben-manes
Copy link

ben-manes commented Feb 18, 2024

I understand correctly that caffeine just locks the bucket in the hash table until the loader is executed, right? If so, this is a very controversial action.

Yes, ConcurrentHashMap uses fine-grained buckets where the number increases as the hash table resizes. The design tries to favor having a very small number of entries per segment, as described below. That says that the expected size of the bin is 60% of the time empty, 30% a single entry, 8% two entries, and so on under ideal hash codes. The Map.compute methods perform the user's function under this lock and the mapping isn't established until this completes.

Caffeine's AsyncCache is actually a Cache<K, CompletableFuture<V>>, so the hash table's mapping is to a future value. When loading a value then the mapping is established under the segment lock, but that is only to create the future so inexpensive. The future is loaded asynchronously and on completion invokes our success/error callback. On success a no-op update is performed (Map.replace(k, future, future)) to recalculate the metadata and unpin the entry, or if a failure then it is removed if present (Map.remove(key, future)). In this mode a long-running loader does not impact other entries, is visible while in-flight, and can be explicitly removed or canceled. Since it the entry is visible (internally & externally) while in-flight, the cache uses pinning to disable the policies as more convenient than special casing logic. Unlike Guava, it doesn't swap the node type for efficiency since the future is user facing, whereas Guava's is a hidden implementation detail for a synchronous cache that uses a fixed number of segments.

I was also initially concerned with the approach taken, but the locking strategy has worked out very well in practice.

ConcurrentHashMap
* The main disadvantage of per-bin locks is that other update
* operations on other nodes in a bin list protected by the same
* lock can stall, for example when user equals() or mapping
* functions take a long time.  However, statistically, under
* random hash codes, this is not a common problem.  Ideally, the
* frequency of nodes in bins follows a Poisson distribution
* (http://en.wikipedia.org/wiki/Poisson_distribution) with a
* parameter of about 0.5 on average, given the resizing threshold
* of 0.75, although with a large variance because of resizing
* granularity. Ignoring variance, the expected occurrences of
* list size k are (exp(-0.5) * pow(0.5, k) / factorial(k)). The
* first values are:
*
* 0:    0.60653066
* 1:    0.30326533
* 2:    0.07581633
* 3:    0.01263606
* 4:    0.00157952
* 5:    0.00015795
* 6:    0.00001316
* 7:    0.00000094
* 8:    0.00000006
* more: less than 1 in ten million
*
* Lock contention probability for two threads accessing distinct
* elements is roughly 1 / (8 * #elements) under random hashes.
*
* Actual hash code distributions encountered in practice
* sometimes deviate significantly from uniform randomness.  This
* includes the case when N > (1<<30), so some keys MUST collide.
* Similarly for dumb or hostile usages in which multiple keys are
* designed to have identical hash codes or ones that differs only
* in masked-out high bits. So we use a secondary strategy that
* applies when the number of nodes in a bin exceeds a
* threshold. These TreeBins use a balanced tree to hold nodes (a
* specialized form of red-black trees), bounding search time to
* O(log N).  Each search step in a TreeBin is at least twice as
* slow as in a regular list, but given that N cannot exceed
* (1<<64) (before running out of addresses) this bounds search
* steps, lock hold times, etc, to reasonable constants (roughly
* 100 nodes inspected per operation worst case) so long as keys
* are Comparable (which is very common -- String, Long, etc).
* TreeBin nodes (TreeNodes) also maintain the same "next"
* traversal pointers as regular nodes, so can be traversed in
* iterators in the same way.

EDIT: Please take all of my commentary as "this is what worked for me" and not "this is what you should do". You should take what you like and discard what you dislike. The comments are meant mostly to illuminate your options to make the best tradeoffs for the project given the language, ecosystem, goals, and your time / interests.

@ben-manes
Copy link

@rueian you might also be interested in something like this coalescing bulkloader example. There individual loads are buffered over a short time window to allow for a more efficient bulk load. That takes advantage of a promise type letting you control when it is completed, so the task could be queued for a little bit. Instead of implementing the queuing, executor, etc. natively the example shows that one can easily combine libraries to get the desired effect, so it helped limit my project's scope. Similarly, promise-style lets one compose with a resilience library to handle retries, fallbacks, rate limits, etc. Just fyi if helpful techniques for your project.

@rueian
Copy link

rueian commented Feb 19, 2024

Hi @maypok86,

Don't specify ttl as negative, it will just break everything (I wish we could add validation for this...). It is better to specify it just unreachable.

Oh, thanks! I will keep that in mind.

I thought at first that it was not linearizable, but in this particular example it seems fine. But if you allow the user to call functions at will, I'm not sure it's going to work.

Though I think it is the users' responsibility to use cache.Delete and cache.Set correctly if they want to implement singleflight like this, I am not sure what linearizability you are saying. Which and which should be linearizable?

It seems then the entry can be removed from the eviction policy, but the pinned entry should stay in it.

Frankly, I think the pinned entries should best not go into policies at all. They simply stay in the hashmap. However, since the policies are applied asynchronously, a check like return n == current && current.Cost() > 0 while deleting a node from the hashmap can not be omitted, then it becomes less relevant whether the pinned entries are tracked by policies or not.

One thing I am wondering is why the cache.SetIfAbsent does not return the current entry. If it does, then I could save one cache.Get in the above example.

@rueian you might also be interested in something like this coalescing bulkloader example.

Hi @ben-manes, yes, thanks for the information. Actually, rueidis does that. rueidis will collect all concurrent requests and pipe them to redis server through one TCP connection. It works just like the eviction policies, having a ring buffer paired with two goroutines, one is dedicated to sending requests to redis and the other one is dedicated to reading responses. Therefore the above example is more like this in practice:

func BulkLoad(ctx context.Context, keys ...string) []Result {
	// ... omitted

	bulkFetchExternal(ctx, pending)
	results := make([]Result, len(flights))
	for i, f := range flights {
		results[i] = f.Wait(ctx)
	}
	return results
}


func bulkFetchExternal(ctx context.Context, keys []keyidx) {
	// queue the request to the ring buffer and it will be pipelined to redis by another goroutine.
}

func readingResponses() { // this is also run in another goroutine.
	for {
		resp, err := readFromConn()
		flight := cache.Get(resp.Key())
		if err != nil {
			flight.err = err
			close(flight.done)
			cache.Delete(resp.key())
		} else {
			flight.data = resp
			close(flight.done)
			cache.Set(resp.key(), flight, someTTL)
		}
	}
}

In terms of limiting project scopes by leveraging other libraries, well, unfortunately, there is no standard approach to promise in go. :(

@maypok86
Copy link
Owner

maypok86 commented Feb 20, 2024

Though I think it is the users' responsibility to use cache.Delete and cache.Set correctly if they want to implement singleflight like this

If that's the case, it's fine. To oversimplify, I meant that you can overwrite or delete entries with concurrent cache.Set and cache.Delete.

One thing I am wondering is why the cache.SetIfAbsent does not return the current entry. If it does, then I could save one cache.Get in the above example.

Because this api looks very strange, especially since it is not clear what to return if there is already an association for this key.

// Is that what you mean?
func (c *Cache[K, V]) SetIfAbsent(key K, value V) (V, bool) {
    ...
}

@maypok86
Copy link
Owner

maypok86 commented Feb 20, 2024

While I didn't have the mental energy to write the code, I decided to check out other implementations of tinylfu in Go, and it looks like theine does indeed do something wrong with doorkeeper. I was also surprised that theine has a lower hit ratio than go-tinylfu on almost all traces I have locally. This can't be related to the cache line aligned count-min sketch? Because that seems like it could lead to more collisions.

S3
CACHE 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000
otter 11.37 20.47 27.77 32.63 39.50 46.69 54.05 62.03
theine 9.89 18.98 27.87 36.67 44.42 53.51 60.43 67.02
tinylfu 12.29 22.93 32.71 42.24 50.87 58.52 64.98 70.45
lru 2.33 4.63 7.59 12.04 22.77 34.63 46.04 56.60
arc 12.82 22.92 29.38 32.46 38.44 47.02 56.36 64.47
P8
CACHE 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000
otter 12.20 21.02 28.05 34.14 39.34 44.30 48.60 52.45
theine 10.75 19.40 26.97 33.87 39.35 44.13 48.85 52.58
tinylfu 13.48 23.42 31.24 38.15 43.72 48.58 53.01 56.68
lru 3.34 9.48 15.49 21.63 27.51 33.13 38.75 43.45
arc 10.17 18.53 25.04 30.66 35.55 40.31 44.56 47.93

@ben-manes
Copy link

I thought at first that it was not linearizable, but in this particular example it seems fine. But if you allow the user to call functions at will, I'm not sure it's going to work.

Guava's use of a future for synchronous loading is not linearizable, which would force a load - put - remove sequence to be in lock-step. Instead the in-flight load can be stomped on and discarded by a mapping change. Simply joining on the future in the put wouldn't be enough as the remove would have to wait on the put to finish as well, and a retry loop would cause non-deterministic ordering. I might be confusing myself, I just recall it was not obvious how to make a future-based loader linearizable and did not try to ensure AsyncCache is via Lincheck.

Linearization may not be a property you actually care about, as we didn't in Guava's Cache. That lacked fancy computes and only loads, so we thought that allowing stomping and the like was okay since data was transient. I think the only issue we had was that an in-flight load was not discarded on a clear or invalidate and some users wanted to broadcast an invalidation to ensure a consistency edge. We assumed local caches could have some staleness due to network disconnects, network delays, etc. and users should use expiration as a safety net regardless. Overall pretty happy without it, but also appreciate having it in Caffeine to offer both modes.

Because this api looks very strange, especially since it is not clear what to return if there is already an association for this key.

fwiw, in Java's ConcurrentMap#putIfAbsent it returns "the previous value associated with the specified key, or null if there was no mapping for the key."

This can't be related to the cache line aligned count-min sketch? Because that seems like it could lead to more collisions.

That should be a port of the Java sketch? The cache aligned was actually just for fun with @bitfaster, as the previous version was fine. In my tests it had a very tiny negative impact to the hit rate, not enough to scare me off of a cool bit of code wrangling. Either way is more than good enough, but it might have exacerbated if some other reason caused a degradation. I recall the author saying his hill climber differs from Caffeine's, but I don't know in what ways, as that might also explain it if it makes poor adaption choices.

@ben-manes
Copy link

From his code, it looks like he is using the climber to adjust the victim's frequency instead of the region sizes. I believe that is covered in the paper as a possible scheme of using a hint or adjusting the sketch Increments, but was not as effective as changing the region's size.

@maypok86
Copy link
Owner

fwiw, in Java's ConcurrentMap#putIfAbsent it returns "the previous value associated with the specified key, or null if there was no mapping for the key."

Yes, it looks like I got a bit carried away, but in Go I still have to return (V, ok), since the value can be of non-nullable type, and you don't want to allocate memory to return the value.

That should be a port of the Java sketch?

Oh, interesting details. I'll have to look into W-TinyLFU after all...


Now one last question on the topic I seem to have the most trouble with.

What should caffeine's users do if they want to get the expiry time for a cache entry?

@ben-manes
Copy link

What should caffeine's users do if they want to get the expiry time for a cache entry?

See Cache.policy() for the backdoor methods. Those are added with a little less care than the broader api, as mostly to unblock users rather than refine more broadly applicable interfaces. I very much believe that we shouldn't block developers from getting their job done, so this provides a bit of a balance between an ideal api and pragmatic one. I try to foresee some of those usages, but more often the methods are added on request.

The simplest is likely to use cache.policy().getEntryIfPresentQuietly(key) which returns a CacheEntry with all of its metadata. A more low-level lookup is cache.policy().expireVariably() to get only the duration from getExpiresAfter(key).

@maypok86
Copy link
Owner

maypok86 commented Mar 5, 2024

I tried to implement policy in Otter and here are some notes.

  1. It's not very obvious from the name that Cache.policy gives access to low-level functions. But I couldn't think of anything better (and short enough) than names like hack/trick.
  2. I tried to implement GetWithTTL/GetWithExpiration based on this and got something not very pretty.
type Cache[K comparable, V any] struct {
	cache otter.Cache[K, V]
}

func NewCache[K comparable, V any](capacity int) (*Cache[K, V], error) {
	cache, err := otter.MustBuilder[K, V](capacity).WithTTL(time.Hour).Build()
	if err != nil {
		return nil, fmt.Errorf("create cache: %w", err)
	}
	return &Cache[K, V]{
		cache: cache,
	}, nil
}

func (c *Cache[K, V]) GetWithTTL(key K) (V, time.Duration, bool) {
	if c.cache.Has(key) {
		// It looks like what is needed is cache.Policy.GetEntry.
		entry, ok := c.cache.Policy().GetEntryQuietly(key)
		if ok {
			expiration := entry.Expiration()
			now := time.Now().Unix()
			ttl := time.Duration(expiration-now) * time.Second
			return entry.Value(), ttl, true
		}
	}
	var v V
	return v, 0, false
}

@ben-manes
Copy link

Yeah, I can't advocate for the name. The best I could think of otherwise were capabilities(), advanced(), features(), and similar. I didn't want users to gravitate to it so a benign name like policy() didn't seem too awful and tbh I was more focused on implementation needs than api design. The majority of cases are served by the asMap() view, which is more obvious as a familiar platform type and offers computes, iterators, etc.

Is the redundant check needed (Has(key) and entry, ok := GetEntryQuietly) or can you rely on ok being false? That removes a nesting.

Can the entry be an external type with convenience methods, e.g. Expiration() as a int64 and ttl() as a Duration that is calculated on demand? That would make it a one-line return.

@maypok86
Copy link
Owner

maypok86 commented Mar 5, 2024

Oh, advanced looks interesting.

Yes, we could add a TTL function, but I don't like the fact that if the user wants to get the ttl/expiration along with the value and update the eviction policy, he has to make two calls. Especially since there might not be an entry found during the GetEntryQuietly call. So the question is more about why there is no GetEntry? It looks very useful and is likely to be used more often than GetEntryQuietly. It's likely that there just hasn't been a request to add it, but maybe there's some more fundamental reason?

And nesting can be removed in another way :).

func zeroValue[V any]() V {
	var zero V
	return zero
}

func (c *Cache[K, V]) GetWithTTL(key K) (V, time.Duration, bool) {
	if !c.cache.Has(key) {
		return zeroValue[V](), 0, false
	}

	entry, ok := c.cache.Policy().GetEntryQuietly(key)
	if !ok {
		return zeroValue[V](), 0, false
	}

	return entry.Value(), entry.TTL(), true
}

@ben-manes
Copy link

It looks like I added it for integration to Oracle Coherence, a distributed cache, and Verizon's Robusta, an internal persisted memory backed cache. They both needed to iterate over the metadata in a defined order, like the hottest or oldest entries. The getIfPresentQuietly was added for a different integration ask, and those other CacheEntry apis was just future proofing to round out the api.

Most users don't need the metadata so it doesn't come up often. It someone had a use-case that made sense then I'd probably add it.

@maypok86
Copy link
Owner

maypok86 commented Mar 5, 2024

@ben-manes, thank you so much for your help, it was very helpful and gave me a lot more insight. 🤗

@maypok86
Copy link
Owner

maypok86 commented Mar 8, 2024

I tried to write a growable queue and here are the conclusions:

  1. A queue with mutexes can be written so that its speed practically coincides with the speed of the channels, but I never managed to overtake them, and it is unlikely to work. After all, channels use a lower-level api.
  2. I also tried to write a bounded version of the queue from this page. It was able to overtake the channels, but unfortunately allocates memory, and if you add sync.Pool to it, then its speed becomes equal to the speed of the channels. :(

Also, according to observations, even with a load profile of reads=25%,writes=75%, the queue speed does not particularly affect the speed of the entire cache, but with only writes it becomes a bottleneck.

And some benchmarks.

goos: darwin
goarch: arm64
pkg: github.com/maypok86/otter/internal/queue
BenchmarkChanProdConsWork100-10                  3815607               324.0 ns/op             0 B/op          0 allocs/op
BenchmarkBlockingProdConsWork100-10              3406512               354.1 ns/op             0 B/op          0 allocs/op
BenchmarkMPSCProdConsWork100-10                  5416998               220.2 ns/op            16 B/op          0 allocs/op
BenchmarkMPSCPoolProdConsWork100-10              3699946               319.6 ns/op             0 B/op          0 allocs/op
PASS
ok      github.com/maypok86/otter/internal/queue        6.260s

So I'll probably really just take a queue with mutexes as a good enough solution. Maybe later I will get bored and I will try to implement the lock-free version, but it will definitely be much later.

@ben-manes
Copy link

ben-manes commented Mar 9, 2024

Sounds like a great plan! That matches my recollection so a simpler mutex is a good baseline.

For (2), I originally wrote an unbounded version of that queue, with the addition of write combining, where an out-of-memory error would only happen in an artificial stress test. I switched to the bounded version when investigating a bug of runaway growth, which turned out to be in Java's ForkJoinPool that was dropping the maintenance task due to an internal race condition. By switching to a bounded queue, the back pressure on a writer lets it try to help perform the maintenance work and recover. Had there not been the very nice JCTools growable ring buffer, I probably would have used a mutex too (maybe trying to have it be lock-free if capacity and lock when growing). That reminds me, though, for linked data structures you should take GC nepotism into consideration, e.g. I'll explicitly null out references on removed nodes to minimize that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants