Shorten critical section in findEviction #217

igchor · 2023-03-24T01:58:40Z

This is the next part of the 'critical section' patch after #183.

Original PR (some of the changes already upstreamed): #172

This PR contains changes to findEviction only. The remaining part (changes in SlabRelease code) will be provided in the next (final) PR.

haowu14 · 2023-04-06T18:26:57Z

cachelib/allocator/CacheAllocator-inl.h

+    if (!toRecycle)
+      continue;


Suggested change

if (!toRecycle)

continue;

if (!toRecycle) {

continue;

}

This happens iff we exhausted the entire linked list and hasn't found a candidate while not exhausting searchTries right?

Yes, that's correct.

Please fix the format suggestion.

haowu14 · 2023-04-06T18:44:45Z

cachelib/allocator/CacheAllocator-inl.h

+          // remove the child from the mmContainer as we will not be evicting
+          // it. We could abort right here, but we need to cleanup in case
+          // unmarkForEviction() returns 0 - so just go through normal path.
+          if (!toRecycle_->isChainedItem() ||


Can you elaborate more?
In the old way of eviction, if we are trying to recycle a chained item (toRecycle.isChainedItem()) but the parent has changed, we don't do any specific treatment until in releaseBackToAllocator, we'll go through the chain with parent being candidate, and found out that toRecycle was never reached by walking the chain. So releaseBackToAllocator fails and we continue iterating the eviction queue.

How does the new way work?

The new way is pretty much the same. This comment states that we don't do any specific treatment as well. We'll just try to evict the item as usual and leverage the existing logic in releaseBackToAllocator.

It might be possible to optimize this case but for this PR I felt like having similar behavior is better.

The only reason for this if here is to optimize the happy case: if the item is not a chained item or the parent hasn't changed, we can safely remove the item from MMContainer (we already hold the MMContainer lock so doing it here has low overhead). If the parent has changed we cannot call the mmContainer.remove() here as it could lead to deadlocks. Instead, we'll just call mmContainer.remove() inside unlinkItemForEviction which is slower (need to acquire the lock again) but safe.

I understand the optimization now.
I think as long as the chained item can't change parent after this point then this is safe. And since we marked the parent as forEviction, it can't be moving nor accessed, so we don't have to do the chainedItemLock_ here.

How much does this optimization give us?

Using default hit_ratio/graph_cache_leader_fbobj/config.json this optimization improves throughput by ~40% on my machine. This is mostly due to almost 100% eviction Success rate (without the optimization I get ~50% Success rate).

haowu14 · 2023-04-06T18:50:37Z

cachelib/allocator/CacheAllocator-inl.h

+
+        token = createPutToken(*candidate_);
+
+        if (shouldWriteToNvmCache(*candidate_) && !token.isValid()) {


shouldWriteToNvmCache is called once in createPutToken and another time in here, can we make it called only once? I think we can just do it as the old way?

Since I wrapped token creation in createPutToken I no longer know if token is invalid due to FailConcurrentFill or due because shouldWriteToNvmCache returned false.

Perhaps I should just put stats_.evictFailConcurrentFill.inc(); inside createPutToken?

Is there a reason you have to create a function for createPutToken?

I wanted to use it in other places as well. But since for now it's only used in findEviction I can remove it. Done.

Also, I just realized I bumped evictFailParentAC and evictFailAC even if the failure was due to invalid token - I fixed this as well.

and and enable combinedLocking support. Shorten MMContainer critical section by removing from Access Container after MMContainer lock is droped. Only markForEviction is executed under the MMContainer critical section now. If markForEviction succeeds, the item is guaranteed to be evicted.

haowu14 · 2023-04-17T21:43:54Z

cachelib/allocator/CacheAllocator-inl.h

+
+        token = createPutToken(*candidate_);
+
+        if (shouldWriteToNvmCache(*candidate_) && !token.isValid()) {


Is there a reason you have to create a function for createPutToken?

haowu14 · 2023-04-17T21:59:56Z

cachelib/allocator/CacheAllocator-inl.h

+          // remove the child from the mmContainer as we will not be evicting
+          // it. We could abort right here, but we need to cleanup in case
+          // unmarkForEviction() returns 0 - so just go through normal path.
+          if (!toRecycle_->isChainedItem() ||


I understand the optimization now.
I think as long as the chained item can't change parent after this point then this is safe. And since we marked the parent as forEviction, it can't be moving nor accessed, so we don't have to do the chainedItemLock_ here.

How much does this optimization give us?

facebook-github-bot · 2023-04-18T00:35:17Z

@haowu14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

haowu14 · 2023-04-25T02:32:53Z

cachelib/allocator/CacheAllocator-inl.h

+        if (evictToNvmCache)
+          token = nvmCache_->createPutToken(candidate_->getKey());
+


Suggested change

if (evictToNvmCache)

token = nvmCache_->createPutToken(candidate_->getKey());

if (evictToNvmCache) {

token = nvmCache_->createPutToken(candidate_->getKey());

}

And this might have been lost in one of the patches you add after you ran test but here is a bug.
The token can be created on candidate A, then candidate A failed to markForEviction and we move to the next item.
For candidate B, evictToNvmCache is false but we successfully markForEviction.

We'll exit this loop with candidate B but the token created from candidate A.

I saw a crash in production test with exception "Item is not nvm evicted and nvm clean". And I suspect this is the reason. I'm putting together a hacky fix on my end to resume the test.

haowu14 · 2023-04-25T02:35:16Z

cachelib/allocator/CacheAllocator-inl.h

+    if (!toRecycle)
+      continue;


Please fix the format suggestion.

haowu14

@igchor I continued the production test with a hacky change where I do auto tokoen_ = createPutToken(...) and only do token == std::move(token) iff the item successfully mark for eviction. However, I noticed a regression in allocation latency mostly caused by creating and destroying PutTokens.
I think this is because in this PR, we try to create put token before mark for eviction while before this PR we do refcount==0 and markMoving before trying to create PutToken.
Can we do it the same way as before?

facebook-github-bot · 2023-04-25T23:38:54Z

@igchor has updated the pull request. You must reimport the pull request before landing.

igchor · 2023-04-25T23:38:56Z

@igchor I continued the production test with a hacky change where I do auto tokoen_ = createPutToken(...) and only do token == std::move(token) iff the item successfully mark for eviction. However, I noticed a regression in allocation latency mostly caused by creating and destroying PutTokens. I think this is because in this PR, we try to create put token before mark for eviction while before this PR we do refcount==0 and markMoving before trying to create PutToken. Can we do it the same way as before?

We need to create token before marking for eviction because markForEviction is the point where the item stops being visible to other threads. I think we could do some optimizations here: for example, check the refcount / moving bit first, and move to the next item if necessary (it's still possible we'll need to destroy the token but it should be less common).

Could you share what is the Evict Success rate and how much regression have you observed? I've run a few benchmarks where evict success rate was around 80% and didn't see any regression with the fix. I've also included the fix in this PR as a separate commit.

Also, have you tried running with "useCombinedLockForIterators": true ?

haowu14 · 2023-04-25T23:50:34Z

@igchor
I just realized I made a mistake in my hacky fix. I just fixed that and am rebuilding a binary. The mistake I made was checking the wrong token, which would always be invalid and resulted in the advance of iterator. Hopefully this de-signify the fact that now we create token before mark evict.

"useCombinedLockForIterators": true

Not yet. I'd need to make sure with this being false there's no regression (I hope)?

facebook-github-bot · 2023-04-25T23:51:05Z

@haowu14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

igchor · 2023-04-26T16:11:30Z

@igchor I just realized I made a mistake in my hacky fix. I just fixed that and am rebuilding a binary. The mistake I made was checking the wrong token, which would always be invalid and resulted in the advance of iterator. Hopefully this de-signify the fact that now we create token before mark evict.

"useCombinedLockForIterators": true

Not yet. I'd need to make sure with this being false there's no regression (I hope)?

Got it. Right, regarding useCombinedLockForIterators, I would expect there is no regression when it's set to false. When it's set to 'true' I'd expect lower allocate latency and higher throughput at the cost of slightly higher find latency.

haowu14 · 2023-04-28T18:18:52Z

In the most recent version there's no significant regression however I don't see improvement with combined locking on either.

Our test is a shadow set tup, meaning that the binary running base (trunk) and test(this PR) will receive exact same get/set at the same time regardless of how quickly that binary can process the request. So we can't look at QPS directly for comparison.
Looking at p95 allocation latency, they look pretty much the same (test is about 1us slower, not sure if it's just accounting error). The eviction success rate is between 55 ~ 70 and they are the same most of the time across the two versions.

I'll run the experiment again on a new experiment since this existing pair of servers ran my buggy code and may not have exactly the same cache content (but I don't think it matters much).

My questions for you:

Any other metrics you'd like me to look at to use as a poxy to evaluate this throughput win?
Can you remind me of what's the slow down on find path?

igchor · 2023-04-28T22:05:45Z

In the most recent version there's no significant regression however I don't see improvement with combined locking on either.

Our test is a shadow set tup, meaning that the binary running base (trunk) and test(this PR) will receive exact same get/set at the same time regardless of how quickly that binary can process the request. So we can't look at QPS directly for comparison. Looking at p95 allocation latency, they look pretty much the same (test is about 1us slower, not sure if it's just accounting error). The eviction success rate is between 55 ~ 70 and they are the same most of the time across the two versions.

I'll run the experiment again on a new experiment since this existing pair of servers ran my buggy code and may not have exactly the same cache content (but I don't think it matters much).

My questions for you:

Any other metrics you'd like me to look at to use as a poxy to evaluate this throughput win?

Can you remind me of what's the slow down on find path?

We actually saw the highest difference on p99 (and higher) latencies.
I think combined locking helps the most if there is higher contention on the MMContainer lock. If it's not the case for the workload you're running, then there might no difference. However, combined locking would still probably help if our multi-tier version would be run on that workload (the length of the critical section for multi-tier is longer which causes contention to go up).
MMContainer lock is used in (at least) 2 places: findEviction (called from allocate) to protect the search for an eviction candidate and in markUseful which is called by all readers (on find). Without combined locking, a thread calling find would just execute its (short) critical section or wait in case of contention. With combined locking, that thread might end up executing a queued task from findEviction as well, and since the findEviction critical section length is longer, it might increase find latency. My understanding is that with combined locking we are pretty much splitting the work (under MMContainer lock) more equally between allocate and find

Also, one more thing that we've noticed when running benchmarks. With this patch, we can get similar (or even better) performance (QPS and latency) while using lower AC locksPower_ (because we no longer lock AC under MMContainer lock).

therealgymmy

Sorry for the late review. This LGTM overall. A few comments inline.

therealgymmy · 2023-05-31T16:37:04Z

cachelib/allocator/CacheAllocator-inl.h

+template <typename CacheTrait>
+void CacheAllocator<CacheTrait>::unlinkItemForEviction(Item& it) {
+  XDCHECK(it.isMarkedForEviction());
+  XDCHECK(it.getRefCount() == 0);


nitpick: XDCHECK_EQ(0, it.getRefcount());

therealgymmy · 2023-05-31T16:37:55Z

cachelib/allocator/CacheAllocator-inl.h

+  XDCHECK(it.isMarkedForEviction());
+  XDCHECK(it.getRefCount() == 0);
+  accessContainer_->remove(it);
+  removeFromMMContainer(it);


didn't we already remove this item from mm-container in findEviction at L1313? (if it were a parent or just a regular item?)

Yes, we might have removed the item but removeFromMMContainer checks if the item is still in the MMContainer and only then calls mmcontainer.remove(). We need to keep this call here because If the parent of the item changed, the item will still be in the mmcontainer.

therealgymmy · 2023-05-31T16:40:23Z

cachelib/allocator/CacheAllocator-inl.h

+  // Since we managed to mark the item for eviction we must be the only
+  // owner of the item.
+  const auto ref = it.unmarkForEviction();
+  XDCHECK(ref == 0u);


same nitpick: XDCHECK_EQ

therealgymmy · 2023-05-31T16:41:22Z

cachelib/allocator/CacheAllocator-inl.h

+        if (evictToNvmCache && !token_.isValid()) {
+          stats_.evictFailConcurrentFill.inc();
+        } else if (candidate_->markForEviction()) {
+          XDCHECK(candidate_->isMarkedForEviction());


Can we refactor this a bit to have fewer nested if-stmt?

e.g.

if (evictToNvmCache && !token_.isValid()) { stats_.evictFailConcurrentFill.inc(); ++itr; continue; } auto markedForEviction = candidate_->markForEviction(); if (!markedForEviction) { ++itr; continue; } // ...

Sure, done.

therealgymmy · 2023-05-31T16:42:04Z

cachelib/allocator/CacheAllocator-inl.h

-    if (candidate->getRefCount() != 0 || !candidate->markMoving()) {
-      ++itr;
+  while (config_.evictionSearchTries == 0 ||
+         config_.evictionSearchTries > searchTries) {


Do we still need this when we're doing a while loop already inside withEvictionIterator?

Can we just reset the iterator internally and restart the search under lock?

Potentially yes, but it might have an impact on performance. The current implementation behaves pretty much the same as the original implementation. If we got rid of this loop, the MMContainer lock could be held for a longer duration.

Resetting the iterator under the lock might actually be a better solution but I just didn't want to change the behavior too much in this PR - perhaps it would be better to do it in a separate PR and evaluate it?

Also, I think the outer loop is also need for cases when releaseBackToAllocator ends up not releasing the item we intend to - this is already done outside the lock.

facebook-github-bot · 2023-05-31T21:09:15Z

@igchor has updated the pull request. You must reimport the pull request before landing.

haowu14 · 2023-05-31T21:38:22Z

Performance tests have passed with autoFDO off. So we are good performance wise.

facebook-github-bot · 2023-05-31T21:47:34Z

@haowu14 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-06-05T23:59:01Z

@haowu14 merged this pull request in 80af379.

facebook-github-bot · 2023-06-08T21:25:45Z

This pull request has been reverted by ed9e088.

haowu14 · 2023-06-08T23:21:32Z

Crashed in staging test for a service. Validating a fix. We'll land it again once it's fixed.

haowu14 · 2023-06-13T22:14:53Z

We upstreamed this again ae7442a

The problem was the lifetime of putToken. It could live after the candidate is released and its key becomes empty. The destructor would try to remove this empty key from the InflightPuts map and assert for success. It's fixed by making sure the putToken is destroyed right after attempting to insert into NVM, before trying to release candidate back to allocator.

igchor · 2023-06-14T05:22:50Z

@haowu14 thanks for letting us now! We'll try to apply the fix on our end and validate it as well. We've actually seen a few inconsistencies reported by cachebench the last few days on our develop branch but we thought this was due to other changes.

Accept reference to the item instead of handle in NVMCache::put()

c44e7f9

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2023

haowu14 requested changes Apr 6, 2023

View reviewed changes

haowu14 requested changes Apr 17, 2023

View reviewed changes

igchor force-pushed the find_eviction_refactor branch 2 times, most recently from efba6bc to 0253683 Compare April 17, 2023 22:54

haowu14 requested changes Apr 25, 2023

View reviewed changes

therealgymmy reviewed May 31, 2023

View reviewed changes

Fix issue with token creation

5d5c3f0

igchor force-pushed the find_eviction_refactor branch from 7a033e6 to 5d5c3f0 Compare May 31, 2023 21:09

facebook-github-bot closed this in 80af379 Jun 5, 2023

facebook-github-bot added the Merged label Jun 5, 2023

facebook-github-bot added the Reverted label Jun 8, 2023


		token = createPutToken(*candidate_);

		if (shouldWriteToNvmCache(*candidate_) && !token.isValid()) {

		if (evictToNvmCache)
		token = nvmCache_->createPutToken(candidate_->getKey());

Shorten critical section in findEviction #217

Shorten critical section in findEviction #217

Conversation

igchor commented Mar 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haowu14 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Apr 25, 2023

igchor commented Apr 25, 2023

haowu14 commented Apr 25, 2023

facebook-github-bot commented Apr 25, 2023

igchor commented Apr 26, 2023

haowu14 commented Apr 28, 2023

igchor commented Apr 28, 2023

therealgymmy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented May 31, 2023

haowu14 commented May 31, 2023

facebook-github-bot commented May 31, 2023

facebook-github-bot commented Jun 5, 2023

facebook-github-bot commented Jun 8, 2023

haowu14 commented Jun 8, 2023

haowu14 commented Jun 13, 2023

igchor commented Jun 14, 2023

igchor commented Mar 24, 2023 •

edited

Loading