Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Validate tags on cache hit #127

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

MGApcDev
Copy link

@MGApcDev MGApcDev commented Jun 21, 2024

Sometimes updates does not invalidate all caches relevant to the model in distributed system (unknown reason)

We are running Lada-cache on a big website with a lot of users.
We can't determine the cause, but sometimes fetch requests, will give us an old (stale) state of the model.

Technically from the code this should not be able to happen, if everything works, so I cannot correct any specific file.
But we can see the error state, so we know why we are getting stale caches.


Bug state:

  • Key1 is cached (tags: Tag1)
  • Key2 is cached (tags: Tag1)
  • Tag1 has Key2
  • Tag1 does NOT have Key1 // <-- Issue
  • When an update happens that will clear Tag1, it clears Key2, but it does not clear Key1
  • When we fetch Key1 we still get the old cached result
  • Invalidating Tag1 will never clear Key1, we have to await for Key1 to expire

Expected state:

  • Key1 is cached (tags: Tag1)
  • Key2 is cached (tags: Tag1)
  • Tag1 has Key2
  • Tag1 HAS Key1
  • When an update happens that will clear Tag1, it clears Key2, and clears Key1
  • When we fetch Key1, it will query the database for a fetch result.

  • The correction in my PR; is that when we get Cache hits, we make sure that the Key is in the tags that they are suppose to be in.
  • This does introduce a little performance overhead to cache hits
  • Tags are already calculated so the hit comes trying to verify the key is in the required tags.
  • The Redis tags are Unique atomic sets, so you cannot add the same key to 1 set. So Redis should have to perform a READ operation for each tag, to check if the Key is in the tag.

Performance on cache hits:

  • Performance before : 1x Redis READ
  • Performance after (tags are correct): 1x Redis READ + (number of tags)x Redis READ
  • Performance after (1 tag is missing): 1x Redis READ + (number of tags)x Redis READ + 1x Redis WRITE

We have currently been running my branch on our production environment for 2 weeks, and it seems to correct the issue completely.
If the performance overhead is a concern, we could also add this as an option to the config and default it to false. That way we don't have any extra overhead for users that don't experience the issue.

 'validate-tags-on-cache-hit' => env('LADA_CACHE_VALIDATE_TAGS_ON_CACHE_HIT', false),

I've tried to add a test that simulates the issue case.

@spiritix
Copy link
Owner

Thanks for your contribution, highly appreciated.

It would be very interesting to find out why the key doesn't get the required tag. I will investigate this further once I find some time. I'll probably add your PR with an additional option, need to check the impact it has on performance.

@MGApcDev
Copy link
Author

MGApcDev commented Jun 25, 2024

Thanks 🙌
We've been trying a few Eloquent model caching libraries, and this one performs by far the best.


Originally I was suspecting it was Redis getting full, and it was doing cache evictions on the tags in Redis (but I don't think that anymore).
We experienced the issues in a few different Eloquent model and different clients, we can't really see a pattern to it.
But then we found one Model instance in our production environment, where we could consistently replicate the issue.
Unfortunately it's limited what we can debug our production.

Very weird issue.
It was an Eloquent model with ID 2618
We had a few different select queries on that model for that ID, but 1 query kept returning the old stale value.
We could clear the whole lada-cache, which solved the stale issue.
But next time we tried to update that model instance, all the caches would cleared, expect for that 1 query.

Thought maybe that the 1 cache didn't generate the correct tags, but I can verify with the change in this PR, that was not the case.

Maybe there is another issue in the code like #124 causing this, but I can't find the cause 🤔
(say if you want any more information, I'll try to get it for you)

@spiritix
Copy link
Owner

If you know the particular query which caused the issue, can you try to reproduce it locally with that query?

@MGApcDev
Copy link
Author

I think we can trace it down to query:

select * from  `appearances` where  `id` = 2618 limit 1

But we have not been able to reproduce the issue locally.

@spiritix
Copy link
Owner

That looks like a rather simple query. I suspect it's an issue with Redis. Can you post the relevant PHP code please?

@MGApcDev
Copy link
Author

MGApcDev commented Jun 27, 2024

The updating code:

...
$appearance = $user->client->getAppearance($appearanceId);
$data = $request->getRequestPayload();
$appearance->update($data); // <-- Native Eloquent model update.

The fetching code. That was stuck for 1 model instance (99.9% of the time there is no issue):

...
$client = $this->getClient();
if ($client) {
    if ($this->clientHost->appearance) {
        $this->clientHost->appearance->refresh(); // <-- this line should Query the new instance
        $this->appearance = $this->clientHost->appearance;
    } else {
        $this->appearance = $client->getDefaultAppearance();
    }
}
return $this->appearance;

We've only ever seen the issue on production.
Unsure if it's because a lot of traffic, might trigger the issue (concurrency issue).
Or we just have more users, so there is a bigger probability of someone reporting the issue.

@spiritix
Copy link
Owner

Yeah I think it's some kind of race condition. Will be pretty hard to investigate. Going to have an in-depth look for the next release.

@MGApcDev
Copy link
Author

MGApcDev commented Sep 16, 2024

Hey, just wondering if there is any updates here.

I think it's just a really hard issue to investigate.
With my fork it's run very stable on our production environment.
Due to the slight performance overhead, I think it should be an option in the config, what do you think?

 'validate-tags-on-cache-hit' => false,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants