Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Local schema cache weighted invalidation based on schema size #2598

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mattsewall-stripe
Copy link

Summary

  • While running schema registry we have observed scenarios where large schemas that contain many versions do not get invalidated quickly enough from the local schema cache. (Similar to large objects in an avalanche rising to the surface, large schemas with many references stay in the cache for max time period.)
  • Rather than increase the total heap space of our schema registry nodes or decrease the total size of the cache we've solved this issue with weight based local schema cache invalidation. Our schemas vary greatly in size and some of them are very large. It’s not efficient for us to size heap to accommodate a full cache of very large schemas without OOMing.
  • No performance degradation has been observed in our schema registry instance, but would love to work through mainline benchmark strategies as needed. Our hypothesis for this reasoning is that our cache hit % has stayed constant between sized based cache and weight based cache.
  • Included in this pr is metrics to track the local schema cache in both size and hits + misses.
  • By default this feature is not enabled.

Motivation

  • We have observed scenarios operating the schema registry where during periods of heavy write traffic the leader node will OOM based on the local cache becoming too large. This write traffic is extremely spiky and contains large schemas that contain many versions (important for checking compatibility).
  • This change hopefully serves to provide operational clarity into managing the local schema cache + provide a fix for this failure scenario.

@mattsewall-stripe mattsewall-stripe requested a review from a team as a code owner April 6, 2023 21:32
* <code>schema.cache.maximum.weight</code>
*/
public static final String SCHEMA_CACHE_MAXIMUM_WEIGHT_CONFIG = "schema.cache.maximum.weight";
public static final int SCHEMA_CACHE_MAXIMUM_WEIGHT_DEFAULT = 1000000;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum of canonical schema length must be under 1,000,000 characters is what this variable means. From the guava cache docs there's a lot of fuzziness as to how things get invalidated within the cache itself (based on usage and % of contribution to weight total?) but we've set this limit such that large schemas should be getting invalidated over time. What this essentially means is that more smaller schemas will be cached (and that's good!) as we should be able to cache quite a few of them. No significant performance degradation has materialized from making this change.

rayokota
rayokota previously approved these changes Apr 6, 2023
Copy link
Member

@rayokota rayokota left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mattsewall-stripe!

@rayokota rayokota dismissed their stale review April 6, 2023 23:01

The build failed with a few checkstyle errors

* <code>schema.cache.use.weight</code>
*/
public static final String SCHEMA_CACHE_USE_WEIGHT_CONFIG = "schema.cache.use.weight";
public static final boolean SCHEMA_CACHE_USE_WEIGHT_DEFAULT = false;
Copy link
Author

@mattsewall-stripe mattsewall-stripe Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned off by default but in my opinion likely could be turned on by default.

@mattsewall-stripe
Copy link
Author

@rayokota

Hey! I apologize for the delay but have built project locally and ran tests :)

@mattsewall-stripe
Copy link
Author

Hi! @rayokota can you take another look? Coming around to this again

@cla-assistant
Copy link

cla-assistant bot commented Sep 25, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants