Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache metric issues #1644

Open
darkowlzz opened this issue Oct 21, 2024 · 4 comments
Open

Cache metric issues #1644

darkowlzz opened this issue Oct 21, 2024 · 4 comments

Comments

@darkowlzz
Copy link
Contributor

This is to highlight the issue with the cache metrics and discuss how to
handle it with respect to the new generic typed cache.

At present, in source-controller we have a custom cache implementation for
storing the Helm repository index, refer
https://github.com/fluxcd/source-controller/tree/v1.4.1/internal/cache. This
cache can store any value in the form of an empty interface interface{}. For
instrumenting the cache, it exports a metric by the name
gotk_cache_events_total with labels event_type, name and namespace for
type of cache event (hit or miss), name of the Flux object for which the cache
is used and namespace of the object, respectively.

Following is how it appears when exported:

# HELP gotk_cache_events_total Total number of cache retrieval events for a Gitops Toolkit resource reconciliation.
# TYPE gotk_cache_events_total counter
gotk_cache_events_total{event_type="cache_hit",name="podinfo",namespace="default"} 3
gotk_cache_events_total{event_type="cache_miss",name="podinfo",namespace="default"} 1

If it is known that the only cache in source-controller is the Helm index cache,
this seems fine. But when we need to add other caches in source-controller for
other purposes, this metric gets confusing.

In https://github.com/fluxcd/pkg/tree/main/cache, a new generic typed cache is
implemented which will be used for different cache implementations across Flux.
See fluxcd/pkg#817 for the latest implementation of the
generic typed cache.

The following sections describe the issues with the existing metrics with
respect to the new generic typed cache.

Lack of object kind information

When a new cache instance (using the existing cache implementation), say auth
token cache, is added for GitRepository, OCIRepository, etc, due to lack of the
object kind in the metric label, we can't determine which object a metric
belongs to. Two objects of different kinds may have the same name and namespace.

The new generic typed cache also has a similar metric but it also includes the object
kind, which would help determine the associated object accurately.

Fixed metric name

The cache metric has hard-coded name, gotk_cache_events_total. If new cache
instance and metrics are added, they'll have the same name. The metric
registerer panics with the error duplicate metrics collector registration attempted when attempting to register two metrics with the same name.

This is true for the new generic typed cache too. The metric name is hard-coded.

In case of the old/existing cache implementation, the stored value is an empty
interface, which can store any value. This enables the usage of the same cache
for storing different items, although that may not be ideal, depending on the
use case. A single shared cache for everything would work fine.

But the new generic typed cache can't be shared because the cache instance are
created with their type and can only store values of specific type. This property of
the new cache requires creating separate caches for storing different types of data.
A cache for storing the helm index cache of type IndexFile. Another cache for
storing auth tokens of type string. The two with hard-coded metric names can't
be registered with the prometheus registerer.


Possible solutions

I believe ideally, the new cache would replace the existing cache in
source-controller. But for that we need to address the above issues. Using the
new generic typed cache will address the first issue as it includes the object kind
in the metric. But for addressing the fixed metric name issue, we may need to
allow dynamic/variable names.

The constructor of the cache metrics can accept a prefix which can be added
before the metric's fixed name. For example, if the fixed name is
cache_events_total, a prefix gotk_helm_index could be passed, resulting in
a new metric with name gotk_helm_index_cache_events_total. This would make is
easier to instrument individual caches and identify them easily. In case of the
helm index, all the metrics will belong to HelmRepository, but for auth token
cache, the same instance of cache may be shared amongst multiple reconcilers and
the object kind would help filter the metric for specific object kinds.

This would also benefit other users of our packages who may like to enable the
caching support. They can pass their own prefix, as gotk_cache wouldn't mean
anything to their users.

Concerns

  • Although we don't publish any monitoring dashboard that uses the existing
    cache metric, renaming the metric would break the existing monitoring setup of
    the users. For just helm index cache metrics, maybe we can find some way to
    continue exporting the metric with the same name, but we may eventually have to
    drop it.
  • For the auth token caches, should the cache be shared across the objects
    kinds/reconcilers? The considerations for this may not be based on the metrics,
    but the decision will affect the way metrics will be exported. If it's shared,
    the auth cache metric could be something like gotk_auth_cache_events_total.
    If it's not shared, it could be something like
    gotk_git_auth_cache_events_total, gotk_oci_auth_cache_events_total, etc.

The new generic typed cache implementation will be updated accordingly.

@stefanprodan
Copy link
Member

stefanprodan commented Oct 21, 2024

I'm for adding the cache prefix to the package and have dedicated metrics per kind in source-controller. Given that Git and OCI can't share the same key/token for Azure, I suspect this will be the case for other providers too.

@matheuscscp
Copy link
Member

matheuscscp commented Oct 22, 2024

I think the cache metrics should have specific names based on use case, and not specifically kind. Each use case will most likely always have a specific kind associated with, but I don't think the kind is what matters the most, as we may have two different use cases that are both associated with the same kind. I can't think of a specific example from the top of my head, but storing Helm indexes and auth tokens are already an example of two different use cases for caching, and the kind they relate to is only crucial for the former.

So I'm also for adding a metric name prefix to the metric/cache constructor. I'm fine with making the kind part of this prefix, as part of the use case name.

@darkowlzz
Copy link
Contributor Author

In the last week's dev meeting, we discussed this and arrived at the following conclusion:

  • The new cache implementation should accept prefix for the metrics to prevent any conflicts.
  • The caches in source-controller will use prefix for cache metrics. The metrics will have enough information in the name to determine what the metric is about and each kind will have their own separate cache so that they can work independently, resulting in metrics like gotk_git_auth_cache_events_total, gotk_oci_auth_cache_events_total.
  • ⚠️ The existing source-controller gotk_cache_events_total metric for helm index will be replaced with more specific gotk_helm_index_cache_events_total metric. This will be a breaking change for the users of this metric.
  • Although the metric names in this case indicate the kind of the object they are associated with, for consistency with other object based metrics, the metric labels will have kind, name and namespace of the associated object.

fluxcd/pkg#817 has been updated accordingly to support these use cases.

More discussion is required to determine how the new caches will be configured. But that can be discussed in a separate issue related to those new caches.

@darkowlzz
Copy link
Contributor Author

Re-opening to keep track of when this will be implemented, especially for the helm index metrics change which will be a breaking change and need to be mentioned in the release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants