-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache metric issues #1644
Comments
I'm for adding the cache prefix to the package and have dedicated metrics per kind in source-controller. Given that Git and OCI can't share the same key/token for Azure, I suspect this will be the case for other providers too. |
I think the cache metrics should have specific names based on use case, and not specifically kind. Each use case will most likely always have a specific kind associated with, but I don't think the kind is what matters the most, as we may have two different use cases that are both associated with the same kind. I can't think of a specific example from the top of my head, but storing Helm indexes and auth tokens are already an example of two different use cases for caching, and the kind they relate to is only crucial for the former. So I'm also for adding a metric name prefix to the metric/cache constructor. I'm fine with making the kind part of this prefix, as part of the use case name. |
In the last week's dev meeting, we discussed this and arrived at the following conclusion:
fluxcd/pkg#817 has been updated accordingly to support these use cases. More discussion is required to determine how the new caches will be configured. But that can be discussed in a separate issue related to those new caches. |
Re-opening to keep track of when this will be implemented, especially for the helm index metrics change which will be a breaking change and need to be mentioned in the release notes. |
This is to highlight the issue with the cache metrics and discuss how to
handle it with respect to the new generic typed cache.
At present, in source-controller we have a custom cache implementation for
storing the Helm repository index, refer
https://github.com/fluxcd/source-controller/tree/v1.4.1/internal/cache. This
cache can store any value in the form of an empty interface
interface{}
. Forinstrumenting the cache, it exports a metric by the name
gotk_cache_events_total
with labelsevent_type
,name
andnamespace
fortype of cache event (hit or miss), name of the Flux object for which the cache
is used and namespace of the object, respectively.
Following is how it appears when exported:
If it is known that the only cache in source-controller is the Helm index cache,
this seems fine. But when we need to add other caches in source-controller for
other purposes, this metric gets confusing.
In https://github.com/fluxcd/pkg/tree/main/cache, a new generic typed cache is
implemented which will be used for different cache implementations across Flux.
See fluxcd/pkg#817 for the latest implementation of the
generic typed cache.
The following sections describe the issues with the existing metrics with
respect to the new generic typed cache.
Lack of object kind information
When a new cache instance (using the existing cache implementation), say auth
token cache, is added for GitRepository, OCIRepository, etc, due to lack of the
object kind in the metric label, we can't determine which object a metric
belongs to. Two objects of different kinds may have the same name and namespace.
The new generic typed cache also has a similar metric but it also includes the object
kind, which would help determine the associated object accurately.
Fixed metric name
The cache metric has hard-coded name,
gotk_cache_events_total
. If new cacheinstance and metrics are added, they'll have the same name. The metric
registerer panics with the error
duplicate metrics collector registration attempted
when attempting to register two metrics with the same name.This is true for the new generic typed cache too. The metric name is hard-coded.
In case of the old/existing cache implementation, the stored value is an empty
interface, which can store any value. This enables the usage of the same cache
for storing different items, although that may not be ideal, depending on the
use case. A single shared cache for everything would work fine.
But the new generic typed cache can't be shared because the cache instance are
created with their type and can only store values of specific type. This property of
the new cache requires creating separate caches for storing different types of data.
A cache for storing the helm index cache of type
IndexFile
. Another cache forstoring auth tokens of type string. The two with hard-coded metric names can't
be registered with the prometheus registerer.
Possible solutions
I believe ideally, the new cache would replace the existing cache in
source-controller. But for that we need to address the above issues. Using the
new generic typed cache will address the first issue as it includes the object kind
in the metric. But for addressing the fixed metric name issue, we may need to
allow dynamic/variable names.
The constructor of the cache metrics can accept a prefix which can be added
before the metric's fixed name. For example, if the fixed name is
cache_events_total
, a prefixgotk_helm_index
could be passed, resulting ina new metric with name
gotk_helm_index_cache_events_total
. This would make iseasier to instrument individual caches and identify them easily. In case of the
helm index, all the metrics will belong to HelmRepository, but for auth token
cache, the same instance of cache may be shared amongst multiple reconcilers and
the object kind would help filter the metric for specific object kinds.
This would also benefit other users of our packages who may like to enable the
caching support. They can pass their own prefix, as
gotk_cache
wouldn't meananything to their users.
Concerns
cache metric, renaming the metric would break the existing monitoring setup of
the users. For just helm index cache metrics, maybe we can find some way to
continue exporting the metric with the same name, but we may eventually have to
drop it.
kinds/reconcilers? The considerations for this may not be based on the metrics,
but the decision will affect the way metrics will be exported. If it's shared,
the auth cache metric could be something like
gotk_auth_cache_events_total
.If it's not shared, it could be something like
gotk_git_auth_cache_events_total
,gotk_oci_auth_cache_events_total
, etc.The new generic typed cache implementation will be updated accordingly.
The text was updated successfully, but these errors were encountered: