Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterAPI custom resource monitoring #37

Closed
wants to merge 6 commits into from

Conversation

bavarianbidi
Copy link

@bavarianbidi bavarianbidi commented May 30, 2022

Signed-off-by: Mario Constanti [email protected]

Status:

Signed-off-by: Mario Constanti <[email protected]>
@bavarianbidi bavarianbidi requested review from a team May 30, 2022 12:21
@fiunchinho
Copy link
Member

Thanks for the RFC.

The CASM upstream proposal "future work" section, mentions adding provider metrics in the future, so I guess solving this means discussing with upstream about the best solution.

Talking about the proposed solutions here (and without having talked to upstream people yet), I feel like having different implementations for different providers living in different applications/repos is more aligned with how the CAPI project is currently made/designed, making different providers to be implemented as different applications. As it's mentioned in the RFC, trying to put all potential providers metrics implementations into a single app would bring many many code dependencies into the project (or a complex use of unstructured objects using conditionals to use different fields for different providers?). Apart from code dependencies, it would introduce provider dependency, meaning that development for one provider could affect other providers.

Using different implementations for different providers makes it harder to deploy, but that's a problem that we are already solving for CAPI controllers anyway.

Signed-off-by: Mario Constanti <[email protected]>
cluster-api-state-metrics-adoption/README.md Outdated Show resolved Hide resolved

But as the code isn't that complex it's very easy to extend the existing implementation to support infrastructure provider specific metrics.

Beside generated changes in `go.mod` and `go.sum` only the following changes are needed to add the two new metrics `capi_openstackcluster_created` and `capi_openstackcluster_paused` into CASM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be so complex&dirty to add different providers by config? like passing GVK as config + using unstructured in the code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be so complex&dirty to add different providers by config? like passing GVK as config + using unstructured in the code.

my assumption was to align with the already existing implementation (import the type and reuse it) as much as possible.

If we're using unstructured objects, we have to deal with different fields for all used different providers, not sure if it's worth to doing it.

@fiunchinho already mentioned this above:

As it's mentioned in the RFC, trying to put all potential providers metrics implementations into a single app would bring many many code dependencies into the project (or a complex use of unstructured objects using conditionals to use different fields for different providers?)

IMO:
As upstream implementation doesn't have a solution for provider specific metrics yet i guess it's worth to start with a simple solution for now (as the benefits for us might be huge on a short term). And once the discussion starts upstream we could take part there as we already have to refactor this component once upstream has found a good solution. WDYT @erkanerol ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is the simplest solution for us (OpenStack provider) but I am not sure it is the most optimal solution for the whole GiantSwarm. We can mention this in the next kaas sync.

Nevertheless, I don't see any big risk in starting with this solution and then improving in time. So +1 from my side.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a note to the upcoming KaaS-Sync

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the conclusion in that sync?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the conclusion in that sync?

that all others will take a look at this PR. Will propagate it again in the next couple of days.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats left for this rfc now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my time to convert it into a new RFC ... still on my list

* WIP: notes about ksm
* changed highlighting for new .go file
Signed-off-by: Mario Constanti <[email protected]>
@bavarianbidi bavarianbidi changed the title Adoption of CASM ClusterAPI custom resource monitoring Jun 8, 2022
@bavarianbidi bavarianbidi requested review from a team and erkanerol June 8, 2022 04:55
- Scope of KSM between management- and workload cluster differs
- resource requirements
- permissions
- Changes in KSM-App affect all clusters, even if changes are only done for management clusters (e.g. version bump)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just change those values at deployment time depending on the target cluster? I think we do for other components too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is YES and this seems the best solution to me. We already use helm templates to differentiate behavior of apps (including monitoring stuff) according to the environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's possible. It's more about accepting the impact when we go this way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this variant mixed with things from other variants 😄

I'm wondering if the following would be possible. I was thinking about one KSM instance as described on this variant, but CustomResourceStateMetrics resources are bundled with the CAPI apps like cluster-api-app and cluster-api-provider-$provider-app. Same as explained on variant 2.3 but without having different KSM instances. I think in this case we get the benefits without having to deploy several KSMs and waste resources.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly one idea, KSM folks mentioned someday in slack (TODO marioc: link to that statement) that each third party component can ship there own CustomResourceStateMetrics configuration and KSM is taking care of it.

Currently CustomResourceStateMetrics is only defined as key in the configmap, which KSM will read.


- Cluster API versions must be reflect in the `CustomResourceStateMetrics` configuration. We will potentially diverge on the used Cluster API versions per infrastructure provider.

#### Variant 2.2: dedicated `kube-state-metrics` instance on management cluster per CAPI providers
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I understand we could put a bunch of {{ if eq .Values.provider "openstack" }} in the helm chart and that would do right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Variant 1 and Variant 2.1 we will have a lot of {{ if eq .Values.provider "openstack" }} in the repos (and therefore we have to decide how to go with corresponding APP releases - as we could break other providers monitoring)

@pipo02mix
Copy link
Contributor

I am still confused if this will be implemented in CASM or KSM will fadeit out

@bavarianbidi
Copy link
Author

I am still confused if this will be implemented in CASM or KSM will fadeit out

upstream discussion about KSM is still open (kubernetes/kube-state-metrics#1755 (comment) - and there is no feedback from KSM folks right now) but i still asume that we will go with KSM.

My idea was to get feedback about both possible implementations and once upstream has a decision we can immediately start with our implementation.

@bavarianbidi
Copy link
Author

for now i would start with a dedicated app - as i won't run the kube-system based KSM from a forked custom branch (https://github.com/kubernetes/kube-state-metrics/compare/master...chrischdi:poc-additional-metric-types-2?expand=1)

@bavarianbidi
Copy link
Author

@JosephSalisbury as you take the CAPI-Monitoring thing into an upcoming "product-sync" - i'm currently think about closing this RFC (with a comment) and creating a new RFC to

  • adopt variant 2.1 as a general purpose for future CR-Monitoring in ManagementClusters (and eventually in WCs as well - if needed)

  • and if we (as GS) belive in a solution, where

    • KSM in a long-term provides a CustomResourceStateMetrics CRD and watch/list/get all CustomResourceStateMetrics CRs
    • and each controller implementation (e.g. ClusterAPI, Flux, ...) creates/serves an own MetricConfiguration of type CustomResourceStateMetrics

    might be the better approach - we could think about implementing/providing this feature to KSM.

    During the initial implementation of cluster-api-monitoring-app and demonstrating the special KSM-instance i often got asked, if CustomResourceStateMetrics is a CRD in a cluster (currently it's only the "header" of the configuration and only used as anchor for generating the configuration)

    Until now, the Idea itself got only published once in a slack thread by KSM maintainer Manuel Krug:

    I'm also currently thinking about a kube-state-metrics-customresourcemetrics CRD, so each CRD could ship their own set of metric definitions with its distribution and kube-state-metric would pick it up

@JosephSalisbury
Copy link
Contributor

JosephSalisbury commented Aug 1, 2022

yeah, i reckon recreating / updating #37 to focus on variant 2.1 (i.e: present all the options, make 2.1 the one we're going to go with) would be neat, and focus it more on cr monitoring in general

i'm not sure yet about customeresourcestatemetrics crs. it feels a bit too far away for us to make a bet on. i could see us having some app that we deploy, and we have our config there, then if customeresourcestatemetrics crs become a thing, we break that apart - does that sound reasonable?

@fiunchinho
Copy link
Member

Should we update the RFC with the approach being used atm and try to close this PR?

@pipo02mix
Copy link
Contributor

yeah it would be a good idea. WDYT @bavarianbidi ?

@bavarianbidi
Copy link
Author

yeah it would be a good idea. WDYT @bavarianbidi ?

yep, i agree ... already having a local branch open but currently i'm focused to make some progress on the mc bootstrap behind a proxy-thing.

will try to invest some time this week/next week to make the current status clear.

@github-actions
Copy link

github-actions bot commented Mar 4, 2023

This RFC has had no activity for 100 days. It will be closed in 1 week, unless activity occurs or the label lifecycle/keep is added.

@bavarianbidi
Copy link
Author

will try to invest some time this week/next week to make the current status clear.

oh man, what was I thinking 🙈 let's give them another try in the next couple of weeks 🤞

@bavarianbidi
Copy link
Author

PR will closed soon.
current state is documented here: https://github.com/giantswarm/giantswarm/pull/27092

All the future discussions about the possibilities with the CustomResourceMonitoring feature of KSM will be moved into a dedicated issue as there is a lot of progress on upstream atm.

@bavarianbidi
Copy link
Author

Current CAPI monitoring is described in https://intranet.giantswarm.io/docs/monitoring/capi_monitoring/.
This RFC will be closed as we have a solution for now. Any additional improvements will be handled in giantswarm/roadmap#2513

@pipo02mix
Copy link
Contributor

But it is not the first variant? I was thinking why not merging instead of closing and don't lose the work you did

@bavarianbidi
Copy link
Author

But it is not the first variant? I was thinking why not merging instead of closing and don't lose the work you did

@pipo02mix
I didn't want to merge the PR as we didn't had a real consent about the implementation and due high priorities and the operational need we started with variant 2.1 - so there was no "real discussion"

I've recently documented this setup in our intranet and also referenced to this PR for possible alternative implementations (https://intranet.giantswarm.io/docs/monitoring/capi_monitoring/#reason-for-current-architecture).
And also upstream KSM is extending the current feature which should have impact for the entire CR monitoring thing within GS (giantswarm/roadmap#2513)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants