-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-39417: Add service account and token for service monitoring #613
OCPBUGS-39417: Add service account and token for service monitoring #613
Conversation
@Vincent056: This pull request references Jira Issue OCPBUGS-39417, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
8c0c61d
to
878149a
Compare
🤖 To deploy this PR, run the following command:
|
🤖 To deploy this PR, run the following command:
|
878149a
to
2969604
Compare
🤖 To deploy this PR, run the following command:
|
Note for reviewers - we pulled in new prometheus dependencies in #491 which needed these changes. |
/jira refresh |
@rhmdnd: This pull request references Jira Issue OCPBUGS-39417, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Looks like we have some cleanup we need to handle in the tests:
|
@Vincent056 With this PR, I cannot reproduce the failure logs in cov1.5.1(seen logs after 2024-09-04T09:29:20). But I do have found some other issues.
|
2969604
to
6521a99
Compare
🤖 To deploy this PR, run the following command:
|
@Vincent056 With this PR, I cannot reproduce the failure logs in cov1.have found some other issues. However, I can still see the "AlertmanagerReceiversNotConfigured" warning alert on GUI with a fresh install compliance operator with index image ghcr.io/complianceascode/compliance-operator-catalog:613-6521a99aa5f53fe9dc7c258539827cdb77e1795b. I haven't figured out whether it is related to CO or not. I will update it later.
|
6521a99
to
8071c28
Compare
🤖 To deploy this PR, run the following command:
|
8071c28
to
b062637
Compare
🤖 To deploy this PR, run the following command:
|
b062637
to
1c64960
Compare
/retest |
9d66805
to
73ffcac
Compare
🤖 To deploy this PR, run the following command:
|
🤖 To deploy this PR, run the following command:
|
/retest |
🤖 To deploy this PR, run the following command:
|
4211619
to
0ecb94e
Compare
Adding service account and token needed for ServiceMonitoring, this will create a new service account compliance-operator-metrics and use create a metric token, and we will use that token for ServiceMonitoring.
0ecb94e
to
53656f1
Compare
🤖 To deploy this PR, run the following command:
|
🤖 To deploy this PR, run the following command:
|
|
||
for _, metric := range metrics { | ||
if metric.Health != "up" { | ||
return fmt.Errorf("Metric %s is not up. LastError: %s", metric.Labels, metric.LastError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether it is a good test case for the bug OCPBUGS-39417. With the last released version cov1.5.1, I can see all targets' health status are up.
% oc get csv
NAME DISPLAY VERSION REPLACES PHASE
compliance-operator.v1.5.1 Compliance Operator 1.5.1 compliance-operator.v1.5.0 Succeeded
% oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/targets | jq -r >> without_PR_613.log
% cat without_PR_613.log| grep -i '"health":'| grep -Ev up
%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that might be because the bug manifested on version 1.5.1. I was using 1.5.0 on a cluster and noticed those metrics endpoints are up.
@Vincent056 I am little confused with this bug, about what does the token in the servicemonitor used for? Actually except the error logs in openshift-monitoring, I didn't see any other effect as the metrics and alerts could be shown normally on the web console. |
I think token is required to create service monitoring, we used to use the token mounted in the pod file, but it is no longer supported with the new Prmethus dependency. |
I was able to experiment with this PR in a cluster and did the following:
oc run -i --rm --restart=Never --image=registry.fedoraproject.org/fedora-minimal:latest -n openshift-compliance metrics-test --overrides='{"spec": {"serviceAccountName": "prometheus-test"}}' -- bash -c 'TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) && curl -k -s https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/targets --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca-crt -H "Authorization: Bearer $TOKEN" -H "Accept: application/json"'
I repeated those same steps with CO 1.5.1 and I wasn't able to find any metrics for the @xiaojiey Do those steps also work for you? |
I was able to verify the metrics on version 1.5.0 using the same steps noted above, and confirmed they're up to. |
const prometheusCommand = `TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) && { curl -k -s https://prometheus-k8s.openshift-monitoring.svc.cluster.local:9091/api/v1/targets --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt -H "Authorization: Bearer $TOKEN"; }` | ||
namespace := f.OperatorNamespace | ||
out, err := runOCandGetOutput([]string{ | ||
"run", "--rm", "-i", "--restart=Never", "--image=registry.fedoraproject.org/fedora:latest", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could probably get away with only fedora-minimal here, but we can patch that after.
@Vincent056 I was able to reproduce the issue in the bug with 4.15 payload. With this PR, the issue gets resolved. The targets are up, and the metrics/alerts working well. What's more, the fix works for all ocp releases.
|
/label qe-approved |
@Vincent056: This pull request references Jira Issue OCPBUGS-39417, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rhmdnd, Vincent056 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fed54b4
into
ComplianceAsCode:master
@Vincent056: Jira Issue OCPBUGS-39417: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-39417 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Adding a service account and token needed for
ServiceMonitoring
, will create a new service accountcompliance-operator-metrics
, and make a metric token for thatServiceAccount
, and we will use that token for ServiceMonitoring.