-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver #69
base: main
Are you sure you want to change the base?
COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver #69
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
@@ Coverage Diff @@
## main #69 +/- ##
==========================================
+ Coverage 93.40% 93.74% +0.33%
==========================================
Files 9 10 +1
Lines 637 671 +34
==========================================
+ Hits 595 629 +34
Misses 36 36
Partials 6 6 |
b0108f0
to
191d005
Compare
7965f66
to
aa85ea0
Compare
@@ -50,5 +50,6 @@ func main() { | |||
// Call the run function (defined in cmd.go) | |||
if err := run(ctx); err != nil { | |||
klog.ErrorS(err, "Scality COSI driver encountered an error, shutting down") | |||
os.Exit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graceful exit to shutdown the metrics server as well.
@@ -10,3 +10,9 @@ const ( | |||
LvlDebug // 4 - Debug-level logs, tricky logic areas | |||
LvlTrace // 5 - Trace-level logs, detailed troubleshooting context | |||
) | |||
|
|||
// Service initialization constants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be expanded soon with all service initialization constants, to make sure we don't have magic numbers/strings
|
||
go func() { | ||
klog.InfoS("Starting Prometheus metrics server", "address", listener.Addr().String()) | ||
if err := srv.Serve(listener); err != nil && err != http.ErrServerClosed { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own edification, why is ErrServerClosed
not considered an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The http.ErrServerClosed
error is not considered an error in this context because it is a normal part of the lifecycle of an HTTP server in Go.
Bacially, the http.ErrServerClosed
is returned by the http.Server.ListenAndServe
method when the server is shut down using the Shutdown or Close methods. This indicates a graceful shutdown, which is expected behavior, not a failure. Treating http.ErrServerClosed
as a non-error ensures that the shutdown process doesn’t log misleading or unnecessary error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep makes total sense, thanks!
@@ -50,5 +50,6 @@ func main() { | |||
// Call the run function (defined in cmd.go) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't write a comment above this line, but at main.go:42
: doesn't ctx.Done()
close immediately when cancel
is called? I don't think checking that is worth it. The timeout is good though. We could however add another read on the sigs
channel in the select if we wanted to force shutdown on multiple SIGINT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right the select block is indeed redundant.
When cancel() is called, ctx.Done() immediately closes, and the select statement in the goroutine will always choose the <-ctx.Done() case first. This makes the select block somewhat redundant in its current form.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking to change it to something like this
What do you think?
go func() {
sig := <-sigs
klog.InfoS("Signal received", "type", sig)
cancel()
klog.InfoS("Scality COSI driver shutdown initiated successfully, context canceled")
select {
case sig = <-sigs:
klog.ErrorS(nil, "Force shutdown due to repeated signal", "type", sig)
os.Exit(1)
case <-time.After(shutdownTimeout):
klog.ErrorS(nil, "Force shutdown due to timeout", "timeout", shutdownTimeout)
os.Exit(1)
}
}()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good 👍
Perhaps the first log could say something like "Initiating graceful shutdown, repeat signal to force shutdown"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I will create another PR for this, its out of scope for this one. Thanks
Name: "cosi_requests_total", | ||
Help: "Total number of requests handled by the COSI driver.", | ||
}, | ||
[]string{"method", "status"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually for http request labels in other components we use:
- method: HTTP method
- code: response code (instead of status)
- action: S3 action (optionaly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a gRPC service and not HTTP, hence the difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see.
In your other PR doc I see the possible values, maybe you can put a comment here to describe possible values for those labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And should it be called *_grpc_requests_total
then to be clear ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All grpc metrics are generated automatically, this is a placeholder for custom metrics.
I can remove this method its that better, but I wanted to keep it for future use.
What I had inmind was have total requests for grpc and HTTP calls, but needs to be discussed with @davidmercier-scality so left it for now.
Its not being used to generate any custom metrics as of now even if further PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But indeed we can use the cosi_driver prefix for this.
MetricsPath = "/metrics" | ||
MetricsAddress = ":8080" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this go directly in the metrics.go
file ? It's not going to be used any where else ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the address use only local interface 127.0.0.1
by default ? And the port is already used by cloudserver, can we pick another one that's not used by other components ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right it will go in the metrics.go, the GO way is to keep it closer indeed if not being re-used.
To address your concerns:
1. Service Context: This service is not part of RING and will be deployed on customers’ Kubernetes clusters. Since it is deployed externally, it won’t conflict with any port numbers within RING.
2. Deployment Setup:
• The container is deployed as a Kubernetes pod, and the metrics route is exposed via a Kubernetes service with its own unique cluster IP.
• End-users can modify the exposed port via deployment configurations or Helm charts.
3. Port Conflict Analysis:
• Even without modifying the port, each Kubernetes service gets a unique IP.
• Since only one HTTP server (for metrics) runs within the pod, conflicts are highly unlikely within the pod.
Here’s an example for clarity:
✗ kubectl get svc --all-namespaces
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 17h
kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 17h
kubernetes-dashboard dashboard-metrics-scraper ClusterIP 10.103.11.103 <none> 8000/TCP 17h
kubernetes-dashboard kubernetes-dashboard ClusterIP 10.96.45.42 <none> 80/TCP 17h
scality-object-storage scality-cosi-driver-metrics ClusterIP 10.97.169.22 <none> 8080/TCP 17h
go.mod
Outdated
@@ -5,8 +5,10 @@ go 1.22.6 | |||
require ( | |||
github.com/aws/aws-sdk-go-v2/credentials v1.17.47 | |||
github.com/aws/smithy-go v1.22.1 | |||
github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The github says it's deprecated in favor of go-grpc-middleware
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can check if you can prefix those metrics with s3_cosi_
so we can easily identify them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not an S3 service so we will prefix it with cosi_driver
for custom metrics but for default grpc metrics we should keep the convention of grpc_, just like we do for HTTP protocol.
Metrics such as grpc_server_started_total convey that these are standard gRPC server metrics, helping tools or dashboards that expect gRPC naming conventions to process them without requiring additional configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check this out go-grpc-middleware
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see.
Maybe a config option like metrics_prefix
if client wants to add some custom prefix for cosi. Otherwise it can still be differentiated with the job name and it might not matter at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For custom metrics we can add that
So default can be cosi_driver prefix which is configurable
But I would like to keep grpc metrics in standard format .
89e0dfc
to
a3e7337
Compare
- Introduces a new metrics package to handle Prometheus instrumentation. - Adds `RequestsTotal` as a custom metric to track COSI driver requests by method and status. - Implements `StartMetricsServer` to expose metrics at the configured HTTP endpoint. - Integrates Prometheus's `promhttp.Handler` for metrics scraping. - Uses constants from the `pkg/constants` package for the metrics path. Issue: COSI-65
- In main.go, allow graceful shutdown of the metrics server. - Added `metricsAddress` flag to configure the Prometheus metrics endpoint. - Integrated `metrics.StartMetricsServer` to expose metrics at the configured address. - Ensured graceful shutdown of the metrics server during service termination. - Updated the `run` function to include metrics server lifecycle management. - Maintains flexibility for metrics configuration using constants from the `pkg/constants` package. Issue: COSI-65
go-grpc-prometheus exports various metrics: - grpc_server_started_total: Count of RPCs started on the server by method. - grpc_server_handled_total: Count of RPCs completed on the server, regardless of success or failure. - grpc_server_handling_seconds_*: Histograms or summaries (if histograms are enabled) for tracking RPC handling duration. - grpc_server_msg_received_total: Number of messages received per RPC. - grpc_server_msg_sent_total: Number of messages sent per RPC. Issue: COSI-65
Issue: COSI-65
Issue: COSI-65
aa85ea0
to
e124149
Compare
}) | ||
|
||
AfterEach(func() { | ||
// Clean up the Unix socket file | ||
socketPath := strings.TrimPrefix(address, "unix://") | ||
if err := os.Remove(socketPath); err != nil && !os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be rather done in server code, before listen ?
I suppose the unix socket is cleaned up by a graceful shutdown, but if there is any crash that prevents the process from cleaning that file it might prevent restart of the server ?
This PR introduces comprehensive Prometheus metrics support for the COSI driver, including metrics instrumentation, integration, and unit tests. The changes are grouped into four key commits, each addressing a distinct aspect of the implementation. Reviewers are encouraged to follow the commit story for a structured understanding of the changes.
Documentation coming soon
Commit Summary:
• Introduces the metrics package for Prometheus instrumentation.
• Adds a custom RequestsTotal metric to track COSI driver requests by method and status.
• Implements StartMetricsServer to expose metrics at an HTTP endpoint.
• Adds a metricsAddress flag for configuring the metrics endpoint.
• Manages the metrics server lifecycle with graceful shutdown support.
• Integrates metrics.StartMetricsServer into the driver’s main runtime.
• Adds gRPC metrics such as RPC counts, handling duration, and message totals using go-grpc-prometheus.
• Adds unit tests for the metrics package, ensuring complete coverage of StartMetricsServer and StartMetricsServerWithListener.
Request for Reviewers:
Please follow the commit story to understand the changes in detail. Focus areas include:
• Metrics integration with the gRPC server.
• Implementation of StartMetricsServer and its lifecycle management.
• Completeness and accuracy of unit tests. (comments are very intentional for Contri-X as I am a bit new to prom unit testing, helping my future self.)
Issue:
Resolves: COSI-65
Example