COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver #69

anurag4DSB · 2024-12-19T01:14:57Z

This PR introduces comprehensive Prometheus metrics support for the COSI driver, including metrics instrumentation, integration, and unit tests. The changes are grouped into four key commits, each addressing a distinct aspect of the implementation. Reviewers are encouraged to follow the commit story for a structured understanding of the changes.
Documentation coming soon

Commit Summary:

Add metrics package for Prometheus instrumentation
• Introduces the metrics package for Prometheus instrumentation.
• Adds a custom RequestsTotal metric to track COSI driver requests by method and status.
• Implements StartMetricsServer to expose metrics at an HTTP endpoint.
Integrate Prometheus metrics server in COSI driver
• Adds a metricsAddress flag for configuring the metrics endpoint.
• Manages the metrics server lifecycle with graceful shutdown support.
• Integrates metrics.StartMetricsServer into the driver’s main runtime.
Instrumented gRPC server with Prometheus exporter
• Adds gRPC metrics such as RPC counts, handling duration, and message totals using go-grpc-prometheus.
Metrics package unit tests
• Adds unit tests for the metrics package, ensuring complete coverage of StartMetricsServer and StartMetricsServerWithListener.
Update Codecov config

Request for Reviewers:

Please follow the commit story to understand the changes in detail. Focus areas include:
• Metrics integration with the gRPC server.
• Implementation of StartMetricsServer and its lifecycle management.
• Completeness and accuracy of unit tests. (comments are very intentional for Contri-X as I am a bit new to prom unit testing, helping my future self.)

Issue:
Resolves: COSI-65

Example

# curl http://localhost:8080/metrics
# HELP go_cgo_go_to_c_calls_calls_total Count of calls made from Go to C by the current process.
# TYPE go_cgo_go_to_c_calls_calls_total counter
go_cgo_go_to_c_calls_calls_total 0
# HELP go_cpu_classes_gc_mark_assist_cpu_seconds_total Estimated total CPU time goroutines spent performing GC tasks to assist the GC and prevent it from falling behind the application. This metric is an overestimate, and not directly comparable to system CPU time measurements. Compare only with other /cpu/classes metrics.
# TYPE go_cpu_classes_gc_mark_assist_cpu_seconds_total counter
go_cpu_classes_gc_mark_assist_cpu_seconds_total 0.001221251
.
.
.
grpc_server_started_total{grpc_method="DriverCreateBucket",grpc_service="cosi.v1alpha1.Provisioner",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="DriverDeleteBucket",grpc_service="cosi.v1alpha1.Provisioner",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="DriverGetInfo",grpc_service="cosi.v1alpha1.Identity",grpc_type="unary"} 6
grpc_server_started_total{grpc_method="DriverGrantBucketAccess",grpc_service="cosi.v1alpha1.Provisioner",grpc_type="unary"} 0
grpc_server_started_total{grpc_method="DriverRevokeBucketAccess",grpc_service="cosi.v1alpha1.Provisioner",grpc_type="unary"} 0
.
.
.
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 37
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

codecov · 2024-12-19T01:16:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.74%. Comparing base (a9cf51d) to head (e124149).

Additional details and impacted files

Files with missing lines	Coverage Δ
pkg/grpcfactory/server.go	`87.80% <100.00%> (+4.47%)`	⬆️
pkg/metrics/metrics.go	`100.00% <100.00%> (ø)`

Components	Coverage Δ
🏠 Main Package	`∅ <ø> (∅)`
🚗 Driver Package	`92.22% <ø> (ø)`
📡 gRPC Factory Package	`83.33% <100.00%> (+1.68%)`	⬆️
🔐 IAM Client Package	`100.00% <ø> (ø)`
🌐 S3 Client Package	`100.00% <ø> (ø)`
🔧 Util Package	`100.00% <ø> (ø)`
📊 Metrics Package	`100.00% <100.00%> (∅)`
🔖 Constants Package	`∅ <ø> (∅)`

@@            Coverage Diff             @@
##             main      #69      +/-   ##
==========================================
+ Coverage   93.40%   93.74%   +0.33%     
==========================================
  Files           9       10       +1     
  Lines         637      671      +34     
==========================================
+ Hits          595      629      +34     
  Misses         36       36              
  Partials        6        6

anurag4DSB · 2024-12-19T09:23:15Z

cmd/scality-cosi-driver/main.go

@@ -50,5 +50,6 @@ func main() {
 	// Call the run function (defined in cmd.go)
 	if err := run(ctx); err != nil {
 		klog.ErrorS(err, "Scality COSI driver encountered an error, shutting down")
+		os.Exit(1)


Graceful exit to shutdown the metrics server as well.

anurag4DSB · 2024-12-19T09:24:55Z

pkg/constants/constants.go

@@ -10,3 +10,9 @@ const (
 	LvlDebug              // 4 - Debug-level logs, tricky logic areas
 	LvlTrace              // 5 - Trace-level logs, detailed troubleshooting context
 )
+
+// Service initialization constants


This will be expanded soon with all service initialization constants, to make sure we don't have magic numbers/strings

fredmnl · 2024-12-19T10:09:37Z

pkg/metrics/metrics.go

+
+	go func() {
+		klog.InfoS("Starting Prometheus metrics server", "address", listener.Addr().String())
+		if err := srv.Serve(listener); err != nil && err != http.ErrServerClosed {


For my own edification, why is ErrServerClosed not considered an error?

The http.ErrServerClosed error is not considered an error in this context because it is a normal part of the lifecycle of an HTTP server in Go.

Bacially, the http.ErrServerClosed is returned by the http.Server.ListenAndServe method when the server is shut down using the Shutdown or Close methods. This indicates a graceful shutdown, which is expected behavior, not a failure. Treating http.ErrServerClosed as a non-error ensures that the shutdown process doesn’t log misleading or unnecessary error messages.

Yep makes total sense, thanks!

fredmnl · 2024-12-19T10:13:19Z

cmd/scality-cosi-driver/main.go

@@ -50,5 +50,6 @@ func main() {
 	// Call the run function (defined in cmd.go)


I can't write a comment above this line, but at main.go:42: doesn't ctx.Done() close immediately when cancel is called? I don't think checking that is worth it. The timeout is good though. We could however add another read on the sigs channel in the select if we wanted to force shutdown on multiple SIGINT.

you are right the select block is indeed redundant.
When cancel() is called, ctx.Done() immediately closes, and the select statement in the goroutine will always choose the <-ctx.Done() case first. This makes the select block somewhat redundant in its current form.

I am thinking to change it to something like this
What do you think?

go func() { sig := <-sigs klog.InfoS("Signal received", "type", sig) cancel() klog.InfoS("Scality COSI driver shutdown initiated successfully, context canceled") select { case sig = <-sigs: klog.ErrorS(nil, "Force shutdown due to repeated signal", "type", sig) os.Exit(1) case <-time.After(shutdownTimeout): klog.ErrorS(nil, "Force shutdown due to timeout", "timeout", shutdownTimeout) os.Exit(1) } }()

I think this is good 👍

Perhaps the first log could say something like "Initiating graceful shutdown, repeat signal to force shutdown"?

Okay I will create another PR for this, its out of scope for this one. Thanks

pkg/metrics/metrics.go

BourgoisMickael · 2024-12-19T21:17:20Z

pkg/metrics/metrics.go

+			Name: "cosi_requests_total",
+			Help: "Total number of requests handled by the COSI driver.",
+		},
+		[]string{"method", "status"},


actually for http request labels in other components we use:

method: HTTP method

code: response code (instead of status)

action: S3 action (optionaly)

This is a gRPC service and not HTTP, hence the difference.

Ok I see.

In your other PR doc I see the possible values, maybe you can put a comment here to describe possible values for those labels

And should it be called *_grpc_requests_total then to be clear ?

All grpc metrics are generated automatically, this is a placeholder for custom metrics.
I can remove this method its that better, but I wanted to keep it for future use.
What I had inmind was have total requests for grpc and HTTP calls, but needs to be discussed with @davidmercier-scality so left it for now.

Its not being used to generate any custom metrics as of now even if further PRs.

But indeed we can use the cosi_driver prefix for this.

BourgoisMickael · 2024-12-19T21:22:54Z

pkg/constants/constants.go

+	MetricsPath = "/metrics"
+	MetricsAddress = ":8080"


should this go directly in the metrics.go file ? It's not going to be used any where else ?

Should the address use only local interface 127.0.0.1 by default ? And the port is already used by cloudserver, can we pick another one that's not used by other components ?

Yes you are right it will go in the metrics.go, the GO way is to keep it closer indeed if not being re-used.

To address your concerns:
1. Service Context: This service is not part of RING and will be deployed on customers’ Kubernetes clusters. Since it is deployed externally, it won’t conflict with any port numbers within RING.
2. Deployment Setup:
• The container is deployed as a Kubernetes pod, and the metrics route is exposed via a Kubernetes service with its own unique cluster IP.
• End-users can modify the exposed port via deployment configurations or Helm charts.
3. Port Conflict Analysis:
• Even without modifying the port, each Kubernetes service gets a unique IP.
• Since only one HTTP server (for metrics) runs within the pod, conflicts are highly unlikely within the pod.

Here’s an example for clarity:

✗ kubectl get svc --all-namespaces NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE default kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 17h kube-system kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP,9153/TCP 17h kubernetes-dashboard dashboard-metrics-scraper ClusterIP 10.103.11.103 <none> 8000/TCP 17h kubernetes-dashboard kubernetes-dashboard ClusterIP 10.96.45.42 <none> 80/TCP 17h scality-object-storage scality-cosi-driver-metrics ClusterIP 10.97.169.22 <none> 8080/TCP 17h

BourgoisMickael · 2024-12-19T21:26:15Z

go.mod

@@ -5,8 +5,10 @@ go 1.22.6
 require (
 	github.com/aws/aws-sdk-go-v2/credentials v1.17.47
 	github.com/aws/smithy-go v1.22.1
+	github.com/grpc-ecosystem/go-grpc-prometheus v1.2.0


The github says it's deprecated in favor of go-grpc-middleware

Maybe you can check if you can prefix those metrics with s3_cosi_ so we can easily identify them

This is not an S3 service so we will prefix it with cosi_driver for custom metrics but for default grpc metrics we should keep the convention of grpc_, just like we do for HTTP protocol.

Metrics such as grpc_server_started_total convey that these are standard gRPC server metrics, helping tools or dashboards that expect gRPC naming conventions to process them without requiring additional configuration.

I will check this out go-grpc-middleware

Ok I see.

Maybe a config option like metrics_prefix if client wants to add some custom prefix for cosi. Otherwise it can still be differentiated with the job name and it might not matter at all.

For custom metrics we can add that
So default can be cosi_driver prefix which is configurable
But I would like to keep grpc metrics in standard format .

- Introduces a new metrics package to handle Prometheus instrumentation. - Adds `RequestsTotal` as a custom metric to track COSI driver requests by method and status. - Implements `StartMetricsServer` to expose metrics at the configured HTTP endpoint. - Integrates Prometheus's `promhttp.Handler` for metrics scraping. - Uses constants from the `pkg/constants` package for the metrics path. Issue: COSI-65

- In main.go, allow graceful shutdown of the metrics server. - Added `metricsAddress` flag to configure the Prometheus metrics endpoint. - Integrated `metrics.StartMetricsServer` to expose metrics at the configured address. - Ensured graceful shutdown of the metrics server during service termination. - Updated the `run` function to include metrics server lifecycle management. - Maintains flexibility for metrics configuration using constants from the `pkg/constants` package. Issue: COSI-65

go-grpc-prometheus exports various metrics: - grpc_server_started_total: Count of RPCs started on the server by method. - grpc_server_handled_total: Count of RPCs completed on the server, regardless of success or failure. - grpc_server_handling_seconds_*: Histograms or summaries (if histograms are enabled) for tracking RPC handling duration. - grpc_server_msg_received_total: Number of messages received per RPC. - grpc_server_msg_sent_total: Number of messages sent per RPC. Issue: COSI-65

Issue: COSI-65

BourgoisMickael · 2024-12-23T08:48:06Z

pkg/grpcfactory/server_test.go

 	})

 	AfterEach(func() {
+		// Clean up the Unix socket file
 		socketPath := strings.TrimPrefix(address, "unix://")
 		if err := os.Remove(socketPath); err != nil && !os.IsNotExist(err) {


Should this be rather done in server code, before listen ?

I suppose the unix socket is cleaned up by a graceful shutdown, but if there is any crash that prevents the process from cleaning that file it might prevent restart of the server ?

anurag4DSB force-pushed the feature/COSI-65-add-metrics-scrapable-by-prometheus branch 2 times, most recently from b0108f0 to 191d005 Compare December 19, 2024 08:22

anurag4DSB marked this pull request as ready for review December 19, 2024 08:27

anurag4DSB changed the title ~~Add metrics package for Prometheus instrumentation~~ Add Prometheus Metrics Support/Instrumentation to COSI Driver Dec 19, 2024

anurag4DSB changed the title ~~Add Prometheus Metrics Support/Instrumentation to COSI Driver~~ COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver Dec 19, 2024

anurag4DSB force-pushed the feature/COSI-65-add-metrics-scrapable-by-prometheus branch from 7965f66 to aa85ea0 Compare December 19, 2024 08:37

anurag4DSB requested a review from BourgoisMickael December 19, 2024 09:21

anurag4DSB commented Dec 19, 2024

View reviewed changes

fredmnl approved these changes Dec 19, 2024

View reviewed changes

anurag4DSB requested a review from jonathan-gramain December 19, 2024 16:46

anurag4DSB assigned tmacro and unassigned tmacro Dec 19, 2024

anurag4DSB requested review from tmacro and dvasilas December 19, 2024 16:47

BourgoisMickael reviewed Dec 19, 2024

View reviewed changes

pkg/metrics/metrics.go Show resolved Hide resolved

BourgoisMickael reviewed Dec 19, 2024

View reviewed changes

anurag4DSB force-pushed the bugfix/COSI-74-remove-silent-errors branch from 89e0dfc to a3e7337 Compare December 20, 2024 13:19

Base automatically changed from bugfix/COSI-74-remove-silent-errors to main December 20, 2024 13:30

anurag4DSB added 5 commits December 20, 2024 14:43

Metrics package unit tests

11ee190

Issue: COSI-65

Add metrics package to Codecov configuration

e124149

Issue: COSI-65

anurag4DSB force-pushed the feature/COSI-65-add-metrics-scrapable-by-prometheus branch from aa85ea0 to e124149 Compare December 20, 2024 13:44

anurag4DSB added 2 commits December 20, 2024 15:57

[fixup post review 1] use go-grpc-middleware

d0934b5

[fixup post review 1.2] share regiustry metrrics

9b5f19f

added checkpoint

087d5e6

BourgoisMickael reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver #69

COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver #69

anurag4DSB commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

anurag4DSB Dec 19, 2024

anurag4DSB Dec 19, 2024

fredmnl Dec 19, 2024

anurag4DSB Dec 19, 2024 •

edited

Loading

fredmnl Dec 19, 2024

fredmnl Dec 19, 2024

anurag4DSB Dec 19, 2024

anurag4DSB Dec 19, 2024

fredmnl Dec 19, 2024

anurag4DSB Dec 19, 2024

BourgoisMickael Dec 19, 2024 •

edited

Loading

anurag4DSB Dec 20, 2024

BourgoisMickael Dec 20, 2024

BourgoisMickael Dec 20, 2024

anurag4DSB Dec 20, 2024

anurag4DSB Dec 20, 2024

BourgoisMickael Dec 19, 2024

BourgoisMickael Dec 19, 2024

anurag4DSB Dec 20, 2024

BourgoisMickael Dec 19, 2024

BourgoisMickael Dec 19, 2024

anurag4DSB Dec 20, 2024

anurag4DSB Dec 20, 2024

BourgoisMickael Dec 20, 2024

anurag4DSB Dec 20, 2024

BourgoisMickael Dec 23, 2024

		@@ -50,5 +50,6 @@ func main() {
		// Call the run function (defined in cmd.go)

COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver #69

Are you sure you want to change the base?

COSI-65: Add Prometheus Metrics Support/Instrumentation to COSI Driver #69

Conversation

anurag4DSB commented Dec 19, 2024 • edited Loading

Commit Summary:

Request for Reviewers:

codecov bot commented Dec 19, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anurag4DSB Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BourgoisMickael Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anurag4DSB commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

anurag4DSB Dec 19, 2024 •

edited

Loading

BourgoisMickael Dec 19, 2024 •

edited

Loading