-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement framework to validate backwards compatibility of metrics #6278
Comments
Instead of this, can we use upload artifact action and a single file of download artifact action So, we have one main file to get artifacts from all workflow runs which might look something like this:
|
## Which problem is this PR solving? - Towards #6219 ## Description of the changes - This PR moves the decoration of the span readers with the metrics factory to inside the storage factory itself rather than handling it a higher level (ex. all-in-one, query server/extension). - For v2, the namespacing of the metrics has been moved from the query extension to the storage extension. - For v1, the namespacing of the metrics has been moved from the various binaries to the storage meta-factory. - This PR contains a few breaking changes as it changes the namespace under which the following metrics are published: - Storage specific metrics were that were being pushed under `jaeger_query` to will now be pushed undder `jaeger_storage` - Cassandra specific metrics were pushed under `jaeger_cassandra` and `jaeger_cassandra-archive` will now be pushed under `jaeger_storage` with the following tags: - `kind=cassandra` - `role=primary` or `role=archive` - `name` with the name of the storage (in jaeger-v2 only) - ElasticSearch/OpenSearch specific metrics were being pushed under `jaeger_index_create` will now be pushed under `jaeger_storage_index_create` with the following tags: - `kind=elasticsearch`/kind=`opensearch` - `role=primary` or `role=archive` - `name` with the name of the storage (in jaeger-v2 only) ## How was this change tested? - CI - We've booked #6278 to implement a framework that will allow for validation of metrics ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [x] I have added unit tests for the new functionality - [x] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: Mahad Zaryab <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]>
why is it "instead"? The point of the task was to merge all different workflows as different jobs in a single workflow. It's fine to keep them as separate included files (only if you keep them in the workflows dir they may still get executed separately, I don't think your |
Signed-off-by: vaidikcode <[email protected]> ## Which problem is this PR solving? - Task one of #6278 ## Description of the changes - Integrated all e2e workflows jobs into a single workflow ## How was this change tested? - CI ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [ ] I have added unit tests for the new functionality - [ ] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: Yuri Shkuro <[email protected]> Signed-off-by: Yuri Shkuro <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]>
Hi, @yurishkuro. I wanted to know what specific tasks we want the helper script to perform in task 2? |
it need to get all metrics available from Jaeger binary at the end of the e2e test run and upload them as a workflow artifact. This will already allow us inspecting what the metrics look like. Then the next step would be to compare those metrics with those from the latest release. |
Hi @yurishkuro, Could you clarify how we plan to retrieve the metrics in the CI/CD environment? I’m working on the helper script to get metrics. Could you clarify if we need to retrieve metrics from specific endpoints? If yes, could you tell me what those endpoints are? Your guidance on this would be greatly appreciated. Thanks! |
@vaidikcode aftet a given e2e test is finished we can curl the metrics endpoint (usually on :8888) to get all metrics, while the binary is still running. |
## Which problem is this PR solving? - Part of #6278 ## Description of the changes - scrape script and usage in cit workflow - diff calculating script - cache save and restore from main workflow runs - A sample diff (txt) ``` --- +++ @@ -303,2 +303,2 @@ -rpc_server_requests_per_rpc{le="+Inf",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TestService",rpc_system="grpc",service_name="jaeger",service_version=""} -rpc_server_requests_per_rpc{le="0",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} +rpc_server_requests_per_rpc{le="+Inf",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} +rpc_server_requests_per_rpc{le="0",rpc_grpc_status_code="1",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} @@ -321 +321 @@ -rpc_server_response_size{le="+Inf",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} +rpc_server_response_size{le="+Inf",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version="",test_change="Export"} @@ -338 +338 @@ -rpc_server_response_size{rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} +rpc_server_response_size{rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="test-jaeger",service_version=""} ``` ## How was this change tested? - ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [ ] I have added unit tests for the new functionality - [x] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: chahatsagarmain <[email protected]> Signed-off-by: chahat sagar <[email protected]> Signed-off-by: Yuri Shkuro <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]>
…#6330) ## Which problem is this PR solving? - Part of jaegertracing#6278 ## Description of the changes - scrape script and usage in cit workflow - diff calculating script - cache save and restore from main workflow runs - A sample diff (txt) ``` --- +++ @@ -303,2 +303,2 @@ -rpc_server_requests_per_rpc{le="+Inf",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TestService",rpc_system="grpc",service_name="jaeger",service_version=""} -rpc_server_requests_per_rpc{le="0",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} +rpc_server_requests_per_rpc{le="+Inf",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} +rpc_server_requests_per_rpc{le="0",rpc_grpc_status_code="1",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} @@ -321 +321 @@ -rpc_server_response_size{le="+Inf",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} +rpc_server_response_size{le="+Inf",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version="",test_change="Export"} @@ -338 +338 @@ -rpc_server_response_size{rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""} +rpc_server_response_size{rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="test-jaeger",service_version=""} ``` ## How was this change tested? - ## Checklist - [x] I have read https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md - [x] I have signed all commits - [ ] I have added unit tests for the new functionality - [x] I have run lint and test steps successfully - for `jaeger`: `make lint test` - for `jaeger-ui`: `yarn lint` and `yarn test` --------- Signed-off-by: chahatsagarmain <[email protected]> Signed-off-by: chahat sagar <[email protected]> Signed-off-by: Yuri Shkuro <[email protected]> Co-authored-by: Yuri Shkuro <[email protected]>
Jaeger binaries produce various metrics that can be used to monitor Jaeger itself in production, such as throughput, queue saturation, error rates, etc. We historically treated those metrics as a stable public API (we even provide a Grafana dashboard mixin), but we never actually had proper integration tests that validate that changes to the code do not break the metrics.
Proposal
We can enhance our existing integration tests to also perform comparison of the metrics. The full set of all possible metrics is never available from a single run because if some components are not utilized (e.g. adaptive sampling or a particular storage type) then their metrics will not be registered. However, since our integration tests cover most of the components, we can scape the metrics endpoint at the end of each test and combine them into a full picture.
Caveat: the exact shape of the metrics depends on all the nested namespaces that could be applied to the
metrics.Factory
, so it is sensitive to the exact code in the main functions, which is wheremetrics.Factory
always originates. Our integration tests for Jaeger v2 usually use the actual binary, so the resulting metrics will reflect how that binary performs in production. But all integration tests for v1 are run from a unit testing framework and themetrics.Factory
initialization may not match how it's done in themain
s. So we may only be able to solve this for Jaeger v2, which is fine.Approach
Help Wanted
We seek community help to implement this functionality. This is not a 30min fix, but we still marked it as
good-first-issue
because it can be done incrementally.Tasks
The text was updated successfully, but these errors were encountered: