Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement framework to validate backwards compatibility of metrics #6278

Open
1 of 5 tasks
yurishkuro opened this issue Nov 29, 2024 · 6 comments
Open
1 of 5 tasks

Implement framework to validate backwards compatibility of metrics #6278

yurishkuro opened this issue Nov 29, 2024 · 6 comments
Labels
enhancement good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement v2

Comments

@yurishkuro
Copy link
Member

yurishkuro commented Nov 29, 2024

Jaeger binaries produce various metrics that can be used to monitor Jaeger itself in production, such as throughput, queue saturation, error rates, etc. We historically treated those metrics as a stable public API (we even provide a Grafana dashboard mixin), but we never actually had proper integration tests that validate that changes to the code do not break the metrics.

Proposal

We can enhance our existing integration tests to also perform comparison of the metrics. The full set of all possible metrics is never available from a single run because if some components are not utilized (e.g. adaptive sampling or a particular storage type) then their metrics will not be registered. However, since our integration tests cover most of the components, we can scape the metrics endpoint at the end of each test and combine them into a full picture.

Caveat: the exact shape of the metrics depends on all the nested namespaces that could be applied to the metrics.Factory, so it is sensitive to the exact code in the main functions, which is where metrics.Factory always originates. Our integration tests for Jaeger v2 usually use the actual binary, so the resulting metrics will reflect how that binary performs in production. But all integration tests for v1 are run from a unit testing framework and the metrics.Factory initialization may not match how it's done in the mains. So we may only be able to solve this for Jaeger v2, which is fine.

Approach

  • At the end of each integration test (CIT) workflow we scrape the metrics collected by the binary and upload them as a github artifact
  • Then the final workflow can gather all those artifacts and compare them with similar reports from the latest release. If differences are found it can upload them as another artifact and link to the PR so that maintainers can inspect the changes and decide if they are acceptable.
  • The artifacts uploaded for the official release can also be referenced from the documentation website as a way of documenting the current collection of metrics.

Help Wanted

We seek community help to implement this functionality. This is not a 30min fix, but we still marked it as good-first-issue because it can be done incrementally.

Tasks

  • currently all CIT workflows are independent. In order to be able to have a final job once all CIT jobs are finished we may need to combine them all into a single CIT workflow with multiple jobs
  • the ability to scrape and compare metrics was implemented in [V2]Add Script for metrics markdown table #5941. We need to integrate the scraping into each CIT workflow (using a helper script) and upload the output as workflow artifacts
  • implement the validation job in a workflow that would compare artifacts from the current PR with those from the latest release and generate a diff (also uploaded as a separate artifact)
  • make the validation job post some form of summary as a comment to the PR (or as the output of the Check)
  • implement a way to incorporate the metrics report into documentation website
@yurishkuro yurishkuro added good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement labels Nov 29, 2024
@Saumya40-codes
Copy link
Contributor

Tasks

  • currently all CIT workflows are independent. In order to be able to have a final job once all CIT jobs are finished we may need to combine them all into a single CIT workflow with multiple jobs

Instead of this, can we use upload artifact action and a single file of download artifact action

So, we have one main file to get artifacts from all workflow runs which might look something like this:

name: Get All Artifact

on:
  push:
    branches:
      - master
  workflow_dispatch:

jobs:
  workflow_1:
    uses: ./.github/workflows/workflow-1.yml

  workflow_2:
    uses: ./.github/workflows/workflow-2.yml

  deploy:
    runs-on: ubuntu-latest
    needs:
      - workflow_1
      - workflow_2
    steps:
      - name: 🔽 Checkout
        uses: actions/checkout@v3

      - name: ⬇️ Download artifact from workflow-1.yml
        uses: actions/download-artifact@v4
        with:
          name: 'metrics-1'
          path: 'deploy'

      - name: ⬇️ Download artifact from workflow-2.yml
        uses: actions/download-artifact@v4
        with:
          name: 'metrics-2'
          path: 'deploy'

      # we will be doing comparison here with the binary

      - name: '📄 Log the content of metrics-1.txt (from compile.yml)'
        run: |
          cat deploy/metrics-1.txt

      - name: '📄 Log the content of hello-2.txt (from compile-2.yml)'
        run: |
          cat deploy/metrics-2.txt

i was able to work through this
Image

yurishkuro added a commit that referenced this issue Nov 30, 2024
## Which problem is this PR solving?
- Towards #6219 

## Description of the changes
- This PR moves the decoration of the span readers with the metrics
factory to inside the storage factory itself rather than handling it a
higher level (ex. all-in-one, query server/extension).
- For v2, the namespacing of the metrics has been moved from the query
extension to the storage extension.
- For v1, the namespacing of the metrics has been moved from the various
binaries to the storage meta-factory.
- This PR contains a few breaking changes as it changes the namespace
under which the following metrics are published:
- Storage specific metrics were that were being pushed under
`jaeger_query` to will now be pushed undder `jaeger_storage`
- Cassandra specific metrics were pushed under `jaeger_cassandra` and
`jaeger_cassandra-archive` will now be pushed under `jaeger_storage`
with the following tags:
    - `kind=cassandra`
    - `role=primary` or `role=archive`
    - `name` with the name of the storage (in jaeger-v2 only)
- ElasticSearch/OpenSearch specific metrics were being pushed under
`jaeger_index_create` will now be pushed under
`jaeger_storage_index_create` with the following tags:
      - `kind=elasticsearch`/kind=`opensearch`
      - `role=primary` or `role=archive`
      - `name` with the name of the storage (in jaeger-v2 only)

## How was this change tested?
- CI
- We've booked #6278 to
implement a framework that will allow for validation of metrics

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: Mahad Zaryab <[email protected]>
Co-authored-by: Yuri Shkuro <[email protected]>
@yurishkuro
Copy link
Member Author

Instead of this, can we use upload artifact action and a single file of download artifact action

why is it "instead"? The point of the task was to merge all different workflows as different jobs in a single workflow. It's fine to keep them as separate included files (only if you keep them in the workflows dir they may still get executed separately, I don't think your uses: syntax works for workflows)

yurishkuro added a commit that referenced this issue Dec 7, 2024
Signed-off-by: vaidikcode <[email protected]>

## Which problem is this PR solving?
- Task one of #6278 

## Description of the changes
- Integrated all e2e workflows jobs into a single workflow

## How was this change tested?
- CI

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [ ] I have added unit tests for the new functionality
- [ ] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: Yuri Shkuro <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Co-authored-by: Yuri Shkuro <[email protected]>
Co-authored-by: Yuri Shkuro <[email protected]>
@vaidikcode
Copy link
Contributor

Hi, @yurishkuro. I wanted to know what specific tasks we want the helper script to perform in task 2?

@yurishkuro
Copy link
Member Author

it need to get all metrics available from Jaeger binary at the end of the e2e test run and upload them as a workflow artifact. This will already allow us inspecting what the metrics look like. Then the next step would be to compare those metrics with those from the latest release.

@vaidikcode
Copy link
Contributor

vaidikcode commented Dec 9, 2024

Hi @yurishkuro, Could you clarify how we plan to retrieve the metrics in the CI/CD environment? I’m working on the helper script to get metrics. Could you clarify if we need to retrieve metrics from specific endpoints? If yes, could you tell me what those endpoints are? Your guidance on this would be greatly appreciated.

Thanks!

@yurishkuro
Copy link
Member Author

@vaidikcode aftet a given e2e test is finished we can curl the metrics endpoint (usually on :8888) to get all metrics, while the binary is still running.

yurishkuro added a commit that referenced this issue Jan 16, 2025
## Which problem is this PR solving?
- Part of #6278 

## Description of the changes
- scrape script and usage in cit workflow
- diff calculating script
- cache save and restore from main workflow runs
- A sample diff (txt)
```
--- 
+++ 
@@ -303,2 +303,2 @@
-rpc_server_requests_per_rpc{le="+Inf",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TestService",rpc_system="grpc",service_name="jaeger",service_version=""}
-rpc_server_requests_per_rpc{le="0",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}
+rpc_server_requests_per_rpc{le="+Inf",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}
+rpc_server_requests_per_rpc{le="0",rpc_grpc_status_code="1",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}

@@ -321 +321 @@
-rpc_server_response_size{le="+Inf",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}
+rpc_server_response_size{le="+Inf",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version="",test_change="Export"}

@@ -338 +338 @@
-rpc_server_response_size{rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}
+rpc_server_response_size{rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="test-jaeger",service_version=""}

```

## How was this change tested?
-
## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [ ] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: chahatsagarmain <[email protected]>
Signed-off-by: chahat sagar <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Co-authored-by: Yuri Shkuro <[email protected]>
Manik2708 pushed a commit to Manik2708/jaeger that referenced this issue Jan 21, 2025
…#6330)

## Which problem is this PR solving?
- Part of jaegertracing#6278 

## Description of the changes
- scrape script and usage in cit workflow
- diff calculating script
- cache save and restore from main workflow runs
- A sample diff (txt)
```
--- 
+++ 
@@ -303,2 +303,2 @@
-rpc_server_requests_per_rpc{le="+Inf",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TestService",rpc_system="grpc",service_name="jaeger",service_version=""}
-rpc_server_requests_per_rpc{le="0",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}
+rpc_server_requests_per_rpc{le="+Inf",rpc_grpc_status_code="0",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}
+rpc_server_requests_per_rpc{le="0",rpc_grpc_status_code="1",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}

@@ -321 +321 @@
-rpc_server_response_size{le="+Inf",rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}
+rpc_server_response_size{le="+Inf",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version="",test_change="Export"}

@@ -338 +338 @@
-rpc_server_response_size{rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="jaeger",service_version=""}
+rpc_server_response_size{rpc_method="Export",rpc_service="opentelemetry.proto.collector.trace.v1.TraceService",rpc_system="grpc",service_name="test-jaeger",service_version=""}

```

## How was this change tested?
-
## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [ ] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `yarn lint` and `yarn test`

---------

Signed-off-by: chahatsagarmain <[email protected]>
Signed-off-by: chahat sagar <[email protected]>
Signed-off-by: Yuri Shkuro <[email protected]>
Co-authored-by: Yuri Shkuro <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement good first issue Good for beginners help wanted Features that maintainers are willing to accept but do not have cycles to implement v2
Projects
None yet
Development

No branches or pull requests

3 participants