Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job state metrics: split into gauge and counter #14390

Draft
wants to merge 3 commits into
base: devel
Choose a base branch
from

Conversation

onefourfive
Copy link
Contributor

@onefourfive onefourfive commented Aug 28, 2023

Signed-off-by: onefourfive <>

SUMMARY

Closes #14369

Prometheus counter metric types should be used for metrics that can only increase, eg terminal jobs states like failed, canceled, error, and successful.

ISSUE TYPE
  • New or Enhanced Feature
COMPONENT NAME

Metrics

  • API
  • Other
AWX VERSION
awx: 22.5.1.dev29+gbd7ad057c4
ADDITIONAL INFORMATION

Metrics Before

# HELP awx_status_total Status of Job launched
# TYPE awx_status_total gauge
awx_status_total{status="running"} 1.0
awx_status_total{status="canceled"} 33.0
awx_status_total{status="waiting"} 0.0
awx_status_total{status="successful"} 65753.0
awx_status_total{status="error"} 359.0
awx_status_total{status="failed"} 985.0
awx_status_total{status="pending"} 0.0

Metrics After

# HELP awx_status_started Status of Job started
# TYPE awx_status_started gauge
awx_status_started{status="waiting"} 0.0
awx_status_started{status="pending"} 0.0
awx_status_started{status="running"} 0.0
# HELP awx_status_completed_total Status of Jobs completed
# TYPE awx_status_completed_total counter
awx_status_completed_total{status="canceled"} 0.0
awx_status_completed_total{status="failed"} 0.0
awx_status_completed_total{status="error"} 0.0
awx_status_completed_total{status="successful"} 0.0
# TYPE awx_status_completed_created gauge
awx_status_completed_created{status="canceled"} 1.693263938630845e+09
awx_status_completed_created{status="failed"} 1.6932639386309366e+09
awx_status_completed_created{status="error"} 1.6932639386309566e+09
awx_status_completed_created{status="successful"} 1.6932639386309748e+09

Note the _created metrics are automatically exported by Prometheus for counters. This can be disabled with an environment variable PROMETHEUS_DISABLE_CREATED_SERIES=True (see docs.

@jessicamack
Copy link
Member

Thank you for opening this PR. Our team will review it shortly and let you know if there are any changes needed.

Copy link
Contributor

@gundalow gundalow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything in prometheus_client that allows us to add aliases for the old names?

@AlanCoding
Copy link
Member

awx_status_completed_created{status="canceled"} 1.693263938630845e+09

These numbers don't look right, or is there something I'm not understanding?

@onefourfive
Copy link
Contributor Author

awx_status_completed_created{status="canceled"} 1.693263938630845e+09

These numbers don't look right, or is there something I'm not understanding?

This is a Unix time stamp showing when the metric was created— checking this value myself, this comes out to a date of this past Monday which makes sense.

@AlanCoding
Copy link
Member

Oh, I'm showing my lack of understanding of this system then.

so the metrics awx_status_completed_total{status="canceled"} 0.0 and awx_status_completed_created{status="canceled"} 1.693263938630845e+09 are complementary, is that fair to say? Since this is a rolling count, we need a time window over which it applies, which is all times after the _created.

This is certainly cool. But this segways into the obvious question here - when you run cleanup_jobs system management job, it will reduce the total counts. After that happens, I could see the data collector freaking out.

Copy link
Member

@kdelee kdelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blockers from my side

@onefourfive
Copy link
Contributor Author

Is there anything in prometheus_client that allows us to add aliases for the old names?

Good question @gundalow ! Not that I am aware of, and none that I see in their documentation.

Since the underlying data driving these metrics hasn't been changed, it would still be possible to keep the original gauges if that's deemed to be needed. However that seems extraneous since it would be the same data.

This is certainly cool. But this segways into the obvious question here - when you run cleanup_jobs system management job, it will reduce the total counts. After that happens, I could see the data collector freaking out.

This is a good point @AlanCoding , while writing this I couldn't conceive of a situation where the counter awx_status_completed might be larger than the underlying statuses.get(state). This is a worthy case to consider.

@AlanCoding
Copy link
Member

Also, what exactly is the timestamp for the *_created metrics? Is it from startup? Does the count represent the number of jobs that went into that status from startup, or the total number in the database?

@djyasin
Copy link
Member

djyasin commented Sep 27, 2023

Hello,
We took another look at this and saw that a few of the CI checks are failing. This particular test seems to be causing the failure:

_____________________________ test_metrics_counts ______________________________
[gw1] linux -- Python 3.9.17 /var/lib/awx/venv/awx/bin/python3.9
Traceback (most recent call last):
  File "/awx_devel/awx/main/tests/functional/analytics/test_metrics.py", line 54, in test_metrics_counts
    assert EXPECTED_VALUES[name] == value
KeyError: 'awx_status_completed_created' 

If you could go ahead and get that test updated we would be happy to further investigate this. Thank you again for your time!

onefourfive and others added 3 commits February 14, 2024 16:52
these shouldn't be needed for local testing, but can be changed
@dmzoneill dmzoneill marked this pull request as draft February 14, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prometheus metrics: use Counters for job metrics that should only increase
6 participants