Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics: use Counters for job metrics that should only increase #14369

Open
4 of 9 tasks
onefourfive opened this issue Aug 22, 2023 · 0 comments · May be fixed by #14390
Open
4 of 9 tasks

Prometheus metrics: use Counters for job metrics that should only increase #14369

onefourfive opened this issue Aug 22, 2023 · 0 comments · May be fixed by #14390

Comments

@onefourfive
Copy link
Contributor

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Feature type

Enhancement to Existing Feature

Feature Summary

Currently the job metrics exposed by AWX for prometheus use gauge type metrics:

# HELP awx_status_total Status of Job launched
# TYPE awx_status_total gauge
awx_status_total{status="running"} 0.0
awx_status_total{status="failed"} 21.0
awx_status_total{status="canceled"} 0.0
awx_status_total{status="successful"} 19.0
awx_status_total{status="waiting"} 0.0
awx_status_total{status="error"} 33.0
awx_status_total{status="pending"} 3.0

Gauges

Gauge metrics make sense for transient states like

  • running
  • waiting
  • pending

Since jobs can enter and exit these states, they fit the definition of a prometheus gauge:

A gauge is a metric that represents a single numerical value that can arbitrarily go up and down.

Counters

However gauges do not make sense for the states that jobs can never exit, which might be called terminal states:

  • failed
  • error
  • canceled
  • successful

These metrics would be better captured in a counter:

A counter is a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. For example, you can use a counter to represent the number of requests served, tasks completed, or errors.

Benefits

Counter metrics are more useful in Prometheus for building better visualizations and alerts using query functions like rate() and increase(), which in turn could be used to build SLOs based on AWX performance.

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Steps to reproduce

Scrape /api/v2/metrics

Current results

awx_status_total{} gauge shows current values for jobs in all states.

# HELP awx_status_total Status of Job launched
# TYPE awx_status_total gauge
awx_status_total{status="running"} 0.0
awx_status_total{status="failed"} 21.0
awx_status_total{status="canceled"} 0.0
awx_status_total{status="successful"} 19.0
awx_status_total{status="waiting"} 0.0
awx_status_total{status="error"} 33.0
awx_status_total{status="pending"} 3.0

Sugested feature result

A gauge is returned for transient job states and a counter is kept for terminal job states.

# HELP awx_status_launched Status of Jobs launched but not completed
# TYPE awx_status_total gauge
awx_status_launched{status="running"} 0.0
awx_status_launched{status="waiting"} 0.0
awx_status_launched{status="pending"} 3.0
# HELP awx_status_completed Status of Jobs completed
# TYPE awx_status_total counter
awx_status_completed{status="failed"} 21
awx_status_completed{status="successful} 19
awx_status_completed{status="error"} 33

Additional information

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants