Skip to content
This repository has been archived by the owner on Feb 7, 2024. It is now read-only.

Spurious numbers in metrics #430

Open
mgrabovsky opened this issue Jun 23, 2021 · 1 comment
Open

Spurious numbers in metrics #430

mgrabovsky opened this issue Jun 23, 2021 · 1 comment
Labels
bug feature:metrics Features and bugs related to the metrics and monitoring subsystem.

Comments

@mgrabovsky
Copy link
Contributor

mgrabovsky commented Jun 23, 2021

In the 48 hours following the deployment of the Prometheus metrics endpoint, at least two bugs have been made apparent thanks to the Grafana dashboard:

  1. Failed tasks often (but not always) seem to be counted twice in retrace_tasks_finished{result="fail"}.
  2. The number of running tasks (retrace_tasks_running) sporadically jumps up to wild numbers, such as 70, 18 or 39, for a few minutes at a time. The maximum allowed number of running tasks (MaxParallelTasks) is 12 on retrace.fp.org, so these numbers make no sense.
@mgrabovsky mgrabovsky added bug feature:metrics Features and bugs related to the metrics and monitoring subsystem. labels Jun 23, 2021
@mgrabovsky
Copy link
Contributor Author

mgrabovsky commented Jun 24, 2021

The relevant code pertaining to 2. (running tasks) is located in retrace.py. It's parsing the output of ps so I can imagine there being some funny interaction with threading, how processes are listed etc.

Edit: I'm wondering if we may be witnessing some race conditions here since multiple workers may be writing to the SQLite database at the same time. Though I hope SQLite should be able to handle that.

Edit 2: OK, it wasn't a database bug. Here's a fragment of the ps output from one of the moments when an unusually high number of running tasks was detected:

    PID    PPID ELAPSED CMD
1578079       1    1727 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
[...]
1589317 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589318 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589319 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589320 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589321 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589322 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589323 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589324 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589325 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589326 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589327 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589328 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589329 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589330 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589331 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589332 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589333 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589334 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589335 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589336 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589337 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589338 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589339 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589340 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589341 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589342 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589343 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589344 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589345 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589346 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589347 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589348 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589349 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589350 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589351 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589352 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589353 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589354 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589355 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589356 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589357 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589358 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589359 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589360 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589362 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589363 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589365 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589366 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589367 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589368 1578079       0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147

@mgrabovsky mgrabovsky added this to the 2.0.0 milestone Jan 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug feature:metrics Features and bugs related to the metrics and monitoring subsystem.
Projects
None yet
Development

No branches or pull requests

1 participant