CancellationCleanup service pegs CPU to 100% #15231
Labels
bug
Something isn't working
great writeup
This is a wonderful example of our standards
performance
Related to an optimization or performance improvement
Bug summary
We have a medium-sized Prefect deployment on AWS EKS cluster with RDS Postgres database. Recently we started using a lot of subflows, accumulating about 50k of them (most in complete state). Last couple of days we were fire-fighting the deployment falling over due to all 3 pods of Prefect server being overloaded (100% CPU) and everything being super slow, late flows accumulating, etc.
After investigation, we realized that the issue was with CancellationCleanup loop taking about 5 minutes to run and using ~60-70% of CPU, also adding unreasonable load to the database. After finishing, the loop immediately starts from beginning, making the whole server starved for resources and failing in a lot of other places.
We checked it's the culprit by disabling all the loops one by one and checking CPU usage, database load and overall responsiveness of the web interface.
Specifically what looks like happens there is that in
clean_up_cancelled_subflow_runs
function, we go through ALL subflows in the database (in all states, including completed ones), and then for each of them run_cancel_subflow
. That initial query seems to be pretty heavy as it also preloads corresponding flow_run_state etc.My guess is that this query is not doing what we expect it to do - maybe
db.FlowRun.id > high_water_mark,
need to be moved into the AND expression?https://github.com/PrefectHQ/prefect/blob/2.x/src/prefect/server/services/cancellation_cleanup.py#L79-L92
Version info (
prefect version
output)Additional context
No response
The text was updated successfully, but these errors were encountered: