CancellationCleanup service pegs CPU to 100% #15231

ashtuchkin · 2024-09-04T22:56:28Z

Bug summary

We have a medium-sized Prefect deployment on AWS EKS cluster with RDS Postgres database. Recently we started using a lot of subflows, accumulating about 50k of them (most in complete state). Last couple of days we were fire-fighting the deployment falling over due to all 3 pods of Prefect server being overloaded (100% CPU) and everything being super slow, late flows accumulating, etc.

After investigation, we realized that the issue was with CancellationCleanup loop taking about 5 minutes to run and using ~60-70% of CPU, also adding unreasonable load to the database. After finishing, the loop immediately starts from beginning, making the whole server starved for resources and failing in a lot of other places.
We checked it's the culprit by disabling all the loops one by one and checking CPU usage, database load and overall responsiveness of the web interface.

Specifically what looks like happens there is that in clean_up_cancelled_subflow_runs function, we go through ALL subflows in the database (in all states, including completed ones), and then for each of them run _cancel_subflow. That initial query seems to be pretty heavy as it also preloads corresponding flow_run_state etc.

My guess is that this query is not doing what we expect it to do - maybe db.FlowRun.id > high_water_mark, need to be moved into the AND expression?

https://github.com/PrefectHQ/prefect/blob/2.x/src/prefect/server/services/cancellation_cleanup.py#L79-L92

                sa.select(db.FlowRun)
                .where(
                    or_(
                        db.FlowRun.state_type == states.StateType.PENDING,
                        db.FlowRun.state_type == states.StateType.SCHEDULED,
                        db.FlowRun.state_type == states.StateType.RUNNING,
                        db.FlowRun.state_type == states.StateType.PAUSED,
                        db.FlowRun.state_type == states.StateType.CANCELLING,
                        db.FlowRun.id > high_water_mark,
                    ),
                    db.FlowRun.parent_task_run_id.is_not(None),
                )
                .order_by(db.FlowRun.id)
                .limit(self.batch_size)

Version info (`prefect version` output)

We're using helm chart https://prefecthq.github.io/prefect-helm version 2024.6.28162841 in our AWS EKS kubernetes cluster.
Database: AWS RDS Postgres.
Image: prefecthq/prefect:2.19.7-python3.10

Here's `prefect version` from the client:
Version:             2.19.7
API version:         0.8.4
Python version:      3.11.7
Git commit:          60f05122
Built:               Fri, Jun 28, 2024 11:27 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.40.1

Additional context

No response

The text was updated successfully, but these errors were encountered:

cicdw · 2024-09-09T18:16:20Z

@ashtuchkin I'll comment here once this is officially released in 2.20.7 later this week; if you're interested in testing this prior to release, you can install off our 2.x branch via pip install -U git+https://github.com/PrefectHQ/[email protected] once #15289 is merged!

jashwanth9 · 2024-09-12T17:52:36Z

@cicdw Thanks for fixing this issue. When can we expect the official release of 2.20.7?

cicdw · 2024-09-12T19:37:52Z

I just cut the release moments ago @jashwanth9 ! It should go live on PyPI imminently.

ashtuchkin · 2024-09-19T16:50:04Z

Thank you @cicdw !

ashtuchkin added the bug Something isn't working label Sep 4, 2024

cicdw added performance Related to an optimization or performance improvement great writeup This is a wonderful example of our standards labels Sep 4, 2024

zzstoatzz added this to OSS Backlog Sep 5, 2024

zzstoatzz moved this to Backlog in OSS Backlog Sep 5, 2024

cicdw moved this from Backlog to Ready in OSS Backlog Sep 9, 2024

cicdw self-assigned this Sep 9, 2024

cicdw mentioned this issue Sep 9, 2024

Move expensive filter to AND #15286

Merged

cicdw closed this as completed in #15286 Sep 9, 2024

github-project-automation bot moved this from In progress to Done in OSS Backlog Sep 9, 2024

cicdw mentioned this issue Sep 9, 2024

Cancellation Cleanup service query fix #15289

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CancellationCleanup service pegs CPU to 100% #15231

CancellationCleanup service pegs CPU to 100% #15231

ashtuchkin commented Sep 4, 2024

cicdw commented Sep 9, 2024 •

edited

Loading

jashwanth9 commented Sep 12, 2024

cicdw commented Sep 12, 2024

ashtuchkin commented Sep 19, 2024

CancellationCleanup service pegs CPU to 100% #15231

CancellationCleanup service pegs CPU to 100% #15231

Comments

ashtuchkin commented Sep 4, 2024

Bug summary

Version info (prefect version output)

Additional context

cicdw commented Sep 9, 2024 • edited Loading

jashwanth9 commented Sep 12, 2024

cicdw commented Sep 12, 2024

ashtuchkin commented Sep 19, 2024

Version info (`prefect version` output)

cicdw commented Sep 9, 2024 •

edited

Loading