Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CancellationCleanup service pegs CPU to 100% #15231

Closed
ashtuchkin opened this issue Sep 4, 2024 · 4 comments · Fixed by #15286
Closed

CancellationCleanup service pegs CPU to 100% #15231

ashtuchkin opened this issue Sep 4, 2024 · 4 comments · Fixed by #15286
Assignees
Labels
bug Something isn't working great writeup This is a wonderful example of our standards performance Related to an optimization or performance improvement

Comments

@ashtuchkin
Copy link
Contributor

Bug summary

We have a medium-sized Prefect deployment on AWS EKS cluster with RDS Postgres database. Recently we started using a lot of subflows, accumulating about 50k of them (most in complete state). Last couple of days we were fire-fighting the deployment falling over due to all 3 pods of Prefect server being overloaded (100% CPU) and everything being super slow, late flows accumulating, etc.

After investigation, we realized that the issue was with CancellationCleanup loop taking about 5 minutes to run and using ~60-70% of CPU, also adding unreasonable load to the database. After finishing, the loop immediately starts from beginning, making the whole server starved for resources and failing in a lot of other places.
We checked it's the culprit by disabling all the loops one by one and checking CPU usage, database load and overall responsiveness of the web interface.

Specifically what looks like happens there is that in clean_up_cancelled_subflow_runs function, we go through ALL subflows in the database (in all states, including completed ones), and then for each of them run _cancel_subflow. That initial query seems to be pretty heavy as it also preloads corresponding flow_run_state etc.

My guess is that this query is not doing what we expect it to do - maybe db.FlowRun.id > high_water_mark, need to be moved into the AND expression?

https://github.com/PrefectHQ/prefect/blob/2.x/src/prefect/server/services/cancellation_cleanup.py#L79-L92

                sa.select(db.FlowRun)
                .where(
                    or_(
                        db.FlowRun.state_type == states.StateType.PENDING,
                        db.FlowRun.state_type == states.StateType.SCHEDULED,
                        db.FlowRun.state_type == states.StateType.RUNNING,
                        db.FlowRun.state_type == states.StateType.PAUSED,
                        db.FlowRun.state_type == states.StateType.CANCELLING,
                        db.FlowRun.id > high_water_mark,
                    ),
                    db.FlowRun.parent_task_run_id.is_not(None),
                )
                .order_by(db.FlowRun.id)
                .limit(self.batch_size)

Version info (prefect version output)

We're using helm chart https://prefecthq.github.io/prefect-helm version 2024.6.28162841 in our AWS EKS kubernetes cluster.
Database: AWS RDS Postgres.
Image: prefecthq/prefect:2.19.7-python3.10

Here's `prefect version` from the client:
Version:             2.19.7
API version:         0.8.4
Python version:      3.11.7
Git commit:          60f05122
Built:               Fri, Jun 28, 2024 11:27 AM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.40.1

Additional context

No response

@ashtuchkin ashtuchkin added the bug Something isn't working label Sep 4, 2024
@cicdw cicdw added performance Related to an optimization or performance improvement great writeup This is a wonderful example of our standards labels Sep 4, 2024
@zzstoatzz zzstoatzz moved this to Backlog in OSS Backlog Sep 5, 2024
@cicdw cicdw moved this from Backlog to Ready in OSS Backlog Sep 9, 2024
@cicdw cicdw self-assigned this Sep 9, 2024
@github-project-automation github-project-automation bot moved this from In progress to Done in OSS Backlog Sep 9, 2024
@cicdw
Copy link
Member

cicdw commented Sep 9, 2024

@ashtuchkin I'll comment here once this is officially released in 2.20.7 later this week; if you're interested in testing this prior to release, you can install off our 2.x branch via pip install -U git+https://github.com/PrefectHQ/[email protected] once #15289 is merged!

@jashwanth9
Copy link

@cicdw Thanks for fixing this issue. When can we expect the official release of 2.20.7?

@cicdw
Copy link
Member

cicdw commented Sep 12, 2024

I just cut the release moments ago @jashwanth9 ! It should go live on PyPI imminently.

@ashtuchkin
Copy link
Contributor Author

Thank you @cicdw !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working great writeup This is a wonderful example of our standards performance Related to an optimization or performance improvement
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants