-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefect Server Experiencing Timeouts Due to Slow Database Communication #16299
Comments
I'm having what I think may be a related issue? I tested out Prefect no problem on WSL, because we're considering using it as a workflow orchestrator. Now when I'm trying to test that same workflow on a corporate RH Linux environment, the server feels like it is starved for resources (even though I'm running a simple workflow and the VM has 4 cores & 64 GB RAM), since tons of services are taking longer to run than their designated loop intervals and it is slow to respond to UI interaction. Likely related: the SQLite database seems to always be locked.
11:29:52.962 | WARNING | prefect.server.services.marklateruns - MarkLateRuns took 16.083751 seconds to run, which is longer than its loop interval of 5.0 seconds.
11:29:52.977 | ERROR | uvicorn.error - Exception in ASGI application
Traceback (most recent call last):
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
self.dialect.do_execute(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
cursor.execute(statement, parameters)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 147, in execute
self._adapt_connection._handle_exception(error)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 298, in _handle_exception
raise error
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 129, in execute
self.await_(_cursor.execute(operation, parameters))
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 132, in await_only
return current.parent.switch(awaitable) # type: ignore[no-any-return,attr-defined] # noqa: E501
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 196, in greenlet_spawn
value = await result
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/aiosqlite/cursor.py", line 48, in execute
await self._execute(self._cursor.execute, sql, parameters)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/aiosqlite/cursor.py", line 40, in _execute
return await self._conn._execute(fn, *args, **kwargs)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/aiosqlite/core.py", line 132, in _execute
return await future
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/aiosqlite/core.py", line 115, in run
result = function()
sqlite3.OperationalError: database is locked
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 403, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
raise exc
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
await self.app(scope, receive, _send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/prefect/server/api/server.py", line 149, in __call__
await self.app(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/routing.py", line 460, in handle
await self.app(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
raise exc
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
await self.app(scope, receive, _send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 20, in __call__
await responder(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/gzip.py", line 39, in __call__
await self.app(scope, receive, self.send_with_gzip)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
await self.middleware_stack(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
await route.handle(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
await self.app(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
raise exc
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
await app(scope, receive, sender)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
response = await f(request)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/prefect/server/utilities/server.py", line 47, in handle_response_scoped_depends
response = await default_handler(request)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
raw_response = await run_endpoint_function(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 212, in run_endpoint_function
return await dependant.call(**values)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/prefect/server/api/task_runs.py", line 72, in create_task_run
model = await models.task_runs.create_task_run(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/prefect/server/database/dependencies.py", line 168, in async_wrapper
return await func(db, *args, **kwargs) # type: ignore
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/prefect/server/models/task_runs.py", line 80, in create_task_run
await session.execute(insert_stmt)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/ext/asyncio/session.py", line 461, in execute
result = await greenlet_spawn(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 201, in greenlet_spawn
result = context.throw(*sys.exc_info())
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2362, in execute
return self._execute_internal(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 2247, in _execute_internal
result: Result[Any] = compile_state_cls.orm_execute_statement(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/orm/bulk_persistence.py", line 1294, in orm_execute_statement
result = conn.execute(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
return meth(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
return connection._execute_clauseelement(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
ret = self._execute_context(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
return self._exec_single_context(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
self._handle_dbapi_exception(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 2355, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
self.dialect.do_execute(
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
cursor.execute(statement, parameters)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 147, in execute
self._adapt_connection._handle_exception(error)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 298, in _handle_exception
raise error
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/dialects/sqlite/aiosqlite.py", line 129, in execute
self.await_(_cursor.execute(operation, parameters))
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 132, in await_only
return current.parent.switch(awaitable) # type: ignore[no-any-return,attr-defined] # noqa: E501
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 196, in greenlet_spawn
value = await result
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/aiosqlite/cursor.py", line 48, in execute
await self._execute(self._cursor.execute, sql, parameters)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/aiosqlite/cursor.py", line 40, in _execute
return await self._conn._execute(fn, *args, **kwargs)
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/aiosqlite/core.py", line 132, in _execute
return await future
File "/home/my_username/python_environment/.venv/lib/python3.10/site-packages/aiosqlite/core.py", line 115, in run
result = function()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
[SQL: INSERT INTO task_run (flow_run_id, task_key, dynamic_key, cache_key, cache_expiration, task_version, flow_run_run_count, empirical_policy, task_inputs, tags, labels, name, run_count, total_run_time, id, created, updated) VALUES (:flow_run_id, :task_key, :dynamic_key, :cache_key, :cache_expiration, :task_version, :flow_run_run_count, :empirical_policy, :task_inputs, :tags, :labels, :name, :run_count, :total_run_time, :id, :created, :updated) ON CONFLICT (flow_run_id, task_key, dynamic_key) DO NOTHING]
[parameters: {'flow_run_id': '541ccbb0-372e-4580-862c-3a5b56177058', 'task_key': 'process_case_flow-a8ab69b3', 'dynamic_key': '0', 'cache_key': None, 'cache_expiration': None, 'task_version': '5fd26d2d027b6eb680eda186427acafc', 'flow_run_run_count': 0, 'empirical_policy': '{"max_retries": 0, "retry_delay_seconds": 0.0, "retries": 0, "retry_delay": 0, "retry_jitter_factor": null}', 'task_inputs': '{"case_data": [{"input_type": "task_run", "id": "08e56be0-6a98-4122-898a-1111edbc8ebd"}]}', 'tags': '[]', 'labels': '{"prefect.flow.id": "b7b42394-69e0-4734-9921-b99a2292fa38", "prefect.flow-run.id": "541ccbb0-372e-4580-862c-3a5b56177058"}', 'name': 'Process Case 1a-0', 'run_count': 0, 'total_run_time': '1970-01-01 00:00:00.000000', 'id': '924ccd4a-285f-46c2-b529-05ef671e8f1f', 'created': '2024-12-10 16:29:36.273078', 'updated': '2024-12-10 16:29:47.872167'}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)
11:29:52.994 | WARNING | prefect.server.services.recentdeploymentsscheduler - RecentDeploymentsScheduler took 16.116992 seconds to run, which is longer than its loop interval of 5 seconds.
11:29:52.998 | WARNING | prefect.server.services.failexpiredpauses - FailExpiredPauses took 16.119 seconds to run, which is longer than its loop interval of 5.0 seconds.
11:29:53.019 | WARNING | prefect.server.services.flowrunnotifications - FlowRunNotifications took 19.135031 seconds to run, which is longer than its loop interval of 4 seconds.
11:29:53.162 | WARNING | prefect.server.services.foreman - Foreman took 16.29065 seconds to run, which is longer than its loop interval of 15.0 seconds.
11:30:44.210 | WARNING | prefect.server.services.failexpiredpauses - FailExpiredPauses took 6.207736 seconds to run, which is longer than its loop interval of 5.0 seconds.
11:30:44.268 | WARNING | prefect.server.services.recentdeploymentsscheduler - RecentDeploymentsScheduler took 6.266031 seconds to run, which is longer than its loop interval of 5 seconds.
11:30:44.294 | WARNING | prefect.server.services.marklateruns - MarkLateRuns took 6.318531 seconds to run, which is longer than its loop interval of 5.0 seconds.
11:30:44.333 | WARNING | prefect.server.services.flowrunnotifications - FlowRunNotifications took 7.302994 seconds to run, which is longer than its loop interval of 4 seconds.
11:31:04.703 | WARNING | prefect.server.services.recentdeploymentsscheduler - RecentDeploymentsScheduler took 10.43387 seconds to run, which is longer than its loop interval of 5 seconds.
11:31:04.749 | WARNING | prefect.server.services.marklateruns - MarkLateRuns took 10.451921 seconds to run, which is longer than its loop interval of 5.0 seconds.
11:31:04.754 | WARNING | prefect.server.services.flowrunnotifications - FlowRunNotifications took 12.412788 seconds to run, which is longer than its loop interval of 4 seconds.
11:31:04.760 | WARNING | prefect.server.services.failexpiredpauses - FailExpiredPauses took 10.545929 seconds to run, which is longer than its loop interval of 5.0 seconds.
11:31:35.370 | WARNING | prefect.server.services.foreman - Foreman took 42.202888 seconds to run, which is longer than its loop interval of 15.0 seconds.
11:31:35.429 | WARNING | prefect.server.services.failexpiredpauses - FailExpiredPauses took 25.666847 seconds to run, which is longer than its loop interval of 5.0 seconds.
11:31:35.441 | WARNING | prefect.server.services.marklateruns - MarkLateRuns took 25.68895 seconds to run, which is longer than its loop interval of 5.0 seconds.
11:31:35.445 | WARNING | prefect.server.services.recentdeploymentsscheduler - RecentDeploymentsScheduler took 25.739011 seconds to run, which is longer than its loop interval of 5 seconds.
11:31:35.495 | WARNING | prefect.server.services.flowrunnotifications - FlowRunNotifications took 26.73926 seconds to run, which is longer than its loop interval of 4 seconds.
11:32:06.644 | WARNING | prefect.server.services.flowrunnotifications - FlowRunNotifications took 31.147717 seconds to run, which is longer than its loop interval of 4 seconds.
11:32:06.906 | WARNING | prefect.server.services.foreman - Foreman took 31.535556 seconds to run, which is longer than its loop interval of 15.0 seconds.
|
These seem to be related (#16304) |
It definitely has something to do with the database. I figured I'd test out running with an in-memory database and I had no issues whatsoever. (python-environment) bash-4.4$ prefect config set PREFECT_API_DATABASE_CONNECTION_URL="sqlite+aiosqlite:///file::memory:?cache=shared&uri=true&check_same_thread=false"
Set 'PREFECT_API_DATABASE_CONNECTION_URL' to 'sqlite+aiosqlite:///file::memory:?cache=shared&uri=true&check_same_thread=false'.
Updated profile 'local'. |
I'll add that shortly after a few of these errors occur, our prefect server pod just crashes and needs to be restarted. As a result, all running flows end up failing or in a zombie state (ie. prefect says the flow is running when it is not). |
Can confirm I'm still seeing this on 3.1.7.dev4 . Seeing intermittent issues with both Postgres and SQLite. |
Hey @zzstoatzz, do you know if any prefect engineers have found any possible causes or are investigating this? Sorry to pester but we're having a lot of issues running flows and using the server UI? |
hi @tomukmatthews - thanks for the bump. this is something we're going to investigate over the coming days - any information you can share about your workloads would be appreciated! thanks for the infra detail so far |
I was running this toy workflow. Wasn't even doing anything real yet, I just wanted to see if Prefect would work for my use-case. Essentially, I had a main workflow that would load a .csv file, and then for every row in that .csv file it would start a subworkflow that would do some stuff to the data. Then once all the subworkflows were done it would come together and summarize the results. The actual processing here is complete nonsense, this was just to test Prefect. import asyncio
from textwrap import dedent
from typing import List, Dict
from prefect import task, flow, get_run_logger
import pandas as pd
from time import sleep
from prefect.tasks import task_input_hash
from prefect.cache_policies import TASK_SOURCE, INPUTS
from prefect.artifacts import create_progress_artifact, update_progress_artifact, create_markdown_artifact
"""An example workflow with confidential information removed for https://github.com/PrefectHQ/prefect/issues/16299"""
#-----------------------------------------------------------------------------------------------------------------------
# Process Case Workflow
#-----------------------------------------------------------------------------------------------------------------------
# Don't cache this one
@task(name="Load CSV")
def load_csv(path: str) -> List[Dict]:
df = pd.read_csv(path)
return df.to_dict('records')
# cache_policy=TASK_SOURCE + INPUTS
@task(name="Solve Stuff")
def solve_stuff(row: Dict) -> Dict:
sleep(2)
row['left_side'] = row["p"]*row["v"]
row["right_side"] = row["n"]*row["r"]*row["t"]
return row
@task(name="Create File")
def create_file(bcs: Dict) -> str:
sleep(2)
return "\n File \n".join([f"{k}: {v}" for k, v in bcs.items()])
@task(name="Create QFile")
def create_qfile(bcs: Dict) -> str:
sleep(2)
return "\n Q \n".join([f"{k}: {v}" for k, v in bcs.items()])
@task(name="Launch Job")
async def launch_job(journal_file: str, q_file: str) -> str:
# progress_artifact_id = create_progress_artifact(
# progress=0.0,
# description="Indicates the estimated progress of the run.",
# )
sleep_time = 20
for i in range(1, sleep_time + 1):
await asyncio.sleep(1)
# update_progress_artifact(artifact_id=progress_artifact_id, progress=(i / sleep_time) * 100)
return "job\n" + journal_file + q_file
# cache_policy=TASK_SOURCE + INPUTS
@task()
def validate_case_bcs(job_results: str, case_bcs: Dict) -> bool:
sleep(2)
markdown_report = dedent(f"""\
# Boundary Conditions Summary
Case ID: {case_bcs["id"]}
```python
def example_func():
return "It works"
```
```mermaid
pie title NETFLIX
"Time spent looking for movie" : 90
"Time spent watching it" : 10
```"""
)
create_markdown_artifact(
markdown=markdown_report,
description="Validate Job Conditions",
)
return True
@task(name="Analyze Case Results")
def analyze_job_results(job_results: str, case_bcs: Dict) -> Dict:
sleep(2)
case_bcs["results"] = job_results
return case_bcs
async def build_case_subflow(row: Dict):
"""https://github.com/PrefectHQ/prefect/issues/7319#issuecomment-1311968282"""
@flow(name=f"Process Case {row.get('id', 'unknown')}")
async def process_case_flow(case_data: Dict) -> Dict:
case_bcs = solve_stuff(case_data)
file_future, q_file_future = create_file.submit(case_bcs), create_qfile.submit(case_bcs)
job_results = await launch_job(file_future, q_file_future)
validate_case_bcs(job_results, case_bcs)
return analyze_job_results(job_results, case_bcs)
return await process_case_flow(row)
@task(name="Summarize All Results")
def summarize_results(all_results: List[Dict]) -> List[Dict]:
sleep(2)
return all_results
#-----------------------------------------------------------------------------------------------------------------------
# CFD Optimization Workflow
#-----------------------------------------------------------------------------------------------------------------------
@flow(name="Toy Job Workflow")
async def job_workflow(csv_path: str) -> List[Dict]:
rows = load_csv(csv_path)
all_cfd_results = await asyncio.gather(*[build_case_subflow(row) for row in rows])
return summarize_results(all_cfd_results)
if __name__ == "__main__":
asyncio.run(job_workflow("example_data.csv")) |
thanks @CorMazz - much appreciated! |
Hi @zzstoatzz, i was curious if you found anything interesting re. the performance issues in the telemetry data? |
Our data pipelines are frequently failing with:
Do you know of any server / client side configuration we can change that'll alleviate this issues in the interim to help our runs to complete successfully? e.g. I've already 3x'd its memory allocation, maybe bumping up the server's memory and cpu further might help? |
I have the same problem currently, but with 2.20.15. The problem actually hits me when I am opening a flow run page where it seems to time out while collecting data to display the flow graph and the logs. The setup is deployed via Helm on Azure and I use Azure PostgresQL flexible server version 14.12 The stack trace reads like |
hi @tomukmatthews - there's still more investigation to do (apologies, last couple weeks have been slower due to the time of the year) from the preliminary investigation I did, after running work constantly for a while (was testing mostly on postgres), I was able to observe the lagging loop services mentioned above, which seemed to be caused by long waits to obtain new db connections. Some ideas have been floated about this being related to the in-memory events implementation, but anecdotally that specific change hasn't made a significant improvement. any findings you'd like to share would be appreciated! we will continue to look into this as time allows |
I am seeing exactly this behavior -- timeout errors with messages about things taking longer than their loop allowances, and then Prefect's workflow management becoming entirely inactive -- but with 2.20.13 under moderate load. This occurred when running against a minimally-powered GCP PSQL server and it went away when I made the PSQL server beefier. |
I also had the impression that a higher SKU for the PostgreSQL server helps - as after the upgrade the problem vanished, but in my case only for a while, and then problem came back. More strangely, when prefect got stalled with the database I could still run my own queries manually to the prefect tables reflecting similar data pull's with both psycopg and asyncpg and results returned with <1s. And the server metrics also showed way enough capacity. |
@RobertFischer could you post the specs of your PSQL instance (before and after)? |
I try to go cheap: |
The lightweight version was GCP's |
Here's a total shot in the dark. Are we positive all DB connections are being closed? If there's a leak in the DB connections and some transaction isn't getting closed, it might cause problems like this due to locks in the DB. If/when I see this again, I can query the DB system tables to see what the situation is, if that's useful. |
One observation when running |
|
Another finding I could add: we have 2 instances in cloud DEV and PROD. On the seldomly used DEV environment I see 5 such idle connections, so much less that on the frequently in use PROD. |
Can confirm we also see these idle connections (5 at the minimum). |
any update from the prefect team on this, @zzstoatzz |
Deleting all the retry configurations seemed to resolve this issue. I have no idea why or if it's a red herring, but just FYI. |
Both @zzstoatzz and I are looking into this for the week; a few small updates:
We'll keep providing updates as we have them and please continue to share any / all relevant information! |
👋 hi all - wanted to share some findings from recent investigation. We believe we've identified a bottleneck in the service responsible for recording task runs which is likely responsible for a significant part of the bad performance reported here. This conclusion is consistent with:
The issue appears to be that in the task run recording service, task runs are recorded sequentially through an in-memory queue and therefore when running continual task-heavy flows, you get:
we're exploring / testing strategies for dealing with this (which have shown initial promise)
While we're still interested in adding external messaging support (e.g. redis) for larger scale, we'd like to reasonably exhaust opportunities for improvement with the in-memory queue. Please feel free to chime in if you feel this doesn't explain the behavior you're seeing or if you have other ideas/thoughts |
Great. That would though not explain why we have a problem also in prefect 2.x, or? |
correct @OliverKleinBST - that would not directly explain seeing the same behavior in 2.x |
Bug summary
We're experiencing significant performance issues with our Prefect server installation, primarily manifesting as timeouts in database communications. The services are consistently running longer than their designated loop intervals:
FlowRunNotifications: Taking ~6.8s vs 4s interval
RecentDeploymentsScheduler: Taking ~7.5s vs 5s interval
The primary error appears to be a timeout in the PostgreSQL asyncpg connection:
Version info (for the server)
Additional context
2024.12.3011129
proxy-read-timeout
in my nginx ingress):Server logs:
The text was updated successfully, but these errors were encountered: