Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/correct streaming resource lock #1879

Merged
merged 5 commits into from
Jan 8, 2025

Conversation

gjpower
Copy link
Contributor

@gjpower gjpower commented Dec 23, 2024

Fixes #1861

Establishes the exit stack context inside the correct streaming response task so it may be closed correctly and avoids the issue of opening it in one task and trying to close it in a separate task.

Also simplifies usage of the exit_stack by using the AsyncContextManager directly async with contextlib.asynccontextmanager(...)()

does a basic refactor on create_completion and create_chat_completion moving repeated code to a separate function.

Adds a check for client disconnect before call to llm. This means that if a client creates a request but closes the connection before the server gets a lock on the llama_proxy to answer the request then the server will simply close the connection instead of calling the llm to generate the response for an already closed connection

move locking for streaming into get_event_publisher call so it is locked and unlocked in the correct task for the streaming reponse
llama_cpp/server/app.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@gjpower gjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general clarifications about the scope of this pull request

llama_cpp/server/app.py Show resolved Hide resolved
llama_cpp/server/app.py Show resolved Hide resolved
llama_cpp/server/app.py Outdated Show resolved Hide resolved
fix: correct type hints for body_model
@agronholm
Copy link

Not your fault @gjpower but the original code looks really weird. An async generator w/o the use of @asynccontextmanager? And manual use of acquire() and release()? What was the original author thinking?

@gjpower
Copy link
Contributor Author

gjpower commented Dec 24, 2024

Not your fault @gjpower but the original code looks really weird. An async generator w/o the use of @asynccontextmanager? And manual use of acquire() and release()? What was the original author thinking?

@agronholm It looks like they were making use of the FastApi depends with generator feature that wraps the dependency resource with AsyncContextManager itself.
https://fastapi.tiangolo.com/tutorial/dependencies/dependencies-with-yield/
https://github.com/fastapi/fastapi/blob/0.89.1/fastapi/dependencies/utils.py#L463

If it wasn't for the streaming context the use of depends with a generator and a single lock would be sufficient. The explicit call to wrap with AsyncContextManager is required to allow it to work in the expected way with depends for the other endpoints.
Have a look at this discussion on FastAPI fastapi/fastapi#9054 (comment)

@abetlen
Copy link
Owner

abetlen commented Jan 8, 2025

@gjpower thank you, appreciate the fix!

@abetlen abetlen merged commit e8f14ce into abetlen:main Jan 8, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash due to "The current task is not holding this lock" when {"stream": true}
3 participants