Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CONSVC-2052] feat: add timeout handling for query tasks #112

Merged
merged 2 commits into from
Nov 21, 2022

Conversation

ncloudioj
Copy link
Contributor

@ncloudioj ncloudioj commented Nov 14, 2022

This adds a query handling timeout for the suggest endpoint. The rationale is that since Firefox uses a timeout (currently 200ms) for each query request to Merino, Merino should do the same to ensure slow providers would not prevent Merino from serving suggestions from other providers. When timeout occurs, Merino will cancel all ongoing query tasks and log/record metrics accordingly.

Note that:

  • this works at the request-level rather than the provider-level, meaning that once the timeout gets triggered, all the unfinished query tasks will be cancelled and optionally processed by the timeout callback. It does what Merino needs for now. We can extend this to be provider-evel timeout if needed in the future.
  • asyncio.gather() also supports timeout, but it doesn't provide fine-grained control on timeout handling. Therefore, I ended up using the asyncio wait primitives. The benchmark shows that the overhead is negligible compared to asyncio.gather()

This fixes CONSVC-2052.

@Trinaa Trinaa requested a review from a team November 15, 2022 17:06
@quiiver
Copy link
Collaborator

quiiver commented Nov 15, 2022

meaning that once the timeout gets triggered, all the unfinished query tasks will be cancelled and optionally processed by the timeout callback. It does what Merino needs for now. We can extend this to be provider-evel timeout if needed in the future.

I haven't read the code yet but does this mean that completed providers will be returned in the response?

@ncloudioj
Copy link
Contributor Author

I haven't read the code yet but does this mean that completed providers will be returned in the response?

Correct.

lookups: list[Task] = []
for p in search_from:
task = metrics_client.timeit_task(
p.query(srequest), f"providers.{p.name}.query"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So technically we are cancelling this task when we cancel the pending tasks upon timeout. does task cancellation propagate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the wait timeout occurs, we're cancelling all the pending tasks (one per each provider) one-by-one here. So there will no cancellation propagation involved here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense, I have another couple questions.

  1. the aiodogstatsd library uses a done_callback to record the timing of the underyling coroutine. does the done callback get called when a task is cancelled?
  2. according to the Task docs calling Task.cancel should throw a CancelledError exception but i'm not seeing any exception catching. Whats going on there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the aiodogstatsd library uses a done_callback to record the timing of the underyling coroutine. does the done callback get called when a task is cancelled?

Yes, it will be called as usual, meaning that it will time a cancelled task in this case. I thought about clearing the done_callback before it gets cancelled, but then figured we might want to keep it for completeness. I am open to other options, too.

according to the Task docs calling Task.cancel should throw a CancelledError exception but i'm not seeing any exception catching. Whats going on there?

We only call cancel on those pending tasks, which should be fine and no CancelledError will be raised. Are you concerning about calling cancel but somehow the task is completed at that point? In that case, it might raise I think we can add a check for that.

Copy link
Collaborator

@quiiver quiiver Nov 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you concerning about calling cancel but somehow the task is completed at that point?

Not particularly, the docs just make it seem like CancelledErrors are automatically thrown when you call cancel on a Task instance regardless of its completion status.

https://docs.python.org/3.11/library/asyncio-task.html#asyncio.Task.cancel

This arranges for a CancelledError exception to be thrown into the wrapped coroutine on the next cycle of the event loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean. I think it will raise a CancelledError if you await it or call task.result() once it gets cancelled, but neither is the case here as it just cancels them and returns them as a separate task list along with the done tasks. If the consumer decides to use them regardless, the yeah CancelledError will be thrown.

Does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks 😅

@Trinaa
Copy link
Contributor

Trinaa commented Nov 16, 2022

@ncloudioj I've merged #106, I apologize for the conflicts this creates. I'll note two things:

  1. Please move the tests/unit/test_task_runner.py module to tests/unit/util/test_task_runner.py. Perhaps propagate the util to utils directory name change to the unit test directory.

  2. (Optional) The test_with_timedout_provider could be included in the test_suggest.py, since it's a generalized suggest endpoint behaviour.

@ncloudioj
Copy link
Contributor Author

Please move the tests/unit/test_task_runner.py module to tests/unit/util/test_task_runner.py. Perhaps propagate the util to utils directory name change to the unit test directory.

Yep, sounds good, will do.

The test_with_timedout_provider could be included in the test_suggest.py, since it's a generalized suggest endpoint behaviour.

If you don't strongly disagree, I'd like to keep it in a separate test file as I am almost certain that we will add more related functionalities and test cases in the future, which might mess up those generic tests in test_suggest.py.

Copy link
Contributor

@Trinaa Trinaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things LGTM on the test side.

@ncloudioj ncloudioj merged commit 01d6a57 into main Nov 21, 2022
@ncloudioj ncloudioj deleted the suggest-api-timeout branch November 21, 2022 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants