Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix thread reentry deadlocks #2832

Merged

Conversation

richardsheridan
Copy link
Contributor

Should fix the timeouts @jakkdl worked on in #2827, which were introduced in #2392. Some discussion from chat:

@richardsheridan: the examples used seem to be extremely simple use cases where we do to_thread_run_sync(from_thread_run_X(afn)) back to back. if we can make a test that just loops for a few seconds doing that maybe it could reliably trigger the race for debugging? I won't be able to get to it for a while but just thinking out loud in case someone else wants to try first.

@oremanj: richardsheridan: I just took another look at the threading code changes and I think there is a race here that I missed. When a host-task run completes, the host task immediately writes the result to the queue, which unblocks the thread and makes it possible for it to terminate. The thread's termination will reschedule the host task. But there's one checkpoint that the host task executes in between putting the result on the queue and going back to sleep:

    async def run(self) -> None:
        # we use extra checkpoints to pick up and reset any context changes
        task = trio.lowlevel.current_task()
        old_context = task.context
        task.context = self.context.copy()
        try:
            await trio.lowlevel.cancel_shielded_checkpoint()
            result = await outcome.acapture(self.unprotected_afn)
            self.queue.put_nowait(result)
        finally:
            task.context = old_context
            await trio.lowlevel.cancel_shielded_checkpoint() # <-- this one

If the report_back_in_trio_thread callback is run during that checkpoint, then we will try to reschedule an already-scheduled task, which violates the task invariants and is probably what's causing this hang.

I'm going to first see how the test suite goes on my new regression test, then push the fix.

if there is a checkpoint after the result is put on the queue to the thread, there is a thread race with a chance to reschedule the task twice, causing hangs.

The try-finally suite was removed as it makes typing difficult and none of the operations can fail (barring MemoryErrors and friends.)
@richardsheridan richardsheridan force-pushed the fix_thread_reentry_deadlocks branch from 60687e4 to 66069ac Compare October 24, 2023 02:51
@codecov
Copy link

codecov bot commented Oct 24, 2023

Codecov Report

Merging #2832 (66069ac) into master (a0c480a) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #2832   +/-   ##
=======================================
  Coverage   99.16%   99.16%           
=======================================
  Files         115      115           
  Lines       17475    17483    +8     
  Branches     3111     3115    +4     
=======================================
+ Hits        17329    17337    +8     
  Misses        101      101           
  Partials       45       45           
Files Coverage Δ
trio/_tests/test_threads.py 100.00% <100.00%> (ø)
trio/_threads.py 100.00% <100.00%> (ø)

@richardsheridan
Copy link
Contributor Author

The test wasn't quite 100% reliably triggered in CI but much more frequently than the intermittent failures we had been seeing.

@oremanj oremanj merged commit 5620d17 into python-trio:master Oct 24, 2023
26 of 28 checks passed
@oremanj
Copy link
Member

oremanj commented Oct 24, 2023

Thanks for fixing this!

@richardsheridan richardsheridan deleted the fix_thread_reentry_deadlocks branch October 24, 2023 11:44
@jakkdl
Copy link
Member

jakkdl commented Oct 25, 2023

Awesome, thanks for the quick fix! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants