gh-128679: Fix tracemalloc.stop() race condition #128710

vstinner · 2025-01-10T14:45:36Z

Check again 'tracemalloc_config.tracing' once the GIL is held in tracemalloc_raw_alloc() and PyTraceMalloc_Track(), since another thread can call tracemalloc.stop() during PyGILState_Ensure() call.

Issue: Race condition in tracemalloc causes segfaults #128679

Check again 'tracemalloc_config.tracing' once the GIL is held in tracemalloc_raw_alloc() and PyTraceMalloc_Track(), since another thread can call tracemalloc.stop() during PyGILState_Ensure() call.

ZeroIntensity · 2025-01-10T17:18:00Z

Should we include a test based off the reproducer in the issue?

* Hold the table lock while calling _PyTraceMalloc_Stop(). * Add tracemalloc_raw_free().

vstinner · 2025-01-13T16:07:30Z

My fix was incomplete, the PyMem_RawFree() hook also had the bug:

    Fix the race condition in the PyMem_RawFree() hook
    
    * Hold the table lock while calling _PyTraceMalloc_Stop().
    * Add tracemalloc_raw_free().

Should we include a test based off the reproducer in the issue?

Ok, I added a test. It was a good idea, I found PyMem_RawFree() bug with it :-)

vstinner · 2025-01-14T15:47:08Z

@ZeroIntensity: Would you mind to review the PR which now has a test?

ZeroIntensity · 2025-01-14T16:56:01Z

Will get to it a little later today :)

ZeroIntensity

I'm still not too comfortable relying on incidental lock ordering with the tables lock for synchronization here. Everywhere tracemalloc_config.tracing is used the GIL is held, right? We should be able to rely on the GIL for synchronization. Oherwise, let's either use a dedicated lock for it or one of the _Py_atomic APIs.

ZeroIntensity · 2025-01-14T20:10:26Z

Modules/_testcapimodule.c

+{
+    PyTraceMalloc_Track(123, 10, 1);
+
+    PyThread_type_lock lock = (PyThread_type_lock)data;


It's a lot easier to use a PyMutex, considering you don't have to heap allocate it. I think it works without a thread state if you use _PyMutex_LockFlags(mutex, _Py_LOCK_DONT_DETACH)

Modules/_testcapimodule.c

ZeroIntensity · 2025-01-14T20:16:16Z

Python/tracemalloc.c

+    else {
+        // gh-128679: tracemalloc.stop() was called by another thread
+    }


I think we can remove the else:

Suggested change

else {

// gh-128679: tracemalloc.stop() was called by another thread

}

// gh-128679: If we're not tracing, then tracemalloc.stop() was called by another thread

ZeroIntensity · 2025-01-14T20:25:56Z

Python/tracemalloc.c

@@ -963,6 +993,10 @@ _PyTraceMalloc_Stop(void)
    if (!tracemalloc_config.tracing)
        return;

+    // Lock to synchronize with tracemalloc_raw_free() which checks
+    // 'tracing' while holding the lock.
+    TABLES_LOCK();


As I said in the other PR, relying on lock ordering like this is really error prone. If we hold the GIL, then we should be able to use tracemalloc_config.tracing without a lock, right?

As I said in the other PR, relying on lock ordering like this is really error prone.

Would you mind to elaborate?

See my comment: #128710 (comment)

vstinner · 2025-01-15T07:19:57Z

Everywhere tracemalloc_config.tracing is used the GIL is held, right? We should be able to rely on the GIL for synchronization.

This PR adds tracemalloc_raw_free() which is called without holding the GIL and cannot acquire the GIL:

     /* GIL cannot be locked in PyMem_RawFree() because it would introduce
        a deadlock in _PyThreadState_DeleteCurrent(). */

So I decided to reuse TABLES_LOCK() to access tracemalloc_config.tracing in tracemalloc_raw_free(). For that, I also modified _PyTraceMalloc_Stop() to use TABLES_LOCK() while modifying tracemalloc_config.tracing (and the different tables).

tracemalloc_free() is called with the GIL held. I added tracemalloc_raw_free() to add the logic to check tracemalloc_config.tracing to handle tracemalloc.stop() race condition.

ZeroIntensity · 2025-01-15T13:31:05Z

So I decided to reuse TABLES_LOCK() to access tracemalloc_config.tracing in tracemalloc_raw_free(). For that, I also modified _PyTraceMalloc_Stop() to use TABLES_LOCK() while modifying tracemalloc_config.tracing (and the different tables).

Right, so this is only safe because of lock-ordering. This approach with the tables lock only works for this case because we assume the only thread that can modify tracemalloc_config.tracing during shutdown also holds the tables lock. But, that's not always true, because not all reads and writes are locked. For example, _PyTraceMalloc_IsTracing and _PyTraceMalloc_ClearTraces are both called without the tables lock held, so that will race. You could fix that by wrapping them with the tables lock too, but I think the much cleaner solution would be to use a dedicated lock (or atomics) for tracemalloc_config.tracing.

My other concern is re-entrancy and lock-ordering deadlocks. There are two cases here, depending on what PyThread_acquire_lock does when waiting:

If PyThread_acquire_lock releases the GIL while waiting, then we need to add an extra check for tracemalloc_config.tracing every time we use TABLES_LOCK(), because the field could have changed while the GIL was released.
If PyThread_acquire_lock doesn't release the GIL, then that's even worse. Imagine thread 1 is waiting on the GIL, and thread 2 holds the GIL. Thread 2 calls into _PyTraceMalloc_Stop, grabs the table lock, and then releases the GIL via an object's finalizer when clearing. Thread 2 is now waiting to re-acquire the GIL. Then, thread 1 takes the GIL and starts waiting on the tables lock. Now we're in a deadlock! Alternatively, an object's finalizer could attempt to acquire the tables lock, which will also deadlock.

Either way, it's problematic.

tom-pytel · 2025-01-15T14:06:01Z

If PyThread_acquire_lock releases the GIL while waiting

PyThread_acquire_lock() can release the GIL? I only see that explicitly in PyThread_acquire_lock_timed_with_retries() which is not called by TABLES_LOCK(), what am I missing?

vstinner · 2025-01-15T14:29:54Z

If PyThread_acquire_lock releases the GIL while waiting

It doesn't release the GIL.

If PyThread_acquire_lock doesn't release the GIL, then that's even worse. Imagine thread 1 is waiting on the GIL, and thread 2 holds the GIL. Thread 2 calls into _PyTraceMalloc_Stop, grabs the table lock, and then releases the GIL via an object's finalizer when clearing.

_PyTraceMalloc_Stop() cannot release the GIL. It calls _Py_hashtable_clear() on these tables:

tracemalloc_traces: value_destroy_func=raw_free().
tracemalloc_domains: value_destroy_func=_Py_hashtable_destroy().
tracemalloc_tracebacks: key_destroy_func=raw_free().
tracemalloc_filenames: key_destroy_func=tracemalloc_clear_filename() which calls Py_DECREF() (filenames are Python str objects).

Py_DECREF(filename) cannot release the GIL, it's a simple Python str object.

tracemalloc is designed around "low-level" functions like the Py_hashtable API which should not release the GIL. Only the PyMem_RawMalloc() hook and PyTraceMalloc_Track() function explicitly acquire/release the GIL.

The tricky part is to remove a trace in PyMem_RawFree() hook which is called without holding the GIL, and it cannot acquire the GIL temporarily.

ZeroIntensity · 2025-01-15T14:35:30Z

Generally, things release the GIL when they're waiting on something, like a lock. PyMutex releases it by default, but I'm not sure if PyThread_acquire_lock does the same. But, if it doesn't, then that's significantly worse because we're prone to deadlocks.

_PyTraceMalloc_Stop() cannot release the GIL.

Ok, good to know. I was under the impression that any Py_DECREF could release the GIL because it could trigger a garbage collection, which will in turn call other finalizers, but I guess not.

Anyways, the fix here would be a lot simpler (and less ambiguous) if we went with a dedicated lock for tracing, but if we're really opposed to doing that, I think we could setup a pending call with Py_AddPendingCall that will eventually deallocate it on the main thread.

vstinner · 2025-01-15T14:39:28Z

Otherwise, let's either use a dedicated lock for it or one of the _Py_atomic APIs.

There are multiple constraints.

(A) Only a single thread must access tracemalloc tables at the same time. TABLES_LOCK() is used for ordering operations on tables.
(B) tracemalloc must no longer add traces as soon as _PyTraceMalloc_Stop() is called: as soon as tracemalloc_config.tracing is zero.

(B) means that tracemalloc_config.tracing must not be set while tables are modified.

My fix for the PyMem_RawFree() hook combines (A) and (B) constraints by using TABLES_LOCK() for both, since the GIL cannot be used.

vstinner · 2025-01-15T14:39:56Z

PyThread_acquire_lock() is a thin wrapper to pthread API which is not aware of the GIL.

tom-pytel · 2025-01-15T15:07:54Z

Generally, things release the GIL when they're waiting on something, like a lock. PyMutex releases it by default, but I'm not sure if PyThread_acquire_lock does the same. But, if it doesn't, then that's significantly worse because we're prone to deadlocks.

Are you sure about PyMutex? Worst thing I see in _PyMutex_LockTimed() is a call to Py_MakePendingCalls() which I thought was not supposed to release GIL and only actually execute in main thread?

EDIT: Never mind, found it.

ZeroIntensity · 2025-01-15T16:43:12Z

TABLES_LOCK() is used for ordering operations on tables.

Well, sort of. It's used for protecting the operations from multithreaded races, not necessarily synchronizing the threads in a certain order.

Anyways--the fix works, but only for the specific test. I'd rather dedicated locks or atomics because:

This will still race with PyTraceMalloc_Track if it's called without a thread state, because the initial tracemalloc_config.tracing read is done without the GIL or tables lock held.
This will likely end up being a maintenance burden; if we ever add a case where the GIL is acquired or re-acquired while holding the tables lock, then we're going to be cloudy with a chance of deadlocks.

So, I'm willing to approve this as a bandaid for 3.12, but if we go with this let's try to get a better fix in 3.13 and main (where _Py_atomic and PyMutex are available).

vstinner · 2025-01-15T16:52:44Z

Here is something different: I wrote PR gh-128888 to redesign tracemalloc locking. It holds TABLES_LOCK() longer and uses it to protect tracemalloc_config.tracing. It doesn't fix #128679 but prepares the code to be able to fix the issue in a reliable way.

ZeroIntensity · 2025-01-15T19:47:56Z

Python/tracemalloc.c

@@ -1317,9 +1353,16 @@ PyTraceMalloc_Track(unsigned int domain, uintptr_t ptr,



Let's either lock the read above this or remove it entirely, then LGTM for 3.12.

I wrote #128897 to fix Python 3.12 and 3.13.

vstinner · 2025-01-15T20:49:43Z

I wrote a different fix based on my refactoring that I just merged: #128893.

pythongh-128679: Fix tracemalloc.stop() race condition

8f58d77

Check again 'tracemalloc_config.tracing' once the GIL is held in tracemalloc_raw_alloc() and PyTraceMalloc_Track(), since another thread can call tracemalloc.stop() during PyGILState_Ensure() call.

vstinner added needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes labels Jan 10, 2025

bedevere-app bot added the awaiting core review label Jan 10, 2025

bedevere-app bot mentioned this pull request Jan 10, 2025

Race condition in tracemalloc causes segfaults #128679

Closed

Define 'res' in PyTraceMalloc_Track()

1e3eac3

vstinner mentioned this pull request Jan 10, 2025

gh-128679: fix race condition in tracemalloc #128695

Closed

vstinner added 2 commits January 13, 2025 17:03

Fix the race condition in the PyMem_RawFree() hook

59fd8de

* Hold the table lock while calling _PyTraceMalloc_Stop(). * Add tracemalloc_raw_free().

Add test

0e6de65

vstinner added 5 commits January 13, 2025 17:21

Fix test on Windows

48e172d

Fix refleak on error

146167b

test: add #include <windows.h>

aceff9b

Skip test on WASI

61edb4c

Rewrite the test without sleep

9df058a

revert unwanted change

ecfa30b

ZeroIntensity self-requested a review January 14, 2025 16:52

ZeroIntensity reviewed Jan 14, 2025

View reviewed changes

Fix refleak in the test

997148b

ZeroIntensity reviewed Jan 15, 2025

View reviewed changes

vstinner closed this Jan 15, 2025

vstinner deleted the tracemalloc branch January 15, 2025 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-128679: Fix tracemalloc.stop() race condition #128710

gh-128679: Fix tracemalloc.stop() race condition #128710

vstinner commented Jan 10, 2025 •

edited by bedevere-app bot

Loading

ZeroIntensity commented Jan 10, 2025

vstinner commented Jan 13, 2025

vstinner commented Jan 14, 2025

ZeroIntensity commented Jan 14, 2025

ZeroIntensity left a comment

ZeroIntensity Jan 14, 2025

ZeroIntensity Jan 14, 2025

ZeroIntensity Jan 14, 2025

vstinner Jan 15, 2025

ZeroIntensity Jan 15, 2025

vstinner commented Jan 15, 2025

ZeroIntensity commented Jan 15, 2025

tom-pytel commented Jan 15, 2025

vstinner commented Jan 15, 2025

ZeroIntensity commented Jan 15, 2025

vstinner commented Jan 15, 2025

vstinner commented Jan 15, 2025

tom-pytel commented Jan 15, 2025 •

edited

Loading

ZeroIntensity commented Jan 15, 2025

vstinner commented Jan 15, 2025

ZeroIntensity Jan 15, 2025

vstinner Jan 15, 2025

vstinner commented Jan 15, 2025

		@@ -1317,9 +1353,16 @@ PyTraceMalloc_Track(unsigned int domain, uintptr_t ptr,

gh-128679: Fix tracemalloc.stop() race condition #128710

gh-128679: Fix tracemalloc.stop() race condition #128710

Conversation

vstinner commented Jan 10, 2025 • edited by bedevere-app bot Loading

ZeroIntensity commented Jan 10, 2025

vstinner commented Jan 13, 2025

vstinner commented Jan 14, 2025

ZeroIntensity commented Jan 14, 2025

ZeroIntensity left a comment

Choose a reason for hiding this comment

ZeroIntensity Jan 14, 2025

Choose a reason for hiding this comment

ZeroIntensity Jan 14, 2025

Choose a reason for hiding this comment

ZeroIntensity Jan 14, 2025

Choose a reason for hiding this comment

vstinner Jan 15, 2025

Choose a reason for hiding this comment

ZeroIntensity Jan 15, 2025

Choose a reason for hiding this comment

vstinner commented Jan 15, 2025

ZeroIntensity commented Jan 15, 2025

tom-pytel commented Jan 15, 2025

vstinner commented Jan 15, 2025

ZeroIntensity commented Jan 15, 2025

vstinner commented Jan 15, 2025

vstinner commented Jan 15, 2025

tom-pytel commented Jan 15, 2025 • edited Loading

ZeroIntensity commented Jan 15, 2025

vstinner commented Jan 15, 2025

ZeroIntensity Jan 15, 2025

Choose a reason for hiding this comment

vstinner Jan 15, 2025

Choose a reason for hiding this comment

vstinner commented Jan 15, 2025

vstinner commented Jan 10, 2025 •

edited by bedevere-app bot

Loading

tom-pytel commented Jan 15, 2025 •

edited

Loading