-
-
Notifications
You must be signed in to change notification settings - Fork 30.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in tracemalloc causes segfaults #128679
Comments
#include <pthread.h>
#include "Python.h"
void *
threadfunc(void *p) {
PyTraceMalloc_Track(123, 10, 1);
return NULL;
}
static PyObject *
test(PyObject *self, PyObject *args) {
for (int i = 0; i < 50; i++) {
pthread_t p;
if (pthread_create(&p, NULL, threadfunc, NULL) != 0)
break;
pthread_detach(p);
}
Py_RETURN_NONE;
}
static PyMethodDef module_methods[] = {
{"test", test, METH_NOARGS},
{NULL, NULL, 0, NULL}
};
static struct PyModuleDef module_definition = {
PyModuleDef_HEAD_INIT,
"mymod",
"test module",
-1,
module_methods
};
PyMODINIT_FUNC
PyInit_mymod(void) {
return PyModule_Create(&module_definition);
}
import gc
import time
import tracemalloc
import mymod
tracemalloc.start()
mymod.test()
tracemalloc.stop()
gc.collect()
print('Waiting...')
time.sleep(10)
print('Done.') Here are the changes as a starting point for the eventual fix: $ git diff main
diff --git a/Python/tracemalloc.c b/Python/tracemalloc.c
index f661d69c03..d2e3dfc53b 100644
--- a/Python/tracemalloc.c
+++ b/Python/tracemalloc.c
@@ -538,11 +538,13 @@ tracemalloc_alloc(int use_calloc, void *ctx, size_t nelem, size_t elsize)
return NULL;
TABLES_LOCK();
- if (ADD_TRACE(ptr, nelem * elsize) < 0) {
- /* Failed to allocate a trace for the new memory block */
- TABLES_UNLOCK();
- alloc->free(alloc->ctx, ptr);
- return NULL;
+ if (tracemalloc_config.tracing) {
+ if (ADD_TRACE(ptr, nelem * elsize) < 0) {
+ /* Failed to allocate a trace for the new memory block */
+ TABLES_UNLOCK();
+ alloc->free(alloc->ctx, ptr);
+ return NULL;
+ }
}
TABLES_UNLOCK();
return ptr;
@@ -963,8 +965,11 @@ _PyTraceMalloc_Stop(void)
if (!tracemalloc_config.tracing)
return;
- /* stop tracing Python memory allocations */
+ /* stop tracing Python memory allocations,
+ but not while something might be in the middle of an operation */
+ TABLES_LOCK();
tracemalloc_config.tracing = 0;
+ TABLES_UNLOCK();
/* unregister the hook on memory allocators */
#ifdef TRACE_RAW_MALLOC
@@ -1317,6 +1322,12 @@ PyTraceMalloc_Track(unsigned int domain, uintptr_t ptr,
gil_state = PyGILState_Ensure();
+ if (!tracemalloc_config.tracing) {
+ /* tracing may have been turned off as we were acquiring the GIL */
+ PyGILState_Release(gil_state);
+ return -2;
+ }
+
TABLES_LOCK();
res = tracemalloc_add_trace(domain, ptr, size);
TABLES_UNLOCK(); As stated, the C extension module reproduces the problem 100% of the time and this fix appears to fix it 100% of the time but the person in charge of tracemalloc should really have a look at this. |
Thank you for the quick response and consistent reproducer! |
@tom-pytel I can confirm that your patch appears to fix the problem. Using the |
Done. |
Check again 'tracemalloc_config.tracing' once the GIL is held in tracemalloc_raw_alloc() and PyTraceMalloc_Track(), since another thread can call tracemalloc.stop() during PyGILState_Ensure() call.
* tracemalloc_realloc_gil() and tracemalloc_raw_realloc() no longer remove the trace on reentrant call. * _PyTraceMalloc_Stop() unregisters _PyTraceMalloc_TraceRef(). * _PyTraceMalloc_GetTraces() sets the reentrant flag. * tracemalloc_clear_traces_unlocked() sets the reentrant flag.
* tracemalloc_realloc_gil() and tracemalloc_raw_realloc() no longer remove the trace on reentrant call. * _PyTraceMalloc_Stop() unregisters _PyTraceMalloc_TraceRef(). * _PyTraceMalloc_GetTraces() sets the reentrant flag. * tracemalloc_clear_traces_unlocked() sets the reentrant flag.
* Use TABLES_LOCK() to protect 'tracemalloc_config.tracing'. * Hold TABLES_LOCK() longer while accessing tables. * tracemalloc_realloc_gil() and tracemalloc_raw_realloc() no longer remove the trace on reentrant call. * _PyTraceMalloc_Stop() unregisters _PyTraceMalloc_TraceRef(). * _PyTraceMalloc_GetTraces() sets the reentrant flag. * tracemalloc_clear_traces_unlocked() sets the reentrant flag.
* Use TABLES_LOCK() to protect 'tracemalloc_config.tracing'. * Hold TABLES_LOCK() longer while accessing tables. * tracemalloc_realloc() and tracemalloc_free() no longer remove the trace on reentrant call. * _PyTraceMalloc_Stop() unregisters _PyTraceMalloc_TraceRef(). * _PyTraceMalloc_GetTraces() sets the reentrant flag. * tracemalloc_clear_traces_unlocked() sets the reentrant flag.
tracemalloc_alloc(), tracemalloc_realloc(), tracemalloc_free(), _PyTraceMalloc_TraceRef() and _PyTraceMalloc_GetMemory() now check tracemalloc_config.tracing after calling TABLES_LOCK().
tracemalloc_alloc(), tracemalloc_realloc(), tracemalloc_free(), _PyTraceMalloc_TraceRef() and _PyTraceMalloc_GetMemory() now check tracemalloc_config.tracing after calling TABLES_LOCK(). _PyTraceMalloc_TraceRef() now always returns 0.
tracemalloc_alloc(), tracemalloc_realloc(), tracemalloc_free(), _PyTraceMalloc_TraceRef() and _PyTraceMalloc_GetMemory() now check tracemalloc_config.tracing after calling TABLES_LOCK(). _PyTraceMalloc_TraceRef() now always returns 0.
tracemalloc_alloc(), tracemalloc_realloc(), PyTraceMalloc_Track(), PyTraceMalloc_Untrack() and _PyTraceMalloc_TraceRef() now check tracemalloc_config.tracing after calling TABLES_LOCK(). _PyTraceMalloc_Stop() now protects more code with TABLES_LOCK(), especially setting tracemalloc_config.tracing to 1. Add a test using PyTraceMalloc_Track() to test tracemalloc.stop() race condition.
tracemalloc_alloc(), tracemalloc_realloc(), PyTraceMalloc_Track(), PyTraceMalloc_Untrack() and _PyTraceMalloc_TraceRef() now check tracemalloc_config.tracing after calling TABLES_LOCK(). _PyTraceMalloc_Stop() now protects more code with TABLES_LOCK(), especially setting tracemalloc_config.tracing to 1. Add a test using PyTraceMalloc_Track() to test tracemalloc.stop() race condition.
tracemalloc_alloc(), tracemalloc_realloc(), tracemalloc_free(), _PyTraceMalloc_TraceRef() and _PyTraceMalloc_GetMemory() now check 'tracemalloc_config.tracing' after calling TABLES_LOCK(). _PyTraceMalloc_TraceRef() now always returns 0.
I fixed race conditions in
void PyMem_RawFree(void *ptr)
{
_PyMem_Raw.free(_PyMem_Raw.ctx, ptr);
} A Python memory allocator is made of two main parts:
The lock makes sure that It's possible to fix void PyMem_RawFree(void *ptr)
{
PyMutex_Lock(&ALLOCATORS_MUTEX);
PyMemAllocatorEx raw = _PyMem_Raw;
PyMutex_Unlock(&ALLOCATORS_MUTEX);
raw.free(raw.ctx, ptr);
} The new problem is that Python allocates a lot of small memory blocks (it calls often the memory allocator) and so any performance slowdown on allocation would have a significant impact on Python overall performance. |
The
The race condition is as old as Python 3.4 and PEP 445 – Add new APIs to customize Python memory allocators. |
I'm not sure customizing the allocators is even supported under free-threading. Mimalloc must be used for the object allocator in order for the GC to work. |
But I don't think that |
There is a race condition between PyMem_SetAllocator() and PyMem_RawMalloc()/PyMem_RawFree(). While PyMem_SetAllocator() write is protected by a lock, PyMem_RawMalloc()/PyMem_RawFree() reads are not protected by a lock. PyMem_RawMalloc()/PyMem_RawFree() can be called with an old context and the new function pointer. On a release build, it's not an issue since the context is not used. On a debug build, the debug hooks use the context and so can crash.
There is a race condition between PyMem_SetAllocator() and PyMem_RawMalloc()/PyMem_RawFree(). While PyMem_SetAllocator() write is protected by a lock, PyMem_RawMalloc()/PyMem_RawFree() reads are not protected by a lock. PyMem_RawMalloc()/PyMem_RawFree() can be called with an old context and the new function pointer. On a release build, it's not an issue since the context is not used. On a debug build, the debug hooks use the context and so can crash.
…ython#128988) There is a race condition between PyMem_SetAllocator() and PyMem_RawMalloc()/PyMem_RawFree(). While PyMem_SetAllocator() write is protected by a lock, PyMem_RawMalloc()/PyMem_RawFree() reads are not protected by a lock. PyMem_RawMalloc()/PyMem_RawFree() can be called with an old context and the new function pointer. On a release build, it's not an issue since the context is not used. On a debug build, the debug hooks use the context and so can crash. (cherry picked from commit 9bc1964)
tracemalloc_alloc(), tracemalloc_realloc(), PyTraceMalloc_Track(), PyTraceMalloc_Untrack() and _PyTraceMalloc_TraceRef() now check tracemalloc_config.tracing after calling TABLES_LOCK(). _PyTraceMalloc_Stop() now protects more code with TABLES_LOCK(), especially setting tracemalloc_config.tracing to 1. Add a test using PyTraceMalloc_Track() to test tracemalloc.stop() race condition. Call _PyTraceMalloc_Init() at Python startup.
…129022) [3.13] gh-128679: Fix tracemalloc.stop() race conditions (#128897) tracemalloc_alloc(), tracemalloc_realloc(), PyTraceMalloc_Track(), PyTraceMalloc_Untrack() and _PyTraceMalloc_TraceRef() now check tracemalloc_config.tracing after calling TABLES_LOCK(). _PyTraceMalloc_Stop() now protects more code with TABLES_LOCK(), especially setting tracemalloc_config.tracing to 1. Add a test using PyTraceMalloc_Track() to test tracemalloc.stop() race condition. Call _PyTraceMalloc_Init() at Python startup. (cherry picked from commit 6b47499)
Ok, the Sadly, the added test crash on a debug build (especially on FreeBSD). The code is only reliable on a release build. I close the issue. Fixing the debug hooks on memory allocators to make the code atomic would impact badly Python performance. It's a no go. |
Thank you to everyone who helped fix this! |
Thanks @tom-pytel for the initial fix! The final fix is more complex since we decided to fix more functions and more race conditions. |
…ython#128988) There is a race condition between PyMem_SetAllocator() and PyMem_RawMalloc()/PyMem_RawFree(). While PyMem_SetAllocator() write is protected by a lock, PyMem_RawMalloc()/PyMem_RawFree() reads are not protected by a lock. PyMem_RawMalloc()/PyMem_RawFree() can be called with an old context and the new function pointer. On a release build, it's not an issue since the context is not used. On a debug build, the debug hooks use the context and so can crash.
Replace uncommon PyGILState_GetThisThreadState() with common _PyThreadState_GET().
Replace uncommon PyGILState_GetThisThreadState() with common _PyThreadState_GET().
Replace uncommon PyGILState_GetThisThreadState() with common _PyThreadState_GET().
_PyTraceMalloc_Stop() now calls PyRefTracer_SetTracer(NULL, NULL).
Crash report
What happened?
This is a bit of a tricky situation, but it is real and impacting my ability to use tracemalloc. As background, I've added code to Polars to make it record all of its allocations in tracemalloc, and this is enabled in debug builds. This then allows writing unit tests that check memory usage, which is very useful in ensuring high memory usage is fixed, and making sure it doesn't get high again.
Unfortunately, I'm hitting a situation where tracemalloc causes segfaults in multi-threaded situations. I believe that this is a race condition between
PyTraceMalloc_Track()
in a new non-Python thread that does not hold the GIL, andtracemalloc.stop()
being called in another thread. My hypothesis in detail:tracemalloc.start()
.tracemalloc.stop()
.If this hypothesis is correct, the solution would for GIL acquisition to bypass tracemalloc altogether if it allocates; it's not like it allocates a lot of memory, so not tracking it is fine. This may be difficult in practice, so another approach would involve having an additional lock so there's no race condition around checking if tracemalloc is enabled.
Here is a stack trace from a coredump from the reproducer (see below) that led me to the above hypothesis:
To run the reproducer you will need to
pip install rustimport
and have Rust installed. (I tried with Cython, had a hard time, gave up.)Here's the Python file:
And here is the Rust file, you should call it
tracemalloc_repro.rs
:You can reproduce by calling
repro.py
. Because this is a race condition, you may need to run it a few times; I had more consistent crashes with Python 3.12, but it does crash on Python 3.13. You may need to tweak the number 50 above to make it happen.CPython versions tested on:
3.12, 3.13
Operating systems tested on:
Linux
Output from running 'python -VV' on the command line:
Python 3.13.1 (main, Dec 4 2024, 08:54:15) [GCC 13.2.0]
Linked PRs
The text was updated successfully, but these errors were encountered: