-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-128679: fix race condition in tracemalloc #128695
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, tracemalloc_config.tracing
isn't thread-safe even with the GIL, because it's read while the GIL is released. It's probably even worse on free-threading. (I think there's an issue somewhere about tracemalloc being non-thread-safe for FT.)
So, we need to either be more strict with the rules on when you can access it, or use atomic reads and writes. Something like _Py_atomic_load_int_relaxed
should do.
/* stop tracing Python memory allocations */ | ||
/* stop tracing Python memory allocations, | ||
but not while something might be in the middle of an operation */ | ||
TABLES_LOCK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to lock writes, we need to lock reads as well, but see my other comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading without lock like the first check in PyTraceMalloc_Track()
can be done for an early out without problems in that particular case (though this absolutely necessitates the second check after GIL acquire). Atomic operations should be used in these cases though where they might be outside GIL protection, that should be safe then if relying on GIL. For free-threading safety more changes would be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't really mix and match atomics. If something relies on a lock for thread-safety, then you must hold that lock in order to read or write to it. It's not safe to rely on the GIL for some reads, and then rely on a different lock for others. It might be OK in some cases with the GIL because of implicit lock-ordering, but that can break very easily.
In this specific case, this would race with PyTraceMalloc_Track
if it was called without a thread state (i.e., the GIL), because that doesn't hold the tables lock when reading, so there's no synchronization between the two operations. It's only slightly OK because the GIL will serialize it for you in almost every case--there are very few people that are using PyTraceMalloc_Track
while their thread state is detached. But that's besides the point, we should go for a more robust fix here rather than relying on lock-ordering with the GIL. (We'll also kill two birds with one stone--it will fix it for both the default build and for free-threading.)
I would either go with atomic reads and writes, or a compare-exchange if there ends up being some problems with inconsistent state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that adding TABLES_LOCK/TABLES_UNLOCK while setting tracing avoids a race condition. At least, I don't understand how. What are your trying to protect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that adding TABLES_LOCK/TABLES_UNLOCK while setting tracing avoids a race condition.
I'm not sure either, I was practicing defensive programming. Dunno if setting this to 0 here would cause a problem for an operation simultaneously taking place in another thread in a tables_lock
critical section but the possible outcomes are:
- Doesn't cause a problem and protect - good outcome.
- Causes a problem and protect - good outcome.
- Doesn't cause a problem and no protect - good outcome.
- Causes a problem and no protect - bad outcome.
I am assuming @vstinner you are familiar with the codebase so if you say this would not cause a problem then I can certainly remove it.
I would either go with atomic reads and writes, or a compare-exchange if there ends up being some problems with inconsistent state.
Neither of those would solve the OP problem by themselves and just changing the tracing
reads and writes to atomic or cmpxchg wouldn't be enough for free-threading either as you would need to explicitly replicate the GIL protection mechanism. On the other hand they won't hurt anything so if you want I can change the tracing
reads/writes to atomic load/store? (they don't actually do anything special on x86 anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, switching it to completely atomic won't fix it, but it's a good start because again, this approach is error-prone. It might pass on the test case, but that doesn't necessarily mean it's the correct approach.
Imagine a case where:
- Thread 1 calls
tracemalloc_alloc
with the GIL held. - Thread 2 calls
PyGILState_Ensure
and is now blocked until it's released. - Thread 1 acquires the tables lock.
- Thread 1 triggers a garbage collection via
ADD_TRACE
- A finalizer releases the GIL.
- Thread 2 takes the GIL, and begins waiting on the tables lock.
- Deadlock ensues, as thread 1 is waiting on the GIL, and thread 2 is waiting on the tables lock. (This wouldn't be an issue if
PyThread_acquire_lock
released the GIL, but I'm not certain that it does, because that's not desirable for all locks.)
If you think that atomic operations will be too difficult to use here, then you can use a dedicated PyMutex
for the tracing
field, but then that lock should be held for absolutely all reads and writes to tracing
.
/* stop tracing Python memory allocations */ | ||
/* stop tracing Python memory allocations, | ||
but not while something might be in the middle of an operation */ | ||
TABLES_LOCK(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that adding TABLES_LOCK/TABLES_UNLOCK while setting tracing avoids a race condition. At least, I don't understand how. What are your trying to protect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tracemalloc_config.tracing
variable is protected by the GIL:
cpython/Include/internal/pycore_tracemalloc.h
Lines 27 to 29 in 802556a
/* Is tracemalloc tracing memory allocations? | |
Variable protected by the GIL */ | |
int tracing; |
There is no need to add TABLES_LOCK(), it's redundant.
I wrote a different fix: PR gh-128710.
Tested, works, cleaner, closing out this one. |
tracemalloc_alloc()
to not add a trace if tracing was turned off while the GIL was being acquired.PyTraceMalloc_Track()
for tracing after acquiring GIL in case tracing was turned off while GIL was being acquired.tracemalloc_config.tracing = 0
in_PyTraceMalloc_Stop()
usingtable_lock
to avoid turning it off in the middle of critical operations.