Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-128679: fix race condition in tracemalloc #128695

Closed
wants to merge 3 commits into from

Conversation

tom-pytel
Copy link
Contributor

@tom-pytel tom-pytel commented Jan 9, 2025

  • Add emergency fallback in tracemalloc_alloc() to not add a trace if tracing was turned off while the GIL was being acquired.
  • Add secondary check in PyTraceMalloc_Track() for tracing after acquiring GIL in case tracing was turned off while GIL was being acquired.
  • Protect tracemalloc_config.tracing = 0 in _PyTraceMalloc_Stop() using table_lock to avoid turning it off in the middle of critical operations.

@bedevere-app
Copy link

bedevere-app bot commented Jan 9, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, tracemalloc_config.tracing isn't thread-safe even with the GIL, because it's read while the GIL is released. It's probably even worse on free-threading. (I think there's an issue somewhere about tracemalloc being non-thread-safe for FT.)

So, we need to either be more strict with the rules on when you can access it, or use atomic reads and writes. Something like _Py_atomic_load_int_relaxed should do.

/* stop tracing Python memory allocations */
/* stop tracing Python memory allocations,
but not while something might be in the middle of an operation */
TABLES_LOCK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to lock writes, we need to lock reads as well, but see my other comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading without lock like the first check in PyTraceMalloc_Track() can be done for an early out without problems in that particular case (though this absolutely necessitates the second check after GIL acquire). Atomic operations should be used in these cases though where they might be outside GIL protection, that should be safe then if relying on GIL. For free-threading safety more changes would be needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't really mix and match atomics. If something relies on a lock for thread-safety, then you must hold that lock in order to read or write to it. It's not safe to rely on the GIL for some reads, and then rely on a different lock for others. It might be OK in some cases with the GIL because of implicit lock-ordering, but that can break very easily.

In this specific case, this would race with PyTraceMalloc_Track if it was called without a thread state (i.e., the GIL), because that doesn't hold the tables lock when reading, so there's no synchronization between the two operations. It's only slightly OK because the GIL will serialize it for you in almost every case--there are very few people that are using PyTraceMalloc_Track while their thread state is detached. But that's besides the point, we should go for a more robust fix here rather than relying on lock-ordering with the GIL. (We'll also kill two birds with one stone--it will fix it for both the default build and for free-threading.)

I would either go with atomic reads and writes, or a compare-exchange if there ends up being some problems with inconsistent state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that adding TABLES_LOCK/TABLES_UNLOCK while setting tracing avoids a race condition. At least, I don't understand how. What are your trying to protect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that adding TABLES_LOCK/TABLES_UNLOCK while setting tracing avoids a race condition.

I'm not sure either, I was practicing defensive programming. Dunno if setting this to 0 here would cause a problem for an operation simultaneously taking place in another thread in a tables_lock critical section but the possible outcomes are:

  1. Doesn't cause a problem and protect - good outcome.
  2. Causes a problem and protect - good outcome.
  3. Doesn't cause a problem and no protect - good outcome.
  4. Causes a problem and no protect - bad outcome.

I am assuming @vstinner you are familiar with the codebase so if you say this would not cause a problem then I can certainly remove it.

I would either go with atomic reads and writes, or a compare-exchange if there ends up being some problems with inconsistent state.

Neither of those would solve the OP problem by themselves and just changing the tracing reads and writes to atomic or cmpxchg wouldn't be enough for free-threading either as you would need to explicitly replicate the GIL protection mechanism. On the other hand they won't hurt anything so if you want I can change the tracing reads/writes to atomic load/store? (they don't actually do anything special on x86 anyway)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, switching it to completely atomic won't fix it, but it's a good start because again, this approach is error-prone. It might pass on the test case, but that doesn't necessarily mean it's the correct approach.

Imagine a case where:

  • Thread 1 calls tracemalloc_alloc with the GIL held.
  • Thread 2 calls PyGILState_Ensure and is now blocked until it's released.
  • Thread 1 acquires the tables lock.
  • Thread 1 triggers a garbage collection via ADD_TRACE
  • A finalizer releases the GIL.
  • Thread 2 takes the GIL, and begins waiting on the tables lock.
  • Deadlock ensues, as thread 1 is waiting on the GIL, and thread 2 is waiting on the tables lock. (This wouldn't be an issue if PyThread_acquire_lock released the GIL, but I'm not certain that it does, because that's not desirable for all locks.)

If you think that atomic operations will be too difficult to use here, then you can use a dedicated PyMutex for the tracing field, but then that lock should be held for absolutely all reads and writes to tracing.

Python/tracemalloc.c Show resolved Hide resolved
/* stop tracing Python memory allocations */
/* stop tracing Python memory allocations,
but not while something might be in the middle of an operation */
TABLES_LOCK();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that adding TABLES_LOCK/TABLES_UNLOCK while setting tracing avoids a race condition. At least, I don't understand how. What are your trying to protect?

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracemalloc_config.tracing variable is protected by the GIL:

/* Is tracemalloc tracing memory allocations?
Variable protected by the GIL */
int tracing;

There is no need to add TABLES_LOCK(), it's redundant.

I wrote a different fix: PR gh-128710.

@tom-pytel
Copy link
Contributor Author

I wrote a different fix: PR gh-128710.

Tested, works, cleaner, closing out this one.

@tom-pytel tom-pytel closed this Jan 10, 2025
@tom-pytel tom-pytel deleted the fix-issue-128679 branch January 11, 2025 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants