gh-115999: Specialize `LOAD_ATTR` for instance and class receivers in free-threaded builds #128164

mpage · 2024-12-21T21:58:42Z

This PR finishes specialization forLOAD_ATTR in the free-threaded build by adding support for class and instance receivers.

The bulk of it is dedicated to making the specialized instructions and the specialization logic thread-safe. This consists of using atomics / locks in the appropriate places, avoiding races in specialization related to reloading versions, and ensuring that the objects stored in inline caches remain valid when accessed (by only storing deferred objects). See the section on "Thread Safety" below for more details.

Additionally, making this work required a few unrelated changes to fix existing bugs or work around differences between the two builds that results from only storing deferred values (which causes in specialization failures in the free-threaded build when a value that would be stored in the cache is not deferred):

Fixed a bug in the cases generator where it wasn't treating macro tokens as terminators when searching for the assignment target of an expression involving PyStackRef_FromPyObjectNew.
Refactored test_descr.MiscTests.test_type_lookup_mro_reference to work when specialization fails (and also be a behavorial test).
Temporarily skip test_capi.test_type.TypeTests.test_freeze_meta when running refleaks tests on free-threaded builds. Specialization failure triggers an existing bug.

Single-threaded Performance

Performance is improved by ~12% on free-threaded builds.
Performance is ~neutral on default builds. (Technically the benchmark runner reports a 0.2% improvement, but that looks like noise to me.)

We're leaving a bit of perf on the table by only storing deferred objects: we can't specialize attribute lookups that resolve to class attributes (e.g. counters, settings). I haven't measured how much perf we're giving up, but I'd like to address that in a separate PR.

Scalability

The object_cfunction and pymethod benchmarks are improved (1.4x slower -> 14.3x faster, 1.8x slower -> 13.0x faster, respectively). Other benchmarks appear unchanged.

I would expect that cmodule_function would also improve, but it looks like the benchmark is bottlenecked on increfing the int.__floor__ method that is returned from the call to _PyObject_LookupSpecial in math_floor (the incref happens in _PyType_LookupRef, which is called by PyObject_LookupSpecial):

cpython/Modules/mathmodule.c

Line 1178 in 3879ca0

PyObject *method = _PyObject_LookupSpecial(number, state->str___floor__);

Raw numbers:

Running benchmarks with 28 threads
                   Merge-base    This PR
object_cfunction    1.4x slower  14.3x faster
cmodule_function    1.7x slower   1.7x slower
mult_constant      12.8x faster  13.6x faster
generator          11.5x faster  12.2x faster
pymethod            1.8x slower  13.4x faster
pyfunction         12.2x faster  13.0x faster
module_function    14.5x faster  14.8x faster
load_string_const  13.0x faster  13.2x faster
load_tuple_const   12.7x faster  14.1x faster
create_pyobject    12.8x faster  13.5x faster
create_closure     13.9x faster  14.8x faster
create_dict        10.5x faster  13.3x faster
thread_local_read   3.9x slower   4.0x slower

Thread Safety

Thread safety of specialized instructions is addressed in a few different ways:

Use atomics where needed.
Lock the instance dictionary in _LOAD_ATTR_WITH_HINT.
Only store deferred objects in the inline cache. All uops that retrieve objects from the cache are preceded by type version guards, which ensure that the (deferred) value retrieved from the cache is valid. The type version will be mutated before destroying the reference in the type. If the type held the last reference, then GC will need to run in order to reclaim the object. GC requires stopping the world, so if GC reclaims the object then we will see the new type version and the guard will fail.

Thread safety of specialization is addressed using similar techniques:

Atomics / locking is used to access shared state.
Specialization code is refactored to read versions at the start of specialization and store the same version in caches (as opposed to rereading the version). This ensures that the specialization is consistent with the version by avoiding races where the type or shared keys dictionary changes after we've queried it.

Stats

Following the instructions in the comment preceding specialize_attr_loadclassattr, I collected stats for the default build for both this PR and its merge base using ./python -m test_typing test_re test_dis test_zlib and compared them using Tools/scripts/summarize_stats.py. The results for LOAD_ATTR are nearly identical and are consistent with results from comparing the merge base against itself:

Issue: Make the specializing interpreter thread-safe in --disable-gil builds #115999

Look up a unicode key in an all unicode keys object along with the keys version, assigning one if not present. We need a keys version that is consistent with presence of the key for use in the guards.

Reading the shared keys version and looking up the key need to be performed atomically. Otherwise, a key inserted after the lookup could race with reading the version, causing us to incorrectly specialize that no key shadows the descriptor.

Everything starts out disabled

* Check that the type hasn't changed across lookups * Descr is now an owned ref

…_INST_ATTR_FROM_DICT

- Use atomic load for value - Use _Py_TryIncrefCompareStackRef for incref

- Use atomics and _Py_TryIncrefCompareStackRef in _LOAD_ATTR_SLOT - Pass type version during specialization

- Check that fget is deferred - Pass tp_version

…out param

Macros should be treated as terminators when searching for the assignment target of an expression involving PyStackRef_FromPyObjectNew

All instance loads are complete!

This matches the previous implementation and causes failures to specialize due to the presence of both __getattr__ and __getattribute__ to be classified correctly (rather than being classified as being out of type versions).

colesbury

Overall looks great!

I think I unintentionally changed the object_cfunction when cleaning it up for inclusion in CPython. It previously had a * 1.0 to coerce the value to a float and avoid the _PyObject_LookupSpecial.

(We should deal with the _PyObject_LookupSpecial bottlenecks, but it's a lower priority)

Python/bytecodes.c

Objects/dictobject.c

Python/bytecodes.c

Objects/dictobject.c

Python/bytecodes.c

colesbury · 2024-12-23T15:06:34Z

Python/bytecodes.c

@@ -2222,16 +2227,36 @@ dummy_func(
            PyObject *attr_o;

            PyDictObject *dict = _PyObject_GetManagedDict(owner_o);
-            DEOPT_IF(hint >= (size_t)dict->ma_keys->dk_nentries);
+            DEOPT_IF(!LOCK_OBJECT(dict));


I'm not thrilled with the lock here -- we've tried to avoid locking in the fast path dictionary accesses -- but given the overall perf gains, we can come back to this later.

Objects/dictobject.c

mpage · 2024-12-23T22:07:24Z

The JIT test failure is unrelated to this PR: here is an example of it failing in the same way on a different PR.

The failing test, test_multiprocessing_spawn.test_processes.WithProcessesTestLock.test_repr_rlock, looks racy.

The test:

cpython/Lib/test/_test_multiprocessing.py

Lines 1525 to 1530 in 30efede

    
           t = threading.Thread(target=self._acquire_release, 
        
                                    args=(lock, 0.2), 
        
                                    name=f'T1') 
        
           t.start() 
        
           time.sleep(0.1) 
        
           self.assertEqual('<RLock(SomeOtherThread, nonzero)>', repr(lock))

The function, _acquire_release, which runs in the thread:

cpython/Lib/test/_test_multiprocessing.py

Lines 1489 to 1497 in 30efede

    
           @staticmethod 
        
           def _acquire_release(lock, timeout, l=None, n=1): 
        
               for _ in range(n): 
        
                   lock.acquire() 
        
               if l is not None: 
        
                   l.append(repr(lock)) 
        
               time.sleep(timeout) 
        
               for _ in range(n): 
        
                   lock.release()

This will fail if _acquire_release runs to completion or doesn't acquire the lock before the test checks the repr.

nascheme · 2024-12-23T22:58:27Z

Lib/test/test_opcache.py

-                a = object()
+                # a must be set to an instance that uses deferred reference
+                # counting in free-threaded builds
+                a = type("Foo", (object,), {})


This looks a bit mysterious to me. Does this do something different than what class a: [...] does? Calling it an instance (which techically true) is a bit confusing too. I would just say a "set to an object that uses ...". It's a class object. If you need it written like this, I think a little helper function with a docstring explaining why it must be done that way would help:

def make_deferred_ref_count_obj(): return type("Foo", (object,), {})

Ah, I see later there is item.a = type("Foo", (object,), {}) so you want an expression not a statement.

nascheme · 2024-12-24T00:08:48Z

Regarding summarize_stats.py, I was expecting to see SPEC_FAIL_ATTR_DESCR_NOT_DEFERRED under the failure kind table for LOAD_ATTR. Do we know why it isn't included? It would give a clue to the amount of performance being missed due to only handling deferred refcount objects.

mpage · 2024-12-24T01:15:32Z

Regarding summarize_stats.py, I was expecting to see SPEC_FAIL_ATTR_DESCR_NOT_DEFERRED under the failure kind table for LOAD_ATTR. Do we know why it isn't included?

Yep: the stats were collected on the default build. I've updated the PR description to make that clear.

It would give a clue to the amount of performance being missed due to only handling deferred refcount objects.

That's a good point. Let me collect stats for the free-threaded build and see what things look like.

mpage · 2024-12-24T06:08:44Z

Hypothesis failure is test_glob: test_selflink() failed on AMD64 RHEL8 Refleaks 3.x #109959
TSAN failure looks new, but unrelated to this PR. I filed Race between PyUnicode_SET_UTF8 and _PyUnicode_CheckConsistency #128212 and will add a suppression.

mpage added 30 commits December 18, 2024 14:13

Add _PyDictKeys_StringLookupAndVersion

8eeb4fe

Look up a unicode key in an all unicode keys object along with the keys version, assigning one if not present. We need a keys version that is consistent with presence of the key for use in the guards.

Pass shared keys version to specialization

fcd05a0

Reading the shared keys version and looking up the key need to be performed atomically. Otherwise, a key inserted after the lookup could race with reading the version, causing us to incorrectly specialize that no key shadows the descriptor.

Add support for enabling each of the instance attribute kinds

5c03db0

Everything starts out disabled

Only cache deferred descriptors

a576748

Make analyze_descriptor thread-safe

8475fd6

* Check that the type hasn't changed across lookups * Descr is now an owned ref

Use an atomic load for GUARD_TYPE_VERSION

29c3356

Use atomics to load valid bit for inline values in _GUARD_DORV_VALUES…

4f9eeb3

…_INST_ATTR_FROM_DICT

Use atomic to load keys version in _GUARD_KEYS_VERSION

e9050fc

Use an atomic load for managed dict in _CHECK_ATTR_METHOD_LAZY_DICT

ade57f2

Enable specialization of method loads

0a87264

Use atomics for fetching type flags

816e22f

Get a strong reference to dict in instance_has_key

ca1e232

Split specialize_dict_access

945b61c

Pass type version to specialize_dict_access

feb7d34

Take a critical section around dict specialization

ecfd199

Use thread-safe version of _PyDictKeys_StringLookup

9fda5db

Use atomic load in _CHECK_MANAGED_OBJECT_HAS_VALUES

3b2c220

Make _LOAD_ATTR_INSTANCE_VALUE thread-safe

408e44b

- Use atomic load for value - Use _Py_TryIncrefCompareStackRef for incref

Make _LOAD_ATTR_WITH_HINT thread-safe

16aab70

Specialize instance accesses

d0920ea

Specialize LOAD_ATTR_SLOT

e7cea82

- Use atomics and _Py_TryIncrefCompareStackRef in _LOAD_ATTR_SLOT - Pass type version during specialization

Enable LOAD_ATTR_PROPERTY

8dac8c4

- Check that fget is deferred - Pass tp_version

Checkpoint LOAD_ATTR_GETATTRIBUTE_OVERRIDEN

d29b3aa

Lock dict in instance_has_key

b0b8102

Lock dict instead of owner when specializing for dict access

85dab0d

Use atomic load for valid bit

581869e

Use _PyDictKeys_StringLookup and lock around it, rather than wasting …

96be738

…out param

Fix cases_generator bug

9afe052

Macros should be treated as terminators when searching for the assignment target of an expression involving PyStackRef_FromPyObjectNew

Fix load

e190a0d

Remove FT_UNIMPLEMENTED

8c78369

All instance loads are complete!

mpage added 3 commits December 20, 2024 09:25

Use atomics when loading oparg

1b787b3

Fix formatting

b868363

Always return type version from analyze_descriptor_load

d6d4c73

This matches the previous implementation and causes failures to specialize due to the presence of both __getattr__ and __getattribute__ to be classified correctly (rather than being classified as being out of type versions).

mpage added the skip news label Dec 21, 2024

bedevere-app bot mentioned this pull request Dec 21, 2024

Make the specializing interpreter thread-safe in --disable-gil builds #115999

Open

mpage requested review from colesbury and nascheme December 21, 2024 22:34

mpage marked this pull request as ready for review December 21, 2024 22:34

mpage requested review from markshannon and methane as code owners December 21, 2024 22:34

bedevere-app bot added the awaiting core review label Dec 21, 2024

Fidget-Spinner mentioned this pull request Dec 23, 2024

gh-121459: Deferred LOAD_ATTR (methods) #124101

Closed

colesbury reviewed Dec 23, 2024

View reviewed changes

mpage added 6 commits December 23, 2024 10:27

Merge branch 'main' into pythongh-115999-load-attr-instance-merged

9755562

Combine incref/steal into new stackref

9673f78

Fix formatting

6c3041f

Remove unnecessary error check for PyUnicode_Type.tp_hash

a20a4a4

Check keys kind explicitly

35b31c6

Pass dict from _CHECK_ATTR_WITH_HINT to _LOAD_ATTR_WITH_HINT

e5a7ae9

mpage requested a review from Fidget-Spinner as a code owner December 23, 2024 20:00

Fix unused variable warning

bb00f5a

mpage requested a review from colesbury December 23, 2024 22:07

nascheme reviewed Dec 23, 2024

View reviewed changes

mpage added 3 commits December 23, 2024 17:43

Update number of specialization failure kinds

b6ae487

Clarify construction of deferred object

11a351d

Add suppression for _PyUnicode_CheckConsistency

6f8aebf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-115999: Specialize `LOAD_ATTR` for instance and class receivers in free-threaded builds #128164

gh-115999: Specialize `LOAD_ATTR` for instance and class receivers in free-threaded builds #128164

mpage commented Dec 21, 2024 •

edited

Loading

colesbury left a comment

colesbury Dec 23, 2024

mpage commented Dec 23, 2024 •

edited

Loading

nascheme Dec 23, 2024 •

edited

Loading

nascheme Dec 23, 2024

nascheme commented Dec 24, 2024

mpage commented Dec 24, 2024 •

edited

Loading

mpage commented Dec 24, 2024

gh-115999: Specialize LOAD_ATTR for instance and class receivers in free-threaded builds #128164

Are you sure you want to change the base?

gh-115999: Specialize LOAD_ATTR for instance and class receivers in free-threaded builds #128164

Conversation

mpage commented Dec 21, 2024 • edited Loading

Single-threaded Performance

Scalability

Thread Safety

Stats

colesbury left a comment

Choose a reason for hiding this comment

colesbury Dec 23, 2024

Choose a reason for hiding this comment

mpage commented Dec 23, 2024 • edited Loading

nascheme Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

nascheme Dec 23, 2024

Choose a reason for hiding this comment

nascheme commented Dec 24, 2024

mpage commented Dec 24, 2024 • edited Loading

mpage commented Dec 24, 2024

gh-115999: Specialize `LOAD_ATTR` for instance and class receivers in free-threaded builds #128164

gh-115999: Specialize `LOAD_ATTR` for instance and class receivers in free-threaded builds #128164

mpage commented Dec 21, 2024 •

edited

Loading

mpage commented Dec 23, 2024 •

edited

Loading

nascheme Dec 23, 2024 •

edited

Loading

mpage commented Dec 24, 2024 •

edited

Loading