Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-106485: Dematerialize instance dictionaries when possible #106539

Merged
merged 22 commits into from
Aug 9, 2023

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented Jul 7, 2023

Here's an alternative to #106496, using LOAD_ATTR_INSTANCE_VALUE.

@brandtbucher brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Jul 7, 2023
@brandtbucher brandtbucher self-assigned this Jul 7, 2023
Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this does the right thing, and does it the right way.

But I don't know if it does it at the right time, or in the right place.
In other words, are we dematerializing in the right instruction, and in such as way as to not degrade specialization?

I guess the measure of that is the stats.

DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
res = _PyDictOrValues_GetValues(dorv)->values[index];
PyDictOrValues *dorv = _PyObject_DictOrValuesPointer(owner);
if (!_PyDictOrValues_IsValues(*dorv)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this big chunk of code inline here, it's going to mess up the tier 2 optimizer (it won't help the compiler generate code for the tier 1 interpreter either).

How about DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR_INSTANCE_DEMATERIALIZE)
and make another instruction LOAD_ATTR_INSTANCE_DEMATERIALIZE to do the dematerialization?

@@ -673,6 +674,8 @@ specialize_dict_access(
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
if (_PyDictOrValues_IsValues(dorv)) {
// Virtual dictionary
unmaterializing:
; // Load-bearing semicolon; don't touch!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe // C doesn't allow declarations immediately after a label?

@@ -695,6 +698,16 @@ specialize_dict_access(
return 0;
}
// We found an instance with a __dict__.
if (dict->ma_values) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a complex condition. The goto makes it even harder to follow.
Could you compute a virtual_dict flag, then branch on that?

@markshannon
Copy link
Member

markshannon commented Jul 12, 2023

The stats still show "has managed dict" as the primary reason for specialization failure.

That suggests to me that we need to dematerialize in LOAD_ATTR_METHOD_WITH_VALUES and LOAD_ATTR_NONDESCRIPTOR_WITH_VALUES as well.

If that is the case we should be able to get the specialization rate for LOAD_ATTR up to 98% (ignoring misses).

@brandtbucher brandtbucher changed the title GH-106485: "Un-materialize" __dict__s in LOAD_ATTR_INSTANCE_VALUE GH-106485: Dematerialize instance dictionaries when possible Jul 16, 2023
@brandtbucher brandtbucher marked this pull request as ready for review July 16, 2023 14:56
@brandtbucher
Copy link
Member Author

Just kicked off a stats/benchmarking job for the new approach. Should be done soon.

@brandtbucher
Copy link
Member Author

Stats comparison vs base and benchmark comparison vs base. On my phone, so not able to dig into the results right now.

Copy link
Member

@markshannon markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stats are puzzling.
This hugely improves the specialization success rate, but make almost no difference to the number of unspecialized LOAD_ATTRs executed.

@@ -1827,8 +1827,10 @@ dummy_func(
op(_CHECK_MANAGED_OBJECT_HAS_VALUES, (owner -- owner)) {
assert(Py_TYPE(owner)->tp_dictoffset < 0);
assert(Py_TYPE(owner)->tp_flags & Py_TPFLAGS_MANAGED_DICT);
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(owner);
DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
PyDictOrValues *dorv = _PyObject_DictOrValuesPointer(owner);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is turning:

  test_bit = owner[-4] & 1
  jump if test_bit -> slow_path

into

  test_bit = owner[-4] & 1
  jump if not test_bit -> following
  spill-registers
  call _PyObject_MakeInstanceAttributesFromDict
  restore-registers
  jump if return-register ->following
  jump -> slow_path
following:

or, which is even worse:

  test_bit = owner[-4] & 1
  spill-registers
  jump if not test_bit -> following
  call _PyObject_MakeInstanceAttributesFromDict
  jump if return-register ->following
  jump -> slow_path
following:
  restore-registers

I think this will need to be

DEOPT_IF(!_PyDictOrValues_IsValues(*dorv), LOAD_ATTR_DEMATERIALIZE)

To keep the slow-path from messing up the code.

Copy link
Member Author

@brandtbucher brandtbucher Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really complicates the code (and probably needs ugly special-casing in the resulting uop trace), since LOAD_ATTR_DEMATERIALIZE would need to jump back into (or repeat) the instruction we happen to be executing after a successful dematerialization.

Can we leave that to another PR, if we determine that the compiler indeed hasn't just moved this cold branch with the register spills out-of-line (like it should, under PGO)?

  test_bit = owner[-4] & 1
  jump if test_bit -> cold
following:
  ...
cold:
  spill-registers
  call _PyObject_MakeInstanceAttributesFromDict
  restore-registers
  jump if return-register ->following
  jump -> slow_path

@@ -2718,8 +2720,10 @@ dummy_func(
assert(type_version != 0);
DEOPT_IF(self_cls->tp_version_tag != type_version, LOAD_ATTR);
assert(self_cls->tp_flags & Py_TPFLAGS_MANAGED_DICT);
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(self);
DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
PyDictOrValues *dorv = _PyObject_DictOrValuesPointer(self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for LOAD_ATTR_INSTANCE_VALUES

@@ -2748,8 +2752,10 @@ dummy_func(
assert(type_version != 0);
DEOPT_IF(self_cls->tp_version_tag != type_version, LOAD_ATTR);
assert(self_cls->tp_flags & Py_TPFLAGS_MANAGED_DICT);
PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(self);
DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
PyDictOrValues *dorv = _PyObject_DictOrValuesPointer(self);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@brandtbucher
Copy link
Member Author

The stats are puzzling.
This hugely improves the specialization success rate, but make almost no difference to the number of unspecialized LOAD_ATTRs executed.

We're only dematerializing about 5M dicts, and the unspecialized LOAD_ATTR rates drop by about 35M (3%). So, on average, each object that used to have a dict gets about 7 new hits. That seems reasonable.

As a side note, I've learned to mostly ignore the specialization success/failure stats when comparing branches like this, since repeated specialization attempts (and exponential backoff) cause failure rates to grow or shrink disproportionately from the actual number of specialized sites. In this case, the number of successes drops by 236k and the number of failures drop by over a million. But the percentages say we actually have 5% more successes now. It's sort of confusing to derive much meaning from those sorts of numbers. 🙃

@brandtbucher
Copy link
Member Author

Confirmed that the weird mypy2 results seem to result from an error in the runner logic for that benchmark.

Comment on lines +5764 to +5766
if (_PyType_HasFeature(tp, Py_TPFLAGS_MANAGED_DICT)) {
OBJECT_STAT_INC(dict_materialized_on_request);
}
Copy link
Member Author

@brandtbucher brandtbucher Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markshannon BTW, this was the missing STAT_INC. Anything that sets a custom tp_new and has Py_TPFLAGS_MANAGED_DICT doesn't get its inline values initialized, and has a NULL dict. We handle it fine, but we may want to consider treating a NULL dict as "has values that need initializing" rather than "has a dict that needs initializing".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out, this is a major source of dict materializations in the benchmarks (about ~4 million dicts that we weren't counting before). It looks like we're creating all of these just to dematerialize them almost immediately.

@brandtbucher brandtbucher enabled auto-merge (squash) August 9, 2023 18:30
@brandtbucher brandtbucher merged commit 326f0ba into python:main Aug 9, 2023
17 checks passed
}
// It's likely that this dict still shares its keys (if it was materialized
// on request and not heavily modified):
assert(PyDict_CheckExact(dict));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should allow dict subclasses?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can't

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants