The existence of a dangling object can cause a crash even if you don't touch it after it dangles #589

oremanj · 2024-05-23T04:38:01Z

oremanj
May 23, 2024

I recently debugged an issue that a coworker of mine was having after converting some bindings from pybind to nanobind. The Python code in question was creating a nanobind instance wrapping a C++ reference (ie using rv_policy::reference). The referent was later destroyed in C++. The nanobind instance continued to exist, but wasn't ever accessed after the C++ referent was destroyed.

On pybind, this worked fine. On nanobind, it would sporadically produce an error

Critical nanobind error: nanobind::detail::inst_new_int(): unexpected collision!

The reason is a little subtle: since the original nanobind instance still existed, the dangling C++ pointer was still registered in inst_c2p. Since the underlying memory had been freed, that chunk was eligible to be reused for a newly allocated nanobind instance. This just happened to be a nanobind instance with internal storage that started at exactly the location that the original C++ referent had been stored. This created a collision when inserting into inst_c2p, and nanobind figured that should be impossible.

I'm curious what you think is the right thing to do here. What is the programmer's responsibility in terms of managing the lifetime of a nanobind instance that wraps a reference, relative to the lifetime of its C++ referent? Obviously doing anything to the Python instance that actually reads or writes part of the C++ object is verboten. But it's difficult to definitively end the lifetime of a Python instance, so the current behavior - where we can get a nanobind assertion from the mere existence of a dangling instance, without committing any C++ UB - seems a little too strict to me.

This would be easy to fix (modify inst_new_int to handle collisions in inst_c2p in the same way that inst_new_ext does) but I wasn't sure about the appetite for that change, so figured I'd raise it in a discussion first.

oremanj · 2024-05-23T05:46:56Z

oremanj
May 23, 2024
Author

I just realized that there's a 100% valid way to produce this error: if you have a Python object created from a unique_ptr<T> returned by C++, and you pass its ownership back to C++ in order to initialize a new unique_ptr<T> argument. The C++ object address is still registered in inst_c2p, but C++ can validly destroy and deallocate it, which creates the same opportunity for a spurious collision assertion in inst_new_int later on.

0 replies

wjakob · 2024-05-23T07:12:52Z

wjakob
May 23, 2024
Maintainer

The thing about multiple Python objects sharing the same pointer address is that they're supposed to have different types. That way, the mapping remains unambiguous. This is what allows you to create an internal reference to the first element of a struct. If the (type, pointer) mapping becomes nonsensical, you could get bogus instances back when casting a C++ pointer back to Python (i.e. nanobind would return an old dangling instance, instead of creating a new one with the specified return value policy)

I don't have a great idea on how to solve this problem.

0 replies

oremanj · 2024-05-31T19:12:38Z

oremanj
May 31, 2024
Author

There are basically two cases:

The dangling address is reused for an internal-storage instance. This one's easy: we have a brand new heap allocation, so any instances whose address conflict with it must be dangling. I have an implementation that "detaches" the dangling instances by removing them from inst_c2p, making them unusable (state=relinquished), and setting a sentinel offset=0 so we don't complain when we can't find them in inst_c2p upon eventual destruction of the Python object.
The dangling address is reused for an external-storage instance. This one is much trickier, because we can't tell whether the pointer being cast to Python is "fresh" or might have already been exposed via a different return value policy. We probably can't safely use the "detach" approach here, but instead would have to solve the general problem of: is this existing instance suitable for reuse when casting the apparently-same C++ object with this new RVP?

There are non-dangling cases that raise the "reuse" problem also:

struct SomeData { int value = 42; };
struct Example {
    std::unique_ptr<SomeData> storage = std::make_unique<SomeData>();

    SomeData* inspect() const { return storage.get(); }
    std::unique_ptr<SomeData> take() { return std::move(storage); }
};
nb::class_<SomeData>(m, "SomeData")
    .def_rw("value", &SomeData::value);
nb::class_<Example>(m, "Example")
    .def(nb::init<>())
    .def("inspect", &Example::inspect, nb::rv_policy::reference_internal)
    .def("take", &Example::take);

import example_mod
ex = example_mod.Example()

if True:
    # This works, at least on CPython:
    if ex.inspect().value == 42:
        store_somewhere(ex.take())
else:
    # This doesn't; the existing rv_policy::reference instance is reused
    # as-is, and we get an assertion failure inside nb_put_unique_finalize().
    # If we were instead returning a raw pointer with rv_policy::take_ownership,
    # we'd get a leak as no one would be assuming ownership (C++ released it
    # and Python didn't take it because the existing reference-only instance was
    # reused.)
    data = ex.inspect()
    if data.value == 42:
        store_somewhere(ex.take())

IMO, the reuse problem doesn't really have anything to do with dangling; dangling is just one way to encounter it unpredictably. What do you think is the ideal solution here? We could:

Create a new owning instance instead of reusing the existing non-owning one.
Upgrade the existing non-owning instance to owning.
Decide this is too obscure or complicated to be worth fixing.
Maybe something else?

Shared ownership, intrusive refcounting, reference_internal, etc add additional cases but don't change the basic strategy IMO.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The existence of a dangling object can cause a crash even if you don't touch it after it dangles #589

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

The existence of a dangling object can cause a crash even if you don't touch it after it dangles #589

oremanj May 23, 2024

Replies: 3 comments

oremanj May 23, 2024 Author

wjakob May 23, 2024 Maintainer

oremanj May 31, 2024 Author

oremanj
May 23, 2024

oremanj
May 23, 2024
Author

wjakob
May 23, 2024
Maintainer

oremanj
May 31, 2024
Author