Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyO3 Sub Interpreter Broken since 0.22.0 #4570

Open
Xuanwo opened this issue Sep 20, 2024 · 12 comments
Open

PyO3 Sub Interpreter Broken since 0.22.0 #4570

Xuanwo opened this issue Sep 20, 2024 · 12 comments
Labels

Comments

@Xuanwo
Copy link
Contributor

Xuanwo commented Sep 20, 2024

Bug Description

Arrow UDF is a User-Defined Functions Framework that allows users to easily create and run user-defined functions (UDF) on Apache Arrow.

We use pyo3 to create a sub-interpreter and run users' UDFs within it to avoid the global GIL. It used to work fine, but we discovered it broke after the 0.22 release, specifically after this commit in #4188.

Steps to Reproduce

I've prepared a repository: https://github.com/Xuanwo/pyo3-sub-interpreter-broken

To reproduce:

git clone https://github.com/Xuanwo/pyo3-sub-interpreter-broken
cd pyo3-sub-interpreter-broken
cargo test

Backtrace

#0  0x00007d3dd69bf880 in _PyGCHead_SET_NEXT (gc=0x7d3dd5e430c8, next=0x7d3dd014dc60) at ./Include/internal/pycore_gc.h:63
#1  _PyObject_GC_UNTRACK (op=0x7d3dd5c2e0a0) at ./Include/internal/pycore_object.h:246
#2  AttributeError_dealloc (self=0x7d3dd5c2e0a0) at Objects/exceptions.c:2315
#3  0x00005618f292c140 in pyo3_ffi::object::Py_DECREF (op=0x7d3dd5c2e0a0)
    at /home/xuanwo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-ffi-0.22.3/src/object.rs:611
#4  pyo3::gil::ReferencePool::update_counts (self=0x5618f299aed8 <pyo3::gil::POOL+16>) at src/gil.rs:311
#5  0x00005618f292c24d in pyo3::gil::GILPool::new () at src/gil.rs:424
#6  0x00005618f292b94a in pyo3::gil::GILGuard::acquire_unchecked () at src/gil.rs:229
#7  0x00005618f292b8e2 in pyo3::gil::GILGuard::acquire () at src/gil.rs:211
#8  0x00005618f28d2661 in pyo3::marker::Python::with_gil<pyo3_sub_interpreter_broken::{impl#2}::new::{closure_env#0}, ()> (f=...)
    at /home/xuanwo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pyo3-0.22.3/src/marker.rs:410
#9  0x00005618f28d3402 in pyo3_sub_interpreter_broken::SubInterpreter::new () at src/lib.rs:42
#10 0x00005618f28d2e45 in pyo3_sub_interpreter_broken::tests::assert_err (code=...) at src/lib.rs:161
#11 0x00005618f28d2f9e in pyo3_sub_interpreter_broken::tests::test_forbid () at src/lib.rs:156
#12 0x00005618f28d2d27 in pyo3_sub_interpreter_broken::tests::test_forbid::{closure#0} () at src/lib.rs:154
#13 0x00005618f28d30d6 in core::ops::function::FnOnce::call_once<pyo3_sub_interpreter_broken::tests::test_forbid::{closure_env#0}, ()> ()
    at /rustc/eeb90cda1969383f56a2637cbd3037bdf598841c/library/core/src/ops/function.rs:250
#14 0x00005618f290e7ab in core::ops::function::FnOnce::call_once<fn() -> core::result::Result<(), alloc::string::String>, ()> () at library/core/src/ops/function.rs:250
#15 test::__rust_begin_short_backtrace<core::result::Result<(), alloc::string::String>, fn() -> core::result::Result<(), alloc::string::String>> ()
    at library/test/src/lib.rs:624
#16 0x00005618f290e055 in test::run_test_in_process::{closure#0} () at library/test/src/lib.rs:647
#17 core::panic::unwind_safe::{impl#23}::call_once<core::result::Result<(), alloc::string::String>, test::run_test_in_process::{closure_env#0}> ()
    at library/core/src/panic/unwind_safe.rs:272
#18 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>, core::result::Result<(), alloc::string::String>> ()
    at library/std/src/panicking.rs:557
#19 std::panicking::try<core::result::Result<(), alloc::string::String>, core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>> ()
    at library/std/src/panicking.rs:521
#20 std::panic::catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<test::run_test_in_process::{closure_env#0}>, core::result::Result<(), alloc::string::String>> ()
    at library/std/src/panic.rs:350
#21 test::run_test_in_process () at library/test/src/lib.rs:647
#22 test::run_test::{closure#0} () at library/test/src/lib.rs:568
#23 0x00005618f28d4dd4 in test::run_test::{closure#1} () at library/test/src/lib.rs:598
#24 std::sys::backtrace::__rust_begin_short_backtrace<test::run_test::{closure_env#1}, ()> () at library/std/src/sys/backtrace.rs:152
#25 0x00005618f28d8502 in std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure#0}<test::run_test::{closure_env#1}, ()> () at library/std/src/thread/mod.rs:538
#26 core::panic::unwind_safe::{impl#23}::call_once<(), std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>> ()
    at library/core/src/panic/unwind_safe.rs:272
#27 std::panicking::try::do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>>, ()> () at library/std/src/panicking.rs:557
#28 std::panicking::try<(), core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<test::run_test::{closure_env#1}, ()>>> () at library/std/src/panicking.rs:521


### Your operating system and version

Archlinux

### Your Python version (`python --version`)

Python 3.12.6

### Your Rust version (`rustc --version`)

rustc 1.81.0 (eeb90cda1 2024-09-04)

### Your PyO3 version

> 0.22.0

### How did you install python? Did you use a virtualenv?

`pacman`, no virtualenv.

Not related.

### Additional Info

Also see: https://github.com/arrow-udf/arrow-udf/pull/70
@Xuanwo
Copy link
Contributor Author

Xuanwo commented Sep 20, 2024

Furthermore, in the upcoming pyo3 release, the GILPool will be removed entirely. What alternative can we use?

@mejrs
Copy link
Member

mejrs commented Sep 20, 2024

It used to work fine, but we discovered it broke after the 0.22 release, specifically after this commit in #4188.

Your code was always broken, this is just when you noticed it. Your code has various examples of UB, like releasing the gil when you aren't holding it, and calling python api without the gil held. Your are also not handling existing threadstate correctly, which is probably why the linked PR stopped your code from "working".

Mixing pyo3 api and subinterpreters is also not supported. My suggestion is to exclusively use pyo3-ffi. The documentation for that is at https://docs.python.org/3/c-api/init.html

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Sep 20, 2024

cc @wangrunji0408 do you have any ideas?

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Sep 20, 2024

Hi, @mejrs, thank you for your prompt response. Our use case is quite unique, as we will run users' UDFs exclusively in this sub-interpreter.

After some research, I found that we can use Python::with_gil_unchecked to correctly hold the GIL.

It seems this approach is incorrect; I didn't run that code in the sub-interpreter.

@davidhewitt
Copy link
Member

What you are doing is inherently unsafe given PyO3's lack of support for subinterpreters, but it's possible I can help work out a subset of operations which you can do without issue. This would be entirely at your own risk, of course.

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Oct 8, 2024

What you are doing is inherently unsafe given PyO3's lack of support for subinterpreters, but it's possible I can help work out a subset of operations which you can do without issue. This would be entirely at your own risk, of course.

Hi, @davidhewitt, would you like to lead me a direction in finding a solution? I'm willing to help implement it in PyO3 or fix our usage.

@davidhewitt
Copy link
Member

davidhewitt commented Oct 10, 2024

I would suggest looking at the problem like this:

  • sharing almost all objects between subinterpreters is unsound; PyO3's Py<T> does nothing to prevent this so if you have these objects crossing between interpreters it's probably broken. Bound<'py, T> and Borrowed<'a, 'py, T> probably allow sharing too, although the 'py lifetime will make it a bit more awkward.
  • consequently, almost all global state is probably not sound when mixed across subinterpreters. #[pyclass] currently stores the type object in global state so #[pyclass] definitely cannot be used with subinterpreters at present.
  • PyO3 also has some internal state, which we're generally working to remove, for many reasons. The GIL Refs were one example of (thread-local) global state, we also have an internal counter to detect if we hold the GIL which I suspect is also a problematic optimization.

A safe solution in PyO3 is extremely nontrivial, see #576

What we should start with, is to consider: what state are you passing into the subinterpreter? If we can isolate that to a very well defined data payload, we can probably make a very limited subinterpreter call sound.

@wangrunji0408
Copy link

@davidhewitt Thank you for your suggestion. After our investigation, we found that the root cause of the issue is in the third point you mentioned—the internal GIL counter. In our code, after switching to a subinterpreter, a GILPool is created, which increments the internal counter by 1. However, after this commit, the counter remains unchanged at 0. This results in Python objects being registered in the internal queue for deferred drop instead of being dropped immediately when exiting the Python scope. When re-entering, a different subinterpreter might be used. This leads to exceptions. Unfortunately, we haven’t found any public interfaces in the latest version of PyO3 that allow us to manipulate the internal counter directly. As a result, this issue cannot be resolved for now. We are looking forward to native support for subinterpreters in PyO3 soon.

@Xuanwo
Copy link
Contributor Author

Xuanwo commented Oct 10, 2024

As a result, this issue cannot be resolved for now. We are looking forward to native support for subinterpreters in PyO3 soon.

From what I understand, it's a long way for PyO3 to officially support subinterpreters. I hope we can find a way to use subinterpreters unsafely for now and work together to build the final solution.

Is it possible for us to modify GILGuard or a similar API so we can manage them ourselves while switching subinterpreters?

@mejrs
Copy link
Member

mejrs commented Oct 10, 2024

Is it possible for us to modify GILGuard or a similar API so we can manage them ourselves while switching subinterpreters?

What makes you think GILGuard is the problem? You should fix the bugs in your code first.

@davidhewitt
Copy link
Member

Unfortunately, we haven’t found any public interfaces in the latest version of PyO3 that allow us to manipulate the internal counter directly. As a result, this issue cannot be resolved for now. We are looking forward to native support for subinterpreters in PyO3 soon.

You should use Python::with_gil to ensure a valid Python thread state within the subinterpreter. PyO3 will bump the internal counter as part of that call.

@davidhewitt
Copy link
Member

I would hasten to note that the internal counter is really an implementation detail / optimization, and you should expect that what you're doing might break again with future PyO3 upgrades.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants