Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What does task_group_context do? #1278

Closed
blonded04 opened this issue Dec 6, 2023 · 8 comments
Closed

What does task_group_context do? #1278

blonded04 opened this issue Dec 6, 2023 · 8 comments
Labels

Comments

@blonded04
Copy link
Contributor

While modifying TBB sources, I was trying really hard to understand what is happening in task_group_context and task_group_context_impl, but I failed miserably.

I'm trying to construct task_group_context, and then use it to execute a task in multiple separate threads in local_wait_for_all loop via execute_and_wait function. However, no matter what I do it just ends up being something like this (gdb output):

tbb::detail::r1::task_group_context_impl::copy_fp_settings (ctx=..., src=...) at onetbb-src/src/tbb/task_group_context.cpp:278
278         new (&ctx.my_cpu_ctl_env) d1::cpu_ctl_env(*src_ctl);
(gdb) bt
#0  tbb::detail::r1::task_group_context_impl::copy_fp_settings (ctx=..., src=...) at onetbb-src/src/tbb/task_group_context.cpp:278
#1  0x0000aaaaaaac24c0 in tbb::detail::r1::task_group_context_impl::bind_to_impl (ctx=..., td=td@entry=0xffff98000900)
    at onetbb-src/src/tbb/task_group_context.cpp:123
#2  0x0000aaaaaaac2628 in tbb::detail::r1::task_group_context_impl::bind_to (ctx=..., td=td@entry=0xffff98000900) at onetbb-src/src/tbb/task_group_context.cpp:188
#3  0x0000aaaaaaabeb88 in tbb::detail::r1::task_dispatcher::execute_and_wait (t=t@entry=0xffff98000e00, wait_ctx=..., w_ctx=...)
    at onetbb-src/src/tbb/task_dispatcher.cpp:161
#4  0x0000aaaaaaabf34c in tbb::detail::r1::execute_and_wait (t=..., t_ctx=..., wait_ctx=..., w_ctx=...) at onetbb-src/src/tbb/task_dispatcher.cpp:121
#5  0x0000aaaaaaaaad70 in tbb::detail::d1::execute_and_wait (w_ctx=..., wait_ctx=..., t_ctx=..., t=...)
    at onetbb-src/include/oneapi/tbb/detail/_task.h:191

I tried:

  • sharing one context between all execute_and_wait calls
  • creating separate context for each call

Also I tried constructing task_group_context with 2 different argument sets (total 2 * 2 = 4 configurations):

  • PARALLEL_FOR
  • tbb::task_group_context::bound, tbb::task_group_context::default_traits | tbb::task_group_context::concurrent_wait

Yet I still get the same error :(

Thank you very much for all your previous answers btw.

@pavelkumbrasev
Copy link
Contributor

Hi @blonded04,
Sorry I didn't understand your question.
What is the problem you highlighted in gdb?

@blonded04
Copy link
Contributor Author

Hi @blonded04, Sorry I didn't understand your question. What is the problem you highlighted in gdb?

I get segfaults on that line :(

new (&ctx.my_cpu_ctl_env) d1::cpu_ctl_env(*src_ctl);

Probably something wrong with ctx.my_cpu_ctl_env (null potentially?) or with src_ctl.

@pavelkumbrasev
Copy link
Contributor

It hard to say what went wrong. Could you provide reproducer?

@blonded04
Copy link
Contributor Author

blonded04 commented Dec 7, 2023

Sorry, it took a while, however the smallest reproducer I got problems with task_group_context is 65 lines of diff to TBB (Ubuntu, Ryzen 5500u).

My hypothesis is that TBB is not a friend of creating a context inside local_wait_for_all where context already exists. If I'm right, how can I safely change context.

main.cpp:

#include <atomic>
#include <iostream>

// shared_state between tbb and main.cpp
#include <oneapi/tbb/problems.h>
#include <tbb/parallel_for.h>

constexpr unsigned nthread = 12u;

int main() {
    // force nthread threads to join arena (they wont leave because wait-limit in task dispatcher is increased)
    std::atomic<unsigned> spin_barrier(nthread);
    tbb::parallel_for(tbb::blocked_range<int>(0, nthread), [&spin_barrier](tbb::blocked_range<int>) {
        spin_barrier.fetch_sub(1, std::memory_order_release);
        while (spin_barrier.load(std::memory_order_acquire)) {
            asm volatile ("pause\npause\npause\npause");
        }
    });
    std::cout << "parallel_for_finished" << std::endl;

    // enable problematic behaviour
    tbb::set_flag();

    // some time later we will face SEGFAULT
    while (tbb::get_counter() < 1000000u) {
        std::cout << "\t" << tbb::get_counter() << std::endl;
    }

    // we won't even get to it
    tbb::set_flag(false);
}

GDB backtrace for segfault:

Thread 2 "main" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff799c700 (LWP 12481)]
tbb::detail::r1::context_guard_helper<false>::set_ctx (ctx=0x7ffff799bcc0, this=0x7ffff799bb80) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/scheduler_common.h:155
155	            curr_cpu_ctl_env.set_env();
(gdb) bt
#0  tbb::detail::r1::context_guard_helper<false>::set_ctx (ctx=0x7ffff799bcc0, this=0x7ffff799bb80) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/scheduler_common.h:155
#1  tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7ffff0001e80, this=0x46a480) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.h:311
#2  tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x46a480) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.h:472
#3  tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.cpp:168
#4  0x000000000040c0c7 in tbb::detail::d1::execute_and_wait (w_ctx=..., wait_ctx=..., t_ctx=..., t=warning: RTTI symbol not found for class 'tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}, tbb::detail::d1::auto_partitioner const>'
...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/detail/_task.h:191
#5  tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned int> const&, {lambda(tbb::detail::d1::blocked_range<unsigned int>)#1} const&, tbb::detail::d1::auto_partitioner&, tbb::detail::d1::task_group_context&) (context=..., partitioner=..., body=..., range=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/parallel_for.h:113
#6  tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned int> const&, {lambda(tbb::detail::d1::blocked_range<unsigned int>)#1} const&, tbb::detail::d1::auto_partitioner&, tbb::detail::d1::task_group_context&) (context=..., partitioner=..., body=..., range=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/parallel_for.h:105
#7  tbb::detail::d1::start_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}, tbb::detail::d1::auto_partitioner const>::run(tbb::detail::d1::blocked_range<unsigned int> const&, {lambda(tbb::detail::d1::blocked_range<unsigned int>)#1} const&, tbb::detail::d1::auto_partitioner&) (partitioner=..., body=..., range=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/parallel_for.h:102
#8  tbb::detail::d1::parallel_for<tbb::detail::d1::blocked_range<unsigned int>, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1}>(tbb::detail::d1::blocked_range<unsigned int> const&, tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter>(tbb::detail::r1::thread_data&, tbb::detail::r1::execution_data_ext&, tbb::detail::r1::outermost_worker_waiter&, long, bool, bool)::{lambda(tbb::detail::d1::blocked_range<unsigned int>)#1} const&) (body=..., range=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/../../include/oneapi/tbb/parallel_for.h:230
#9  tbb::detail::r1::task_dispatcher::receive_or_steal_task<false, tbb::detail::r1::outermost_worker_waiter> (this=this@entry=0x46a480, tls=..., ed=..., waiter=..., isolation=<optimized out>, fifo_allowed=<optimized out>, critical_allowed=<optimized out>) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.h:238
#10 0x0000000000407c68 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (waiter=..., t=0x0, this=0x46a480) at /usr/include/c++/11/bits/atomic_base.h:818
#11 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (waiter=..., t=0x0, this=0x46a480) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/task_dispatcher.h:472
#12 tbb::detail::r1::arena::process (this=<optimized out>, tls=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/arena.cpp:137
#13 0x0000000000418b0b in tbb::detail::r1::market::process (this=0x460400, j=...) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/market.cpp:599
#14 0x000000000041be19 in tbb::detail::r1::rml::private_worker::run (this=0x468e00) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/private_server.cpp:271
#15 0x000000000041bfad in tbb::detail::r1::rml::private_worker::thread_routine (arg=<optimized out>) at /home/valery/repos/tbb_test/cmake_deps/onetbb-src/src/tbb/private_server.cpp:221
#16 0x00007ffff7b9c609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007ffff7ac1133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Diff in TBB:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 47872941..eaa81b12 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -66,7 +66,7 @@ include(CMakeDependentOption)
 # Handle C++ standard version.
 if (NOT MSVC)  # no need to cover MSVC as it uses C++14 by default.
     if (NOT CMAKE_CXX_STANDARD)
-        set(CMAKE_CXX_STANDARD 11)
+        set(CMAKE_CXX_STANDARD 20)
     endif()
 
     if (CMAKE_CXX${CMAKE_CXX_STANDARD}_STANDARD_COMPILE_OPTION)  # if standard option was detected by CMake
@@ -108,7 +108,7 @@ option(TBB_DISABLE_HWLOC_AUTOMATIC_SEARCH "Disable HWLOC automatic search by pkg
 option(TBB_ENABLE_IPO "Enable Interprocedural Optimization (IPO) during the compilation" ON)
 
 if (NOT DEFINED BUILD_SHARED_LIBS)
-    set(BUILD_SHARED_LIBS ON)
+    set(BUILD_SHARED_LIBS OFF)
 endif()
 
 if (NOT BUILD_SHARED_LIBS)
diff --git a/include/oneapi/tbb/parallel_for.h b/include/oneapi/tbb/parallel_for.h
index 91c7c44c..fd3f6aee 100644
--- a/include/oneapi/tbb/parallel_for.h
+++ b/include/oneapi/tbb/parallel_for.h
@@ -29,6 +29,7 @@
 #include "task_group.h"
 
 #include <cstddef>
+#include <functional>
 #include <new>
 
 namespace tbb {
@@ -109,12 +110,12 @@ struct start_for : public task {
             // defer creation of the wait node until task allocation succeeds
             wait_node wn;
             for_task.my_parent = &wn;
-            execute_and_wait(for_task, context, wn.m_wait, context);
+            d1::execute_and_wait(for_task, context, wn.m_wait, context);
         }
     }
     //! Run body for range, serves as callback for partitioner
     void run_body( Range &r ) {
-        tbb::detail::invoke(my_body, r);
+        my_body(r);
     }
 
     //! spawn right task, serves as callback for partitioner
diff --git a/include/oneapi/tbb/problems.h b/include/oneapi/tbb/problems.h
new file mode 100644
index 00000000..fa380b53
--- /dev/null
+++ b/include/oneapi/tbb/problems.h
@@ -0,0 +1,37 @@
+#pragma once
+
+#include <atomic>
+
+namespace tbb {
+
+namespace internal {
+
+inline std::atomic<bool>& get_flag_impl() {
+    static std::atomic<bool> flag(false);
+    return flag;
+}
+
+inline std::atomic<unsigned>& get_counter_impl() {
+    static std::atomic<unsigned> counter(0u);
+    return counter;
+}
+
+} // namespace internal
+
+inline void set_flag(bool value=true) {
+    internal::get_flag_impl().store(value, std::memory_order_release);
+}
+
+inline bool get_flag() {
+    return internal::get_flag_impl().load(std::memory_order_acquire);
+}
+
+inline unsigned get_counter() {
+    return internal::get_counter_impl().load(std::memory_order_acquire);
+}
+
+inline void increment_counter() {
+    internal::get_counter_impl().fetch_add(1, std::memory_order_release);
+}
+
+} // namespace tbb
\ No newline at end of file
diff --git a/src/tbb/scheduler_common.h b/src/tbb/scheduler_common.h
index 9e103657..ddf082aa 100644
--- a/src/tbb/scheduler_common.h
+++ b/src/tbb/scheduler_common.h
@@ -254,7 +254,7 @@ public:
         // threshold value tuned separately for macOS due to high cost of sched_yield there
         , my_yield_threshold{10 * yields_multiplier}
 #else
-        , my_yield_threshold{100 * yields_multiplier}
+        , my_yield_threshold{10000000 * yields_multiplier}
 #endif
         , my_pause_count{}
         , my_yield_count{}
diff --git a/src/tbb/task_dispatcher.h b/src/tbb/task_dispatcher.h
index f6ff3f17..aa16bf57 100644
--- a/src/tbb/task_dispatcher.h
+++ b/src/tbb/task_dispatcher.h
@@ -30,6 +30,10 @@
 #include "itt_notify.h"
 #include "concurrent_monitor.h"
 
+#include "oneapi/tbb/task_group.h"
+#include "oneapi/tbb/parallel_for.h"
+#include "oneapi/tbb/problems.h"
+
 #include <atomic>
 
 #if !__TBB_CPU_CTL_ENV_PRESENT
@@ -229,6 +233,16 @@ d1::task* task_dispatcher::receive_or_steal_task(
         }
         // Nothing to do, pause a little.
         waiter.pause(slot);
+
+        if (get_flag()) {
+            tbb::parallel_for(
+                tbb::blocked_range<unsigned>{0u, arena_index + 1u}, 
+                [] (tbb::blocked_range<unsigned> range) {
+                    for (unsigned idx = range.begin(); idx < range.end(); idx++) {
+                        increment_counter();
+                    }
+                });
+        }
     } // end of nonlocal task retrieval loop
 
     __TBB_ASSERT(is_alive(a.my_guard), nullptr);

@blonded04
Copy link
Contributor Author

blonded04 commented Dec 7, 2023

What happens internally is I end up calling local_wait_for_all inside local_wait_for_all, but not inside any other task.

Is there any assumptions that forbid that kind of behavior and are there any workarounds that respect existing invariants?

@sarathnandu
Copy link
Contributor

Hi @blonded04

It seems you are trying to do a parallel_for inside the steal loop which is something we never tried. Looking at your gdb it's missing debug information (possibly due to optimizations turned on ). So i would like to ask you to run it on debug mode so that assertions should provide some context on what is going on.

But I would like to clarify running parallel_for inside the steal loop is something we never recommend to do.

@sarathnandu
Copy link
Contributor

Hi @blonded04 , can we close this issue?

@blonded04
Copy link
Contributor Author

Hi @blonded04 , can we close this issue?

Yes! Thank you very much for help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants