-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tbb on wasm always executed on the main thread. #1287
Comments
================== After several comparisons, some phenomena were discovered. My function is to perform voxel processing through VDB. I encapsulated the functionality within a function. ==================
================== It seems that TBB needs a warm-up. So I made some changes. I compiled the code using emscripten and added |
==================
In this code snippet, threads can directly make use of the multi-core features without the need for pre-warming like in TBB. I wonder if there is any way to bypass this issue or adjust some mechanisms in TBB. |
=============== =============== |
But I have found a possible solution, which is to execute the following code segment after the program starts, acting as the warm-up code for ===============
I found that executing this nearly ineffective code ahead of time enables subsequent OpenVDB to efficiently utilize multi-core computing. |
Hi, Did you face this issue with TBB prior to your porting to WASM? As you have said - it doesn't seem to be a WASM issue, but an inherent TBB issue. I will investigate this further and keep you updated. |
I have been using it in non-web scenarios, mainly on macOS, and it works well |
HI @JhaShweta1 ============================= The phenomenon is that using TBB (Threading Building Blocks) for computation is much slower than using a single thread, approximately three times slower. ============================= Just like the previous method, warm up TBB by using the code snippet below.
OpenSubdiv extensively utilizes The subsequent performance of the normal CPU utilization rate will never exceed 100%, which is quite peculiar. As a result, there is a significant decrease in performance compared to the single-threaded version without using TBB. ============================= ============================= |
Hi, I am also encountering similar issues, but only for nodejs and not in the browser. For nodejs 18 with |
There seems to be no way |
I wonder if this is something related to the scheduler in tbb, not familiar with the internals so cannot say much. I can try to create a MRE and detailed environment information (emscripten, browser, nodejs version) if that helps. |
Hi, |
Sure, but this will take some time as I am busy with other things right now. Debugging this wasm weirdness takes quite a lot of time... Hopefully I have more time next week to do this. |
Consider the following code: #include <chrono>
#include <iostream>
#include <thread>
#include "oneapi/tbb/parallel_for.h"
using namespace std::chrono_literals;
int main() {
auto start = std::chrono::high_resolution_clock::now();
oneapi::tbb::parallel_for( //
oneapi::tbb::blocked_range<std::size_t>(0, 10), [&](const auto &r) {
std::this_thread::sleep_for(1s);
auto end = std::chrono::high_resolution_clock::now();
std::cout << "worker: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(
end - start)
.count()
<< std::endl;
});
return 0;
} Examples results:
The results are close to 1000, indicating this is indeed running in multiple threads. However, the CPU utilization never exceeds 100% for compute heavy workload: #include <chrono>
#include <iostream>
#include <thread>
#include "oneapi/tbb/parallel_for.h"
using namespace std::chrono_literals;
int main() {
auto start = std::chrono::high_resolution_clock::now();
oneapi::tbb::parallel_for( //
oneapi::tbb::blocked_range<std::size_t>(0, 10), [](const auto &r) {
long long steps = 0;
for (long long i = 2; i < 1000000000000; i++) {
long long n = i;
while (n != 1) {
if (n % 2)
n = (3 * n + 1) / 2;
else
n /= 2;
steps++;
}
}
std::cout << "good " << steps << std::endl;
});
return 0;
} time node a.js
node a.js 6.22s user 0.03s system 101% cpu 6.147 total # emcmake cmake -DCMAKE_BUILD_TYPE=Release -DEMSCRIPTEN_SYSTEM_PROCESSOR=web ..
cmake_minimum_required(VERSION 3.11)
project(test)
include(FetchContent)
set(TBB_TEST OFF CACHE INTERNAL "" FORCE)
set(TBB_STRICT OFF CACHE INTERNAL "" FORCE)
FetchContent_Declare(TBB
GIT_REPOSITORY https://github.com/oneapi-src/oneTBB.git
GIT_TAG v2021.11.0
)
FetchContent_MakeAvailable(TBB)
set(CMAKE_CXX_FLAGS "-pthread")
set(CMAKE_EXE_LINKER_FLAGS "-pthread -sPTHREAD_POOL_SIZE=4 -sINITIAL_MEMORY=1gb")
add_executable(a a.cpp)
target_link_libraries(a PUBLIC TBB::tbb)
target_link_options(a PUBLIC -pthread)
|
I also found the same issue in my project. Since we are constrained to the web js, we also made a reproducible docker environment for that case:
Note: this starts docker in detached mode. You need to stop it manually. If you didn't run any other docker image, just remove the last one with Output:
As you can see, multiple threads are possible in the same C++ program but TBB scheduler still manages to bind tasks to the main thread. |
Certainly, TBB WebAssembly (WASM) is very unstable, but some open-source projects depend on it. It seems like the official team doesn't pay much attention to the bugs discussed. I wonder if we should consider abandoning this library in the future. |
Hi All, sorry to hear you had such problem. Our team are not yet experts in WASM. We are also new to this technology so it takes longer time for us to react to such problems. |
@pavelkumbrasev I think this may be related to issue #1341. I tried to add an observer to at least log the entry point of the threads and found out error stated in #1341. Once solved, I found that the observer hooks are being called after all the parallel loops are invoked (see #1341 (comment) for more details). I think that there must be a bug during the thread initialization related to my comment in that issue. |
After testing, it has been found that std::thread can utilize all the cores in almost all scenarios. |
@jellychen, I'm not really familiar with WASM work model. Is there a chance you can print threads stacks during parallel section execution where CPU utilization is equal to 1 thread running so we can see if threads are sleeping in thread pool for some reason or their stacks are also involved in computation? |
Same behavior, recompiled in debug mode, got this
|
@b-qp I believe we saw this problem before with static version of TBB (and only with static version). |
I'm sorry for the late response; I've been on vacation recently. I'm not quite sure how to print the call stack. Could you tell me the exact steps? |
@jellychen, it will be just a guess because I'm not familiar with a technology too. |
Maybe Wasm does not support gdb debugging |
Could you please provide steps to reproduce the issue? (If you can do it with debug version of the library it also will be helpful) |
Almost nothing special is required, as long as you compile to wasm to perform the simplest parallel tasks, you can say 100% sure to occur |
@pavelkumbrasev see my comment above (#1287 (comment)). |
I suspect that the multithreading mechanism of TBB does not work effectively under the web worker mechanism of Emscripten. It might not be an issue with TBB, perhaps it's a problem with the web itself. However, I have found a solution by implementing a set of interfaces similar to TBB, although not entirely. Many pieces of software only utilize parts of the TBB interface, mainly task_group, parallel_sort, parallel_for, and parallel_reduce. My approach involves initializing a std::thread pool at startup and then bridging these implementations to std::thread. So far, this solution has shown better effects than TBB in some software experiments. Currently, the multithreading performance of TBB in some wasm software, such as Openvdb, is even weaker than its single-threading performance. I hope this can help most developers working on wasm. |
@jellychen, I'm not sure if the problem is the Emscripten. I was able to reproduce described behavior and from my perspective something is odd. I will continue investigating the problem. |
@jellychen I have summarized concluded analysis into a set of questions into Emscripten discussion: |
I have also read quite a bit of the TBB code, and I will keep tracking this issue with the hope that TBB gets even better. I'm grateful for the work you've done. |
Hi @jellychen and @SoilRos, I was thinking what the best way to overcome current problem. |
Based on the current situation, it doesn't seem feasible. I have many interfaces within my WebAssembly module that require manipulation of the DOM from the main thread, and there's no way to migrate them out of the main thread. |
That's probably applicable for a lot of WebAssembly users. I will try to come up with the solution keeping this in mind. |
I guess I now understand why our project with tbb works fine on the browser but not with nodejs. Will try to figure out how to make it work there. Thanks. |
@pavelkumbrasev while technically my program could use a proxy thread because I do not manipulate the DOM from C++ yet, I had problems passing this flag around through my stack and haven't manage yet to get this to compile. So a solution that does not involve this flag would be very much appreciated! |
We encountered a similar issue when exporting C++ code to WASM. We used TBB's task_arena, and part of our code looked like this:
We also experienced the problem where only the main thread was executing. Later, we noticed that in the exported CMakeLists, -sPTHREAD_POOL_SIZE=4 was set. After manually setting the concurrency value to 4, the program achieved the correct parallel performance. |
Hi @sotha-sil-zen, thank you for sharing your projects experience. |
We came to this conclusion by chance. At the time, we suspected that hardware_concurrency() was reading the configuration of the development machine. When compiling to WASM, -sPTHREAD_POOL_SIZE= actually sets the upper limit for concurrency. Generally, the number of threads allowed on the development machine is higher than the PTHREAD_POOL_SIZE value. This inconsistency might cause the main thread not to switch to other threads during execution. Here is the code snippet for reference: https://github.com/sotha-sil-zen/embind_with_tbb_test emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.61 (67fa4c16496b157a7fc3377afd69ee0445e8a6e3) |
On the wasm platform
Both tbb::task_group tg and tbb::parallel_for are always executed on the main thread. But std::thread executes on an asynchronous thread. What causes this?
and oneapi::tbb::info::default_concurrency() > 10
The text was updated successfully, but these errors were encountered: