tbb::task_group thread scaling #313

Dr15Jones · 2020-12-09T21:32:01Z

As part of transitioning from using the deprecated tbb::task API to tbb::task_group I have been doing performance measurement on our applications. I have found that when using a single tbb::task_group we get highly diminished thread scaling. To illustrate the problem, I created four highly simplified versions of the main processing loop of our applications. The code for the simple applications can be found here: https://github.com/Dr15Jones/tbb_group_scaling. Each application does the same processing but uses TBB in a different way. The differences are

using tbb::tasks directly which are all created using allocate_root (this is how our application typically works)
using 1 tbb::task_group to launch all the needed work
using N tbb::task_groups where we can use a task_group per thread we are requesting.
using tbb::tasks directly but using allocate_additional_child_of (created based on studying the performance of the other three cases).

When testing on either an Intel or AMD CPU, the single tbb::task_group was found to either not scale as the number of threads increased or to have extremely weak scaling compared to the other options. The tbb::task using allocate_additional_child_of had the best performance followed closely by the N tbb::task_groups case.

My question is, are there plans to improve the performance when using a single tbb::task_group? If not, is the use of multiple tbb::task_groups working together to share the load on creating tasks a supported use case? Alternatively, could a new API for creating a performant hierarchy of task_groups be developed in order to avoid doing a 'spin' loop over the task_group::wait calls?

Dr15Jones · 2020-12-09T21:39:58Z

To give some context, here is a plot of the throughput (effectively groups of actions per second) when using my Intel based laptop with a 4 core linux VM.

Here is a plot of the throughput for a 32 core AMD machine

alexey-katranov · 2020-12-16T06:53:10Z

I'd slightly refactor the approach with N task_groups to be similar with child_task.
Replace https://github.com/Dr15Jones/tbb_group_scaling/blob/master/with_multiple_groups.cc#L46-L61 with

    tbb::task_group group;
    auto start = std::chrono::high_resolution_clock::now();
    for (unsigned int i = 0; i < nLanes; ++i) {
        group.run(
            [&nEventsProcessed, nEvents, nChains, &group]() {
                tbb::task_group lane_group;
                lane_group.run_and_wait([&nEventsProcessed, nEvents, nChains, &lane_group]() {
                    workInLane(nEventsProcessed, nChains, nEvents, lane_group, 0);
                });
            });
    }

    group.wait();

Also you do not need iNGroupsDone any more. Just remove https://github.com/Dr15Jones/tbb_group_scaling/blob/master/with_multiple_groups.cc#L14

It uses one task_group to wait. It is not so elegant as child_task because there are nested task_groups but it should scale well.

Dr15Jones · 2020-12-16T15:37:53Z

@alexey-katranov thank you for taking the time to look at this. Unfortunately, although my example properly shows the performance characteristics of our actual application, it does not exhibit the full range of capabilities. In the full application, the equivalent of the for loop can spawn multiple independent tasks (in addition to once a task finishes it starts another task) plus there are cases which force synchronization across the tasks in different iterations of the for loop which allows multiple tasks from the same for loop iteration to run concurrently on different threads. Therefore doing a run_and_wait is not an option for us and we need to call run.

alexey-katranov · 2020-12-16T17:51:10Z

Thank you for the clarification. We will think how we can improve tasking interfaces to cover such cases. Notify: @aleksei-fedotov

pavelkumbrasev · 2024-07-12T10:28:05Z

@Dr15Jones it took awhile but task_group is scalable now 😄 (this commit 1f52f50)

alexey-katranov added the enhancement label Dec 16, 2020

kkm000 mentioned this issue Jun 15, 2021

install tcmalloc kaldi-asr/kaldi#4564

Merged

pavelkumbrasev mentioned this issue Feb 9, 2024

Improve scalability in task_group #1310

Open

14 tasks

pavelkumbrasev closed this as completed Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tbb::task_group thread scaling #313

tbb::task_group thread scaling #313

Dr15Jones commented Dec 9, 2020

Dr15Jones commented Dec 9, 2020

alexey-katranov commented Dec 16, 2020 •

edited

Loading

Dr15Jones commented Dec 16, 2020

alexey-katranov commented Dec 16, 2020

pavelkumbrasev commented Jul 12, 2024

tbb::task_group thread scaling #313

tbb::task_group thread scaling #313

Comments

Dr15Jones commented Dec 9, 2020

Dr15Jones commented Dec 9, 2020

alexey-katranov commented Dec 16, 2020 • edited Loading

Dr15Jones commented Dec 16, 2020

alexey-katranov commented Dec 16, 2020

pavelkumbrasev commented Jul 12, 2024

alexey-katranov commented Dec 16, 2020 •

edited

Loading