-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworked wait_contex approach + scalability improvement in task_group #1345
Conversation
4605379
to
c3ea187
Compare
@@ -221,6 +221,34 @@ void notify_waiters(std::uintptr_t wait_ctx_addr) { | |||
governor::get_thread_data()->my_arena->get_waiting_threads_monitor().notify(is_related_wait_ctx); | |||
} | |||
|
|||
d1::wait_tree_vertex_interface* get_thread_reference_vertex(d1::wait_tree_vertex_interface* wc) { | |||
__TBB_ASSERT(wc, nullptr); | |||
auto& dispatcher = *governor::get_thread_data()->my_task_dispatcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can thread data be uninitialized here? Would get_thread_data_if_initialized()
with the assert __TBB_ASSERT(governor::get_thread_data() != nullptr, "")
be enough here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be not yet initialize. Consider this example:
tbb::task_group tg;
tg.run(); // get_thread_reference_vertex will be called but TBB not yet initialized
src/tbb/task.cpp
Outdated
@@ -221,6 +221,34 @@ void notify_waiters(std::uintptr_t wait_ctx_addr) { | |||
governor::get_thread_data()->my_arena->get_waiting_threads_monitor().notify(is_related_wait_ctx); | |||
} | |||
|
|||
d1::wait_tree_vertex_interface* get_thread_reference_vertex(d1::wait_tree_vertex_interface* wc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Perhaps, rename
wc
totop_wait_context
. - Are there plans to pass something different than pointer to
wait_context
into this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Renamed.
- Not really.
src/tbb/task.cpp
Outdated
auto& dispatcher = *governor::get_thread_data()->my_task_dispatcher; | ||
|
||
d1::reference_vertex* ref_counter{nullptr}; | ||
auto pos = dispatcher.m_reference_vertex_map.find(wc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve code readability I noticed that dispatcher.m_reference_vertex_map
is used many times within this function. I suggest making an alias to it first, and then use the alias in other parts of the function.
auto pos = dispatcher.m_reference_vertex_map.find(wc); | |
auot& reference_map = dispatcher.m_reference_vertex_map; | |
auto pos = reference_map.find(wc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/tbb/task.cpp
Outdated
} else { | ||
constexpr std::size_t max_reference_vertex_map_size = 1000; | ||
if (dispatcher.m_reference_vertex_map.size() > max_reference_vertex_map_size) { | ||
for (auto it = dispatcher.m_reference_vertex_map.begin(); it != dispatcher.m_reference_vertex_map.end();) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, there might be a situation when a processing of this loop might take significant amount of time. Shall we introduce some threshold that would regulate how much time a thread could spend cleaning the container?
Perhaps, put a TODO
note about this at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added TODO
std::uint64_t ref = m_ref_count.fetch_sub(static_cast<std::uint64_t>(delta)) - static_cast<std::uint64_t>(delta); | ||
if (ref == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, this would make this part a bit more readable.
std::uint64_t ref = m_ref_count.fetch_sub(static_cast<std::uint64_t>(delta)) - static_cast<std::uint64_t>(delta); | |
if (ref == 0) { | |
std::uint64_t prev_ref_value = m_ref_count.fetch_sub(static_cast<std::uint64_t>(delta)); | |
if (prev_ref_value == static_cast<std::uint64_t>(delta)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. @kboyarinov what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Aleksei, since the arithmetic operation after the long fetch_sub
expression is not much noticeable.
If you are warned about the logical part, I guess the check can also be if (prev_ref_value - delta == 0)
but I am not sure about that.
} | ||
} | ||
|
||
void release(std::uint32_t delta = 1) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delta
can be negative, but here possible parameter's values are only positive. Consider, naming the parameter differently here and in other similar places. Perhaps, release_num
or simply num
would suit better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is identical to wait_contex
include/oneapi/tbb/detail/_task.h
Outdated
void release(std::uint32_t, const d1::execution_data&) override { | ||
__TBB_ASSERT(false, | ||
"This method is overloaded only to fulfill the base class interface requirements, and thus, it should not be called."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually it means that the inheritance is incorrect. Introducing intermediate interface would not only avoid writing such implementations but also would add to code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kboyarinov what do you think?
1 extra interface class or 1 method with assert in private section.
template<typename F> | ||
class function_task : public task { | ||
const F m_func; | ||
wait_context& m_wait_ctx; | ||
small_object_allocator m_allocator; | ||
|
||
void finalize(const execution_data& ed) { | ||
// Make a local reference not to access this after destruction. | ||
wait_context& wo = m_wait_ctx; | ||
// Copy allocator to the stack | ||
auto allocator = m_allocator; | ||
// Destroy user functor before release wait. | ||
this->~function_task(); | ||
wo.release(); | ||
|
||
allocator.deallocate(this, ed); | ||
} | ||
task* execute(execution_data& ed) override { | ||
task* res = d2::task_ptr_or_nullptr(m_func); | ||
finalize(ed); | ||
return res; | ||
} | ||
task* cancel(execution_data& ed) override { | ||
finalize(ed); | ||
return nullptr; | ||
} | ||
public: | ||
function_task(const F& f, wait_context& wo, small_object_allocator& alloc) | ||
: m_func(f) | ||
, m_wait_ctx(wo) | ||
, m_allocator(alloc) {} | ||
|
||
function_task(F&& f, wait_context& wo, small_object_allocator& alloc) | ||
: m_func(std::move(f)) | ||
, m_wait_ctx(wo) | ||
, m_allocator(alloc) {} | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was that removed because of possibility to reuse function_task
above that inherits task_handle_task
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Their layouts differ however. The PR is marked as backward compatible though. Is it because we are not certain about header-only backward incompatibilities or I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to recompile the whole application.
It is explained here:
#1371
af785df
to
bdd7ba6
Compare
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Co-authored-by: Konstantin Boyarinov <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Co-authored-by: Konstantin Boyarinov <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Co-authored-by: Aleksei Fedotov <[email protected]>
bdd7ba6
to
bd9a34c
Compare
Signed-off-by: pavelkumbrasev <[email protected]>
Other than this - LGTM |
Co-authored-by: Konstantin Boyarinov <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Signed-off-by: pavelkumbrasev <[email protected]>
Description
reference_vertex
that should help with scalability problem that single wait_context.task_group
has a flat reference counting scheme i.e., there is a central reference counter where all the created tasks should increase/decrease reference during execution.This approach works fine while tasks are big and submitted from small number of threads (<8).
When multiple threads will start tree-like algorithm with a lot of tasks the overall performance of the application will drastically degrade with increasing number of threads due to huge synchronization cost.
This patch utilizes per thread reference counter in
task_group
.Fixes # - issue number(s) if exists
Type of change
Choose one or multiple, leave empty if none of the other choices apply
Add a respective label(s) to PR if you have permissions
Tests
Documentation
Breaks backward compatibility
Notify the following users
@vossmjp @kboyarinov
Other information