-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ExtraBuffers: revamping of the idea #802
base: master
Are you sure you want to change the base?
Conversation
343d087
to
4e52fcf
Compare
(used by both local and dist)
4e52fcf
to
f2d49df
Compare
not yet as continuation
f2d49df
to
d9da590
Compare
ex::start_detached(tile::set0(dlaf::internal::Policy<B>(), w2.readwrite_sender(LocalTileIndex(0, 0)))); | ||
ex::start_detached(buffers.reduce(w2.readwrite_sender(LocalTileIndex(0, 0)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to create a continuation with a single task?
[[nodiscard]] auto reduce(TileSender tile) { | ||
namespace ex = pika::execution::experimental; | ||
|
||
std::vector<ex::any_sender<pika::shared_future<matrix::Tile<const T, D>>>> buffers; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type is so cumbersome, probably @msimberg is aware of this and is going to say that it is something we will manage when we will make the tile sender mechanism pika::future free.
I don't know if there is any workaround available, but AFAIK currently our Unwrapping
facilities does not unwrap vector
, so we just get it sent as it is even using our dlaf::internal::transform
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your complaint, but typedef
exists 😉 I see that complex eventually being a helper typedef inside e.g. Matrix
, or at least in the dlaf::matrix
namespace and then you'll be able to write std::vector<ReadOnlyTileSender>
or something like that. Would that partially or completely take care of your concerns here? This is what I'm planning on doing in #766 (because currently it's also dealing with long unwieldy types).
include/dlaf/matrix/extra_buffers.h
Outdated
return ex::when_all(std::move(tile), ex::when_all_vector(std::move(buffers))) | | ||
dlaf::internal::transform(dlaf::internal::Policy<DefaultBackend_v<D>>(), | ||
[](const matrix::Tile<T, D>& tile, | ||
const std::vector<pika::shared_future<matrix::Tile<const T, D>>>& | ||
buffers, | ||
auto&&... ts) { | ||
for (const auto& buffer : buffers) { | ||
if constexpr (D == Device::CPU) { | ||
static_assert(sizeof...(ts) == 0, | ||
"Parameter pack should be empty for MC."); | ||
dlaf::tile::internal::add(T(1), buffer.get(), tile); | ||
} | ||
#ifdef DLAF_WITH_GPU | ||
else if constexpr (D == Device::GPU) { | ||
dlaf::tile::internal::add(T(1), buffer.get(), tile, ts...); | ||
} | ||
#endif | ||
else { | ||
DLAF_STATIC_UNIMPLEMENTED(T); | ||
} | ||
} | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really not nice...15 lines for saying
for (const auto& buffer : buffers)
dlaf::tile::internal::add(T(1), buffer.get(), <optional whip::stream_t>);
I know we can go with something like
dlaf::tile::internal::add(T(1), buffer.get(), ts...);
but we will loose a bit of static checking...
do we have any better alternative?
This is a revamping of a concept we already tested in #386. Due to the important changes happened to the codebase in the meanwhile, together with the fact that it was just a feasibility test, we opted for the easier path of just re-implementing the idea.
ExtraBuffers is a utility that comes handy for the "reduce" pattern, i.e. multiple tasks wants to write on the same memory. ExtraBuffers gives you a set of "buffer" tiles that tasks can use to write in parallel, and at the end the reduction is done on request.
Note
Differently from the original one, this PR went for a simpler implementation, where "partial computations" are stored exclusively on ExtraBuffers tiles, without using the tile where the result will be reduced afterwards.
In this PR there is also a first usage example, in W2 computation inside ReductionToBand.
TODO