Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement simple sequenced keys constraint #273

Merged
merged 62 commits into from
Nov 15, 2024

Conversation

devreal
Copy link
Contributor

@devreal devreal commented Feb 27, 2024

Here is a first implementation of constraints. The flow is as follows:

  1. The application creates a constraint that it can pass to one or more TTs. Currently, all TTs must have the same key (like BcastA/B). We have to supply different mappers for BcastA/B because the keys contain different information (for A, k is in key[1], for B it's in key[0]). The mapper extracts the sequence number from the key and returns it as an ordinal to the constraint. The mapper can be either provided to the constraint directly or to the TT, which then passes it on to the constraint. That (and handling void :/) makes the code somewhat more complex than needed. Users probably will implement only one version for the configuration they need.
  2. Once a task has all its inputs, it will query the constraint whether the key can be scheduled for execution. If not, the TT will store the task in a hash table that is used once the key is released.
  3. When a task completes, the TT reports to constraint that a given key has completed. It is up to the constraint what to do with that information. In the example constraint, we decrement a counter and once it's zero we release the next wave of tasks.
  4. TTs register themselves as a listener and are informed about a new set of tasks that is released. Releasing tasks in bulk gives us some room for optimization. Ultimately, that's why I diverted from the release callback per task that we talked about in Knoxville.

User-defined constraints should inherit from ConstraintBase to get the listener handling and have to provide the proper check and complete callbacks.

I still need to remove the control flow we have today in SPMM and check that it works as expected. Overall, the user-facing interface is quite clean and the internal interface is not terrible (a win already).

@@ -1272,6 +1274,9 @@ namespace ttg_parsec {

bool m_defer_writer = TTG_PARSEC_DEFER_WRITER;

std::vector<ttg::meta::detail::constraint_callback_t<keyT>> constraints_check;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see the difference between constraints_check and constraints_complete. It looks like we always push_back a callback to both vectors (or to none), and sometimes we iterate over constraints_check only, but since we appear to never remove the callback from either, I'm not sure I follow why we need both (not saying I found a problem, just wanting more explanations :) )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll put some comments there. These are two different callbacks:

  1. constraints_check hands the task over to the constraint, which will either tell us directly that it is not constrained (i.e., can be executed) or stores the key for later release through a callback (the listener action) and tells us that the key is indeed constrained.
  2. constraints_complete tells the constraint that the task has executed (so that the constraint can check for new tasks to release through the listener action)

@@ -292,11 +292,16 @@ class SpMM25D {
, parallel_bcasts_(parallel_bcasts) {
Edge<Key<2>, void> a_ctl, b_ctl;
Edge<Key<2>, int> a_rowctl, b_colctl; // TODO: can we have multiple control inputs per TT?
auto constraint = ttg::make_shared_constraint<ttg::SequencedKeysConstraint<Key<2>>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks neat indeed.

I'm slightly worried about the limitation that the constraint needs to work on a single key type.

We don't need it in GEMM, because by constraining the broadcast tasks, we indirectly constrain the GEMM tasks, but let's say that I wanted to explicitly constrain the GEMM tasks too: I would need to define a new shared constraint with SequencedKeysConstraint<Key<3>> and I would have to release the constraint on both shared constraints to make them progress simultaneously?

Also, I don't see in the GEMM code any explicit release of constraints? Who makes them progress?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I have not found a good way to type pun the keys. Today, we couldn't include the gemm task in the constraint because the key types are different (Key<2> and Key<3>). I'm sure there is way to do that, I'm not sure about the complexity...

/**
* Used for external key mapper.
*/
SequencedKeysConstraint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what I had in mind. Might work too, but let's go over it during the next meeting, to make sure it does what we want.

AFAICT, the SequencedKeyConstraint makes sure tasks are released in the order of the integer returned by the mapping function; release of tasks can be paused (calling the stop() method) and resumed (calling the start() method), but I'm not sure what calls stop or stat in the GEMM example. It looks to me that with this implementation, if we discover enough BCastA / BCastB tasks, we'll push them in the right order in the scheduler (which is already a progress), but there is no mechanism to actually prevent a BCastA with a high 'k' to be scheduled if it is discovered first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SPMM example we attach the mapping function to the TT (the keys for BcastA/B are the same type but we need different information from them). We cannot pass one mapping here. The different check functions below are conditionally enabled on whether a mapper was provided to the constraint or whether the TT has to provide the ordinal.

@devreal
Copy link
Contributor Author

devreal commented Jun 6, 2024

The constraints need to be ported to the MADNESS backend. I'd like to refactor the common parts out into the TT base class to avoid duplication yet another bunch of code.

devreal and others added 8 commits June 20, 2024 22:09
Constraints should be checked in the order in which they were added.
This avoids deadlocks if all constraints ensure progress (e.g., they
don't wait for a specific task to appear) but may lead to different
outcomes (execution orders) if the order of constraints changes.

Signed-off-by: Joseph Schuchart <[email protected]>
Replaced by constraints.

Signed-off-by: Joseph Schuchart <[email protected]>
Auto-release makes sure there are no deadlocks by enabling the next
set of keys once the current ordinal is done. Without auto-release
applications must release the next set explitly and ensure there
are no deadlocks.

Signed-off-by: Joseph Schuchart <[email protected]>
The constructor of the SequencedKeysConstraint takes a Boolean
argument that determines whether tasks are automatically released
or whether we depend on the application to release the next wave of tasks.

Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
devreal and others added 17 commits September 10, 2024 15:01
Constraints are a no-op in the madness backend so don't try to
execute with them.

Signed-off-by: Joseph Schuchart <[email protected]>
most importantly pulls in ValeevGroup/tiledarray#483 which solves linking errors due to missing fmt instantiations
We should consider this a hint and fall back to non-persistent
values if the type is not derived from TTValue or we are running
with madness.

Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
We cannot fully serialize the DeviceTensor and we don't need
the DeviceTensor in host execution.

Signed-off-by: Joseph Schuchart <[email protected]>
Fixes issues in TA device header.

Signed-off-by: Joseph Schuchart <[email protected]>
We should only call that when we run on the PaRSEC backend.
Otherwise we'll pull the rug from under the madness backend.

Signed-off-by: Joseph Schuchart <[email protected]>
Signed-off-by: Joseph Schuchart <[email protected]>
@devreal
Copy link
Contributor Author

devreal commented Nov 1, 2024

Somehow this breaks the element-wise SPMM. Not sure why, and how to dig into that. Will need some time or volunteers :)

evaleev and others added 4 commits November 13, 2024 22:44
Signed-off-by: Joseph Schuchart <[email protected]>
We have to provide our own check for equal based on the provided comparator.

Signed-off-by: Joseph Schuchart <[email protected]>
@devreal
Copy link
Contributor Author

devreal commented Nov 14, 2024

@evaleev Something is off with the Boosts in the Ubuntu CI:

CMake Error at CMakeLists.txt:4 (find_package):
  Found package configuration file:

    /home/runner/work/ttg/ttg/install/lib/cmake/ttg/ttg-config.cmake

  but it set ttg_FOUND to FALSE so package "ttg" is considered to be NOT
  FOUND.  Reason given by package:

  The following imported targets are referenced, but are missing:
  Boost::callable_traits Boost::serialization Boost::iostreams

I've seen similar issues on one of our machines. This PR bumps the minimum required to 1.81. What was the reason for that?

@evaleev
Copy link
Contributor

evaleev commented Nov 14, 2024

@evaleev Something is off with the Boosts in the Ubuntu CI:

Ubuntu provides too old of a Boost, which wreaks unseen levels of havoc (build succeeds but metadata in ttg-config.cmake is somehow all screwed up).

I've seen similar issues on one of our machines. This PR bumps the minimum required to 1.81. What was the reason for that?

tracking issues across the entirety of the stack is difficult. From what I recall some parts of boost 1.80 used std::unary_function which was removed in C++17, but most libs still shipped it ... but not Apple clang 15, so I bumped to boost to 1.81 across my stack without trying to fine-grain it. Building from source provides Boost 1.81 as it is a good-enough implementation of modularized Boost build, so it makes sense to demand at least that version to avoid diffs between system-provided boost and from-source boost.

Boost is hell. But reimplementing it is hell^2.

@evaleev evaleev merged commit 6f7c82d into TESSEorg:master Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants