diff --git a/README.md b/README.md index bab81df44..e5d0bc803 100644 --- a/README.md +++ b/README.md @@ -29,12 +29,16 @@ The development of TTG was motivated by _irregular_ scientific applications like int main(int argc, char *argv[]) { ttg::initialize(argc, argv); - + // a simple template task auto tt = ttg::make_tt([]() { std::cout << "Hello, World!\n"; }); + // make sure the graph is connected (simple for a single node) ttg::make_graph_executable(tt); + // start executing any available tasks ttg::execute(); + // create task to kickstart computation if (ttg::get_default_world().rank() == 0) tt->invoke(); + // wait for completion ttg::fence(); ttg::finalize(); @@ -74,7 +78,7 @@ Although it does not involve any useful flow of computation and/or data, the abo ### Programming Model -The basic model of computation is built around a Template Task Graph (TTG). A TTG consists of one or more connected Template Task (TT) objects. Each message that travels between TTs consist of a (potentially void) task ID and (optional) datum. A TT creates a task for a given task ID when its every input terminal receives a message with that task ID. The task body can send data to zero or more of the output terminals defined for the corresponding TT. +The basic model of computation is built around a Template Task Graph (TTG). A TTG consists of one or more connected Template Task (TT) objects. Each message that travels between TTs consist of a (potentially void) task ID and (optional) datum. A TT creates a task for a given task ID when all of its input terminals have received a message with that task ID. The task body can send data to zero or more of the output terminals defined for the corresponding TT. Thus, task creation is a byproduct of messages traveling through one or more TTGs. What makes the model powerful is the ability to encode large DAGs of tasks compactly. @@ -82,10 +86,10 @@ Before proceeding further, let's refine the few concepts used to define the prog - `TaskId` (aka `Key`): A unique identifier for each task. It must be _perfectly_ hashable. - `Terminal`: A port for receiving (input) and sending (output) messages. Each message consists of a (potentially void) `TaskId` and an (optional) datum. Terminals are strongly-typed. An {in,out}put terminal can be connected to one or more {out,in}put terminal (as long as the `TaskId` and datum types match). Input terminals are programmable (e.g., incoming messages can be optionally reduced). -- `TemplateTask` (aka `TT`): This is a _template_ for creating tasks. Task template creates a task associated with a given `TaskId` when every input terminal received messages for the given `TaskId`. -- `Edge`: A connection between an input terminal and an output terminal. N.B. Concept `Edge` denotes a 1-to-1 connection and exists to be able to think of TTGs as graphs ("data flows between TTs' terminals via Edges"); do not confuse with the TTG C++ class `Edge` which behaves like a hyperedge by composing 1-to-many and many-to-1 connections between terminals. +- `TemplateTask` (aka `TT`): This is a _template_ for creating tasks. The task template creates a task associated with a given `TaskId` when every input terminal received messages for the given `TaskId`. +- `Edge`: A connection between an input terminal and an output terminal. An `Edge` denotes a 1-to-1 connection and exists to be able to think of TTGs as graphs ("data flows between TTs' terminals via Edges"); do not confuse with the TTG C++ class `Edge` which behaves like a hyperedge by composing 1-to-many and many-to-1 connections between terminals. -Due to its simplicity only template tasks appear in the "Hello, World!" program. +Due to its simplicity only one template task appears in the "Hello, World!" program. ## Structure of a Minimal TTG Program @@ -101,8 +105,8 @@ Let's go over each of these steps using the "Hello, World!" example. The complet ### Select the TTG Backend -TTG C++ implementation is currently supported by 2 backends providing task scheduling, data transfer, and resource management. -While it is possible to use specific TTG backend explicitly, by using the appropriate namespaces, it is recommended to write backend-neutral programs that can be specialized to a particular backend as follows. +The TTG C++ implementation is currently supported by 2 backends providing task scheduling, data transfer, and resource management. +While it is possible to use a specific TTG backend explicitly, by using the appropriate namespaces, it is recommended to write backend-neutral programs that can be specialized to a particular backend in of two ways. 1. By defining one (and only one) of the following macros, via the command-line argument to the compiler (recommended) or as an explicit `#define` statement in the source code: - `TTG_USE_PARSEC`: selects the PaRSEC backend as the default; @@ -131,13 +135,13 @@ While it is possible to use specific TTG backend explicitly, by using the approp ### Initialize -To initialize TTG runtime invoke `ttg::initialize(argc, argv)`; there are several overloads of this function that also accept other optional parameters, such as the number of threads in the main thread pool, the MPI communicator for execution, etc. +To initialize the TTG runtime invoke `ttg::initialize(argc, argv)`. There are several overloads of this function that also accept other optional parameters, such as the number of threads in the main thread pool, the MPI communicator for execution, etc. ## Specify a TTG -To make a TTG create and connect one or more TTs. The simplest TTG consists of a single TT. +To make a TTG, create and connect one or more TTs. The simplest TTG consists of a single TT as in the hello-world example. -The "Hello, World!" example contains a single TT that executes a single task (hence, task ID can be omitted, i.e., void) that does not take and produce any data. The easiest way to make such a TT is by wrapping a callable (e.g., a lambda) with `ttg::make_tt`: +The "Hello, World!" example contains a single TT that executes a single task (hence, task ID can be omitted, i.e., `void`) that does not take and produce any data. The easiest way to make such a TT is by wrapping a callable (e.g., a lambda) with `ttg::make_tt`: ```cpp auto tt = ttg::make_tt([]() { std::cout << "Hello, World!\n"; }); @@ -145,7 +149,7 @@ The "Hello, World!" example contains a single TT that executes a single task (he ## Execute TTG -To execute a TTG we must make it executable (this will declare the TTG program complete so no additional changes to the flowgraph are possible). To execute the TTG its root TT must receive at least one message; since in this case the task does not receive either task ID or data the message is empty (i.e., void): +To execute a TTG we must make it executable (this will declare the TTG program complete so no additional changes to the flowgraph are possible). To execute the TTG its root TT must receive at least one message; since in this case the task does not receive either task ID or data, the message is empty (i.e., `void`): ```cpp ttg::make_graph_executable(tt); @@ -154,8 +158,7 @@ To execute a TTG we must make it executable (this will declare the TTG program tt->invoke(); ``` -`ttg::execute()` must occur before, not after, sending any messages. Note also that we must ensure that only one such message must be generated. Since TTG execution uses the Single Program Multiple Data (SPMD) model, -when launching the TTG program as multiple processes only the first process (rank) gets to send the message. +`ttg::execute()` must occur before, not after, sending any messages. Note also that we must ensure that only one such message is generated. Since TTG execution uses the Single Program Multiple Data (SPMD) model, when launching the TTG program as multiple processes only the first process (rank) gets to send the message. ## Finalize TTG Since TTG program is executed asynchronously, we must ensure that all tasks are finished: @@ -173,7 +176,7 @@ Before exiting `main()` the TTG runtime should be finalized: ## Beyond "Hello, World!" Since "Hello, World!" consists of a single task it does not demonstrate either how to control scheduling of -multiple tasks or enable data flow between tasks. Let's use computation of `N`th Fibonacci number as +multiple tasks or enable data flow between tasks. Let's use the computation of the `N`th Fibonacci number as a simple example of a recursive task-based computation that is often used ([OpenMP](https://www.openmp.org/wp-content/uploads/openmp-examples-5.1.pdf), [TBB](https://github.com/oneapi-src/oneTBB/blob/master/examples/test_all/fibonacci/fibonacci.cpp), @@ -195,28 +198,40 @@ $F_N = F_{N-1} + F_{N-2}, F_0=0, F_1=1$. int main(int argc, char *argv[]) { ttg::initialize(argc, argv); - const int64_t N = 20; + const int64_t N = 20; // want to compute fib(20) + // edges used for recursion ttg::Edge f2f_nm1, f2f_nm2; + // edge to the task printing the output ttg::Edge f2p; auto fib = ttg::make_tt( [=](int64_t n, int64_t F_nm1, int64_t F_nm2) { auto F_n = F_nm1 + F_nm2; if (n < N) { + // recursion: send result to first input and + // prior result to second input of Fib(n+1) ttg::send<0>(n + 1, F_n); ttg::send<1>(n + 1, F_nm1); - } else + } else { + // send to print task below ttg::sendv<2>(F_n); + } }, - ttg::edges(f2f_nm1, f2f_nm2), ttg::edges(f2f_nm1, f2f_nm2, f2p), + // input edges: first input, second input + ttg::edges(f2f_nm1, f2f_nm2), + // output edges: to first input, to second input, to print task + ttg::edges(f2f_nm1, f2f_nm2, f2p), + // name of the task "fib"); auto print = ttg::make_tt([](int64_t F_N) { std::cout << N << "th Fibonacci number is " << F_N << std::endl; }, + // input from fib ttg::edges(f2p), + // no outputs ttg::edges(), "print"); ttg::make_graph_executable(fib); ttg::execute(); - if (ttg::rank() == 0) fib->invoke(2, std::make_tuple(1, 0)); + if (ttg::get_default_world().rank() == 0) fib->invoke(2, std::make_tuple(1, 0)); ttg::fence(); ttg::finalize(); @@ -242,7 +257,7 @@ $F_{n-1},F_{n-2} \to F_{n}$). ## Data-Dependent Example : Largest Fibonacci Number < N -To illustrate the real power of TTG let's tweak the problem slightly: instead of computing first $N$ Fibonacci numbers let's find the largest Fibonacci number smaller than some $N$. The key difference in the latter case is that, unlike the former, the number of tasks is NOT known a priori; furthermore, to make a decision whether we need to compute next Fibonacci number we must examine the value returned by the previous task. This is an example of data-dependent tasking, where the decision which (if any) task to execute next depends on the values produced by previous tasks. The ability to compose regular as well as data-dependent task graphs is a distinguishing strength of TTG. +To illustrate the real power of TTG let's tweak the problem slightly: instead of computing the first $N$ Fibonacci numbers let's find the largest Fibonacci number smaller than some $N$. The key difference in the latter case is that, unlike the former, the number of tasks is NOT known a priori; furthermore, to make a decision whether we need to compute next Fibonacci number we must examine the value returned by the previous task. This is an example of data-dependent tasking, where the decision which (if any) task to execute next depends on the values produced by previous tasks. The ability to compose regular as well as data-dependent task graphs is a distinguishing strength of TTG. To make things even more interesting, we will demonstrate how to implement such program both for execution on CPUs as well as on accelerators (GPUs). The complete examples, including the CMake build harness, can be found in [dox examples](https://github.com/TESSEorg/ttg/tree/master/doc/dox/dev/devsamp/fibonacci). @@ -253,10 +268,11 @@ The complete examples, including the CMake build harness, can be found in [dox e #include #include "ttg/serialization.h" -/// N.B. contains values of F_n and F_{n-1} +/// Struct containing values of F_n and F_{n-1} struct Fn { int64_t F[2]; // F[0] = F_n, F[1] = F_{n-1} Fn() { F[0] = 1; F[1] = 0; } + // make Fn serializable template void serialize(Archive& ar) { ar & F; @@ -267,9 +283,9 @@ struct Fn { } }; -auto make_ttg_fib_lt(const int64_t) { - ttg::Edge f2f; - ttg::Edge f2p; +auto make_ttg_fib_lt(const int64_t F_n_max) { + ttg::Edge f2f; // fib to fib + ttg::Edge f2p; // fib to print auto fib = ttg::make_tt( [=](int64_t n, Fn&& f_n) { @@ -277,22 +293,27 @@ auto make_ttg_fib_lt(const int64_t) { f_n.F[1] = f_n.F[0]; f_n.F[0] = next_f_n; if (next_f_n < F_n_max) { + // send to next Fib ttg::send<0>(n + 1, f_n); } else { + // send to print ttg::send<1>(n, f_n); } }, ttg::edges(f2f), ttg::edges(f2f, f2p), "fib"); auto print = ttg::make_tt( - [=](Fn&& f_n) { + [=](const Fn& f_n) { std::cout << "The largest Fibonacci number smaller than " << F_n_max << " is " << f_n.F[1] << std::endl; }, ttg::edges(f2p), ttg::edges(), "print"); + // create a TTG that receives inputs on the first input of fib auto ins = std::make_tuple(fib->template in<0>()); - std::vector> ops; - ops.emplace_back(std::move(fib)); - ops.emplace_back(std::move(print)); + // collect fib and print into a vector + std::vector> tts; + tts.emplace_back(std::move(fib)); + tts.emplace_back(std::move(print)); + // instantiate the TTG; note the use of make_ttg instead of make_tt return make_ttg(std::move(ops), ins, std::make_tuple(), "Fib_n < N"); } @@ -304,8 +325,8 @@ int main(int argc, char* argv[]) { auto fib = make_ttg_fib_lt(N); ttg::make_graph_executable(fib.get()); ttg::execute(); - if (ttg::default_execution_context().rank() == 0) - fib->template in<0>()->send(1, Fn{});; + if (ttg::get_default_world().rank() == 0) + fib->invoke(1, Fn{}); ttg::fence(); ttg::finalize(); @@ -330,7 +351,7 @@ Because `Fn` is now a user-defined type, for TTG to be able to copy/move it betw #### Why `make_ttg_fib_lt`? -Until now we have constructed individual TTs and linked them together; i.e., TTGs until now was implicit. Function `make_ttg_fib_lt` instead explicitly creates a graph of TTs (a TTG). This seemingly small step helps improve composability by allowing to use entire TTGs as a component of other graphs by stitching it with TTs or TTGs together. +Until now we have constructed individual TTs and linked them together; i.e., the TTG until now was implicit. The function `make_ttg_fib_lt` instead explicitly creates a graph of TTs (a TTG). This seemingly small step helps improve composability by allowing to use entire TTGs as a component of other graphs by stitching it with TTs or TTGs together. [//]: ![Fibonacci_TTG_example](doc/images/fibonacci_ttg.png) @@ -345,7 +366,7 @@ The host-only part is completely independent of the type of the device programmi ```cpp struct Fn : public ttg::TTValue { std::unique_ptr F; // F[0] = F_n, F[1] = F_{n-1} - ttg::Buffer b; + ttg::Buffer b; // buffer managing host and device memory Fn() : F(std::make_unique(2)), b(F.get(), 2) { F[0] = 1; F[1] = 0; } Fn(const Fn&) = delete; Fn(Fn&& other) = default; @@ -370,8 +391,9 @@ auto make_ttg_fib_lt(const int64_t F_n_max = 1000) { assert(n > 0); ttg::trace("in fib: n=", n, " F_n=", f_n.F[0]); + // select a device and make b available co_await ttg::device::select(f_n.b); - + // compute the next value on the device (see below) next_value(f_n.b.current_device_ptr()); // wait for the task to complete and the values to be brought back to the host @@ -384,6 +406,7 @@ auto make_ttg_fib_lt(const int64_t F_n_max = 1000) { } }, ttg::edges(f2f), ttg::edges(f2f, f2p), "fib"); + auto print = ttg::make_tt( [=](Fn&& f_n) { std::cout << "The largest Fibonacci number smaller than " << F_n_max << " is " << f_n.F[1] << std::endl; @@ -422,22 +445,25 @@ Although the structure of the device-capable program is nearly identical to the ##### `TTValue` -For optimal performance low-level runtime that manages the data motion across the memory hierarchy (host-to-host (i.e., between MPI ranks), host-to-device, and device-to-device) must be able to _track_ each datum as it orchestrates the computation. For example, when a TTG task `send`'s a datum to an output terminal connected to multiple consumers the runtime may avoid unnecessary copies, e.g. by recognizing that all consumers will only need read-only access to the data, hence reference to the same datum can be passed to all consumers. This requires being able to map pointer to a C++ object to the control block that describes that object to the runtime. Deriving C++ type `T` from `TTValue` makes it possible to track objects `T` by embedding the control block into each object. This is particularly important for the data that has to travel to the device. +For optimal performance, the low-level runtime that manages the data motion across the memory hierarchy (host-to-host (i.e., between MPI ranks), host-to-device, and device-to-device) and so it must be able to _track_ each datum as it orchestrates the computation. For example, when a TTG task sends a datum to an output terminal connected to multiple consumers the runtime may avoid unnecessary copies, e.g., by recognizing that all consumers will only need read-only access to the data, hence reference to the same datum can be passed to all consumers. This requires the mapping of a pointer to a C++ object to the control block that describes that object to the runtime. Deriving C++ type `T` from `TTValue` includes the control block in `T` and avoids creating a separate control block. This is particularly important for the data that has to travel to the device. ##### `Buffer` -`Buffer` is a view of a contiguous sequence of objects of type `T` in the host memory that can be automatically moved by the runtime to/from the device memory. Here `Fn::b` is a view of the 2-element sequence pointed to by `Fn::F`; once it's constructed the content of `Fn::F` will be moved to/from the device by the runtime. The subsequent actions of `Fn::b` cause the automatic transfers of data to (`device::select(f_n.b)`) and from (`ttg::device::wait(f_n.b)`) the device. +`Buffer` is a view of a contiguous sequence of objects of type `T` in the host memory that can be automatically moved by the runtime to/from the device memory. Here `Fn::b` is a view of the 2-element sequence pointed to by `Fn::F`; once it's constructed the content of `Fn::F` will be moved to/from the device by the runtime. The subsequent use of `Fn::b` cause the automatic transfers of data to (`device::select(f_n.b)`) and from (`ttg::device::wait(f_n.b)`) the device. +A `Buffer` can be either owning or non-owning. In the example above, the memory is owned by the `unique_ptr`. +If no pointer is passed to the constructor of `Buffer` the buffer allocates the necessary host-side memory. +In order to guarantee relocatability of buffers, the data managed by a buffer should be located on the heap, i.e., dynamically allocated. ##### `device::Task` -The key challenge of device programming models is that they are fundamentally _asynchronous_ to hide the large latency of interacting with the device. Kernel launches, unlike function calls on CPU, take 1000s of CPU cycles to occur, and the asynchrony helps amortize these costs by overlapping kernels launch and execution. Task programming models are a seemingly good match for device programming, but the key challenge is how to make device-capable task code look most like standard host-only task code. TTG ability to use _C++ coroutines_ as task bodies allows it to deal with asynchronous calls inside the tasks (the use of coroutines is the primary reason why TTG requires C++20 support by the C++ compiler). Roughly speaking, coroutines are resumable functions; they can return to the caller via a `co_await` statement and resumed at that point once some condition (typically, completion of submitted actions) has been satisdied. Device tasks `co_await` at every point where further progress requires completion of preceding device tasks: +The key challenge of device programming models is that they are fundamentally _asynchronous_ to hide the latency of interacting with the device. Kernel launches, unlike function calls on CPU, as well as memory transfers take 1000s of CPU cycles, and the asynchrony helps amortize these costs by overlapping kernels launch and execution. Task programming models are a seemingly good match for device programming, but the key challenge is how to make device-capable task code look most like standard host-only task code. TTG ability to use _C++ coroutines_ as task bodies allows it to deal with asynchronous calls inside the tasks (the use of coroutines is the primary reason why TTG requires C++20 support by the C++ compiler). Roughly speaking, coroutines are resumable functions; they can return to the caller via a `co_await` statement and be resumed at that point once some condition (typically, completion of submitted actions) has been satisfied. Device tasks `co_await` at every point where further progress requires completion of preceding device tasks: - - First `co_await` ensures that contents of `f_n.F[]` are available on the device. During the first invocation the data resides on the host, hence this allocates memory on the device and transfers the contents of `f_n.F[]` from host to device. During subsequent invocations the contents of `f_n.F[]` are likely already available on the device (unless the runtime decides to compute $F_{n+1}$ on a different device than $F_n$), thus this `co_await` may become a no-op. - - Second `co_await` ensures that the kernel launched by `next_value` has completed and the contents of `f_n.F[]` changed by that kernel are available on the host. This always causes device-to-host transfer. - - Third set of `co_await`'s ensures that the corresponding `device::send`, which sends the data located in the device memory, has completed. Since `device::send` within a task will typically return a local variable exit from coroutine would destroy such variables prematurely, hence instead of a `co_return` the coroutine concludes by waiting for the `device::send` to complete before exiting. + - The first `co_await ttg::device::select` asks the runtime to select a device (if multiple are available) and ensures that the contents of `f_n.F[]` are made available on that device. During the first invocation the data resides on the host, hence the runtime allocates memory on the device and transfers the contents of `f_n.F[]` from host to device. During subsequent invocations the contents of `f_n.F[]` are likely already available on the device (unless the runtime decides to compute $F_{n+1}$ on a different device than $F_n$), thus this `co_await` may become a no-op. + - The second `co_await ttg::device::wait` ensures that the kernel launched by `next_value` has completed and the contents of `f_n.F[]` changed by that kernel are available on the host. This always causes device-to-host transfer if one or more `Buffer` are provided. If no buffer is provided then the call only waits for all previously submitted kernels to complete. + - The last set of `co_await`'s ensures that the corresponding `ttg::device::send`, which sends the data located in the device memory, has completed. Since `device::send` within a task may return a local variable (e.g., for the key) exit from the coroutine would destroy such variables prematurely, hence instead of a `co_return` the coroutine concludes by waiting for the `device::send` to complete before exiting. ##### `ExecutionSpace` -TTG and its underlying runtime needs to be told in which _execution space_ the task code will operate. The current choices are denoted by the `ExecutionSpace` enumeration: +TTG and its underlying runtime needs to be told in which _execution space_ the task code will operate in. The current choices are denoted by the `ExecutionSpace` enumeration: - `ExecutionSpace::Host`: host processor (default) - `ExecutionSpace::CUDA`: an NVIDIA CUDA device @@ -454,7 +480,6 @@ Here's the CUDA version of the device kernel and its host-side wrapper; ROCm and ```cpp #include "fibonacci_cuda_kernel.h" -#ifdef TTG_HAVE_CUDA __global__ void cu_next_value(int64_t* fn_and_fnm1) { int64_t fnp1 = fn_and_fnm1[0] + fn_and_fnm1[1]; fn_and_fnm1[1] = fn_and_fnm1[0]; @@ -463,7 +488,6 @@ Here's the CUDA version of the device kernel and its host-side wrapper; ROCm and void next_value(int64_t* fn_and_fnm1) { cu_next_value<<<1, 1>>>(fn_and_fnm1); } -#endif // TTG_HAVE_CUDA ``` `cu_next_value` is the device kernel that evaluates $F_{n+1}$ from $F_{n}$ and $F_{n-1}$. `next_value` is a host function that launches `cu_next_value`; this is the function called in the `fib` task. @@ -493,8 +517,7 @@ To simplify debugging of multirank TTG programs it is possible to automate the p # TTG Performance Competitive performance of TTG for several paradigmatic scientific applications on shared- and distributed-memory machines (CPU only) -will be discussed in [manuscript ``Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment''](https://www.ipdps.org/ipdps2022/2022-accepted-papers.html) to be presented at [IPDPS'22](https://www.ipdps.org/ipdps2022/). -Stay tuned! +is discussed in [manuscript ``Generalized Flow-Graph Programming Using Template Task-Graphs: Initial Implementation and Assessment''](https://www.ipdps.org/ipdps2022/2022-accepted-papers.html) and has been presented at [IPDPS'22](https://www.ipdps.org/ipdps2022/). # TTG Performance Tracing