diff --git a/_sources/cmake_support.rst.txt b/_sources/cmake_support.rst.txt index 84c7fb70523..61eb16ef287 100644 --- a/_sources/cmake_support.rst.txt +++ b/_sources/cmake_support.rst.txt @@ -20,7 +20,7 @@ The minimal supported CMake version for |onedpl_short| is 3.11 on Linux and 3.20 The supported `CMake Generator `_ for Linux is `Unix Makefiles `_ (default). In the Windows environment, the supported generator is `Ninja `_ as described in the `Intel® oneAPI DPC++/C++ Compiler Developer Guide and Reference `_ which may be specified via ``-GNinja``. |onedpl_short| Backend Options -============== +============================== Backend for Parallel Execution Policies (par and par_unseq) ----------------------------------------------------------- @@ -88,7 +88,7 @@ Below is an example ``Linux`` CMake invocation which generates Unix makefiles fo mkdir build && cd build cmake -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=release -DONEDPL_PAR_BACKEND=tbb .. -Below is an example ``Windows`` CMake invocation which generates ``Ninja`` build scripts (see :ref:`Requirements Section`) for the project in the parent directory with the ``icx`` compiler, ``OpenMP`` backend and ``debug`` build type: +Below is an example ``Windows`` CMake invocation which generates ``Ninja`` build scripts (see the Requirements Section) for the project in the parent directory with the ``icx`` compiler, ``OpenMP`` backend and ``debug`` build type: .. code:: cpp diff --git a/_sources/dynamic_selection_api/auto_tune_policy.rst.txt b/_sources/dynamic_selection_api/auto_tune_policy.rst.txt new file mode 100644 index 00000000000..4bdf3ea7925 --- /dev/null +++ b/_sources/dynamic_selection_api/auto_tune_policy.rst.txt @@ -0,0 +1,248 @@ +Auto-Tune Policy +################ + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +The auto-tune policy selects resources using runtime profiling. ``auto_tune_policy`` +is useful for determining which resource performs best +for a given kernel. The choice is made based on runtime performance +history, so this policy is only useful for kernels that have stable +performance. Initially, this policy acts like ``round_robin_policy``, +rotating through each resource (one or more times). Then, once it has +determined which resource is performing best, it uses that resource +thereafter. Optionally, a resampling interval can be set to return to +the profiling phase periodically. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + class auto_tune_policy { + public: + // useful types + using resource_type = typename Backend::resource_type; + using wait_type = typename Backend::wait_type; + + class selection_type { + public: + auto_tune_policy get_policy() const; + resource_type unwrap() const; + }; + + // constructors + auto_tune_policy(deferred_initialization_t); + auto_tune_policy(uint64_t resample_interval_in_milliseconds = 0); + auto_tune_policy(const std::vector& u, + uint64_t resample_interval_in_milliseconds = 0); + + // deferred initializer + void initialize(uint64_t resample_interval_in_milliseconds = 0); + void initialize(const std::vector& u, + uint64_t resample_interval_in_milliseconds = 0); + + // queries + auto get_resources() const; + auto get_submission_group(); + + // other implementation defined functions... + }; + + } + +This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, +and ``submit_and_wait``. It can also be used with ``policy_traits``. + +Example +------- + +In the following example, an ``auto_tune_policy`` is used to dynamically select between +two queues, a CPU queue and a GPU queue. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + std::vector r { sycl::queue{sycl::cpu_selector_v}, + sycl::queue{sycl::gpu_selector_v} }; + + const std::size_t N = 10000; + std::vector av(N, 0.0); + std::vector bv(N, 0.0); + std::vector cv(N, 0.0); + for (int i = 0; i < N; ++i) { + av[i] = bv[i] = i; + } + + ex::auto_tune_policy p{r}; // (1) + + { + sycl::buffer a_b(av); + sycl::buffer b_b(bv); + sycl::buffer c_b(cv); + + + for (int i = 0; i < 6; ++i) { + ex::submit_and_wait(p, [&](sycl::queue q) { // (2) + // (3) + std::cout << (q.get_device().is_cpu() ? "using cpu\n" : "using gpu\n"); + return q.submit([&](sycl::handler &h) { // (4) + sycl::accessor a_a(a_b, h, sycl::read_only); + sycl::accessor b_a(b_b, h, sycl::read_only); + sycl::accessor c_a(c_b, h, sycl::read_write); + h.parallel_for(N, [=](auto i) { c_a[i] = a_a[i] + b_a[i]; }); + }); + }); + }; + } + + for (int i = 0; i < N; ++i) { + if (cv[i] != 2*i) { + std::cout << "ERROR!\n"; + } + } + std::cout << "Done.\n"; + } + +The key points in this example are: + +#. An ``auto_tune_policy`` is constructed to select between the CPU and GPU. +#. ``submit_and_wait`` is invoked with the policy as the first argument. The selected queue will be passed to the user-provided function. +#. For clarity when run, the type of device is displayed. +#. The queue is used in function to perform and asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. + +Selection Algorithm +------------------- + +The selection algorithm for ``auto_tune_policy`` uses runtime profiling +to choose the best resource for the given function. A simplified, expository +implementation of the selection algorithm follows: + +.. code:: + + template + selection_type auto_tune_policy::select(Function&& f, Args&&...args) { + if (initialized_) { + auto k = make_task_key(f, args...); + auto tuner = get_tuner(k); + auto offset = tuner->get_resource_to_profile(); + if (offset == use_best) { + return selection_type {*this, tuner->best_resource_, tuner}; + } else { + auto r = resources_[offset]; + return selection{*this, r, tuner}; + } + } else { + throw std::logic_error(“selected called before initialization”); + } + } + +where ``make_task_key`` combines the inputs, including the function and its +arguments, into a key that uniquely identifies the user function that is being +profiled. ``tuner`` is the encapsulated logic for performing runtime profiling +and choosing the best option for a given key. When the call to ``get_resource_to_profile()`` +return ``use_best``, the tuner is not in the profiling phase, and so the previously +determined best resource is used. Otherwise, the resource at index ``offset`` +in the ``resources_`` vector is used and its resulting performance is profiled. +When an ``auto_tune_policy`` is initialized with a non-zero resample interval, +the policy will periodically return to the profiling phase base on the provided +interval value. + +Constructors +------------ + +``auto_tune_policy`` provides three constructors. + +.. list-table:: ``auto_tune_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``auto_tune_policy(deferred_initialization_t);`` + - Defers initialization. An ``initialize`` function must be called prior to use. + * - ``auto_tune_policy(uint64_t resample_interval_in_milliseconds = 0);`` + - Initialized to use the default set of resources. An optional resampling interval can be provided. + * - ``auto_tune_policy(const std::vector& u, uint64_t resample_interval_in_milliseconds = 0);`` + - Overrides the default set of resources. An optional resampling interval can be provided. + +Deferred Initialization +----------------------- + +A ``auto_tune_policy`` that was constructed with deferred initialization must be +initialized by calling one its ``initialize`` member functions before it can be used +to select or submit. + +.. list-table:: ``auto_tune_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``initialize(uint64_t resample_interval_in_milliseconds = 0);`` + - Initialize to use the default set of resources. An optional resampling interval can be provided. + * - ``initialize(const std::vector& u, uint64_t resample_interval_in_milliseconds = 0);`` + - Overrides the default set of resources. An optional resampling interval can be provided. + +Queries +------- + +A ``auto_tune_policy`` has ``get_resources`` and ``get_submission_group`` +member functions. + +.. list-table:: ``auto_tune_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``std::vector get_resources();`` + - Returns the set of resources the policy is selecting from. + * - ``auto get_submission_group();`` + - Returns an object that can be used to wait for all active submissions. + +Reporting Requirements +---------------------- + +If a resource returned by ``select`` is used directly without calling +``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` +to provide feedback to the policy. The ``auto_tune_policy`` tracks the +performance of submissions on each device via callbacks that report +the execution time. The instrumentation to report these events is included +in the implementations of ``submit`` and ``submit_and_wait``. However, if you +use ``select`` and then submit work directly to the selected resource, it +is necessary to explicitly report these events. + +.. list-table:: ``auto_tune_policy`` reporting requirements + :widths: 50 50 + :header-rows: 1 + + * - ``execution_info`` + - is reporting required? + * - ``task_submission`` + - No + * - ``task_completion`` + - No + * - ``task_time`` + - Yes + +In generic code, it is possible to perform compile-time checks to avoid +reporting overheads when reporting is not needed, while still writing +code that will work with any policy, as demonstrated below: + +.. code:: cpp + + auto s = select(my_policy); + if constexpr (report_info_v) + { + s.report(execution_info::task_submission); + } diff --git a/_sources/dynamic_selection_api/dynamic_load_policy.rst.txt b/_sources/dynamic_selection_api/dynamic_load_policy.rst.txt new file mode 100644 index 00000000000..ffd7dae4850 --- /dev/null +++ b/_sources/dynamic_selection_api/dynamic_load_policy.rst.txt @@ -0,0 +1,231 @@ +Dynamic Load Policy +################### + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +The dynamic load policy tracks the number of submissions currently submitted but not yet completed on each +resource and selects the resource that has the fewest unfinished submissions. +``dynamic_load_policy`` is useful for offloading kernels of varying cost to devices +of varying performance. A load-based assignment may achieve a good load balancing +by submitting tasks to a resource that completes work faster. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + class dynamic_load_policy { + public: + // useful types + using resource_type = typename Backend::resource_type; + using wait_type = typename Backend::wait_type; + + class selection_type { + public: + dynamic_load_policy get_policy() const; + resource_type unwrap() const; + }; + + // constructors + dynamic_load_policy(deferred_initialization_t); + dynamic_load_policy(); + dynamic_load_policy(const std::vector& u); + + // deferred initializer + void initialize(); + void initialize(const std::vector& u); + + // queries + auto get_resources() const; + auto get_submission_group(); + + // other implementation defined functions... + }; + + } + +This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, +and ``submit_and_wait``. It can also be used with ``policy_traits``. + +Example +------- + +The following example demonstrates a simple approach to send work to more than +one queue concurrently using ``dynamic_load_policy``. The policy selects the +resource with the fewest number of unfinished submissions. + +.. code:: cpp + + #include + #include + #include + + const std::size_t N = 10000; + namespace ex = oneapi::dpl::experimental; + + void f(sycl::handler& h, float* v); + void do_cpu_work(); + + int dynamic_load_example(std::vector& devices, + std::vector& usm_data) { + + ex::dynamic_load_policy p{devices}; // (1) + + auto num_devices = p.get_resources().size(); + auto num_arrays = usm_data.size(); + // (2) + auto submission_group_size = num_arrays; + + std::cout << "Running with " << num_devices << " queues\n" + << " " << num_arrays << " usm arrays\n" + << "Will perform " << submission_group_size << " concurrent offloads\n"; + + + for (int i = 0; i < 100; i+=submission_group_size) { // (3) + for (int j = 0; j < submission_group_size; ++j) { // (4) + ex::submit(p, [&](sycl::queue q) { // (5) + float *data = usm_data[j]; + return q.submit([=](sycl::handler &h) { // (6) + f(h, data); + }); + }); + do_cpu_work(); // (7) + } + ex::wait(p.get_submission_group()); // (8) + } + return 0; + } + +The key points in this example are: + +#. A ``dynamic_load_policy`` is constructed that selects from queues in the ``devices`` vector. +#. The total number of concurrent offloads, ``submission_group_size``, will be limited to the number of USM arrays. In this example, we allow multiple simultaneous offloads to the same queue. The only limitation is that there should be enough available vectors to support the concurrent executions. +#. The outer ``i``-loop iterates from 0 to 99, stepping by the ``submission_group_size``. This number of submissions will be offloaded concurrently. +#. The inner ``j``-loop iterates over ``submission_group_size`` submissions. +#. ``submit`` is used to select a queue and pass it to the user's function, but does not block until the event returned by that function completes. This provides the opportunity for concurrency across the submissions. +#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. +#. Some additional work is done between calls to ``submit``. ``dynamic_load_policy`` is most useful when there is time for work to complete on some devices before the next assignment is made. If all submissions are performed simultaneously, all devices will appear equally loaded, since the fast devices would have had no time to complete their work. +#. ``wait`` is called to block for all the concurrent ``submission_group_size`` submissions to complete. + +Selection Algorithm +------------------- + +The selection algorithm for ``dynamic_load_policy`` chooses the resource +that has the fewest number of unfinished offloads. The number of unfinished +offloads is the difference between the number of reported task submissions +and then number of reported task completions. This value is tracked for each +available resource. + +Simplified, expository implementation of the selection algorithm: + +.. code:: + + template + selection_type dynamic_load_policy::select(Args&& ...) { + if (initialized_) { + auto least_loaded_resource = find_least_loaded(resources_); + return selection_type{dynamic_load_policy(*this), least_loaded}; + } else { + throw std::logic_error("select called before initialialization"); + } + } + +where ``resources_`` is a container of resources, such as +``std::vector`` of ``sycl::queue``. The function ``find_least_loaded`` +iterates through the resources available to the policy and returns the +resource with the fewest number of unfinished offloads. + +Constructors +------------ + +``dynamic_load_policy`` provides three constructors. + +.. list-table:: ``dynamic_load_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``dynamic_load_policy(deferred_initialization_t);`` + - Defers initialization. An ``initialize`` function must be called prior to use. + * - ``dynamic_load_policy();`` + - Initialized to use the default set of resources. + * - ``dynamic_load_policy(const std::vector& u);`` + - Overrides the default set of resources. + +Deferred Initialization +----------------------- + +A ``dynamic_load_policy`` that was constructed with deferred initialization must be +initialized by calling one of its ``initialize`` member functions before it can be used +to select or submit. + +.. list-table:: ``dynamic_load_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``initialize();`` + - Initialize to use the default set of resources. + * - ``initialize(const std::vector& u);`` + - Overrides the default set of resources. + +Queries +------- + +A ``dynamic_load_policy`` has ``get_resources`` and ``get_submission_group`` +member functions. + +.. list-table:: ``dynamic_load_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``std::vector get_resources();`` + - Returns the set of resources the policy is selecting from. + * - ``auto get_submission_group();`` + - Returns an object that can be used to wait for all active submissions. + +Reporting Requirements +---------------------- + +If a resource returned by ``select`` is used directly without calling +``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` +to provide feedback to the policy. The ``dynamic_load_policy`` tracks the +number of outstanding submissions on each device via callbacks that report +when a submission is started, and when it is completed. The instrumentation +to report these events is included in the implementations of +``submit`` and ``submit_and_wait``. However, if you use ``select`` and then +submit work directly to the selected resource, it is necessary to explicitly +report these events. + +.. list-table:: ``dynamic_load_policy`` reporting requirements + :widths: 50 50 + :header-rows: 1 + + * - ``execution_info`` + - is reporting required? + * - ``task_submission`` + - Yes + * - ``task_completion`` + - Yes + * - ``task_time`` + - No + +In generic code, it is possible to perform compile-time checks to avoid +reporting overheads when reporting is not needed, while still writing +code that will work with any policy, as demonstrated below: + +.. code:: cpp + + auto s = select(my_policy); + if constexpr (report_info_v) + { + s.report(execution_info::task_submission); + } diff --git a/_sources/dynamic_selection_api/fixed_resource_policy.rst.txt b/_sources/dynamic_selection_api/fixed_resource_policy.rst.txt new file mode 100644 index 00000000000..eb2ceda0282 --- /dev/null +++ b/_sources/dynamic_selection_api/fixed_resource_policy.rst.txt @@ -0,0 +1,244 @@ +Fixed-Resource Policy +##################### + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +The fixed-resource policy always returns the same resource selection. +``fixed_resource_policy`` is designed for two primary scenarios: + +#. debugging the use of dynamic selection +#. special casing a dynamic selection capable application for a specific resource when it is known to be best on that platform. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + class fixed_resource_policy { + public: + // useful types + using resource_type = typename Backend::resource_type; + using wait_type = typename Backend::wait_type; + + class selection_type { + public: + fixed_resource_policy get_policy() const; + resource_type unwrap() const; + }; + + // constructors + fixed_resource_policy(deferred_initialization_t); + fixed_resource_policy(std::size_t offset = 0); + fixed_resource_policy(const std::vector& u, + std::size_t offset = 0); + + // deferred initializers + void initialize(std::size_t offset = 0); + void initialize(const std::vector& u, + std::size_t offset = 0); + + // queries + auto get_resources() const; + auto get_submission_group(); + + // other implementation defined functions... + }; + + } + +This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, +and ``submit_and_wait``. It can also be used with ``policy_traits``. + + +Example +------- + +In the following example, a ``fixed_resource_policy`` is used when the code is +compiled so that it selects a specific device. When ``USE_CPU`` is defined at +compile-time, this example always uses the CPU queue. When ``USE_GPU`` is defined +at compile-time, it always uses the GPU queue. Otherwise, it uses an +``auto_tune_policy`` to dynamically select between these two queues. Such a scenario +could be used for debugging or simply to maintain the dynamic selection code even if +the best device to use is known for some subset of platforms. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + std::vector r { sycl::queue{sycl::cpu_selector_v}, + sycl::queue{sycl::gpu_selector_v} }; + + const std::size_t N = 10000; + std::vector av(N, 0.0); + std::vector bv(N, 0.0); + std::vector cv(N, 0.0); + for (int i = 0; i < N; ++i) { + av[i] = bv[i] = i; + } + + #if USE_CPU + ex::fixed_resource_policy p{r}; // (1) uses index 0 of r, the cpu + #elif USE_GPU + ex::fixed_resource_policy p{r, 1}; // (2) uses index 1 of r, the gpu + #else + ex::auto_tune_policy p{r}; + #endif + + { + sycl::buffer a_b(av); + sycl::buffer b_b(bv); + sycl::buffer c_b(cv); + + + for (int i = 0; i < 6; ++i) { + ex::submit_and_wait(p, [&](sycl::queue q) { // (3) + // (4) + std::cout << (q.get_device().is_cpu() ? "using cpu\n" : "using gpu\n"); + return q.submit([&](sycl::handler &h) { // (5) + sycl::accessor a_a(a_b, h, sycl::read_only); + sycl::accessor b_a(b_b, h, sycl::read_only); + sycl::accessor c_a(c_b, h, sycl::read_write); + h.parallel_for(N, [=](auto i) { c_a[i] = a_a[i] + b_a[i]; }); + }); + }); + }; + } + + for (int i = 0; i < N; ++i) { + if (cv[i] != 2*i) { + std::cout << "ERROR!\n"; + } + } + std::cout << "Done.\n"; + } + +The key points in this example are: + +#. If ``USE_CPU`` is defined, a ``fixed_resouce_policy`` is constructed that targets the CPU. +#. If ``USE_GPU`` is defined, a ``fixed_resouce_policy`` is constructed that targets the GPU. +#. ``submit_and_wait`` is invoked with the policy as the first argument. The selected queue will be passed to the user-provided function. +#. For clarity when run, the type of device is displayed. +#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. + +Selection Algorithm +------------------- + +The selection algorithm for ``fixed_resource_policy`` always returns +the same specific resource from its set of resources. The index of the +resource is set during construction or deferred initialization. + +Simplified, expository implementation of the selection algorithm: + +.. code:: + + template + selection_type fixed_resource_policy::select(Args&& ...) { + if (initialized_) { + return selection_type{*this, resources_[fixed_offset_]}; + } else { + throw std::logic_error(“select called before initialization”); + } + } + +where ``resources_`` is a container of resources, such as +``std::vector`` of ``sycl::queue``, and ``fixed_offset_`` stores a +fixed integer offset. Both ``resources_`` and ``fixed_offset`` +are set during construction or deferred initialization of the policy +and then remain constant. + +Constructors +------------ + +``fixed_resource_policy`` provides three constructors. + +.. list-table:: ``fixed_resource_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``fixed_resource_policy(deferred_initialization_t);`` + - Defers initialization. An ``initialize`` function must be called prior to use. + * - ``fixed_resource_policy(std::size_t offset = 0);`` + - Sets the index for the resource to be selected. Uses the default set of resources. + * - ``fixed_resource_policy(const std::vector& u, std::size_t offset = 0);`` + - Overrides the default set of resources and optionally sets the index for the resource to be selected. + +Deferred Initialization +----------------------- + +A ``fixed_resource_policy`` that was constructed with deferred initialization must be +initialized by calling one its ``initialize`` member functions before it can be used +to select or submit. + +.. list-table:: ``fixed_resource_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``initialize(std::size_t offset = 0);`` + - Sets the index for the resource to be selected. Uses the default set of resources. + * - ``initialize(const std::vector& u, std::size_t offset = 0);`` + - Overrides the default set of resources and optionally sets the index for the resource to be selected. + +Queries +------- + +A ``fixed_resource_policy`` has ``get_resources`` and ``get_submission_group`` +member functions. + +.. list-table:: ``fixed_resource_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``std::vector get_resources();`` + - Returns the set of resources the policy is selecting from. + * - ``auto get_submission_group();`` + - Returns an object that can be used to wait for all active submissions. + +Reporting Requirements +---------------------- + +If a resource returned by ``select`` is used directly without calling +``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` +to provide feedback to the policy. However, the ``fixed_resource_policy`` +does not require any feedback about the system state or the behavior of +the workload. Therefore, no explicit reporting of execution information +is needed, as is summarized in the table below. + +.. list-table:: ``fixed_resource_policy`` reporting requirements + :widths: 50 50 + :header-rows: 1 + + * - ``execution_info`` + - is reporting required? + * - ``task_submission`` + - No + * - ``task_completion`` + - No + * - ``task_time`` + - No + +In generic code, it is possible to perform compile-time checks to avoid +reporting overheads when reporting is not needed, while still writing +code that will work with any policy, as demonstrated below: + +.. code:: cpp + + auto s = select(my_policy); + if constexpr (report_info_v) + { + s.report(execution_info::task_submission); + } diff --git a/_sources/dynamic_selection_api/functions.rst.txt b/_sources/dynamic_selection_api/functions.rst.txt new file mode 100644 index 00000000000..6dd574b5020 --- /dev/null +++ b/_sources/dynamic_selection_api/functions.rst.txt @@ -0,0 +1,597 @@ +Functions +######### + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several functions provided as part +of the API. + +Select +------ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + selection_t select(Policy&& p, Args&&... args); + } + +The function ``select`` chooses a resource (the *selection*) based on the +policy ``p``. Whether any additional arguments are needed or considered +depends on the policy. + +An example that calls ``select`` using a ``round_robin_policy``: + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::round_robin_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 6; ++i) { + auto selection = ex::select(p); + auto q = ex::unwrap(selection); + std::cout << "selected queue is " + << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + } + } + +The output of this example: + +.. code:: + + selected queue is cpu + selected queue is gpu + selected queue is cpu + selected queue is gpu + selected queue is cpu + selected queue is gpu + +The object returned by ``select`` is a *selection*. The exact type of the +selection object depends on the policy. If it is necessary to know the exact +type, it can be determined by using traits: +``policy_trait::selection_type`` or the helper trait ``selection_t``. + +Unwrapping a *selection* returns the underlying resource. For example, unwrapping +a selection when using SYCL (the default) results in a SYCL queue. + +A selection can also be used to report *execution info*. More advanced policies, +such as ``dynamic_load_policy`` and ``auto_tune_policy`` require that runtime +execution information be reported back through the selection object when the +selection resource is used. + +When possible, the selection should be passed to a ``submit`` or ``submit_and_wait`` function as the mechanism for submitting work to the resource. The ``submit`` and +``submit_and_wait`` functions implement the reporting of execution information +needed by some policies, such as ``dynamic_load_policy`` and ``auto_tune_poliy``. +If the selected resource is used directly, this reporting must be done explicitly +(using the ``report`` functions). + +Submit +------ + +``submit`` has two function signatures: + +#. the first argument is a *policy* object. +#. the first argument is a *selection* object that was returned by a previous call to ``select``. + +Submit Using a Policy ++++++++++++++++++++++ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + submission_t submit(Policy&& p, F&& f, Args&&... args); + } + +Chooses a resource using the policy ``p`` and +then calls the user function ``f``, passing the unwrapped selection +and ``args...`` as the arguments. It also implements the necessary +calls to report execution information for policies that +require reporting. + +``submit`` returns a *submission* object. Passing the *submission* object to the +``wait`` function will block the calling thread until the work offloaded by the +submission is complete. When using SYCL queues, this behaves as if calling +``sycl::event::wait`` on the SYCL event returned by the user function. + +The following example demonstrates the use of the function ``submit`` and the +function ``wait``. The use of ``single_task`` is for syntactic demonstration +purposes only; any valid command group or series of command groups can be +submitted to the selected queue. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::round_robin_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 4; ++i) { + auto done = ex::submit(/* policy object */ p, + /* user function */ + [](sycl::queue q, /* any additional args... */ int j) { + std::cout << "(j == " << j << "): submit to " + << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + auto e = q.single_task([]() { /* do some work */ }); + return e; /* MUST return sycl::event */ + }, + /* any additional args... */ i); + std::cout << "(i == " << i << "): async work on main thread\n"; + ex::wait(done); + std::cout << "(i == " << i << "): submission done\n"; + } + } + +The output from this example: + +.. code:: + + (j == 0): submit to cpu + (i == 0): async work on main thread + (i == 0): submission done + (j == 1): submit to gpu + (i == 1): async work on main thread + (i == 1): submission done + (j == 2): submit to cpu + (i == 2): async work on main thread + (i == 2): submission done + (j == 3): submit to gpu + (i == 3): async work on main thread + (i == 3): submission done + +Submit Using a Selection +++++++++++++++++++++++++ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + auto submit(Selection s, F&& f, Args&&... args); + } + +Calls the user function ``f``, passing the unwrapped selection ``s`` and ``args...`` +as the arguments. It also implements the necessary calls to report execution +information for policies that require reporting. + +``submit`` returns a *submission* object. Passing the *submission* object to the +``wait`` function will block the calling thread until the work offloaded by the +submission is complete. When using SYCL queues, this behaves as if calling +``sycl::event::wait`` on the SYCL event returned by the user function. + +The following example demonstrates the use of the function ``submit`` with an +object return by a call to select. The use of ``single_task`` is for +syntactic demonstration purposes only; any valid command group or series of +command groups can be submitted to the selected queue. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::round_robin_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 4; ++i) { + auto s = ex::select(p); + auto done = ex::submit(/* selection object */ s, + /* user function */ + [](sycl::queue q, /* any additional args... */ int j) { + std::cout << "(j == " << j << "): submit to " + << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + auto e = q.single_task([]() { /* do some work */ }); + return e; /* MUST return sycl::event */ + }, + /* any additional args... */ i); + std::cout << "(i == " << i << "): async work on main thread\n"; + ex::wait(done); + std::cout << "(i == " << i << "): submission done\n"; + } + } + +The output from this example: + +.. code:: + + (j == 0): submit to cpu + (i == 0): async work on main thread + (i == 0): submission done + (j == 1): submit to gpu + (i == 1): async work on main thread + (i == 1): submission done + (j == 2): submit to cpu + (i == 2): async work on main thread + (i == 2): submission done + (j == 3): submit to gpu + (i == 3): async work on main thread + (i == 3): submission done + +Wait +---- + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + void wait(W&& w); + } + +The function ``wait`` blocks the calling thread until the work associated with +object ``w`` is complete. The object returned from +a call to ``submit`` can be passed to this function to wait for the completion of a specific submission or the +object returned from a call to ``get_submission_group`` to wait for all submissions +made using a policy. Example code that demonstrates waiting for a specific +submission can be seen in the section for ``submit``. + +The following is an example that demonstrates waiting for all submissions by passing +the object returned by ``get_submission_group()`` to ``wait``: + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::round_robin_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 4; ++i) { + auto done = ex::submit(/* policy object */ p, + /* user function */ + [](sycl::queue q, /* any additional args... */ int j) { + std::cout << "(j == " << j << "): submit to " + << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + auto e = q.single_task([]() { /* do some work */ }); + return e; /* MUST return sycl::event */ + }, + /* any additional args... */ i); + std::cout << "(i == " << i << "): async work on main thread\n"; + } + ex::wait(p.get_submission_group()); + std::cout << "done waiting for all submissions\n"; + } + +The output from this example: + +.. code:: + + (j == 0): submit to cpu + (i == 0): async work on main thread + (j == 1): submit to gpu + (i == 1): async work on main thread + (j == 2): submit to cpu + (i == 2): async work on main thread + (j == 3): submit to gpu + (i == 3): async work on main thread + done waiting for all submissions + +Submit and Wait +--------------- + +Just like ``submit``, ``submit_and_wait`` has two function signatures: + +#. the first argument is a *policy* object. +#. the first argument is a *selection* object that was returned by a previous call to ``select``. + +The difference between ``submit_and_wait`` and ``submit`` is that +``submit_and_wait`` blocks the calling thread until the work associated +with the submission is complete. This behavior is essentially a short-cut +for calling ``wait`` on the object returned by a call to ``submit``. + +Submit and Wait Using a Policy +++++++++++++++++++++++++++++++ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + void submit_and_wait(Policy&& p, F&& f, Args&&... args); + } + +Chooses a resource using the policy ``p`` and +then calls the user function ``f``, passing the unwrapped selection +and ``args...`` as the arguments. It implements the necessary +calls to report execution information for policies that +require reporting. This function blocks the calling thread until +the user function and any work that it submits to the selected resource +are complete. + +The following example demonstrates the use of the function ``submit_and_wait``. +The use of ``single_task`` is for syntactic demonstration +purposes only; any valid command group or series of command groups can be +submitted to the selected queue. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::round_robin_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 4; ++i) { + ex::submit_and_wait(/* policy object */ p, + /* user function */ + [](sycl::queue q, /* any additional args... */ int j) { + std::cout << "(j == " << j << "): submit to " + << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + auto e = q.single_task([]() { /* do some work */ }); + return e; /* MUST return sycl::event */ + }, + /* any additional args... */ i); + std::cout << "(i == " << i << "): submission done\n"; + } + } + +The output from this example: + +.. code:: + + (j == 0): submit to cpu + (i == 0): submission done + (j == 1): submit to gpu + (i == 1): submission done + (j == 2): submit to cpu + (i == 2): submission done + (j == 3): submit to gpu + (i == 3): submission done + + +Submit and Wait Using a Selection ++++++++++++++++++++++++++++++++++ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + void submit_and_wait(Selection s, F&& f, Args&&... args); + } + +Calls the user function ``f``, passing the unwrapped selection ``s`` and ``args...`` +as the arguments. It also implements the necessary calls to report execution +information for policies that require reporting. + +This function blocks the calling thread until +the user function and any work that it submits to the resource +are complete. + +The following example demonstrates the use of the function ``submit_and_wait``. +The use of ``single_task`` is for syntactic demonstration +purposes only; any valid command group or series of command groups can be +submitted to the selected queue. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::round_robin_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 4; ++i) { + auto s = ex::select(p); + ex::submit_and_wait(/* selection object */ s, + /* user function */ + [](sycl::queue q, /* any additional args... */ int j) { + std::cout << "(j == " << j << "): submit to " + << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + auto e = q.single_task([]() { /* do some work */ }); + return e; /* MUST return sycl::event */ + }, + /* any additional args... */ i); + std::cout << "(i == " << i << "): submission done\n"; + } + } + + +The output from this example: + +.. code:: + + (j == 0): submit to cpu + (i == 0): submission done + (j == 1): submit to gpu + (i == 1): submission done + (j == 2): submit to cpu + (i == 2): submission done + (j == 3): submit to gpu + (i == 3): submission done + +Policy Queries +-------------- + +Getting the Resource Options +++++++++++++++++++++++++++++ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + std::vector> get_resources(Policy&& p); + } + +Returns a ``std::vector`` that contains the resources that a policy +selects from. The following example demonstrates the use of the function +``get_resources``. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::round_robin_policy p_explicit{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + std::cout << "Resources in explicitly set policy\n"; + for (auto& q : p_explicit.get_resources()) + std::cout << "queue is " << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + + std::cout << "\nResources in default policy\n"; + ex::round_robin_policy p_default; + for (auto& q : p_default.get_resources()) + std::cout << "queue is " << ((q.get_device().is_gpu()) ? "gpu\n" : "not-gpu\n"); + } + +The output from this example on a test machine is shown below. + +.. code:: + + Resources in explicitly set policy + queue is cpu + queue is gpu + + Resources in default policy + queue is not-gpu + queue is not-gpu + queue is gpu + queue is gpu + +When passing queues to the policy, the results show that the policy uses those +resources, a single CPU queue and a single GPU queue. + +The platform used to run this example has two GPU drivers installed, +as well as an FPGA emulator. When no resources are explicitly provided to the +policy constructor, the results show two non-GPU devices (the CPU and the FPGA +emulator) and two drivers for the GPU. + +Getting the Group of Submissions +++++++++++++++++++++++++++++++++ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + auto get_submission_group(Policy&& p); + } + +Returns an object that can be passed to ``wait`` to block the main +thread until all work submitted to queues managed by the policy are +complete. + +An example that demonstrates the use of this function can be found in +the section that describes the ``submit`` function. + +Report +------ + +Reporting Events with No Associated Values +++++++++++++++++++++++++++++++++++++++++++ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + void report(Selection&& s, const Info& i); + } + +Reports an execution info event to the policy. What events must reported +is policy dependent. No reporting is necessary when using the ``submit`` or +``submit_and_wait`` functions, since these functions contain all necessary +instrumentation. + +An example that uses reporting for the ``dynamic_load_policy`` is shown +below. This reporting is only necessary because ``select`` is used +but the resource is not passed to a ``submit`` or ``submit_and_wait`` function but +is instead used directly. The use of ``single_task`` is for syntactic demonstration +purposes only; any valid command group or series of command groups can be +submitted to the selected queue. + +.. code:: cpp + + #include + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::dynamic_load_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 6; ++i) { + auto selection = ex::select(p); + auto q = ex::unwrap(selection); + + ex::report(selection, ex::execution_info::task_submission); + q.single_task([]() { /* do work */ }).wait(); + ex::report(selection, ex::execution_info::task_completion); + } + } + +Reporting Events with Associated Values ++++++++++++++++++++++++++++++++++++++++ + +.. code:: cpp + + namespace oneapi::dpl::experimental { + template + void report(Selection&& s, const Info& i, const Value& v); + } + +Reports an execution info event along with an associated value to the policy. +What events must reported is policy dependent. No reporting is necessary +if using the ``submit`` or ``submit_and_wait`` functions, since these functions contain +all necessary instrumentation. + +An example that uses reporting for the ``auto_tune_policy`` is shown +below. This reporting is only necessary in this case because ``select`` is used +but the resource is not passed to a ``submit`` or ``submit_and_wait`` function but +is instead used directly. The use of ``single_task`` is for syntactic demonstration +purposes only; any valid command group or series of command groups can be +submitted to the selected queue. + +.. code:: cpp + + #include + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + ex::auto_tune_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 6; ++i) { + auto f = []() {}; + auto selection = ex::select(p, f); + auto q = ex::unwrap(selection); + + auto before = std::chrono::steady_clock::now(); + q.single_task(f).wait(); + auto after = std::chrono::steady_clock::now(); + ex::report(selection, ex::execution_info::task_time, (after-before).count()); + } + } diff --git a/_sources/dynamic_selection_api/policies.rst.txt b/_sources/dynamic_selection_api/policies.rst.txt new file mode 100644 index 00000000000..f74a2f0dca7 --- /dev/null +++ b/_sources/dynamic_selection_api/policies.rst.txt @@ -0,0 +1,134 @@ +Policies +######## + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +Policy Traits +------------- + +Traits can be used to determine useful type information about policies. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + struct policy_traits { + using selection_type = typename std::decay_t::selection_type; + using resource_type = typename std::decay_t::resource_type; + using wait_type = typename std::decay_t::wait_type; + }; + + template + using selection_t = typename policy_traits::selection_type; + + template + using resource_t = typename policy_traits::resource_type; + + template + using wait_t = typename policy_traits::wait_type; + + } + +``selection_t`` is the type returned by calls to ``select`` when using policy of type ``Policy``. +Calling ``unwrap`` on an object of type ``selection_t`` returns an object of +type ``resource_t``. When using the default SYCL backend, ``resource_t`` +is ``sycl::queue`` and ``sycl::wait_t`` is ``sycl::event``. The user functions +passed to ``submit`` and ``submit_and_wait`` are expected to have a signature of: + +.. code:: cpp + + wait_t user_function(resource_t, ...); + +Common Reference Semantics +-------------------------- + +If a policy maintains state, the state is maintained separately for each +independent policy instance. So for example, two independently constructed +instances of a ``round_robin_policy`` will operate independently of each other. +However, policies provide *common reference semantics*, so copies of a +policy instance share state. + +An example, demonstrating this difference, is shown below: + +.. code:: cpp + + #include + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + template + void print_type(const std::string &str, Selection s) { + auto q = ex::unwrap(s); + std::cout << str << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + } + + int main() { + ex::round_robin_policy p1{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + ex::round_robin_policy p2{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + ex::round_robin_policy p3 = p2; + + std::cout << "independent instances operate independently\n"; + auto p1s1 = ex::select(p1); + print_type("p1 selection 1: ", p1s1); + auto p2s1 = ex::select(p2); + print_type("p2 selection 1: ", p2s1); + auto p2s2 = ex::select(p2); + print_type("p2 selection 2: ", p2s2); + auto p1s2 = ex::select(p1); + print_type("p1 selection 2: ", p1s2); + + std::cout << "\ncopies provide common reference semantics\n"; + auto p3s1 = ex::select(p3); + print_type("p3 (copy of p2) selection 1: ", p3s1); + auto p2s3 = ex::select(p2); + print_type("p2 selection 3: ", p2s3); + auto p3s2 = ex::select(p3); + print_type("p3 (copy of p2) selection 2: ", p3s2); + auto p3s3 = ex::select(p3); + print_type("p3 (copy of p2) selection 3: ", p3s3); + auto p2s4 = ex::select(p2); + print_type("p2 selection 4: ", p2s4); + } + +The output of this example: + +.. code:: + + p1 selection 1: cpu + p2 selection 1: cpu + p2 selection 2: gpu + p1 selection 2: gpu + + copies provide common reference semantics + p3 (copy of p2) selection 1: cpu + p2 selection 3: gpu + p3 (copy of p2) selection 2: cpu + p3 (copy of p2) selection 3: gpu + p2 selection 4: cpu + + +Available Policies +------------------ + +More detailed information about the API is provided in the following sections: + +.. toctree:: + :maxdepth: 2 + :titlesonly: + :glob: + + fixed_resource_policy + round_robin_policy + dynamic_load_policy + auto_tune_policy + diff --git a/_sources/dynamic_selection_api/round_robin_policy.rst.txt b/_sources/dynamic_selection_api/round_robin_policy.rst.txt new file mode 100644 index 00000000000..0d766dc566c --- /dev/null +++ b/_sources/dynamic_selection_api/round_robin_policy.rst.txt @@ -0,0 +1,222 @@ +Round-Robin Policy +################## + +The dynamic selection API is an experimental feature in the |onedpl_long| +(|onedpl_short|) that selects an *execution resource* based on a chosen +*selection policy*. There are several policies provided as part +of the API. Policies encapsulate the logic and any associated state needed +to make a selection. + +The round-robin policy cycles through the set of resources at each selection. ``round_robin_policy`` +is useful for offloading kernels of similar cost to devices of similar +capabilities. In those cases, a round-robin assignment of kernels to devices +will achieve a good load balancing. + +.. code:: cpp + + namespace oneapi::dpl::experimental { + + template + class round_robin_policy { + public: + // useful types + using resource_type = typename Backend::resource_type; + using wait_type = typename Backend::wait_type; + + class selection_type { + public: + round_robin_policy get_policy() const; + resource_type unwrap() const; + }; + + // constructors + round_robin_policy(deferred_initialization_t); + round_robin_policy(); + round_robin_policy(const std::vector& u); + + // deferred initializer + void initialize(); + void initialize(const std::vector& u); + + // queries + auto get_resources() const; + auto get_submission_group(); + + // other implementation defined functions... + }; + + } + +This policy can be used with all the dynamic selection functions, such as ``select``, ``submit``, +and ``submit_and_wait``. It can also be used with ``policy_traits``. + +Example +------- + +The following example demonstrates a simple approach to send work to each +queue in a set of queues, and then wait for all devices to complete the work +before repeating the process. A ``round_robin_policy`` is used rotate through +the available devices. + +.. code:: cpp + + #include + #include + #include + + const std::size_t N = 10000; + namespace ex = oneapi::dpl::experimental; + + void f(sycl::handler& h, float* v); + + + int round_robin_example(std::vector& similar_devices, + std::vector& usm_data) { + + ex::round_robin_policy p{similar_devices}; // (1) + + auto num_devices = p.get_resources().size(); + auto num_arrays = usm_data.size(); + + // (2) + auto submission_group_size = (num_arrays < num_devices) ? num_arrays : num_devices; + + std::cout << "Running with " << num_devices << " queues\n" + << " " << num_arrays << " usm arrays\n" + << "Will perform " << submission_group_size << " concurrent offloads\n"; + + for (int i = 0; i < 100; i += submission_group_size) { // (3) + for (int j = 0; j < submission_group_size; ++j) { // (4) + ex::submit(p, [&](sycl::queue q) { // (5) + float* data = usm_data[j]; + return q.submit([=](sycl::handler &h) { // (6) + f(h, data); + }); + }); + } + ex::wait(p.get_submission_group()); // (7) + } + return 0; + } + +The key points in this example are: + +#. A ``round_robin_policy`` is constructed that rotates between the CPU and GPU queues. +#. The total number of concurrent offloads, ``submission_group_size``, will be limited to the number of USM arrays or the number of queues, whichever is smaller. +#. The outer ``i``-loop iterates from 0 to 99, stepping by the ``submission_group_size``. This number of submissions will be offload concurrently. +#. The inner ``j``-loop iterates over ``submission_group_size`` submissions. +#. ``submit`` is used to select a queue and pass it to the user's function, but does not block until the event returned by that function completes. This provides the opportunity for concurrency across the submissions. +#. The queue is used in a function to perform an asynchronous offload. The SYCL event returned from the call to ``submit`` is returned. Returning an event is required for functions passed to ``submit`` and ``submit_and_wait``. +#. ``wait`` is called to block for all the concurrent ``submission_group_size`` submissions to complete. + +Selection Algorithm +------------------- + +The selection algorithm for ``round_robin_policy`` rotates through +the elements of the set of available resources. A simplified, expository +implementation of the selection algorithm follows: + +.. code:: + + template + selection_type round_robin_policy::select(Args&&...) { + if (initialized_) { + auto& r = resources_[next_context_++ % num_resources_]; + return selection_type{*this, r}; + } else { + throw std::logic_error(“selected called before initialization”); + } + } + +where ``resources_`` is a container of resources, such as +``std::vector`` of ``sycl::queue``, ``next_context_`` is +a counter that increments at each selection, and ``num_resources_`` +is the size of the ``resources_`` vector. + +Constructors +------------ + +``round_robin_policy`` provides three constructors. + +.. list-table:: ``round_robin_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``round_round_policy(deferred_initialization_t);`` + - Defers initialization. An ``initialize`` function must be called prior to use. + * - ``round_robin_policy();`` + - Initialized to use the default set of resources. + * - ``round_robin_policy(const std::vector& u);`` + - Overrides the default set of resources. + +Deferred Initialization +----------------------- + +A ``round_robin_policy`` that was constructed with deferred initialization must be +initialized by calling one its ``initialize`` member functions before it can be used +to select or submit. + +.. list-table:: ``round_robin_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``initialize();`` + - Initialize to use the default set of resources. + * - ``initialize(const std::vector& u);`` + - Overrides the default set of resources. + +Queries +------- + +A ``round_robin_policy`` has ``get_resources`` and ``get_submission_group`` +member functions. + +.. list-table:: ``round_robin_policy`` constructors + :widths: 50 50 + :header-rows: 1 + + * - Signature + - Description + * - ``std::vector get_resources();`` + - Returns the set of resources the policy is selecting from. + * - ``auto get_submission_group();`` + - Returns an object that can be used to wait for all active submissions. + +Reporting Requirements +---------------------- + +If a resource returned by ``select`` is used directly without calling +``submit`` or ``submit_and_wait``, it may be necessary to call ``report`` +to provide feedback to the policy. However, the ``round_robin_policy`` +does not require any feedback about the system state or the behavior of +the workload. Therefore, no explicit reporting of execution information +is needed, as is summarized in the table below. + +.. list-table:: ``round_robin_policy`` reporting requirements + :widths: 50 50 + :header-rows: 1 + + * - ``execution_info`` + - is reporting required? + * - ``task_submission`` + - No + * - ``task_completion`` + - No + * - ``task_time`` + - No + +In generic code, it is possible to perform compile-time checks to avoid +reporting overheads when reporting is not needed, while still writing +code that will work with any policy, as demonstrated below: + +.. code:: cpp + + auto s = select(my_policy); + if constexpr (report_info_v) + { + s.report(execution_info::task_submission); + } diff --git a/_sources/dynamic_selection_api_main.rst.txt b/_sources/dynamic_selection_api_main.rst.txt new file mode 100644 index 00000000000..2b3f18d97a6 --- /dev/null +++ b/_sources/dynamic_selection_api_main.rst.txt @@ -0,0 +1,92 @@ +Dynamic Selection API +##################### + +The dynamic selection API is an experimental feature in the |onedpl_long| (|onedpl_short|). +Dynamic selection provides functions for choosing a *resource* using a +*selection policy*. By default, the resources selected via these APIs +in |onedpl_short| are SYCL queues. There are several functions and selection +policies provided as part of the API. + +The selection policies include: ``fixed_resource_policy`` that always selects a +specified resource, ``round_robin_policy`` that rotates between resources, +``dynamic_load_policy`` that chooses the resource that has the fewest outstanding +submissions, and ``auto_tune_policy`` that chooses the best resources based on runtime +profiling information. + +Policy objects are used as arguments to the dynamic selection functions. The +``select`` function picks and returns a resource based on a policy. The ``submit`` +and ``submit_and_wait`` functions select a resource and then pass the chosen resource +to a developer-provided function. + +The following code example shows some of the key aspects of the API. The use +of any empty ``single_task`` is for syntactic demonstration purposes only; +any valid command group or series of command groups can be submitted to the +selected queue. + +.. code:: cpp + + #include + #include + #include + + namespace ex = oneapi::dpl::experimental; + + int main() { + + // (1) create a policy object + ex::round_robin_policy p{ { sycl::queue{ sycl::cpu_selector_v }, + sycl::queue{ sycl::gpu_selector_v } } }; + + for (int i = 0; i < 6; ++i) { + + // (2) call one of the dynamic selection functions + // -- pass the policy to the API function + // -- provide a function to be called with a selected queue + // -- the user function must receive a sycl queue + // -- the user function must return a sycl event + auto done = ex::submit(p, + // (3) use the selected queue in user function + [=](sycl::queue q) { + std::cout << "submit task to " + << ((q.get_device().is_gpu()) ? "gpu\n" : "cpu\n"); + return q.single_task([]() { /* work here */ }); + }); + + // (4) each submission can be waited on using the returned object + ex::wait(done); + } + + // (5) and/or all submissions can be waited on as a group + ex::wait(p.get_submission_group()); + } + +In the preceding example, the key points in the code include: + +#. A policy object is created. In this example, the policy is a ``round_robin_policy`` that will rotate between a CPU and GPU SYCL queue. +#. The ``submit`` function is called in a loop. The arguments to the call include the policy object and user-provided function. +#. The user-provided function receives a SYCL queue (selected by the policy) and returns a SYCL event that represents the end of the chain of work that was submitted to the queue. +#. The submit function returns an object that can be waited on. Calling ``wait`` on the ``done`` object blocks the main thread until the work submitted to the queue by your function is complete. +#. The whole group of submissions made during the loop can be waited on. In this example, the call is redundant, since each submission was already waited for inside of the loop body. + +The output from this example is: + +.. code:: + + submit task to cpu + submit task to gpu + submit task to cpu + submit task to gpu + submit task to cpu + submit task to gpu + +And shows that the user function is passed alternating queues. + +More detailed information about the API is provided in the following sections: + +.. toctree:: + :maxdepth: 2 + :titlesonly: + :glob: + + dynamic_selection_api/functions + dynamic_selection_api/policies diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt index ea608f001be..53f5e9cfee8 100644 --- a/_sources/index.rst.txt +++ b/_sources/index.rst.txt @@ -22,6 +22,7 @@ For general information, refer to the `oneDPL GitHub* repository introduction parallel_api_main api_for_sycl_kernels_main + dynamic_selection_api_main macros cmake_support oneDPL_2022.0_changes diff --git a/_sources/introduction.rst.txt b/_sources/introduction.rst.txt index cacc6ee8abc..7ab4434c8b8 100644 --- a/_sources/introduction.rst.txt +++ b/_sources/introduction.rst.txt @@ -65,6 +65,9 @@ Difference with Standard C++ Parallel Algorithms * oneDPL execution policies only result in parallel execution if random access iterators are provided, the execution will remain serial for other iterator types. +* Function objects passed in to algorithms executed with device policies must provide ``const``-qualified ``operator()``. + `The SYCL specification `_ states that writing to such an object during a SYCL + kernel is undefined behavior. * For the following algorithms, par_unseq and unseq policies do not result in vectorized execution: ``includes``, ``inplace_merge``, ``merge``, ``set_difference``, ``set_intersection``, ``set_symmetric_difference``, ``set_union``, ``stable_partition``, ``unique``. @@ -91,6 +94,15 @@ When called with |dpcpp_short| execution policies, |onedpl_short| algorithms app Known Limitations ***************** +* When compiled with ``-fsycl-pstl-offload`` option of Intel oneAPI DPC++/C++ compiler and with + ``libstdc++`` version 8 or ``libc++``, ``oneapi::dpl::execution::par_unseq`` offloads + standard parallel algorithms to the SYCL device similarly to ``std::execution::par_unseq`` + in accordance with the ``-fsycl-pstl-offload`` option value. +* For ``transform_exclusive_scan`` and ``exclusive_scan`` to run in-place (that is, with the same data + used for both input and destination) and with an execution policy of ``unseq`` or ``par_unseq``, + it is required that the provided input and destination iterators are equality comparable. + Furthermore, the equality comparison of the input and destination iterator must evaluate to true. + If these conditions are not met, the result of these algorithm calls is undefined. * For ``transform_exclusive_scan``, ``transform_inclusive_scan`` algorithms the result of the unary operation should be convertible to the type of the initial value if one is provided, otherwise it is convertible to the type of values in the processed data sequence: ``std::iterator_traits::value_type``. @@ -98,6 +110,8 @@ Known Limitations vector execution policies when building a program with GCC 10 and using ``-O0`` option. * Compiling ``reduce`` and ``transform_reduce`` algorithms with the Intel DPC++ Compiler, versions 2021 and older, may result in a runtime error. To fix this issue, use an Intel DPC++ Compiler version 2022 or newer. +* When compiling on Windows, add the option ``/EHsc`` to the compilation command to avoid errors with oneDPL's experimental + ranges API that uses exceptions. * The use of |onedpl_short| together with the GNU C++ standard library (libstdc++) version 9 or 10 may lead to compilation errors (caused by oneTBB API changes). Using libstdc++ version 9 requires TBB version 2020 for the header file. This may result in compilation errors when diff --git a/_sources/introduction/onedpl_gsg.rst.txt b/_sources/introduction/onedpl_gsg.rst.txt index 99862e45087..5b598c23c30 100644 --- a/_sources/introduction/onedpl_gsg.rst.txt +++ b/_sources/introduction/onedpl_gsg.rst.txt @@ -46,9 +46,9 @@ and use the ``std`` namespace. CMake Support ------------- -`CMake `_ generates build scripts which can then be used to build and link your application. |onedpl_short| can be added to your project via CMake. +`CMake `_ generates build scripts which can then be used to build and link your application. |onedpl_short| can be added to your project via CMake. -A simple example for Linux is provided below. For more detailed usage and options including details specific to Windows, please look to the `CMake Support Page `_. +A simple example for Linux is provided below. For more detailed usage and options including details specific to Windows, please look to the `CMake Support Page `_. Simple Example CMake File ************************* @@ -212,4 +212,6 @@ Find More * - `oneDPL Samples `_ - Learn how to use |onedpl_short| with samples. * - `Layers for Yocto* Project `_ - - Add oneAPI components to a Yocto project build using the meta-intel layers. \ No newline at end of file + - Add oneAPI components to a Yocto project build using the meta-intel layers. + * - `oneAPI Samples Catalog `_ + - Explore the complete list of oneAPI code samples in the oneAPI Samples Catalog (GitHub*). These samples were designed to help you develop, offload, and optimize multiarchitecture applications targeting CPUs, GPUs, and FPGAs. \ No newline at end of file diff --git a/_sources/onedpl_gsg.rst.txt b/_sources/onedpl_gsg.rst.txt index 87e34c7eeb4..5b598c23c30 100644 --- a/_sources/onedpl_gsg.rst.txt +++ b/_sources/onedpl_gsg.rst.txt @@ -212,4 +212,6 @@ Find More * - `oneDPL Samples `_ - Learn how to use |onedpl_short| with samples. * - `Layers for Yocto* Project `_ - - Add oneAPI components to a Yocto project build using the meta-intel layers. \ No newline at end of file + - Add oneAPI components to a Yocto project build using the meta-intel layers. + * - `oneAPI Samples Catalog `_ + - Explore the complete list of oneAPI code samples in the oneAPI Samples Catalog (GitHub*). These samples were designed to help you develop, offload, and optimize multiarchitecture applications targeting CPUs, GPUs, and FPGAs. \ No newline at end of file diff --git a/_static/documentation_options.js b/_static/documentation_options.js index e70f11feaed..82fe17c8d52 100644 --- a/_static/documentation_options.js +++ b/_static/documentation_options.js @@ -1,6 +1,6 @@ var DOCUMENTATION_OPTIONS = { URL_ROOT: document.getElementById("documentation_options").getAttribute('data-url_root'), - VERSION: '2022.2.0', + VERSION: '2022.3.0', LANGUAGE: 'None', COLLAPSE_INDEX: false, BUILDER: 'html', diff --git a/_static/pygments.css b/_static/pygments.css index f227e5c6e0e..6110e9f1add 100644 --- a/_static/pygments.css +++ b/_static/pygments.css @@ -22,6 +22,7 @@ span.linenos.special { color: #000000; background-color: #ffffc0; padding-left: .highlight .cs { color: #8f5902; font-style: italic } /* Comment.Special */ .highlight .gd { color: #a40000 } /* Generic.Deleted */ .highlight .ge { color: #000000; font-style: italic } /* Generic.Emph */ +.highlight .ges { color: #000000; font-weight: bold; font-style: italic } /* Generic.EmphStrong */ .highlight .gr { color: #ef2929 } /* Generic.Error */ .highlight .gh { color: #000080; font-weight: bold } /* Generic.Heading */ .highlight .gi { color: #00A000 } /* Generic.Inserted */ diff --git a/api_for_sycl_kernels/random.html b/api_for_sycl_kernels/random.html index 84f16857621..d35b0cd61e7 100644 --- a/api_for_sycl_kernels/random.html +++ b/api_for_sycl_kernels/random.html @@ -6,7 +6,7 @@ - Random Number Generators — oneDPL Documentation 2022.2.0 documentation + Random Number Generators — oneDPL Documentation 2022.3.0 documentation @@ -96,7 +96,7 @@ -

oneDPL Documentation 2022.2.0 documentation

+

oneDPL Documentation 2022.3.0 documentation