Add support for async `f!` and `j!` #229

charleskawczynski · 2023-10-10T18:40:21Z

@tapios suggested adding this if it's not too difficult, and here it is:

This PR adds support for asynchronous f! and j! on both CPU and GPU.

We can try it out by passing the comms context, but we don't need to use it if we don't want.

tapios · 2023-10-10T19:47:33Z

Nice! When you know how much it affects performance, please let us know.

simonbyrne · 2023-10-23T22:50:17Z

Trying it out here: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/122

charleskawczynski · 2023-10-23T23:00:54Z

New build at:

charleskawczynski · 2023-10-24T15:24:10Z

Interesting, we're hitting maxiter reached in saturation_adjustment:. Ah, right, it was reduced to 3. 🤷🏻

charleskawczynski · 2023-10-24T15:24:53Z

I guess, maybe, we need a smaller timestep? (we could try increasing maxiter, but I assume it was reduced for good reason).

charleskawczynski · 2024-02-29T15:13:31Z

Update on this: the failures in ClimaAtmos were because the device was not synchronized, adding proper syncs did fix this.

I've updated the PR in a couple ways:

The multiple streams version is now in a CUDA extension, so that we don't need to directly depend on CUDA
I've kept both the multiple streams and multiple tasks for CUDADevices (we should compare them)
I've added a multithreaded version for the CPU

We should compare the streams vs tasks on the CUDA side. The multiple streams version with events do seem to be executing T_exp! and T_lim! asynchronously on separate streams, however, are not executing concurrently. The reason being that we're probably utilizing too many resources, and the scheduler is deciding to execute them serially.

One way we can try getting concurrent execution is by reducing the number of blocks in our kernel launches, but we might only want to (somehow) do that in T_exp! and T_lim!, rather than globally.

So, all in all, this isn't immediately offering a performance improvement on GPUs yet. The CPU multithreaded implementation might be an improvement, but it's not yet been tested/benchmarked. Also, there could be interplay with threads being used in the broadcast kernels (which, perhaps, we could/should just remove).

charleskawczynski · 2024-03-25T13:55:19Z

Further update: we will not add async support with T_exp! and T_lim!, as fusing these allowed us to eliminate dss calls. Instead, we'll add async support with f! and j!.

I've updated the PR to reflect this.

charleskawczynski · 2024-11-14T19:19:56Z

ext/CudaExt.jl

+    event = CUDA.CuEvent(CUDA.EVENT_DISABLE_TIMING)
+    CUDA.record(event, CUDA.stream()) # record event on main stream
+
+    stream1 = CUDA.CuStream() # make a stream


It could be that CUDA doesn't like making these streams at every RHS evaluation. Maybe we can try caching this in the timestepper?

charleskawczynski · 2024-11-14T19:53:25Z

This could be really good for strong scaling

charleskawczynski requested a review from simonbyrne October 10, 2023 18:40

charleskawczynski force-pushed the ck/async_support branch 2 times, most recently from a0138c1 to a685d12 Compare October 10, 2023 22:13

charleskawczynski mentioned this pull request Oct 10, 2023

Always compute+add T_exp and T_lim #230

Closed

charleskawczynski force-pushed the ck/async_support branch 2 times, most recently from 35338ba to 6f4b777 Compare October 11, 2023 05:41

charleskawczynski mentioned this pull request Feb 9, 2024

Add support for async T_exp! and T_lim! #247

Closed

charleskawczynski force-pushed the ck/async_support branch 2 times, most recently from 471efd2 to 6d87113 Compare February 15, 2024 21:06

charleskawczynski mentioned this pull request Feb 28, 2024

Allow commscontext input, def compute_T_exp_T_lim #257

Merged

charleskawczynski force-pushed the ck/async_support branch 5 times, most recently from 9b0e8a2 to 522bb2d Compare February 29, 2024 15:05

charleskawczynski force-pushed the ck/async_support branch 2 times, most recently from f16e85d to f4d5494 Compare March 1, 2024 15:45

charleskawczynski force-pushed the ck/async_support branch from f4d5494 to 8df83cc Compare March 25, 2024 13:54

Add support for async f and j

b29df31

charleskawczynski force-pushed the ck/async_support branch from 8df83cc to b29df31 Compare March 25, 2024 14:00

charleskawczynski changed the title ~~Add support for async T_exp! and T_lim!~~ Add support for async f! and j! Mar 25, 2024

charleskawczynski commented Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for async `f!` and `j!` #229

Add support for async `f!` and `j!` #229

charleskawczynski commented Oct 10, 2023 •

edited

Loading

tapios commented Oct 10, 2023

simonbyrne commented Oct 23, 2023

charleskawczynski commented Oct 23, 2023 •

edited

Loading

charleskawczynski commented Oct 24, 2023

charleskawczynski commented Oct 24, 2023

charleskawczynski commented Feb 29, 2024 •

edited

Loading

charleskawczynski commented Mar 25, 2024 •

edited

Loading

charleskawczynski Nov 14, 2024

charleskawczynski commented Nov 14, 2024

Add support for async f! and j! #229

Are you sure you want to change the base?

Add support for async f! and j! #229

Conversation

charleskawczynski commented Oct 10, 2023 • edited Loading

tapios commented Oct 10, 2023

simonbyrne commented Oct 23, 2023

charleskawczynski commented Oct 23, 2023 • edited Loading

charleskawczynski commented Oct 24, 2023

charleskawczynski commented Oct 24, 2023

charleskawczynski commented Feb 29, 2024 • edited Loading

charleskawczynski commented Mar 25, 2024 • edited Loading

charleskawczynski Nov 14, 2024

Choose a reason for hiding this comment

charleskawczynski commented Nov 14, 2024

Add support for async `f!` and `j!` #229

Add support for async `f!` and `j!` #229

charleskawczynski commented Oct 10, 2023 •

edited

Loading

charleskawczynski commented Oct 23, 2023 •

edited

Loading

charleskawczynski commented Feb 29, 2024 •

edited

Loading

charleskawczynski commented Mar 25, 2024 •

edited

Loading