-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for async f!
and j!
#229
base: main
Are you sure you want to change the base?
Conversation
Nice! When you know how much it affects performance, please let us know. |
a0138c1
to
a685d12
Compare
35338ba
to
6f4b777
Compare
Trying it out here: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/122 |
Interesting, we're hitting |
I guess, maybe, we need a smaller timestep? (we could try increasing maxiter, but I assume it was reduced for good reason). |
471efd2
to
6d87113
Compare
9b0e8a2
to
522bb2d
Compare
Update on this: the failures in ClimaAtmos were because the device was not synchronized, adding proper syncs did fix this. I've updated the PR in a couple ways:
We should compare the streams vs tasks on the CUDA side. The multiple streams version with One way we can try getting concurrent execution is by reducing the number of blocks in our kernel launches, but we might only want to (somehow) do that in So, all in all, this isn't immediately offering a performance improvement on GPUs yet. The CPU multithreaded implementation might be an improvement, but it's not yet been tested/benchmarked. Also, there could be interplay with threads being used in the broadcast kernels (which, perhaps, we could/should just remove). |
f16e85d
to
f4d5494
Compare
f4d5494
to
8df83cc
Compare
Further update: we will not add async support with I've updated the PR to reflect this. |
8df83cc
to
b29df31
Compare
T_exp!
and T_lim!
f!
and j!
event = CUDA.CuEvent(CUDA.EVENT_DISABLE_TIMING) | ||
CUDA.record(event, CUDA.stream()) # record event on main stream | ||
|
||
stream1 = CUDA.CuStream() # make a stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be that CUDA doesn't like making these streams at every RHS evaluation. Maybe we can try caching this in the timestepper?
This could be really good for strong scaling |
@tapios suggested adding this if it's not too difficult, and here it is:
This PR adds support for asynchronous
f!
andj!
on both CPU and GPU.We can try it out by passing the comms context, but we don't need to use it if we don't want.