-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes cudaErrorInvalidValue when running on nvbench-created cuda stream #113
base: main
Are you sure you want to change the base?
Conversation
09cb757
to
eac79ef
Compare
nvbench/cuda_stream.cuh
Outdated
@@ -42,10 +45,18 @@ struct cuda_stream | |||
* Constructs a cuda_stream that owns a new stream, created with | |||
* `cudaStreamCreate`. | |||
*/ | |||
cuda_stream() | |||
: m_stream{[]() { | |||
cuda_stream(std::optional<nvbench::device_info> device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs should be updated to explain the semantics of the new device
parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated docs. Could you please check if it's understandable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, thanks for catching it! Some of the tests don't build after the changes, you can run ci/local/build.bash
from the nvbench
root to build and test if you have docker setup.
Once tests are passing this is good to go.
Thanks for reviewing the PR. |
@elstehle I'm still seeing a test regression when running
|
Thanks! Sorry, I've had missed that regression as it only occurred on systems with three devices or less. Issue with the test in
When the states are created, we create the stream for each state on that state's given device. If a given device doesn't exist, we run into a cuda error. For comparison, if we'd currently run a benchmark with invalid device ids, the runner would fail with the same error.
I resolved this regression by adjusting the test in |
if (!m_state.get_cuda_stream().has_value()) | ||
{ | ||
m_state.set_cuda_stream(nvbench::cuda_stream{m_state.get_device()}); | ||
} | ||
return m_state.get_cuda_stream().value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels weird to have the initialization of the optional
external to state
.
How about putting this logic inside state::get_cuda_stream
instead and don't expose the optional
externally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about putting this logic inside
state::get_cuda_stream
instead and don't expose theoptional
externally.
@allisonvacanti and I have discussed that option too but agreed to prefer explicitly setting the stream over implicitly initializing it as a byproduct, if it didn't exist. Considering the user interfacing with the API, I feel that, for multi-GPU systems, it's safer to make it explicit when resources are created and what device they are associated with. Especially, when the current device
may influence what device a resource is associated with.
That said, I'm fine to have it any way we decide makes more sense. @allisonvacanti what do you think?
This PR fixes a minor issue that may occur when
nvbench
is run on multiple GPUs without a user-provided cuda stream.The issue
The error that I observed in this case looked like:
When run with
memcheck
I would see:The Problem
It seems that nvbench is creating all the nvbench-owned streams on
device 0
.Suggested Fix
This fix makes sure that the streams are created on the device on which they are later on used.