Remove CUDART hijack #1730

muraj · 2024-07-26T23:40:19Z

This issue is to track progress on removing the need for the CUDART hijack and eventual removal of it from the Realm codebase. Current use cases known are (will be updated as use cases come up):

Task kernel synchronization
- Largely mitigated thanks to cuCtxRecordEvent and task-specific stream exposure.
~~Task kernel legion prof annotations~~
- Not actually implemented, but will be implemented with cupti
Regent: support turning off hijack with cuda 10+ #1059 ?
- @magnatelee is looking into this in the next few weeks.

elliottslaughter · 2024-07-27T16:23:44Z

#1059 is a blocker on the Regent side.

lightsighter · 2024-07-27T23:37:24Z

Noting here that the only way that Legion Prof accurately represents the time that CUDA kernels for a GPU task spend on the device today is by relying on Realm's hijack. If the hijack is disabled, then Realm currently over approximates the lifetime of the GPU kernels by assuming that are launched and start running on the GPU immediately as soon as the GPU task starts, which is not always the case. Detecting when the first kernel is launched an enqueuing a CUDA event right before it is probably going to be challenging without the hijack.

muraj · 2024-07-28T03:45:40Z

@lightsighter If I were to replace the current timestamp reporting for the kernel launched with a CUPTI activity timestamp difference, would that cover the Legion profiler use case? I am thinking I can enable the following and retrieve the completion field to get the information requested:
https://docs.nvidia.com/cupti/api/structCUpti__ActivityKernel9.html
I can then use the correlationId to correlate to the currently running task using a cupti callback:
https://docs.nvidia.com/cupti/api/structCUpti__CallbackData.html

These should not have any additional overhead than what is already in place with the Realm CUDART hijack, but it does require CUPTI to be installed, which requires a CUDA Toolkit to be install locally on the system somewhere. CUPTI is ABI stable IIRC, so a dynamic loader can be built and we can dynamically detect it's presence on the system (and can toggle it via a cmdline arg or w/e if you'd like). Is that acceptable?

@elliottslaughter I'm not sure I understand what the actual issue is here, would it be possible for you to summarize it in the issue? I've added myself to the issue and I can talk to @magnatelee about it next week for more clarity.

lightsighter · 2024-07-28T09:35:23Z

These should not have any additional overhead than what is already in place with the Realm CUDART hijack, but it does require CUPTI to be installed, which requires a CUDA Toolkit to be install locally on the system somewhere. CUPTI is ABI stable IIRC, so a dynamic loader can be built and we can dynamically detect it's presence on the system (and can toggle it via a cmdline arg or w/e if you'd like). Is that acceptable?

That seems reasonable to me, but I'll let other folks comment as well as I'm not the only one who expects that to work. Note that we don't need to profile every kernel. We mainly want this profiling response from Realm to provide tight bounds on when the kernels on the GPU are actually running on the device:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/profiling.h?ref_type=heads#L127-147

muraj · 2024-07-28T15:17:48Z

@lightsighter Unless I'm mistaken, the hijack doesn't seem to come into play with this. When we add an OperationTimelineGPU to the operation, it enables this path:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/cuda/cuda_module.cc#L1597

This just puts either a stream callback on the task stream, or records an event and schedules a bgworker to retrieve that event and record the CPU time.
https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/cuda/cuda_module.cc#L1214

No tight bounding is done as far as I can tell.

eddy16112 · 2024-07-28T17:01:23Z

If I remember correctly. We do not record the tight bounding of each kernel, but only record the start of the first kernel and the end of the last kernel.

muraj · 2024-07-28T17:33:52Z

@eddy16112 I understand that, but where is this done? If it's done with GPUWorkStart, then it's not tight bound at all.

eddy16112 · 2024-07-28T19:29:36Z

We record an event before and after the task body respectively, and use the first event as the start and the second one as the end.

muraj · 2024-07-28T23:03:22Z

Yes, but those events are queued before the task is run. If the stream is idle at the start of the task body (which it must be IIUC), then the event is recorded immediately, not when the first kernel starts. This logic also has nothing to do with the hijack.

lightsighter · 2024-07-29T06:07:17Z

Right, I think the CUDA runtime hijack used to do that by only enqueuing the event upon the first kernel launch and not at the start of the task, but that code seems to have been lost. It could have happened anytime in the last 9 years when I wrote the first version of the hijack and then stopped working on it myself.

I'll note that we should probably fix this regardless of whether it is for getting rid of the hijack or not. I know for certain that @jjwilke could use it right now for running code without the hijack.

muraj · 2024-07-29T06:43:18Z

@lightsighter Can you file a separate issue for this with all the requirements and steps to actually test if this is set up properly? I haven't gotten familiar enough with Legion Prof to ensure I can see the issue and ensure it's fixed.

lightsighter · 2024-07-29T09:13:28Z

I've started an issue on improving the precision of the GPU profiling here: #1732

elliottslaughter · 2024-10-23T03:27:59Z

Is there any plan to automatically capture completion of events on the default stream associated with a GPU?

One thing I noticed while porting the application in #1682 is that if you do not have the hijack, and use set_task_ctxsync_required(false) (to avoid overheads), but then miss passing the task stream to some kernels in your application, you can easily get data corruption due to the missed synchronization. And this is an easy thing to miss because it requires scanning every single CUDA kernel call in your entire application code.

eddy16112 · 2024-10-23T03:45:01Z

Is there any plan to automatically capture completion of events on the default stream associated with a GPU?

I am not sure if it is doable. @muraj ?

I think it is an application bug if they forgot to use the task stream. Actually we have a realm cuda hook tool which could detect such leaks.

muraj · 2024-10-23T05:12:53Z

Correct, this would be an application bug. It is part of the contract in setting ctx_sync_required(false).

That said, with ctx_sync enabled, on recent drivers (12.0+), we record an event at the end of the task which should minimize the over synchronization that happens by quite a bit, so if the app can't make the contract of ctx_sync_required(false), it hopefully shouldn't be too expensive.

Is there any plan to automatically capture completion of events on the default stream associated with a GPU?

That is what ctx_sync_required(true) with recent drivers does already.

elliottslaughter · 2024-10-23T23:00:49Z

Ok, let me see if I understand the options here. After the hijack is removed, applications can either:

Do nothing. This is fine as long as they do not also call set_task_ctxsync_required(false). There might be a minor performance impact due to over-synchronization, but there should be no correct issues.
Convert all kernels to use the task stream and set_task_ctxsync_required(false). This approach preserves the current behavior and performance, with some effort.

The issue, as far as Regent is concerned, is that it's (or can be) a mixed application. Regent itself always uses the task stream. So in a pure Regent application it would always be safe to call set_task_ctxsync_required(false). But Regent applications can call arbitrary C++ code, and that code may or may not use the task stream. If an application contains a mix of approaches (1) and (2), then this leads to issues.

Has anyone gone to measure how much of performance impact the event record actually has in approach (1)? That's the conservative approach for mixed codes, it just feels a bit unfortunate as the vast majority of Regent users would be getting a needless slowdown.

Alternatively if anyone else can suggest another way for Regent to figure out if it's calling a blob of code that does not properly use the task stream, I'm all ears.

muraj · 2024-10-24T04:40:17Z

Convert all kernels to use the task stream

That is not necessary. An application can use whatever stream they wish, but the contract of set_task_ctxsync_required(false) is that all work must synchronize with the task stream before the task completes. For example, a task can use some other stream, record a cuda event in that stream, then have the task stream wait on this event before completing.

If an application contains a mix of approaches (1) and (2), then this leads to issues.

That is correct, which is why set_task_ctxsync_required(false) is not the default and needs to be managed on a case by case basis. Typical libraries like cublas and others usually provide apis that take a stream or hand back a cuda event in order to synchronize with the effects of said API call. Most libraries do not rely on context or device synchronize to achieve this, as that would ruin concurrent performance.

Has anyone gone to measure how much of performance impact the event record actually has in approach (1)?

Not on a legion or higher level application, no, but on a realm application, yes. I ran local performance tests with simultaneous DMA work running in the background and got negligible to the slightly better performance compared to just calling cuCtxSynchronize in a background thread. So in the worst case, we do no worse than what was there without the hijack. In the best case with no contention, we achieve about the same performance as with the hijack. Again, this is assuming we have a driver that supports cuda 12.5+ (I need the cuCtxRecordEvent API from the driver to achieve this).

In addition, I know legate folks like @manopapad have done their own analyses and have begun converting their codes over to using set_task_ctxsync_required(false) and properly synching with the task stream.

lightsighter · 2024-10-24T09:15:46Z

The issue, as far as Regent is concerned, is that it's (or can be) a mixed application

That is correct, which is why set_task_ctxsync_required(false) is not the default and needs to be managed on a case by case basis.

Just to be very clear for @elliottslaughter, this is a per-dynamic-task API call that you can do to disable context synchronization after that task (and only that task) is done running. It means you can selectively opt-in to disabling the context synchronization for individual tasks, so I don't think you have to worry about it being a global setting that will not compose well.

elliottslaughter · 2024-10-24T21:00:38Z

Ok, I guess I'd forgotten that, since this is the second time I've needed to be reminded: #1557 (comment)

There are still issues with tasks of the form:

__demand(__cuda)
task asdf(...)
  for ... do ... end -- Regent turns this into a kernel launch
  call_native_c_code() -- Internally launches more kernels

If the C code is used inside the same task and does not properly use the task stream, then we're still in the same situation. This limits the scope of the problem, but the issue still fundamentally occurs.

(And empirically based on available evidence, appears to be happening in S3D.)

muraj · 2024-10-24T21:50:25Z

This limits the scope of the problem, but the issue still fundamentally occurs.

Correct, again, the application needs to know if that's an assumption it can make in the task and if it cannot, then it must not use set_task_ctxsync_required(false)

muraj added Realm Issues pertaining to Realm best effort indicates the milestone tag for an issue is a goal rather than a commitment labels Jul 26, 2024

muraj self-assigned this Jul 26, 2024

muraj added this to the realm-24.11 milestone Sep 16, 2024

lightsighter mentioned this issue Sep 18, 2024

Realm: Duplicate registration of function _ZN3cub11EmptyKernelIvEEvv #1583

Open

elliottslaughter mentioned this issue Oct 25, 2024

Data corruption with CUDA hijack off #1782

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove CUDART hijack #1730

Remove CUDART hijack #1730

muraj commented Jul 26, 2024 •

edited by elliottslaughter

Loading

elliottslaughter commented Jul 27, 2024

lightsighter commented Jul 27, 2024

muraj commented Jul 28, 2024 •

edited

Loading

lightsighter commented Jul 28, 2024

muraj commented Jul 28, 2024

eddy16112 commented Jul 28, 2024

muraj commented Jul 28, 2024

eddy16112 commented Jul 28, 2024

muraj commented Jul 28, 2024

lightsighter commented Jul 29, 2024

muraj commented Jul 29, 2024

lightsighter commented Jul 29, 2024

elliottslaughter commented Oct 23, 2024

eddy16112 commented Oct 23, 2024

muraj commented Oct 23, 2024 •

edited

Loading

elliottslaughter commented Oct 23, 2024

muraj commented Oct 24, 2024

lightsighter commented Oct 24, 2024

elliottslaughter commented Oct 24, 2024

muraj commented Oct 24, 2024

Remove CUDART hijack #1730

Remove CUDART hijack #1730

Comments

muraj commented Jul 26, 2024 • edited by elliottslaughter Loading

elliottslaughter commented Jul 27, 2024

lightsighter commented Jul 27, 2024

muraj commented Jul 28, 2024 • edited Loading

lightsighter commented Jul 28, 2024

muraj commented Jul 28, 2024

eddy16112 commented Jul 28, 2024

muraj commented Jul 28, 2024

eddy16112 commented Jul 28, 2024

muraj commented Jul 28, 2024

lightsighter commented Jul 29, 2024

muraj commented Jul 29, 2024

lightsighter commented Jul 29, 2024

elliottslaughter commented Oct 23, 2024

eddy16112 commented Oct 23, 2024

muraj commented Oct 23, 2024 • edited Loading

elliottslaughter commented Oct 23, 2024

muraj commented Oct 24, 2024

lightsighter commented Oct 24, 2024

elliottslaughter commented Oct 24, 2024

muraj commented Oct 24, 2024

muraj commented Jul 26, 2024 •

edited by elliottslaughter

Loading

muraj commented Jul 28, 2024 •

edited

Loading

muraj commented Oct 23, 2024 •

edited

Loading