How can I guarantee GPU use? #19453

MattAlanWright · 2024-02-07T16:33:22Z

MattAlanWright
Feb 7, 2024

I have a test program written in C# that performs inference on an image multiple times using a YOLOv8n network and ONNXRuntime. It is a crude benchmarking program used to ballpark inference times taken on CPU vs GPU.

An anomaly that I keep running into and am unable to diagnose is that usually the GPU benchmark will be considerably faster than the CPU benchmark, taking about 20ms per frame to the CPU's ~60+ ms-per-frame on my laptop, but sometimes they will show almost identical inference times. Some characteristics of the anomaly:

It seems binary - either the GPU inference time is 20ms-per-frame or it approximately matches the CPU inference time. It does not vary outside those values.
It is intermittent and not reproducible as far as I can tell so far. Sometimes it just happens.

These two points together suggest to me that sometimes ONNXRuntime chooses to perform inference on the CPU even though I have told it to perform it on the GPU, where "told it to" means that I used SessionOptions.MakeSessionOptionWithCudaProvider() to create my SessionOptions rather than simply new SessionOptions() (which is how to "tell it" to use the CPU).

My questions:

Is there a way to determine with certainty where inference was performed?
Can I guarantee that inference is performed on the GPU?
If yes to the second question, is that always a good idea?

I think profiling may help but I don't know how to read the output.

I'm afraid I am not able to share the entire code snippet. I will share my Program.cs file and will include snippets of the Inference.cs code.

Program.cs:

using System.Diagnostics;
using Emgu.CV;
using Emgu.CV.CvEnum;

namespace CudaDistributionTest;

internal static class Program
{
    private const bool DoRunDemo = false;
    private const bool DoRunBenchmarks = true;

    private const int NumWarmupInferences = 50;
    private const int NumMeasuredInferences = 100;

    private const string ImagePath = @"images\bus.jpg"; // Path to image on which to run inference

    private static void Main(string[] args)
    {
        bool allDllsLoaded = Utils.LoadAllCudaDlls();
        if (!allDllsLoaded)
        {
            Console.WriteLine("One or more DLLs failed to load. Bailing!");
            return;
        }

        bool zlibLoaded = Utils.LoadZLibWapiDll();
        if (!zlibLoaded)
        {
            Console.WriteLine("zlibwapi failed to load. Bailing!");
            return;
        }

        Console.WriteLine($"CUDA runtime version: {Utils.GetCudaRuntimeVersion()}");
        Console.WriteLine($"CUDA device count: {Utils.GetDeviceCount()}");

        if (DoRunDemo) RunDemo();
        if (DoRunBenchmarks) RunBenchmarks();
        Console.ReadKey();
    }

    private static void RunCpuBenchmark()
    {
        RunBenchmark(false);
    }

    private static void RunGpuBenchmark()
    {
        RunBenchmark(true);
    }

    private static void Warmup(Inference inference, DisposableOrtValueList ortValueList, Mat img)
    {
        for (var i = 0; i < NumWarmupInferences; i++)
        {
            inference.RunInference(ortValueList.OrtValues, img.Size);
        }
    }

    private static void RunBenchmark(bool useGpu)
    {
        using Mat img = CvInvoke.Imread(ImagePath, ImreadModes.AnyColor);
        using DisposableOrtValueList ortValueList = Utils.PreprocessImage(img, Inference.ModelShape);
        using Inference inference = new(useGpu);

        Warmup(inference, ortValueList, img);

        Stopwatch gpuStopwatch = new();
        gpuStopwatch.Start();
        for (var i = 0; i < NumMeasuredInferences; i++)
        {
            inference.RunInference(ortValueList.OrtValues, img.Size);
        }
        gpuStopwatch.Stop();

        double avgInferenceTime = gpuStopwatch.ElapsedMilliseconds / (double)NumMeasuredInferences;
        double fps = 1000.0 / avgInferenceTime;
        string device = useGpu ? "GPU" : "CPU";
        Console.WriteLine($"Average {device} inference time: {avgInferenceTime} ms ({fps:F2} FPS)");
    }

    private static void RunBenchmarks()
    {
        RunCpuBenchmark();
        RunGpuBenchmark();
    }

    private static void RunDemo()
    {
        const bool useGpu = true;
        const bool doProfile = true;

        using Mat img = CvInvoke.Imread(ImagePath, ImreadModes.AnyColor);
        using Inference inf = new(useGpu, doProfile);
        IEnumerable<Detection> detections = inf.PrepImageAndRunInference(img);

        Utils.DrawBoxesAroundDetectedObjects(img, detections);
        CvInvoke.Imshow("Detections (via ONNXRuntime and YOLOv8)", img);
        CvInvoke.WaitKey(-1);
        CvInvoke.DestroyAllWindows();
    }
}

Some pieces of Inference.cs:

    private SessionOptions CreateSessionOptions(bool useCuda)
    {
        if (!useCuda)
        {
            Console.WriteLine("Creating default CPU session.");
            return new SessionOptions();
        }

        int runtimeVersion = Utils.GetCudaRuntimeVersion();
        int deviceCount = Utils.GetDeviceCount();
        if (runtimeVersion < 0 || deviceCount < 0)
        {
            Console.WriteLine("Runtime not supported by hardware. Defaulting to CPU inference session.");
            return new SessionOptions();
        }

        try
        {
            Console.WriteLine("Creating CUDA-enabled session.");
            return SessionOptions.MakeSessionOptionWithCudaProvider();
        }
        catch (OnnxRuntimeException e)
        {
            Console.WriteLine(
                "Failed to create CUDA-enabled session, falling back to default CPU session.", e);
            return new SessionOptions();
        }
    }

    public Inference(bool useGpu, bool doProfile=false)
    {
        SessionOptions = CreateSessionOptions(useGpu);
        SessionOptions.EnableProfiling = doProfile;

        byte[] model = File.ReadAllBytes(YoloModelPath);
        Session = new InferenceSession(model, SessionOptions);
        RunOptions = new RunOptions();
    }

    public IEnumerable<Detection> RunInference(List<OrtValue> inputs, Size imageSize)
    {
        using IDisposableReadOnlyCollection<OrtValue> outputs = Session.Run(
            RunOptions,
            Session.InputNames,
            inputs,
            Session.OutputNames);

        OrtValue output = outputs[0];
        ReadOnlySpan<float> outSpan = output.GetTensorDataAsSpan<float>();
        ...
    }

Some of Utils.cs:

    public static DisposableOrtValueList PreprocessImage(Mat img, int modelShape)
    {
        // Determine scale factor and new image size (before adding letterbox border). Resize image.
        float r = Math.Min((float)modelShape / img.Width, (float)modelShape / img.Height);
        var newWidth = (int)Math.Round(img.Width * r);
        var newHeight = (int)Math.Round(img.Height * r);
        using Mat resizedInputMat = new();
        CvInvoke.Resize(img, resizedInputMat, new Size(newWidth, newHeight));

        // Determine amount of padding to add to create letterbox border
        int dw = (modelShape - newWidth) / 2;
        int dh = (modelShape - newHeight) / 2;

        var top = (int)Math.Round(dh - 0.1);
        var bottom = (int)Math.Round(dh + 0.1);
        var left = (int)Math.Round(dw - 0.1);
        var right = (int)Math.Round(dw + 0.1);

        // Add letterbox border
        using Mat letterboxInput = new();
        var borderColor = new MCvScalar(114, 114, 114);
        CvInvoke.CopyMakeBorder(resizedInputMat, letterboxInput, top, bottom, left, right, BorderType.Constant,
            borderColor);

        // Create ONNXRuntime Tensor for model input. The blob and tensor's shape are (1, channels, height, width) which
        // is dictated by the YOLO network architecture. The BlobFromImage operation is used to scale the pixel values between
        // 0.0 and 1.0, swap the R and B channels, and produce the correct shape listed above.
        Mat blob = new();
        DnnInvoke.BlobFromImage(letterboxInput, blob, 1.0 / 255.0, new Size(modelShape, modelShape), new MCvScalar(),
            true);

        long[] dims = { 1, 3, 640, 640 };
        var inputOrtValue = OrtValue.CreateTensorValueWithData(OrtMemoryInfo.DefaultInstance, TensorElementType.Float,
            dims,
            blob.DataPointer, 32 * 3 * 640 * 640);
        return new DisposableOrtValueList(new List<OrtValue> { inputOrtValue }, blob);
    }

Answered by pranavsharma

Feb 7, 2024

Turn on verbose mode and look for "Node placements" in the logs. It'll tell you which nodes were placed on the GPU and which were placed on the CPU. This is not something that changed each time you create a session; it's determinate.

View full answer

pranavsharma · 2024-02-07T19:00:16Z

pranavsharma
Feb 7, 2024

Turn on verbose mode and look for "Node placements" in the logs. It'll tell you which nodes were placed on the GPU and which were placed on the CPU. This is not something that changed each time you create a session; it's determinate.

4 replies

MattAlanWright Feb 7, 2024
Author

Thank you I will test this out. Can you confirm that to do this I just need to set SessionOptions.LogVerbosityLevel to something > 0? The documentation for this property is kind of poor: https://onnxruntime.ai/docs/api/csharp/api/Microsoft.ML.OnnxRuntime.SessionOptions.html#Microsoft_ML_OnnxRuntime_SessionOptions_LogVerbosityLevel

MattAlanWright Feb 7, 2024
Author

Okay I have logging working. For anyone else reading this, I set the following:

        SessionOptions = CreateSessionOptions(useGpu);
        SessionOptions.EnableProfiling = doProfile;
        SessionOptions.LogVerbosityLevel = 1;
        SessionOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE;

        byte[] model = File.ReadAllBytes(YoloModelPath);
        Session = new InferenceSession(model, SessionOptions);
        RunOptions = new RunOptions();
        RunOptions.LogVerbosityLevel = 1;
        RunOptions.LogSeverityLevel = OrtLoggingLevel.ORT_LOGGING_LEVEL_VERBOSE;

@pranavsharma if I search the logs for "node" I am pretty overwhelmed, there are thousands of hits. Many of the hits look like the following:

Removing initializer '<whatever>'. It is no longer used by any node.

Can you tell me what to look for?

pranavsharma Feb 7, 2024

"Node placements"

MattAlanWright Feb 9, 2024
Author

Oh woops, yes you said that already, sorry. Thanks for the help so far. I can't reproduce the issue right now so I will update this discussion with more info when that happens. For now I can confirm what the logs look like when things are working as expected. The CPU session produces the following:

2024-02-09 08:31:24.9980762 [V:onnxruntime:, session_state.cc:1142 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Node placements
2024-02-09 08:31:24.9985054 [V:onnxruntime:, session_state.cc:1145 onnxruntime::VerifyEachNodeIsAssignedToAnEp]  All nodes placed on [CPUExecutionProvider]. Number of nodes: 300

and the CUDA session produces the following:

2024-02-09 08:31:29.9604413 [V:onnxruntime:, session_state.cc:1142 onnxruntime::VerifyEachNodeIsAssignedToAnEp] Node placements
2024-02-09 08:31:29.9611705 [V:onnxruntime:, session_state.cc:1145 onnxruntime::VerifyEachNodeIsAssignedToAnEp]  All nodes placed on [CUDAExecutionProvider]. Number of nodes: 177

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I guarantee GPU use? #19453

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How can I guarantee GPU use? #19453

MattAlanWright Feb 7, 2024

Replies: 1 comment · 4 replies

pranavsharma Feb 7, 2024

MattAlanWright Feb 7, 2024 Author

MattAlanWright Feb 7, 2024 Author

pranavsharma Feb 7, 2024

MattAlanWright Feb 9, 2024 Author

MattAlanWright
Feb 7, 2024

Replies: 1 comment 4 replies

pranavsharma
Feb 7, 2024

MattAlanWright Feb 7, 2024
Author

MattAlanWright Feb 7, 2024
Author

MattAlanWright Feb 9, 2024
Author