Cuda synchronize alternative for profiling #304

aimilefth · 2022-07-13T14:15:03Z

Greetings,

I am currently using tf-trt and I want to measure the perfomance of my models (Latency, Throughput).

The tensorrt c++ API has the functionality of cuda synchronize via the cuda events API https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-events

On top of that, Pytorch contains the torch.cuda.synchronize() alternative
https://pytorch.org/docs/stable/generated/torch.cuda.synchronize.html

However in the TF TRT docs, I cant find something similar, which in my opinion is essential in order to correctly measure perfomance metrics

Have I missed anything or are there plans to integrate such functionality?

Thank you

ncomly-nvidia · 2022-07-18T15:46:35Z

Hi @aimilefth, you are correct on all counts. This is critical to measure performance in tensorflow, however, the APIs do not currently exist in TF (not just TF-TRT). We are in the process of adding such APIs. @DEKHTIARJonathan can add more.

You can also check out the benchmarking scripts for how TF-TRT overcomes this currently.

slai-natanijel · 2023-01-12T18:52:32Z

@ncomly-nvidia I was looking at the TensorRT ResNet50 benchmarking example here. The throughput seems exceptionally high, almost 250,000 IPS on the T4, whereas ML Perf reports 39,000 IPS for the A100, which is a better GPU.

Is the use of time.perf_counter() correct here? - just putting it around the inference function?

DEKHTIARJonathan · 2023-01-12T20:45:35Z

@slai-natanijel what is the input size for MLPerf, because TF-TRT uses MNIST (a very small input size) for demo purpose.
We chose on using MNIST because it's easy to download and use, clearly not comparable to the performance you would get with an input size 10x larger (so 100x more pixels)

slai-natanijel · 2023-01-13T11:28:46Z

@DEKHTIARJonathan Ah yes you are right - ML Perf uses 224x224x3 images. However, when I tested on A100 on this image size, I get like 700,000 IPS (expected 30,000 IPS) when I wrap time.perf_counter() around an inference call.

So how do your benchmarking scripts overcome the synchronisation issue currently?

DEKHTIARJonathan · 2023-01-13T17:50:04Z

@slai-natanijel let me guess... Did you call '.numpy()' or resynchronize the GPU after the computation before the final perf_counter() call?

Don't forget that TF is eager executed which means there is no guarantee the computation is actually over when you return from 'result = model(data)'.

slai-natanijel · 2023-01-16T14:44:58Z

@DEKHTIARJonathan
I tried the following:

start = time.perf_counter()
pred = func(x)['predictions'].numpy()
end = time.perf_counter()

where func(x) is an inference call to TensorRT. I get more reasonable IPS numbers with the above code, although I can't estimate how much overhead is added with the dictionary access and numpy() .

DEKHTIARJonathan · 2023-01-16T17:53:28Z

@slai-natanijel actually it's a very good point ;) And it's a lot... And even worse, it's actually very changing due to the nature of memcpyDtoH ...

But you're in luck my friend :)

We actually are adding a feature in TensorFlow right now to address this issue: tensorflow/community#434

Now in the meantime, you can use a little bit of TensorFlow dark magic to minimize that overhead:

def force_gpu_resync(func):
    p = tf.constant(0.)  # Create small tensor to force GPU resync

    def wrapper(*args, **kwargs):
        rslt = func(*args, **kwargs)
        (p + 1.).numpy()  # Sync the GPU
        return rslt

    return wrapper

model = ...  # a TF function, Eager Function, TF-TRT converted model, etc.
model = force_gpu_resync(model)

It will add very minor overhead, until the RFC above is merged, it's the best you can do.

@slai-natanijel may I ask which company do you work for ? That way we can follow up with you

slai-natanijel · 2023-01-18T19:59:23Z

Great - I'll be watching the sync API!
I tried your code snippet - I think it works fine, although there is no noticeable difference in performance compared to the numpy method. I guess if the output tensor is large, then we'd see a bigger difference.
My email is [email protected]

ncomly-nvidia assigned DEKHTIARJonathan Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda synchronize alternative for profiling #304

Cuda synchronize alternative for profiling #304

aimilefth commented Jul 13, 2022

ncomly-nvidia commented Jul 18, 2022

slai-natanijel commented Jan 12, 2023 •

edited

Loading

DEKHTIARJonathan commented Jan 12, 2023

slai-natanijel commented Jan 13, 2023 •

edited

Loading

DEKHTIARJonathan commented Jan 13, 2023

slai-natanijel commented Jan 16, 2023 •

edited

Loading

DEKHTIARJonathan commented Jan 16, 2023 •

edited

Loading

slai-natanijel commented Jan 18, 2023

Cuda synchronize alternative for profiling #304

Cuda synchronize alternative for profiling #304

Comments

aimilefth commented Jul 13, 2022

ncomly-nvidia commented Jul 18, 2022

slai-natanijel commented Jan 12, 2023 • edited Loading

DEKHTIARJonathan commented Jan 12, 2023

slai-natanijel commented Jan 13, 2023 • edited Loading

DEKHTIARJonathan commented Jan 13, 2023

slai-natanijel commented Jan 16, 2023 • edited Loading

DEKHTIARJonathan commented Jan 16, 2023 • edited Loading

slai-natanijel commented Jan 18, 2023

slai-natanijel commented Jan 12, 2023 •

edited

Loading

slai-natanijel commented Jan 13, 2023 •

edited

Loading

slai-natanijel commented Jan 16, 2023 •

edited

Loading

DEKHTIARJonathan commented Jan 16, 2023 •

edited

Loading