-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda synchronize alternative for profiling #304
Comments
Hi @aimilefth, you are correct on all counts. This is critical to measure performance in tensorflow, however, the APIs do not currently exist in TF (not just TF-TRT). We are in the process of adding such APIs. @DEKHTIARJonathan can add more. You can also check out the benchmarking scripts for how TF-TRT overcomes this currently. |
@ncomly-nvidia I was looking at the TensorRT ResNet50 benchmarking example here. The throughput seems exceptionally high, almost 250,000 IPS on the T4, whereas ML Perf reports 39,000 IPS for the A100, which is a better GPU. Is the use of |
@slai-natanijel what is the input size for MLPerf, because TF-TRT uses MNIST (a very small input size) for demo purpose. |
@DEKHTIARJonathan Ah yes you are right - ML Perf uses 224x224x3 images. However, when I tested on A100 on this image size, I get like 700,000 IPS (expected 30,000 IPS) when I wrap So how do your benchmarking scripts overcome the synchronisation issue currently? |
@slai-natanijel let me guess... Did you call '.numpy()' or resynchronize the GPU after the computation before the final perf_counter() call? Don't forget that TF is eager executed which means there is no guarantee the computation is actually over when you return from 'result = model(data)'. |
@DEKHTIARJonathan
where |
@slai-natanijel actually it's a very good point ;) And it's a lot... And even worse, it's actually very changing due to the nature of But you're in luck my friend :) We actually are adding a feature in TensorFlow right now to address this issue: tensorflow/community#434 Now in the meantime, you can use a little bit of TensorFlow dark magic to minimize that overhead:
It will add very minor overhead, until the RFC above is merged, it's the best you can do. @slai-natanijel may I ask which company do you work for ? That way we can follow up with you |
Great - I'll be watching the sync API! |
Greetings,
I am currently using tf-trt and I want to measure the perfomance of my models (Latency, Throughput).
The tensorrt c++ API has the functionality of cuda synchronize via the cuda events API https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#cuda-events
On top of that, Pytorch contains the torch.cuda.synchronize() alternative
https://pytorch.org/docs/stable/generated/torch.cuda.synchronize.html
However in the TF TRT docs, I cant find something similar, which in my opinion is essential in order to correctly measure perfomance metrics
Have I missed anything or are there plans to integrate such functionality?
Thank you
The text was updated successfully, but these errors were encountered: