-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep NMS index gathering on cuda device #8766
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8766
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Hi @Ghelfi! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
2db6ab2
to
3f40bdb
Compare
@NicolasHug for review |
I would like to share an important finding regarding the root cause of slow performance in NMS when dealing with a large number of boxes (#8713). The issue is primarily due to the occurrence of minor page faults, as detailed in [1]. When data is transferred from the GPU to the host, the operating system must locate an appropriate memory space to store these temporary variables, and this operation can be quite costly in terms of execution time. As shown in Fig. 2 of [1], execution time curves diverge into two distinct groups as the number of objects increases, with varying results across different hardware configurations. Further analysis is provided in Table 1 and Section Performance Analysis of [1]. To summarize, the key takeaways are:
Since end users typically do not have the privilege to modify the operating system, and adjusting the lifecycle of these temporary variables may fall outside the scope of
The experimental comparison among three approaches can be found in [1]. Personally, I highly recommend CPU-free NMS. Nevertheless, the simplest implementation of Async NMS could potentially provide performance benefits if data copy is set to non-blocking mode, along with adding system synchronization before the CPU accesses the data: // nms_kernel_impl on GPU
at::Tensor mask_cpu = mask.to(at::kCPU, /* non_blocking*/ true);
// some non-relevant CPU codes
cudaStreamSynchronize(stream)
// unsigned long long* mask_host = (unsigned long long*)mask_cpu.data_ptr<int64_t>(); The following is result of default NMS and CPU-free NMS executed on V100 with the latest docker image I would strongly recommend that the community can cite the paper [1] to help users understand that the code modifications involved in the PR (#8766) are based not just on experimental results, but also on clear explanations and insights drawn from computing architecture. [UPDATED-1] I carefully compared the implementation of CPU-free NMS in this PR with my implementation and found slight differences. I can do another comparison to access performance If necessary. References: |
Thanks for your contribution and the sources shared. It is nice to be backed by a paper with a more thorough analysis. I'll read the paper and reference it in the code as comment to bring context for future users. |
Well done, the improvement looks quite significant ! @NicolasHug any chance this makes it to the next release of |
To provide a proper NMS implementation for end users, I would like to offer the following personal comments. As I mentioned earlier, my NMS-free NMS implementation differs slightly from the proposed one. In my approach, the function From what I understand, most object detectors limit the top-k objects fed into the NMS function, with k generally set between 1000 and 5000. In this range, the elapsed time for NMS is under 2 ms in both the random and best-case scenarios. (The worst case is unlikely to occur with real-world datasets.) However, the overhead of launching a second GPU kernel is approximately 1-5 ms, which depends heavily on the CPU's status. That being said, I completely agree that executing In summary, I believe that feeding 20,000+ objects into NMS is a rare occurrence, and the fused NMS implementation can offer slightly better performance in most scenarios. Another issue to consider is whether we should maintain two NMS implementations in This is my personal comment, and I completely understand that the maintainers may have other considerations. I wholeheartedly respect any decisions made by the maintainers. |
Sorry, we've already done the branch cut for 0.21 last month, so we won't be able to land it with the release coping up later in January. I will make sure to review it for the following one though! |
Performs the unwrap of IoU mask directly on the cuda device in NMS.
This prevents device -> host -> device data transfer.
fixes #8713