Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add GPU trace for KT.regroup benchmark (pytorch#2157)
Summary: Pull Request resolved: pytorch#2157 # context * we are adding fbgemm operators for the KT.regroup function. * we wanted a good way to measure the performance beside the runtime * **trace is very important to evaluate the actual performance impact** * for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`) * but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance. # usage * to generate trace file in the given path (.) ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=. ``` ``` $ ll *.json -rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 943675 Jun 21 22:21 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350349 Jun 21 22:21 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json ``` # performance * GPU (notes: w/ dup falls back to native-pytorch implementation (`_regroup_keyed_tenors`)) ``` INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1 INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000 INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 ``` * CPU ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 144.8 ms | Memory (P90): 0.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 159.1 ms | Memory (P90): 0.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 203.0 ms | Memory (P90): 0.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 132.4 ms | Memory (P90): 0.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 134.7 ms | Memory (P90): 0.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cpu | Runtime (P90): 131.8 ms | Memory (P90): 0.0 ``` # traces * _regroup_keyed_tenors {F1712147044} * KeyedTensor.regroup {F1712148863} * KTRegroupAsDict {F1712150411} Reviewed By: dstaay-fb Differential Revision: D58906521 fbshipit-source-id: 46e37184cd58c0f25e48112510388de9bd39ac71
- Loading branch information