-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How cuda_graph to acclerate the denosing? #22
Comments
Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest. |
However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph |
and I quote the introduction from the internet —"cuda_graph is suitable for multiple identical calculation operations with the same input invariance so as to accelerate the computing task ” |
Yes. I bound it to the current input. However, during inference, I copy the new input to the bounded input as shown in distrifuser/models/distri_sdxl_unet_pp.py |
use_cuda_graph=True or False, These two diffierent setting will result in different inference speed, Why? How the three cuda_graphs matchs the 50 denosing steps?
The text was updated successfully, but these errors were encountered: