How cuda_graph to acclerate the denosing? #22

LHQUer · 2024-09-26T04:57:10Z

use_cuda_graph=True or False, These two diffierent setting will result in different inference speed, Why? How the three cuda_graphs matchs the 50 denosing steps?

lmxyy · 2024-09-26T05:02:55Z

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

LHQUer · 2024-09-26T10:18:10Z

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph

LHQUer · 2024-09-26T10:26:35Z

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

and I quote the introduction from the internet —"cuda_graph is suitable for multiple identical calculation operations with the same input invariance so as to accelerate the computing task ”

lmxyy · 2024-10-05T00:14:31Z

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph

Yes. I bound it to the current input. However, during inference, I copy the new input to the bounded input as shown in distrifuser/models/distri_sdxl_unet_pp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How cuda_graph to acclerate the denosing? #22

How cuda_graph to acclerate the denosing? #22

LHQUer commented Sep 26, 2024

lmxyy commented Sep 26, 2024

LHQUer commented Sep 26, 2024

LHQUer commented Sep 26, 2024

lmxyy commented Oct 5, 2024

How cuda_graph to acclerate the denosing? #22

How cuda_graph to acclerate the denosing? #22

Comments

LHQUer commented Sep 26, 2024

lmxyy commented Sep 26, 2024

LHQUer commented Sep 26, 2024

LHQUer commented Sep 26, 2024

lmxyy commented Oct 5, 2024