Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How cuda_graph to acclerate the denosing? #22

Open
LHQUer opened this issue Sep 26, 2024 · 4 comments
Open

How cuda_graph to acclerate the denosing? #22

LHQUer opened this issue Sep 26, 2024 · 4 comments

Comments

@LHQUer
Copy link

LHQUer commented Sep 26, 2024

use_cuda_graph=True or False, These two diffierent setting will result in different inference speed, Why? How the three cuda_graphs matchs the 50 denosing steps?

@lmxyy
Copy link
Collaborator

lmxyy commented Sep 26, 2024

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

@LHQUer
Copy link
Author

LHQUer commented Sep 26, 2024

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph

@LHQUer
Copy link
Author

LHQUer commented Sep 26, 2024

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

and I quote the introduction from the internet —"cuda_graph is suitable for multiple identical calculation operations with the same input invariance so as to accelerate the computing task ”

@lmxyy
Copy link
Collaborator

lmxyy commented Oct 5, 2024

Without CUDA Graph, the communication kernel launching overhead will dominate the latency. The first one is the warm-up graph. The second one is for the iteration right after the warm-up phase. The third one is the rest.

However, according to the official introduction and usage of CUDA_graph, it states that when creating CUDA_graph, it is bound to the current input, which means that in the subsequent steps, if you want to use the previously created CUDA graph, you cannot change the input. However, in the subsequent denoising steps (possibly steps 6-50), when using CUDA_graph, we need to use parameters such as "sample" from the previous time step before calling the forward function for the current denoising step. Does the change in values of variables such as' sample 'not conflict with the input invariance of CUDA graph

Yes. I bound it to the current input. However, during inference, I copy the new input to the bounded input as shown in distrifuser/models/distri_sdxl_unet_pp.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants