torch.compile ae.decode #25

yorickvP · 2024-09-27T18:27:40Z

It takes about 80 seconds on my machine to compile this. Makes the encoding step about 50% faster on A5000 (0.3 -> 0.2s), haven't tried H100.

daanelson

this is great! you can push to an internal H100 model (just don't leave it running 😄) on Replicate to test perf in prod, good to have solid metrics on that before we merge

daanelson · 2024-09-27T22:47:36Z

predict.py

@@ -166,12 +167,65 @@ def base_setup(
            shared_models=shared_models,
        )



nit - since these flags are just simple little flags we set setup for dev/schnell predictor, I don't mind adding a separate compile_ae flag

daanelson · 2024-09-27T22:56:17Z

predict.py

+        # the order is important:
+        # torch.compile has to recompile if it makes invalid assumptions
+        # about the input sizes. Having higher input sizes first makes
+        # for fewer recompiles.


any way we can compile once with craftier use of dynamo.mark_dynamic - add a max=192 on dims 2 & 3? I assume you've tried this, curious how it breaks

I tried max=192, but it didn't have any effect. Setting torch.compile(dynamic=True) makes for one fewer recompile, but I should check the runtime performance of that.

yorickvP · 2024-10-01T19:01:06Z

Did some H100 benchmarks.

flux-schnell 1 image, VAE not compiled

30ms prepare
355ms denoise-single-item
117ms vae-decode
total: 505ms

flux-schnell 4 images, VAE not compiled

30 ms prepare
4x 355 ms denoise-single-item
3.21s vae-decode
total: 4.69s

flux-schnell 4 images, VAE compiled

30ms prepare
4x 355 ms denoise-single-item
152ms vae-decode
total: 1.62s

The VAE speed seems reproducible, where the uncompiled VAE spends a lot of time in nchwToNhwcKernel while the compiled version manages to avoid it.

At the same time, I had a cog bug saying output streams failed to drain, crashing the pod instantly, but this seems unrelated to my PR.

jonluca · 2024-10-17T22:23:45Z

Did you figure out what the output streams failed to drain issue was? I'm seeing that in prod with our cog deploy as well

yorickvP · 2024-10-18T06:37:36Z

@jonluca as I understand it, it was a regression in cog and should be fixed when building with 0.9.25 and later.
It was caused by cog replacing stdout/stderr during predictions, but not during setup, causing forked processes to attempt to write to the original stdout/stderr. Should be fixed in replicate/cog#1969 but let me know if it's not!

yorickvP requested a review from daanelson September 27, 2024 18:27

yorickvP force-pushed the yorickvp/torch-compile-vae branch from 461db42 to 99cecf1 Compare September 27, 2024 19:01

daanelson reviewed Sep 27, 2024

View reviewed changes

yorickvP added 4 commits October 10, 2024 18:00

torch.compile ae.decode

08fc060

Fix ruff errors

4dd56f6

Add compile_ae argument to base_setup

f676e86

fp8/pipeline: add some profiler annotation for prepare/denoise/vae

0039a42

yorickvP force-pushed the yorickvp/torch-compile-vae branch from 99cecf1 to 0039a42 Compare October 10, 2024 16:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.compile ae.decode #25

torch.compile ae.decode #25

yorickvP commented Sep 27, 2024 •

edited

Loading

daanelson left a comment

daanelson Sep 27, 2024

daanelson Sep 27, 2024

yorickvP Sep 29, 2024

yorickvP commented Oct 1, 2024

jonluca commented Oct 17, 2024

yorickvP commented Oct 18, 2024

		@@ -166,12 +167,65 @@ def base_setup(
		shared_models=shared_models,
		)

torch.compile ae.decode #25

Are you sure you want to change the base?

torch.compile ae.decode #25

Conversation

yorickvP commented Sep 27, 2024 • edited Loading

daanelson left a comment

Choose a reason for hiding this comment

daanelson Sep 27, 2024

Choose a reason for hiding this comment

daanelson Sep 27, 2024

Choose a reason for hiding this comment

yorickvP Sep 29, 2024

Choose a reason for hiding this comment

yorickvP commented Oct 1, 2024

flux-schnell 1 image, VAE not compiled

flux-schnell 4 images, VAE not compiled

flux-schnell 4 images, VAE compiled

jonluca commented Oct 17, 2024

yorickvP commented Oct 18, 2024

yorickvP commented Sep 27, 2024 •

edited

Loading