Out-of-box Keras 3 Benchmarks: Currently Pytorch is significantly slower than Tensorflow in the training #20037

jeffDebug · 2024-07-24T01:03:56Z

jeffDebug
Jul 24, 2024

Referring to the Keras 3 benchmark page, it is seen that the PyTorch is significantly slower than the Tensorflow in the training. The following result is extracted from this page to compare the training speed.

The speed is measured in ms/step. Lower is better.

Training Model	Batch Size	Keras 3 (TensorFlow)	Keras 3 (PyTorch)
SegmentAnything (fit)	1	355.25	1,388.87
Stable Diffusion (fit)	8	392.24	823.44
BERT (fit)	32	214.49	808.68
Gemma (fit)	8	232.52	525.15
Mistral (fit)	8	185.92	452.12

There is a reason why we investigated this benchmarking result. For our own DL model, the training speed also became significantly slower (50% to 60% slower), after we migrated from Tensorflow 2 (with GPU/Cuda support) to Keras 3 (with PyTorch+cuda) on Windows. On Windows, the Tensorflow 2.10+ has no GPU/Cuda support anymore, which is one of reasons why we wanted to switch to Keras 3 with the PyTorch+cuda backend.

In the above benchmarking page, I also noticed the footnote right below the result (Table 2):

LLM inference with the PyTorch backend is abnormally slow at this time because KerasNLP uses static sequence padding, unlike HuggingFace. This will be addressed soon.

My question is, in which version of the future release of Keras 3+, will the PyTorch slowness issue be addressed? Should this also generally resolve the training speed issue for the PyTorch backend?

In the current Keras 3, are there any ways I can improve the training speed for the PyTorch backend?

jeffDebug · 2024-07-26T00:30:16Z

jeffDebug
Jul 26, 2024
Author

Could anyone please give any feedback or suggestion on how to optimize the training for the torch backend?

Referring to this page, I see one possible way to try is to use the Cuda Graph. But the underlying torch.nn.Module object must have been wrapped inside the keras.models.Model. I'm wondering if keras 3 provides any API to access underlying torch models to make such optimization possible?

1 reply

ghsanti Jul 27, 2024

Just comment from another user.

Why not to try Jax?

Also TensorFlow GPU support is available in Windows-WSL if you wanted to stay with TensorFlow backend.

Some of the optimisations that francois named in this issue:

Use jit_compile=True
Tune your batch size
Tune the value for steps_per_execution (n of batches used per epoch.)

And also to pick the fastest backend (tends to be Jax it seems.)

I guess it'd be important to pick up the best optimiser as well. See this issue for example.

abhaskumarsinha · 2024-07-27T16:00:31Z

abhaskumarsinha
Jul 27, 2024

Hello Jeff,

To the best of my knowledge, unless PyTorch uses highly optimized linear algebra compilers, CUDA based libs, it is very likely PyTorch to be slower. PyTorch benchmarks often have compiler-aware codes in different LLMs that actually cause that speed up.

For most of the time JAX backend is usually the fastest for inference and if using tf.data dataloaders during training. Feel free to switch the backend to JAX.

Best Regards,
Abhas Kumar Sinha

5 replies

jeffDebug Jul 30, 2024
Author

@ghsanti @abhaskumarsinha

Thanks for the suggestion and reference links.

Concerning the WSL on Windows, we try to avoid adding the WSL (layer) in our product, (1) to minimize the engineering complication when shipping the product, (2) to minimize the runtime overhead and (3) to minimize the size of the product deployment. Therefore, unless it's absolutely necessary, we'd choose not to use the WSL.

Regarding the JAX backend, the GPU/Cuda is not supported on Window yet (see, this reference link)..

Regarding the tuning the PyTorch backend, so far I have tried the followings.

Increasing the batch size: this doesn't significantly speed up the training if we keep the same amount of training samples each epoch.
Tuning the steps_per_execution: I ran into this "ValueError: steps_per_execution must be 1 with the PyTorch backend. Received: steps_per_execution=2"
Enabling jit_compile=True: the training crashed into the "RuntimeError: Cannot find a working triton installation". I'm not sure if this is the pytorch/triton installation issue, or the compatibility issue between the Keras 3.3.3 and torch 2.3.0. I'm still trying to figure this out. If you guys have any experience, or advice, I'd be glad to hear about. I assume the flag of "jit_compile=True" must have been tested in Keras 3, with some version of PyTorch. I'm wondering which PyTorch version had been tested for this flag.

ghsanti Jul 30, 2024

Would Docker be less overhead @jeffDebug ? Could allow you to use linux+cuda+tf

For the rest, I can't add much I'm afraid:

Yes, that's not surprising
I missed this line in the docs:

steps_per_execution: (...) Not supported with the PyTorch backend.

So it has to be 1 indeed.

That may need to be raised as an issue unless you can solve it

ghsanti Jul 30, 2024

You may try this though @jeffDebug ?

torch.compile is the latest method to speed up your PyTorch code! torch.compile makes PyTorch code run faster by JIT-compiling PyTorch code into optimized kernels, all while requiring minimal code changes.

torch.compile is included in the latest PyTorch. Running TorchInductor on GPU requires Triton, which is included with the PyTorch 2.0 nightly binary. If Triton is still missing, try installing torchtriton via pip (pip install torchtriton --extra-index-url "https://download.pytorch.org/whl/nightly/cu117" for CUDA 11.7).

Source

jeffDebug Jul 30, 2024
Author

@ghsanti Much thanks again.

Yes, we do use the docker on the cloud, which is indeed linux+cuda+tf. It is the support of Windows platform that I'm working to resolve for the training performance now.

Yes, I indeed had tried

pip (pip install torchtriton --extra-index-url

pip wouldn't be able to find the torchtriton for Windows though. As stated in the triton website, it is indeed only the Linux that is supported. I'm still searching around to see if any triton wheel file for Win64 would be available. I may also try playing with different versions of PyTorch, too. So far no resolution is found yet.

ghsanti Jul 30, 2024

Profiling tools aren't planned atm: #19476

chollet said:

There are no profiling plans from our side at this time -- if you want to create this feature, you're welcome to do it, it would be a very valuable addition.

As you may find bit rabbit-holing the thread, in Pytorch you can use pytorch-lightning or the profiler, and in tf there is some tensorboard app.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-box Keras 3 Benchmarks: Currently Pytorch is significantly slower than Tensorflow in the training #20037

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Out-of-box Keras 3 Benchmarks: Currently Pytorch is significantly slower than Tensorflow in the training #20037

jeffDebug Jul 24, 2024

The speed is measured in ms/step. Lower is better.

Replies: 2 comments · 6 replies

jeffDebug Jul 26, 2024 Author

ghsanti Jul 27, 2024

abhaskumarsinha Jul 27, 2024

jeffDebug Jul 30, 2024 Author

ghsanti Jul 30, 2024

ghsanti Jul 30, 2024

jeffDebug Jul 30, 2024 Author

ghsanti Jul 30, 2024

jeffDebug
Jul 24, 2024

Replies: 2 comments 6 replies

jeffDebug
Jul 26, 2024
Author

abhaskumarsinha
Jul 27, 2024

jeffDebug Jul 30, 2024
Author

jeffDebug Jul 30, 2024
Author