Release v0.4.0 · pytorch/executorch

We're excited to announce the Beta release of ExecuTorch! This release includes many new features, improvements, and bug fixes.

API Stability and Runtime Compatibility Guarantees

Starting with this release, ExecuTorch's Python and C++ APIs will follow the API Lifecycle and Deprecation Policy, and the .pte file format will comply with the Runtime Compatibility Policy.

New Features

Introduced exir.to_edge_transform_and_lower API for combining the functionality of to_edge, transform, and to_backend
- Allows users to prevent specific op decompositions while lowering to backends that implement those ops
Increased operator coverage for ExecuTorch’s portable library
Added new experimental APIs:
- LLM runner C++ APIs such as prefill_image(), prefill_prompt(), and generate_from_pos() with multimodal support
- executorch.runtime python module for loading .pte files and running them with the underlying C++ runtime
Added a new Tensor API to bundle the dynamic data and metadata within a Tensor object.
Improved the Module API to share an ExecuTorch Program between several Modules and provide APIs to set inputs/outputs before execution
Added find_package(executorch) for projects to easily link to ExecuTorch’s prebuilt library in CMake
Introduced reproducible benchmarking infrastructure to measure, debug, and track performance, enabling on-demand and automated nightly benchmarking of models and backend delegates on modern smartphones
- New benchmarking apps for Apple platforms to measure model performance on iOS/macOS and Android
Added support for TikToken v5 vision tokenizer
Improved parallelization for LLM prefill
Added experimental capabilities for on-device training, along with an example prototype for LLM finetuning

Supported Models

Added support for the following models:
- LLaMA 3 models, including LLaMA 3 8B, 3.1 8B, and 3.2 1B/3B
- [MultiModal] LLaVA (Large Language and Vision Assistant)
- Phi-3-mini
- Gemma 2B
Added LLaMA 3, 3.1, and 3.2 to the Android Llama Demo app
Added LLaVa multimodal support to the iOS iLLaMA and Android LLaMa Demo apps

Hardware Acceleration

Delegate framework
- Allow delegate to consume buffer mutations
[New] MediaTek
- Added support for a new MediaTek backend
- Enabled LLaMa 3 acceleration on MediaTek’s NPU
- Added export scripts and runners for 8 different OSS models
- Implemented intermediate tensor logging
CoreML
- Added LLaMA support for in-place KV cache, fused SDPA kernel, and 4-bit per-block quantization
- Added primitive support for dynamic shapes to work without torch._check
- Expanded operator coverage to over 100 ops
- Enabled stateful runtime execution
- Implemented Intermediate tensor logging
MPS
- Added support for 4-bit linear kernels (iOS 18 only)
- Enabled LLaMa 2 7B and LLaMa 3 8B
Qualcomm (Qualcomm Neural Network)
- Enabled LLaMa 3 8B with 4-bit linear kernel, SpinQuant, fused RMSNorm from QNN 2.25, and model sharding
- Added support for the AI Hub model format
- Implemented Intermediate tensor logging
ARM
- Added new operators
  - addm, addmmaddm, avg_pool2daddm, batch_normaddm, bmmaddm, clone/cataddm, conv2d improvementsaddm, divaddm, ecpaddm, fulladdm, hardtanhaddm, logaddm, mean_dumaddm, muladdm, permuteaddm, reluaddm, sigmoidaddm, sliceaddm, softmaxaddm, subaddm, unsqueezeaddm, view
- Added/enabled lowering passes to improve network compatibility
- Improved quantization support
  - Made quantization accuracy improvements for all models
  - Added quantization coverage for all available ops
- Improved channel last support by reducing overhead and number of conversions
- Added performance measurements on Corstone-300 FVP for Ethos-U55
- Moved to new compilation flow in Vela to provide better performance and compatibility
- Improved code documentation for third party contributors
XNNPACK
- Enhanced XNNPACK backend performance
- Added support for new LLaMa models and other quantized LLMs on Android/iOS devices, including LLaMA 3 8B, 3.1 8B, and 3.2 1B/3B
- Introduced major partitioner refactor to improve UX and stability
- Improved model coverage to ensure better stability
Vulkan
- Made latency optimizations for Vulkan convolution and matrix multiplication compute shaders through various algorithmic improvements
- Added quantizer for 8 bit weight-only quantization
- Expanded operator coverage to 63 ops
- Added 4-bit and 8-bit weight quantized linear kernels
- Added support for view tensors in the Vulkan graph runtime, allowing for no-copy permutes, squeeze/unsqueeze etc.
- Added support for symbolic integers in the Vulkan graph runtime
- Integration with ExecuTorch SDK to track compute shader latencies
Cadence
- Added an x86 executor to sanity check and numerically verify models locally
- Added multiple supported e2e models such as wav2vec2
- Integrated low-level optimizations resulting in 10x+ performance improvements
- Migrated more graph-level optimizations to the open source repository
- Enabled more types in the CadenceQuantizer, and moved to int8 default for better performance

Developer Experience

Introduced API to enable intermediate output logging in delegates
Improved CMake build system and reduced reliance on Buck2
Added override options for fallback PAL implementations through CMake flag (-DEXECUTORCH_PAL_DEFAULT)
Changes to DimOrder (please see this issue for current progress and next steps)

Bug Fixes

Fixed various issues related to quantization, tensor operations, and backend integrations
Resolved memory allocation and management issues
Fixed compatibility issues with different Python and dependency versions
Fixed bundled program and plan_execute in pybindings

Breaking Changes

Updated the minimum C++ version to C++17 for the core runtime
Removed all C++ headers under //executorch/util (see extension/runner_util/inputs.h for a PrepareInputTensors replacement)
- Users are expected now to provide their own read_file.h functionality
Renamed instances of sdk to devtools for file names, function names, and CMake options

Deprecation

Added new annotations and decorators for API lifecycle and deprecation management
- New ET_EXPERIMENTAL annotation indicates C++ APIs that may change without notice
- New @deprecated and @experimental python decorators indicate non-stable APIs
Names under the torch:: namespace are deprecated in favor of names under the executorch:: namespace, please migrate code to use the new namespace and avoid adding new references to the torch:: namespace
Constant buffers are no longer stored inside the .pte flatbuffer and are stored in a segment attached to the .pte moving forward
All C++ macros beginning with underscores such as __ET_UNUSED are deprecated in favor of unprefixed names such as ET_UNUSED
capture_pre_autograd_graph() is deprecated in lieu of the new torch.export_for_training() API

Thanks to the following open source contributors for their work on this release!

denisVieriu97, Erik-Lundell, Esteb37, SaoirseARM, benkli01, bigfootjon, chuntl, cymbalrush, derekxu, dulinriley, freddan80, haowhsu-quic, namanahuja, neuropilot-captain, oscarandersson8218, per, python3kgae, r-barnes, robell, salykova, shewu-quic, tom-arm, winskuo-quic, zingo

Full Changelog: v0.3.0...v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0