We're excited to announce the Beta release of ExecuTorch! This release includes many new features, improvements, and bug fixes.
API Stability and Runtime Compatibility Guarantees
Starting with this release, ExecuTorch's Python and C++ APIs will follow the API Lifecycle and Deprecation Policy, and the .pte
file format will comply with the Runtime Compatibility Policy.
New Features
- Introduced
exir.to_edge_transform_and_lower
API for combining the functionality ofto_edge
,transform
, andto_backend
- Allows users to prevent specific op decompositions while lowering to backends that implement those ops
- Increased operator coverage for ExecuTorch’s portable library
- Added new experimental APIs:
- LLM runner C++ APIs such as
prefill_image()
,prefill_prompt()
, andgenerate_from_pos()
with multimodal support executorch.runtime
python module for loading.pte
files and running them with the underlying C++ runtime
- LLM runner C++ APIs such as
- Added a new Tensor API to bundle the dynamic data and metadata within a Tensor object.
- Improved the Module API to share an ExecuTorch Program between several Modules and provide APIs to set inputs/outputs before execution
- Added
find_package(executorch)
for projects to easily link to ExecuTorch’s prebuilt library in CMake - Introduced reproducible benchmarking infrastructure to measure, debug, and track performance, enabling on-demand and automated nightly benchmarking of models and backend delegates on modern smartphones
- Added support for TikToken v5 vision tokenizer
- Improved parallelization for LLM prefill
- Added experimental capabilities for on-device training, along with an example prototype for LLM finetuning
Supported Models
- Added support for the following models:
- LLaMA 3 models, including LLaMA 3 8B, 3.1 8B, and 3.2 1B/3B
- [MultiModal] LLaVA (Large Language and Vision Assistant)
- Phi-3-mini
- Gemma 2B
- Added LLaMA 3, 3.1, and 3.2 to the Android Llama Demo app
- Added LLaVa multimodal support to the iOS iLLaMA and Android LLaMa Demo apps
Hardware Acceleration
- Delegate framework
- Allow delegate to consume buffer mutations
- [New] MediaTek
- Added support for a new MediaTek backend
- Enabled LLaMa 3 acceleration on MediaTek’s NPU
- Added export scripts and runners for 8 different OSS models
- Implemented intermediate tensor logging
- CoreML
- Added LLaMA support for in-place KV cache, fused SDPA kernel, and 4-bit per-block quantization
- Added primitive support for dynamic shapes to work without
torch._check
- Expanded operator coverage to over 100 ops
- Enabled stateful runtime execution
- Implemented Intermediate tensor logging
- MPS
- Added support for 4-bit linear kernels (iOS 18 only)
- Enabled LLaMa 2 7B and LLaMa 3 8B
- Qualcomm (Qualcomm Neural Network)
- Enabled LLaMa 3 8B with 4-bit linear kernel, SpinQuant, fused RMSNorm from QNN 2.25, and model sharding
- Added support for the AI Hub model format
- Implemented Intermediate tensor logging
- ARM
- Added new operators
addm
,addmmaddm
,avg_pool2daddm
,batch_normaddm
,bmmaddm
,clone/cataddm
,conv2d improvementsaddm
,divaddm
,ecpaddm
,fulladdm
,hardtanhaddm
,logaddm
,mean_dumaddm
,muladdm
,permuteaddm
,reluaddm
,sigmoidaddm
,sliceaddm
,softmaxaddm
,subaddm
,unsqueezeaddm
,view
- Added/enabled lowering passes to improve network compatibility
- Improved quantization support
- Made quantization accuracy improvements for all models
- Added quantization coverage for all available ops
- Improved channel last support by reducing overhead and number of conversions
- Added performance measurements on Corstone-300 FVP for Ethos-U55
- Moved to new compilation flow in Vela to provide better performance and compatibility
- Improved code documentation for third party contributors
- Added new operators
- XNNPACK
- Enhanced XNNPACK backend performance
- Added support for new LLaMa models and other quantized LLMs on Android/iOS devices, including LLaMA 3 8B, 3.1 8B, and 3.2 1B/3B
- Introduced major partitioner refactor to improve UX and stability
- Improved model coverage to ensure better stability
- Vulkan
- Made latency optimizations for Vulkan convolution and matrix multiplication compute shaders through various algorithmic improvements
- Added quantizer for 8 bit weight-only quantization
- Expanded operator coverage to 63 ops
- Added 4-bit and 8-bit weight quantized linear kernels
- Added support for view tensors in the Vulkan graph runtime, allowing for no-copy permutes, squeeze/unsqueeze etc.
- Added support for symbolic integers in the Vulkan graph runtime
- Integration with ExecuTorch SDK to track compute shader latencies
- Cadence
- Added an x86 executor to sanity check and numerically verify models locally
- Added multiple supported e2e models such as wav2vec2
- Integrated low-level optimizations resulting in 10x+ performance improvements
- Migrated more graph-level optimizations to the open source repository
- Enabled more types in the CadenceQuantizer, and moved to int8 default for better performance
Developer Experience
- Introduced API to enable intermediate output logging in delegates
- Improved CMake build system and reduced reliance on Buck2
- Added override options for fallback PAL implementations through CMake flag (
-DEXECUTORCH_PAL_DEFAULT
) - Changes to DimOrder (please see this issue for current progress and next steps)
Bug Fixes
- Fixed various issues related to quantization, tensor operations, and backend integrations
- Resolved memory allocation and management issues
- Fixed compatibility issues with different Python and dependency versions
- Fixed bundled program and plan_execute in pybindings
Breaking Changes
- Updated the minimum C++ version to C++17 for the core runtime
- Removed all C++ headers under
//executorch/util
(seeextension/runner_util/inputs.h
for aPrepareInputTensors
replacement)- Users are expected now to provide their own
read_file.h
functionality
- Users are expected now to provide their own
- Renamed instances of
sdk
todevtools
for file names, function names, and CMake options
Deprecation
- Added new annotations and decorators for API lifecycle and deprecation management
- New
ET_EXPERIMENTAL
annotation indicates C++ APIs that may change without notice - New
@deprecated
and@experimental
python decorators indicate non-stable APIs
- New
- Names under the
torch::
namespace are deprecated in favor of names under theexecutorch::
namespace, please migrate code to use the new namespace and avoid adding new references to thetorch::
namespace - Constant buffers are no longer stored inside the
.pte
flatbuffer and are stored in a segment attached to the.pte
moving forward - All C++ macros beginning with underscores such as
__ET_UNUSED
are deprecated in favor of unprefixed names such asET_UNUSED
capture_pre_autograd_graph()
is deprecated in lieu of the newtorch.export_for_training()
API
Thanks to the following open source contributors for their work on this release!
denisVieriu97, Erik-Lundell, Esteb37, SaoirseARM, benkli01, bigfootjon, chuntl, cymbalrush, derekxu, dulinriley, freddan80, haowhsu-quic, namanahuja, neuropilot-captain, oscarandersson8218, per, python3kgae, r-barnes, robell, salykova, shewu-quic, tom-arm, winskuo-quic, zingo
Full Changelog: v0.3.0...v0.4.0