14 Jul 22:33

driazati

d361585

Apache TVM v0.9.0

Introduction

The TVM community has worked since the v0.8 release to deliver many exciting features and improvements. v0.9.0 is the first release on the new quarterly release schedule and includes many highlights, such as:

MetaSchedule's full implementation
ARM cascading scheduler for Arm Ethos(TM)-U NPUs
Collage which brings tuning to BYOC
Several microTVM improvements
New tvm.relay.build parameters - runtime=, executor=,
AOT - Support for the C++ runtime (with llvm and c targets only) and support for host-driven AOT in the C runtime
Hexagon RPC support
- Testing via Hexagon SDK simulator and on device via Snapdragon-based HDK boards and phones
- AOT and USMP support
- Threading
- Initial op support
MLF - Support for multiple modules in a single MLF artifact
Several TIR schedule primitives and transforms including (abridged):
- schedule.transform_layout - Applies a layout transformation to a buffer as specified by an IndexMap.
- schedule.transform_block_layout - Applies a schedule transformation to a block as specified by an IndexMap.
- schedule.set_axis_separators - Sets axis separators in a buffer to lower to multi-dimensional memory (e.g. texture memory).
- transform.InjectSoftwarePipeline - Transforms annotated loop nest into a pipeline prologue, body and epilogue where producers and consumers are overlapped.
- transform.CommonSubexprElimTIR - Implements common-subexpression elimination for TIR.
- transform.InjectPTXAsyncCopy - Rewrites global to shared memory copies in CUDA with async copy when annotated tir::attr::async_scope.
- transform.LowerCrossThreadReduction - Enables support for reductions across threads on GPUs.
And many more! See the list of RFCs and PRs included in v0.9.0 for a complete list, as well as the full change list.

RFCs

These RFCs have been merged in apache/tvm-rfcs since the last release.

What's Changed

Note that this list is not comprehensive of all PRs and discussions since v0.8. Please visit the full listing of commits for a complete view: v0.8.0...v0.9.0.rc0.

AOT

#11208 - Calculate used memory at the callsite of primitive functions
#11365 - Fix function number datatype from char to uint16_t
#11091 - Enable A-Normal Form in the AOT executor
#10753 - Support LLVM backend with C++ runtime
#10518 - Use python temporary directory for AOT tests
#10337 - BugFix of workspace calculation
#10282 - [runtime] Add Metadata classes for AOTExecutor
#9501 - [3/3][DeviceAPI] Wire up cpacked Device API context
#9500 - [2/3][DeviceAPI] Add Hooks for Activate/Deactivate/Open/Close
#9395 - [1/3][DeviceAPI] Connecting devices structure to relevant operators

BYOC

#11474 - Two helper passes for external codegen using RelayToTIR custom pass machinery
#11144 - Remove support for run-time linked-params from codegen
#10590 - Add order to functions in C Codegen
#11638 - [DNNL][CBLAS]Unifles all MKLDNN/DNNL to DNNL
#11619 - RelayToTIR custom codegen passes can still depend on dynamic shape functions
DNNL - #11902, #11642, #11513, #11571, #11560, #11345, #11111, #10837, #10421, #9995, #9797
TensorRT - #11923, #11203, #10759, #10772, #10388
CMSIS-NN - #11732, #11625, #10939, #11013, #10817, #10563, #10224, #10148, #10100, #9338, #9531, #9409, #9331
OpenCLML - #10243
CUTLASS - #11631, #10185, #10177, #10110, #10036, #9899, #9820, #9800, #9795, #9746, #9737, #9698, #95...

Assets 5

24 Nov 17:14

junrushao

v0.8.0

7b3a22e

Apache TVM v0.8 Release Note

Overview
Accepted RFCs
Features and Improvements

Overview

Apache TVM v0.8 brings several major exciting experimental features, including:

PaddlePaddle frontend
TVMScript: round-trippable python-based syntax for TIR
TorchScript integration
TensorIR scheduling language
TensorRT and CUTLASS integration via BYOC
Int4 TensorCore support in AutoTVM
MicroTVM Project API and Zephyr, Arduino support
AOT executor
Robust Windows support
Affine analysis infra: iter-affine-map
Improved Vulkan backend
CUDA graph support in TVM runtime

Besides, The community has been working together to refactor and evolve the existing infrastructure, including but not limited to:

Relay compilation engine
Relay pattern language
CI and build process
Refactoring documentation and tutorials
Stablizing AutoScheduler
Stablizing TVMC command line driver interface
Stablizing target system
Frontend coverage, quantization, dynamic shape, training

Full changelog: https://gist.github.com/junrushao1994/c669905dbc41edc2e691316df49d8562.

Accepted RFCs

The community has adopted a formal RFC process. Below is a list of the formal RFCs accepted by the community since then:

[RFC-0005] Meta schedule (AutoTIR)
[RFC-0006] Automatic mixed-precision pass and support
[RFC-0007] Parametrized unit tests
[RFC-0008] MicroTVM Project API
[RFC-0009] Unified static memory planner
[RFC-0010] Target-registered compiler flow customisation
[RFC-0011] Arm® Ethos-U integration
[RFC-0014] Pipeline executor
[RFC-0015] Use CMSIS-NN with TVM
[RFC-0019] Add PaddlePaddle frontend
[RFC-0020] Extend metadata in project option
[RFC-0022] TIR non-scalar constants
[RFC-0023] Adding annotation field to tir.allocate nodes
[RFC-0025] PyTorchTVM
[RFC-0027] Formalize TVM documentation organization
[RFC-0028] Command line composition from internal registry
[RFC-0029] Migrating target attributes to IRModule
[RFC-0030] Command line configuration files
[RFC-0031] C Device API
[RFC-0036] TVMScript namespace
[RFC-0041] Update TVMScript block syntax

Features and Improvements

TE, TIR, TVMScript

TVMScript parser and printer #7630 #9115 #9286
Scheduleable TIR (S-TIR) infrastructure, analysis and lowering passes #7553 #7765 #7847 #8114 #8121 #7873 #7923 #7962 #7848 #8044 #7806
S-TIR schedule primitives: compute-inline, reverse-compute-inline, fuse, split, rfactor, storage-align, vectorize, unroll, bind, reorder, cache-read, cache-write, compute-at, reverse-compute-at, decompose-reduction #8170 #8467 #8544 #8693 #8716 #8767 #8863 #8943 #9041
While loop in TIR #7425 #9004
Metaprogramming in S-TIR via specialize #8354
Support Return value in TIR #7084 #7932
Storage scope support in PointerType #8017 #8366 #8463
Creation of S-TIR via TE compute #7987

AutoTVM, AutoScheduler, Meta Schedule

PopenPoolExecutor is used to replace python native library to provide better multiprocessing support as well as enable auto-tuning in Jupyter notebooks for AutoTVM and AutoScheduler #6959 #8492 #8913 #8820 #8851
AutoScheduler improvement and stabilization: task scheduler, layout rewrite, early stopping, dispatching #6945 #6750 #6987 #7156 #8862 #8995 #7571 #7376 #7377 #7344 #7185
AutoScheduler support for sparse workloads #7313 #7635 #8065
AutoScheduler support for Vulkan, ROCm, Mali #7626 #7038 #7132
AutoTVM support for int4 TensorCore #7831 #8402
Meta Schedule core infrastructure, builder runner and database #8615 #8623 #8642 #8817 #9079 #9132 #9154 #9053 #9059 #9044 #9111 #9061 #9153

Operator Coverage

Operators for Int-8 vision transformer on GPU #7814
Optimizing NMS and ROI-related kernel on GPU #7257 #7172 #7136 #7796 #7463 #6516 #7440 #7666 #8174
Support and optimize sparse operators #8605 #7477 #7435 #6889 #6580 #8437
Sort-related operators and optimization #9184 #7669 #8672 #7611 #7195 #7056 #6978
Support for einsum operator #6370
Matmul, dense operators and their optimization #8921 #8527 #8234 #8250 #6616 #8229 #8401 #7404 #8669
Convolution and pooling operators and their optimization #8620 #8936 #8584 #7075 #7142 #7515 #6999 #6899 #6840 #6137 #6802 #6445 [#671...

Assets 5

02 Oct 18:30

ZihengJiang

v0.7.0

728b829

Apache TVM (incubating) v0.7.0

Apache TVM (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.

Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects.

While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Introduction

v0.7 brings many major features. The community works together to refactor the internal code base to bring an unified IR code structure with unified IRModule, type system and pass infrastructure. We have also bought many exciting new features, some highlights include:

Initial automatic scheduling support
Initial command line driver interface
WebGPU and webassembly support
Better first class rust support in the codebase
Intial Hexagon support
Bring your own codegen (BYOC) support

The community also continues to bring high quality improvements to the existing modules including, but not limited to: better frontend coverage, performance, quantization, uTVM and dynamic shape support.

New Features

Automatic Scheduling (Experimental)

Phase 0: Ansor minimum system for auto schedule generating #5962
Phase 1: Access Analyzer #6103
Phase 1: Add follow_split and follow_fused_split steps #6142
Phase 1: Add pragma/storage_align/rfactor steps #6141
Phase 1: Add RPC Runner #6077
Phase 1: Add annotation/compute_at/compute_root/compute_inline steps #6073
Phase 1: Add cache_read/cache_write steps #6107
Phase 1: Rename namspace form auto_schedule to auto_scheduler #6059
Phase 1: The base class for cost models #6187
Phase 1: feature extraction for cost models #6190
Phase 1: XGBoost Cost Model #6270
Phase 2: Basic GPU Sketch Search Policy #6269
Phase 2: Evolutionary Search #6310
Phase 2: Update heavy operations with parallel_for #6348
Parallel the InitPopulation (#6512)
Tutorial: Using the template-free auto-scheduler on CPU (#6488)

BYOC

External codegen support in Relay (#4482)，(#4544)
Bring Your Own Codegen Guide -- Part 1 #4602
Bring Your Own Codegen Guide -- Part 2 #4718
Relay annotation and partitioning for external compilers #4570
JSON Runtime with DNNL End-to-End Flow #5919
Handle one symbol for each runtime #5989
Run accelerator specific optimizations #6068
Arm Compute Library integration #5915
Retire the example json runtime #6177
json_node.h should include data_type.h #6224
Improve installation tutorial #6170
Add support for dense (fully connected) layer #6254
Introduce the Ethos-N BYOC integration #6222
Enable remote device via environment variables #6279
Improved pooling support #6248
Add support for quantized convolution #6335
CoreML codegen #5634

Operator Coverage

Add strided_set operation (#4303)
Add support for conv3d (#4400), pool3d (#4478), 3d upsampling ops (#4584)
Add group convolution for VTA (#4421)
Add 1d deconvolution op (#4476)
Allow batch matmul to be fused into injective ops (#4537)
Add native depthtospace and spacetodepth operators (#4566)
Add CUDNN conv3d support (#4418)
Dilation2D operator support #5033
Isfinite operator #4981
Unravel Index operator #5082
Add thrust support for nms #5116
Resize3d, Upsample3d op support #5633
Add operator Correlation #5628
affine_grid and grid_sample #5657
Sparse to dense operator #5447
Conv3d_transpose op support added #5737
add op crop_and_resize #4417
Add bitwise ops #4815
Sparse to dense operator #5447
support dynamic NMS(Non Maximum Suppression), symbolic begin, end, and strides for strided_slice #4312
Conv3d_transpose op support added #5737
ReverseSequence operator #5495
Conv1D #4639
1D Pooling #4663

Quantization

Channel wise quantization - Quantize & Requantize #4629
Support QNN ops. #5066
Adding support for QNN subtract op #5153
TFLite QNN Tutorial #5595
Tutorial: Deploy Quantized Model on CUDA #4667
Support asymmetric per-layer quantized operators #6109

Relay

Add convertlayout pass in Relay (#4335, #4600)
Added Merge Composite pass #4771
Call graph for relay #4922
Add inline pass #4927
Target annotation for external codegen #4933
GradientCell Relay Pass #5039
Add MergeCompilerRegions pass #5134
Non-recursive Graph Vistor and Rewriter (#4886)
[Blocksparse] Pipeline for lowering dense model to sparse-dense (#5377)
Relay op strategy #4644
Static Tensor Array (#5103)
Memory planner (part 1) #5144
ONNX codegen #5052
Add Parser 2.0 #5932, part 2 #6162
Basic block normal form #6152
Convert Layout pass. #4664
Pattern Language, Matcher, Rewriter, and Function Paritioner #5231

Runtime and Backend

Add ADTObject POD container type (#4346)
TFLite RPC runtime (#4439)
Standardized graph runtime export (#4532)
MISRA-C compliant TVM runtime #3934
Add String container #4628
Introduce Virtual Memory Allocator to CRT (#5124)
Initial implementation of Hexagon runtime support (#5252)
FastRPC interface for Hexagon runtime (#5353)
CoreML Runtime (#5283)
AutoTVM + uTVM for Cortex-M7 (#5417)
Windows Support for cpp_rpc (#4857)
Implement TVMDSOOp(TensorFlow custom op) for TVM runtime (#4459)
WebGPU support #5545
TVM WebAssembly JS Runtime #5506
Hexagon driver for offloading kernels to simulator #5492
Introduce runtime::Array #5585
Allow non-nullable ObjectRef, introduce Optional. (#5314)
Introduce static slots for common objects. (#5423)
ntroduce RValue reference(move) support to TypedPackedFunc (#5271)
Introduce MetadataModule to separate code compilation/interpretation and weight initialization #5770
Support module based interface runtime #5753
Add TVM application extension with WASM runtime #5892
Provide guide to user who has difficulty register SEqualReduce (#5300)

Rust Support

Revive the Rust + SGX refactor #4976
Improve Rust bindings: Map, Array, String, various IR nodes #6339
Rust Refactor Stage 4: Rewrite Rust graph runtime to use new APIs #5830
Second stage of Rust Refactor #5527
tvm crate stage 3 of Rust refactor #5769
Add first stage of updating and rewriting Rust bindings. #5526

TIR

Introduce StructuralHash for the Unified IR. #5160
Introduce StructuralEqual Infra for the unified IR. #5154
Introduce ExprDeepEqual, Remove IRDeepCompare #5206
[TIR] Introduce BufferLoad/Store (#5205)
Improved massive build times caused by tir.floormod and tir.floordiv. Fixed Topi testcase. #5666
Buffer logger assert removed #6147
Enhance VerifyGPUCode #6194
HoistIfThenElse added #6066
Hybrid Script Support for TIR #6227
Migrate Low-level Passes to Pass Manager #5198
HoistIfThenElse added #6066
Hybrid Script Support for TIR #6227
Block scope hoisting added #6238

TE

reverse-mode autodiff without any optimization #5121
Tensor Expression Debug Display (TEDD) #4651
Optimize and eliminate the Jacobian tensor for te.autodiff #6078

TVMC(Experimental)

TVMC - A command line driver for TVM (Part 1) #6112
TVMC - Linting error on onnx command line driver frontend #6536
TVMC - Command line driver 'compile' (part 2/4) #6302
TVMC - Introduce 'tune' subcommand (part 3/4) #6537
TVMC - Introduce 'run' subcommand (part 4/4) #6578
TVMC - Getting started tutorial for TVMC #6597

Feature Improvement

Accelerator and Microcontroller Support

Cleanup legacy verilog code (#4576)
uTVM support for ARM STM32F746XX boards (#4274)
Add --runtime=c, remove micro_dev target, enable LLVM backend #6145

Arithmetic Analysis

Linear system and equation solver (#5171)
Inequalities solver #5618
Improve IntervalSet's floormod (#5367)
Remove legacy const pattern functions (#5387)
Handle likely in IRMutatorWithAnalyzer #5665
ExtendedEuclidean merge impl to int_operator #5625
Rewrite simplify fix for Vectorized Cooperative Fetching #5924

AutoTVM and Graph Tuner

Adding ROCM schedules for TOPI (#4507)
NHWC conv2d schedule templates for ARM (#3859)
Use VM compile to extract autotvm tasks #4328
Download fallback schedule file if it does not exist #4671
Ignore error when removing tmpdir #4781
Fix a bug in generating the search space #4779
Minor bug fixes in AutoTVM for QNN graphs #4797
Fix autotvm customized template #5034
Add opt out operator for has_multiple_inputs for graph tuner #5000
Customize SI prefix in logging (#5411)
Update XGBoost verbosity option #5649
Support range in index based tuners #4870
Enable random fill and CPU cache flush for AutoTVM and Ansor (#6391)
Auto-scheduler tutorial for GPU and necessary refactor/fix (#6512)

BYOC

[BYOC] Bind constant tuples in graph partitioner (#5476)
[BYOC] Add support for composite functions in BYOC (#5261)
[BYOC] Register pattern tables from external codegens (#5262)
[BYOC] Enhance partitioning and external codegen (#5310)
[BYOC] Refine AnnotateTarget and MergeCompilerRegion Passes (#5277)
[BYOC] Use Non-Recursive Visitor/Mutator (#5410)
[BYOC] Refine DNNL Codegen (#5288)
[BYOC] Add example of Composite + Annotate for DNNL fused op (#5272)
[BYOC] Prevent duplicate outputs in subgraph Tuple (#5320)
[BYOC] Introduce further operator support (#6355)
[BYOC] Support input nodes with multiple entries (#6368)
[BYOC] Add maximum support for float32 (#6506)

Codegen

Intrinsic dispatching with OCML instead of LLVM for ROCm (#4499)
Make target codegen take IRModule and PrimFunc. #5107
Enhance CUDA codegen for SelectNode #4983
Vectorization for intrinsics #5101
[LLVM] Do not...

Assets 5

10 Jul 19:29

yzhliu

v0.6.1

0d0d515

Apache TVM (incubating) v0.6.1

Apache TVM (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.

While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Apache TVM (incubating) 0.6.1 is a maintenance release incorporating important bug fixes and important performance improvements. All users of Apache TVM (incubating) 0.6.0 are advised to upgrade. Please review following release notes to learn the bug fixes.

Bug Fixes

Fixed process termination routine in windows #4844
[Runtime] Fix NDArray SaveDLTensor declaration and implementation signature different #4586
[NODE][Serialization]fix serialization precision loss in float #4503
[Relay][Frontend][TF] fix _parse_param bug #4711
Fix bias_add gradient #4516
Make sure to visit the arguments of inlined functions #4783
Fix Python syntax error in start_rpc_server_to_tracker.py #4682
[Bugfix] Fixed crash caused by reversing bitwise operations #4852
[Fix][VM] Fix copy constructor #5237
fix small bug about dense_grad #5695
[Fix] Fix conv2d alter op for arm cpu #5532
[Fix] Fix dense x86 schedule #4728
[Relay][Fix] Fix alter op layout when calling a global var #4454
[Relay][Pass] Fix lambda lift pass for recursive call #4432
[BUGFIX] Fix search path for libtvm_topi.so #4467
[Bugfix] Fix Python debugger segfaults with TVM built with LLVM #5685
[RUNTIME] Fix compile errors of OpenCL FPGA backend #4492
[BUGFIX][BACKPORT-0.6][ARITH] Fix FloorMod Simplifier #5509
Some Windows and MSVC fixes #4569
[Chisel][VTA] Fix multiple transfer issue in LoadUop module #4442
[VTA] Fix an issue in updating uop_idx in the TensorGemm module #4694
[VTA] Fixed a crash issue in TSIM driver #4527
[VTA] Enable streamlined GEMM execution #4392
[VTA][Chisel] End-to-end Inference with Chisel VTA #4574
Added declare of aluBits for TensorAlu #4624
[Quantization] Fix annotation for multiply op #4458
LRN only supports 4D tensors, remove it from alter_op_layout #5520
fix topi.nn.global_pool layout="NHWC" #4656
[FFI][Windows] Fix hasattr by extracting Python error type from Windows error message #4780
[Runtime] Export GraphRuntime in tvm_runtime.dll #5002
Fix Base64OutStream portability issue #4668
[AUTOTVM] Fix a bug in generating the search space #4779
[Relay][VM] Fix compilation of If-Elses #5040
[RELAY][FRONTEND][TENSORFLOW] Fix FuseBatchNorm output cast error if need_cast is True #4894
[Bugfix] fskip of EliminateCommonSubexpr cannot always return false #4620
[Fix] Add ConstantNode to IsAtomic #5457
[Fix] Fix RemoveUnusedFunctions pass #4700
[Realy][fix] Fix alpha_equal bug for attribute check #4897
[Arith] keep div_mode during floordiv simplify #5922
[ARITH][BACKPORT-0.6] fix a min/max simplify bug #5761
[0.6-BACKPORT] Improve robustness of the docs build #5583

Assets 5

05 Dec 06:47

yzhliu

v0.6.0

c6f8c23

Apache TVM (incubating) v0.6.0

Apache TVM (incubating) is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Apache Incubator PMC.

While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

New Features

Relay in Production

Relay is a functional, differentiable programming language designed to be an expressive intermediate representation for machine learning systems. Relay supports algebraic data types, closures, control flow, and recursion, allowing it to directly represent more complex models than computation graph-based IRs (e.g., NNVM) can. In TVM v0.6, Relay is in stable phase and is ready for production.

Algebraic Data Types (ADT) support (#2442, #2575). ADT provides an expressive, efficient, and safe way to realize recursive computation (e.g., RNN). Refer to https://docs.tvm.ai/langref/relay_adt.html for more information.
Pass manager for Relay (#2546, #3226, #3234, #3191)
Most frameworks have been supported in Relay, including ONNX, Keras, Tensorflow, Caffe2, CoreML, NNVMv1, MXNet (#2246).
Explicitly manifest memory and tensor allocations in Relay. (#3560)

Relay Virtual Machine

The Relay Virtual Machine (Relay VM) is the new generation of runtime to strike a balance between performance and flexibility when deploying and executing Relay programs. Previously, the graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.

Relay VM is now usable and is able to achieve decent performance for a various of models and targets.

Design (#2810 #2915) and a first version of implementation (#2889),
Add VM runtime for Relay and compiler support (#3120, #3121, #2889, #3139)
Relay VM (pattern matching #3470, port to python #3391, serialization #3647)
Relay VM Profiler (#3727)
Support execution on devices for Relay VM (#3678)
[Relay][VM] Add more passes to VMCompiler (#4058)
[relay][vm] Separate VM runtime with executable (#4100)
Port VM, VM compiler, and Object into Python (#3391)
VM: Add AllocTensor instruction and better instruction printer (#3306)
[Relay][VM][Interpreter] Enable first-class constructors in VM and interpreter via eta expansion. (#4218)
[Relay][VM] Clean up the VM and VM profiler code (#4391)

Training

Relay is designed to natively support first-order and higher-order differentiation. The automatic differentiation infrastructure is now usable and a count of operators with gradient support are available in v0.6 release.

Higher order reverse mode automatic differentiation that work with control flow (#2496)
Higher order continuation passing style (#3456, #3485 )
Relay gradient registration (clip #3509, max_pool2d and avg_pool2d #3601)
Relay AD algorithm (#3585)
Relay Training - allow gradient to return a tuple (#3600), numerical gradient check (#3630)
Improve AD for concatenate (#3729)
[Relay][Training] Add missing gradient check to gradient pass (#4169)
As a part of Relay's automatic differentiation system, we are adding primal gradients for Relay operators. Please refer to #2562 for tracking the progress.
Gradient for Conv2d (#3636)
Add gradient operators (#3857, #3894, #3901, #3915)
Add gradient for log-softmax (#4069)
[Relay][Training] Add gradient for Crossentropy (#3925)
[Relay][Training] Add and fix gradients (#4126)

Quantization

Low-bit inference is getting more and more popular as it benefits both the performance and storage usage. TVM now supports two types of quantization. 1. Automatic quantizaion takes floating-point precision model, does per-layer calibration and generates low-bit model. 2. TVM also imports pre-quantized model from Tensorflow and MXNet, a new dialect QNN is introduced to handle further lowering to normal operators.

Automatic Quantization
- Low-bit automatic quantization supported. (#2116). The workflow includes annotation, calibration and transformation.
- Refactor quantization codebase and fix model accuracy. (#3543)
- KL-divergence-based per-layer calibration. (#3538)
- Add option to select which convolution layers are quantized. (#3173)
- [Relay][Quantize] Integrate data-aware calibration into quantization. (#4295)
Pre-quantized model support (QNN operators and legalize pass).
- Add a legalize pass to Relay (#3672)
- Qnn Concatenate, quantize, dequantize and requantize operators (#3819, #3730, #3745, #3531)
- QNNtoRelay & QNNLegalize Pass utility (#3838, #3782)
- Requantize: Optimize lowering for some corner cases. (#3864)
- New quantized operator support: conv2d, add, dense (#3580, #3736, #3896, #3910)
- Do type checking for the input and kernel in the qnn conv2d (#3904)
- Legalize and AlterOpLayout for Intel int8. (#3961)
- Renaming tests to follow the Relay nomenclature. (#3975)
- Fix padding changes due to #3739 (#3989)
- Memorizing quantize node mapping to avoid duplicated simulated quantization (#3233)
- Infrastructure to support pre-quantized models (QNN) (#3971).
- [Relay][AlterOp] NHWC to NCHWc support for Pool, concatenate, sum. (#4059)
- [TOPI][x86] Cascade lake support. (#4123)
- [TOPI][x86] Legalize - Support int8xint8 convolution to use VNNI inst (#4196)
- Qnn dequantize with min max using Mxnet flavor to support Mxnet prequantized models. (#3945)
- Improve the lowering of Qnn Dense (#4213)
- Adding support for dequantizing from int32 to float32. (#4130)
- [QNN] Refactor fixed point multiplicat...

Assets 5

18 Feb 22:49

ZihengJiang

v0.5

f08015e

v0.5-pre-apache-incubation

NOTE: This is a release pre apache incubation

This release features several major improvements. Some of the highlights are: Arbitrary bits quantization algorithm; High-level auto-differentiable programming IR--Relay(NNVMv2).

The community welcomes new reviewers @nishi-t @were @siju-samuel @jroesch @xqdan @zhiics @grwlf @ajtulloch @vinx13 @junrushao1994 @FrozenGene @liangfu , new committers @srkreddy1238 @eqy @masahi @nhynes @phisiart @merrymercy @Laurawly @adityaatluri @Huyuwei

Change List

Fully featured 8-bit network support
- 8bit quantizer
- Arbitrary bits quantization algorithm
- Intel cpu support
NVidia GPU 8-bit kernel
- int8 gemm recipe
- int8 conv2d
- Autotvm integration
Automated tuning and scheduling
- AutoTVM optimizations for mobile GPUs
- AutoTVM optimizations for CUDA
- AutoTVM optimizations for x86
Initial release of the differentiable programming IR, Relay
- Generic & informative Relay error reporting #2408
- Relay IR text format support #1781
- Support control flows
- A Normal Form Canonicalization #2251
- Type system support
- End to end compilation
  - Frontend support: Caffe2 #2507 , CoreML #2476 , Keras #2376 , MXNet #2163 , ONNX, TFLite #2365
  - Operator coverage #1799 #2051
- FoldScaleAxis #2020
- SimplifyInference #2033
- CombineParallelConv2D #2089
- InstrumentBoundCheckers pass #2079
- Bind & FoldConstant #2100
- Alter Op Layout #2150
- General OpFusion #2090
CodeGen
- Gcc / g++ compatible C code generator for TVM #2161
- Device type annotation for heterogeneous compilation #2361
- Cache packed func ptr, lift alloca #2070
- Generalize compute to tensor region #1476
Runtime
- Relay interpreter and compiler #1954
- Heterogeneous runtime #1695
- Language bindings: Golang runtime #1470 , Rust runtime #1597
- Add min_repeat_ms to time_evaluator #2200
- Bundled interpreter demonstration #2297
- Enable PlanMemory in the graph runtime #2120
Language Binding
- Rust frontend #2292
VTA
- Improved RPC for VTA #2043
Hybrid python programming model
- Support for scheduling #2416
- Support for Inter-function call #2287
- Backend support #2477
TOP
- Initial support for sparse tensor computation
- Improve ARM CPU depthwise convolution performance #2345
- Port winograd ops to relay #2356
Tutorials and docs
- Relay language docs #2232
- Tutorials on how to use SGX backend
- How to write a pass in python
- General lowering flow of TVM
- How to do tensorize
- TFLite frontend tutorial #2508
- Keras seq2seq model for translation tutorial #1815
- Committer guide and tips #2468
- Code review guideline on API designs #2459

Contributors

Code reviewers

@tqchen
@liangfu quantization, relay, topi, frontend
@zhiics relay, runtime, frontend
@nhynes quantization, rust
@Huyuwei frontend
@yzhliu relay, frontend, perf
@xqdan hybrid script, tvm/lang
@ZihengJiang relay
@vinx13 relay/pass, topi
@masahi relay/pass, frontend, doc, topi
@grwlf frontend, topi, relay, quantization
@tmoreau89 vta, relay, backend, runtime
@kazum frontend
@nishi-t frontend, topi
@PariksheetPinjari909 frontend
@jroesch relay, frontend, doc
@srkreddy1238 relay/op, frontend
@siju-samuel relay/op, frontend
@junrushao1994 relay
@icemelon9 relay, perf, tvm/lang, codegen
@ajtulloch relay, frontend
@alex-weaver relay
@kevinthesun hybrid script, topi, relay
@Laurawly topi
@were hybrid script, topi
@FrozenGene frontend, topi, relay/pass
@eqy relay, topi, runtime, rust
@zhreshold frontend, relay/op
@merrymercy relay/op, topi, runtime, frontend
@derisavi-huawei symbolic integers

Code contributions

@tqchen tvm
@vinx13 relay/pass, topi
@siju-samuel topi, relay/op
@merrymercy autotvm, topi, relay/pass
@srkreddy1238 relay/op, frontend/tf
@MarisaKirisame relay
@slyubomirsky relay, docs
@jroesch relay
@nhynes rust
@wweic docs, relay/pass
@yzhliu perf, frontend
@zhiics relay/pass, relay/op, runtime
@were hybrid script
@icemelon9 perf, relay/pass, relay/op
@joshpoll relay, docs
@sgrechanik-h codegen
@kazum frontend/keras, topi
@masahi relay/op, docs
@FrozenGene perf, frontend/tf
@liangdzou docs
@junrushao1994 relay/op
@eqy autotvm, runtime
@apivovarov docs
@ajtulloch runtime, nnpack
@kevinthesun relay/op, perf
@ZihengJiang relay/pass, quantization
@hlu1 nnpack, frontend/caffe2
@lixiaoquan nnvm
@imorinaga frontend/mxnet
@liangfu topi, docs
@xqdan codegen
@PariksheetPinjari909 frontend/darknet
@alexeyr frontend/tensorflow
@Rasterer topi
@yangchen-MS codegen
@anijain2305 relay/op
@grwlf topi
@Huyuwei topi, frontend/keras
@denis0x0D runtime/trace, relay/pass
@Mutinifni codegen
@derisavi relay/pass
@tmoreau89 vta
@Laurawly topi, perf
@zhreshold frontend, topi
@kun-zh codegen
@reminisce relay/op
@ehsanmok rust
@cnuernber perf
@cowanmeg topi, codegen
@yuruofeifei topi

Assets 2

03 Sep 19:25

tqchen

v0.4

60769b7

v0.4-pre-apache-incubation

NOTE: This is a release pre apache incubation

This release features several major improvements. The high-level graph optimizer is now part of TVM repo. Some of the highlights are: Initial support of AutoTVM for automated optimization; customized accelerator backend VTA. Please also check out tvm.ai for latest blogposts.

The community welcomes new reviewers @kazum @alex-weaver @masahi @zhreshold @PariksheetPinjari909 @srkreddy1238 @eqy, new code owner @merrymercy, and new committer @yzhliu

Change List

Tensor Expression and Optimization

Tensor operator primitives
- Introduce attrs field to operator primitives(e.g. compute) to store additional metadata, the attrs can be used as hint for scheduling
Enable embedding of asm micro-kernels
Hybrid python programming model
- python AST based IR builder interface
- support GPU programs
AutoTVM, Automated tuning, and scheduling
- basic autotvm infra
- GPU IR verifier
- basic autotuning tutorial
- topi integration
ARM support
- winograd support
- initial support of ARM autotuning records
TOPI Vision
- Generic GPU sort support(useful for vision)
- SSD operator support
TOPI numpy consistency
- Rename all binary operators for numpy consistecy: broadcast_add-> add, broadcast_sub -> substract, broadcast_mul -> multiply, broadcast_div->divide
- New operators: slice, LRN, equal, not_equal, less, greater
- tutorials on topi
Initial low-bit operator support support
- Optimized popcount generation on ARM
- general bit-serial convolution and GEMM
- optimized low bit kernels
- parallel optimization
New topi backend optimization for intel graphics
Adapt AVX schedules for SSE target

Backend

VTA: customized accelerator backend
- custom hardware backend example
- tutorials on how to use customized accelerator
Initial experimental support for HLS backend
Bugfix in SPIRV code generator for vulkan
libdevice support, enable NVPTX backend

Runtime

Introduce NDArrayContainer for managed NDarray
RPC and Device API
- Support communication between big/small endian machines.
- RPC and device API protocol upgrade (this is a non-backward compatible change) to support big-small endian communication. This is a non-backward compatible change, need to use the latest version of TVM runtime with the RPC
- graduate rpc from contrib, tvm.contrib.rpc->tvm.rpc
  -Support tracker in Android RPC, add fault tolerance for AutoTVM
BIG.LITTLE aware threadpool
tvm4j graph runtime that runs end to end workload in java
DLPack support
- Support from_dlpack and to_dlpack
- Enables bridges to pytorch
Enable link of stackvm in runtime

NNVM

Tensorflow graphdef frontend
Keras frontend
- improved to support reuse layers, add activations
ONNX
- gather, LRN
CoreML frontend
- Support C-RNN and activation functions
Fix grads for sum and expand_like
Enhanced operator fusion for multiple elemwise branches
Separate nnvm fusion and compilation pass

Misc

Unified build system to cmake, customizable cmake path for vulkan, rocm, cuda

Contributors

See the complete list here. Thanks to all the contributors to contribute to this release.

Code reviewers

@yzhliu topi, tvm4j, nnvm
@kevinthesun nnvm
@Huyuwei topi operators
@tmoreau89 hardware backends
@comaniac fpga backends
@kazum nnvm, opencl backend, fpga
@nishi-t nnvm, opencl backend
@merrymercy topi, arm,
@vinx13 gpu backend
@masahi nnvm, topi
@eqy autotvm
@jroesch runtime
@PariksheetPinjari909 frontends, topi
@srkreddy1238 frontends, topi
@FrozenGene autotvm

Compiler

@alex-weaver vulkan
@were hybrid script mode
@nishi-t CUDA, fp16, int8 support
@Ktabata intel FPGA support
@kazum xilinx fpga support
@cowanmeg arm optimized popcount
@tmoreau89 VTA customized accelerator

TOPI, graph optimization

@merrymercy AutoTVM
@yzhliu tvm4j graph runtime, x86
@Laurawly intel graphics
@abergeron conda build fix
@nhynes sgx random
@masahi topi, more robust op fusion
@kevinthesun vision ops
@grwlf argmax/min ops
@cowanmeg bit-serial operator
@ehsanmok topi tutorial
@zhiics refactor fusion and compilation into separate pass
@liangfu binary logical operators

Frontends

@srkreddy1238 tutorials for deployment, tensorflow frontend
@siju-samuel coreml, tf frontend
@PariksheetPinjari909 nnvm, slice
@kazum keras
@nishi-t mxnet, nnvm

Deploy

@eqy rpc, thread runtime
@dayanandasiet android tutorials

Assets 2

03 Sep 19:21

tqchen

v0.3

f7d9d7e

v0.3-pre-apache-incubation

NOTE: This is a release pre apache incubation

This release features numerous improvements in TOPI and backends. We make the first step toward object detection support in TOPI, featuring operators necessary for YOLO and SSDs. The topi now supports numpy-style API and operator overloading. RPC is significantly improved to support resource allocation and using a pool of devices. We are adding two new backends: WebGL for running GPUs on the browser, and Vulkan for running on next-generation graphics API. Please also check out tvm blogs for latest blogposts

Change List

TOPI Vision operators
- SSD support
- YOLO support
- NMS operator support in vision
TOPI general numpy-style operators
- numpy style operator overload in topi
- more operators: flip, take
- dilation support on conv2d and depthwise
8bit support
- ARM 8bit gemm
- ARM 8bit conv
Low bit operator support
- popcount intrinsics
- 1-bit fully connected
Contrib: MPSDNN fully-connected and conv2d support
Better RPC support
- RPC Tracker support to allow centralized resource management
- RPC protocol upgrade (this is a non-backward compatible change) to support timeout in the proxy
  - This is a breaking change, need to use the latest version of TVM runtime with the RPC
- Fault-tolerant to early server termination with correct exception propagated
- RPC support enabled for ROCm AMDGPUs
Tutorials and docs
- How to deploy to android devices.
Optimizations for hardware backends
- intel CPU (AVX and AVX512)
Schedule Primitives
- rfactor now support factor_axis to specify the factored dimension in the result
- cache_write now support multiple output operators
- enable warp memory which generates shuffle instructions
Framework bridge
- MXNet bridge supported
C++ compiler API support
- build migration
- topi migration to c++
- Target system in c++
WebGL backend
- runtime and codegen
- topi integration
- end to end pipeline on the browser
Vulkan backend
- vulkan runtime
- spirv code generator
Security
- intel SGX runtime support
- multi-threaded SGX runtime
LLVM 7.0 support
Robustness
- VerifyMemory to verify incorrect GPU schedules that writes into GPU memory from cpu
- Verify compute formulas
Better CPU parallel runtime

Main Contributors

See complete list here. Thanks to all the contributors to contribute to this release.

Code Reviewers

@zhreshold for reviewing many vision ops
@Huyuwei topi operators
@sxjscience for reviewing topi operators

TOPI:

@merrymercy Mali GPU support
@PariksheetPinjari909 topi vision ops, support for darknet operators
@yzhliu intel CPU optimization
@kevinthesun Vision operators, initial ssd, nms operator support
@dingobye Various great TOPI improvements for operator overloading
@Huyuwei dilation support to conv
@masahi Intel CPU topi
@nishi-t improvements in pooling

Compiler:

@nhynes SGX support
@phisiart WebGL backend
@alex-weaver C++ compiler support
@kun-zh bug fix bound checking in code.
@xqdan improvement low-level schedule rewrite.
@yidawang parallel runtime improvement
@eqy AMD GPU backend improvements
@Laurawly Initial improvements for Intel GPU
@cnuernber Improved runtime device stream API

Assets 2

31 Jan 20:00

tqchen

v0.2

9e67577

v0.2-pre-apache-incubation

NOTE: This is a release pre apache incubation

This release comes with a complete set of TOPI support for NNVM compiler, which allows compilation of end to end workloads. We also make major improvements in supporting new backends: ROCm for AMDGPUs and ARM GPU. Check out previous blogs that describes these major improvements in detail!

Backend support
- Support LLVM mainline(4.0, 5.0, 6.0)
- Support ROCM stack for AMD GPUs
- More robust OpenCL support for ARM GPUs
Android RPC runtime
Multi-threading optimization for ARM
- multi-threaded depthwise
- multi-threaded conv2d
New schedule primitives
- storage_align for shared memory alignment
- double_buffer
UnrollLoop : more robust version of unroll loop, count maximum steps that can be unrolled.
Full set of TOPI operators
- Introduce tvm.target to specify target options for compilation better.
- broadcast/ reduction operators
- pooling and global pooling
- Generic target support for topi
- schedule with external libraries
End to end deep learning pipelines for CPU, GPU, ARM GPU
Tutorials
- How to load compiled module in any language runtime
- How to use java runtime
Contrib library: MIOpen, CuDNN
Ongoing items that contains functioning pieces
- WebGL backend
- C++ compiler support
- MPS DNN
- low bit support, introduced popcount

Assets 2

Releases: apache/tvm

Apache TVM v0.9.0

Introduction

RFCs

What's Changed

AOT

BYOC

Apache TVM v0.8 Release Note

Overview

Accepted RFCs

Features and Improvements

TE, TIR, TVMScript

AutoTVM, AutoScheduler, Meta Schedule

Operator Coverage

Apache TVM (incubating) v0.7.0

Introduction

New Features

Automatic Scheduling (Experimental)

BYOC

Operator Coverage

Quantization

Relay

Runtime and Backend

Rust Support

TIR

TE

TVMC(Experimental)

Feature Improvement

Accelerator and Microcontroller Support

Arithmetic Analysis

AutoTVM and Graph Tuner

BYOC

Codegen

Apache TVM (incubating) v0.6.1

Bug Fixes

Apache TVM (incubating) v0.6.0

New Features

Relay in Production

Relay Virtual Machine

Training

Quantization

v0.5-pre-apache-incubation

Change List

Contributors

v0.4-pre-apache-incubation

Change List

Tensor Expression and Optimization

Backend

Runtime

NNVM

Misc

Contributors

v0.3-pre-apache-incubation

Change List

Main Contributors

v0.2-pre-apache-incubation