Apache TVM v0.16.0
Introduction
The TVM community has worked since the v0.15.0 release to deliver the following new exciting improvements! This release version is:
- First support of Relax, with dynamic shape and pipeline
- Dlight module for optimizing LLM TIR workloads on GPU
- Disco module for initial SPMD multi-GPU support
The main tags are below (bold text is with lots of progress):
- Community, RFCs
- Adreno, ArmComputeLibrary, Metal, cuda & cutlass & tensorrt, micoNPU, Runtime
- Relax, Dlight, Disco
- Arith, TIR, TVMScript
- Docs, CI, Misc, BugFix
Please visit the full listing of commits for a complete view: v0.16.dev0...v0.16.0.rc0.
Community
RFCs
This new RFC explores how TVM can be utilized to generate code for the SME ISA to achieve improved inference performance on supported Arm®-based hardware implementing the SME extension.
- #107 - [RFC] Scalable Matrix Extension enablement
Arith
- #16735 - [Fixup] Require feature flag for tighter inequality bounds
- #16588 - Provide tighter ConstIntBounds for special cases
- #16704 - [Fix]Fix canonical simplification of LE
BYOC
- #16567 - Skip processed functions in FuseOpsByPattern and RunCodegen
BugFix
- #16766 - [Target] Added null check to fix segfault at ->defined() in cpu.cc DetectSystemTriple()
- #16739 - [Ansor] Fixing Ansor Gradient Bug
- #16820 - [Fix] PAPI docs
- #16793 - [Fix] fix for numpy 2.0 compatibility
- #16790 - [Fix] Fix build errors with VS2022
- #16780 - [Fix] Fix numpy dtype map
- #16773 - [Fix] Fix the purity flag of "vm.call_tir_dyn" and "kill" ops
- #16770 - [Hotfix] Revert driver API pass ordering that breaks MLC, mark failing test
- #16771 - [Fix] Remove redundant "remove_all_unused" in IPC memory lowering
- #16746 - [Fix][Builtin] Fix "GetQueryPosition" of PagedKVCache
- #16728 - [Fix] Introduce TVM_DEBUG_WITH_ABI_CHANGE to warn ABI changes in debug mode
- #16714 - [Fix] PagedKVCache fetching compute stream when copy stream is needed
- #16684 - [SLM] Produce well-formed Relax for nn.modules.KVCache
- #16659 - add the default value for DFT in ONNX frontend
- #16637 - [Transform] Preserve symbolic variables in FuseOps
- #16649 - [FFI] Add a missing default for datatype lanes
- #16492 - [Executor] fix debug_executor function debug_get_output
- #16598 - [Transform]Handle non-composite lambda functions in FuseOps
- #16565 - [Transform] Keep private non-primitive functions in FuseTIR
- #16518 - Use xxx instead of pow(x,3)
- #16436 - Ensure that bf16 arrays are created as expected
- #16361 - Disable SingleEnvThreadVerifier
- #16289 - [AUTOTVM][FIX] Typo fixes and add a warning in the Droplet Search
CI
- #16837 - Disable flaky unit test
- #16765 - [AOT][Testing] Improve output mismatch information on test failure
- #16661 - add merge_with_main in unity
- #16611 - [AOT][Testing] Print output values on test failure
- #16546 - Disable testing that downloads from mxnet
- #16521 - Fix CI Script and Broken Tests
- #16502 - Support tvm-bot rerun for tvm-unity task
- #16435 - Update image tag to 20240126-070121-8ade9c30e
- #16420 - [WASM] Update emsdk and nodejs version
- #16384 - Remove NVIDIA_DISABLE_REQUIRE
- #16382 - In jenkins.cmd_utils.Sh.tee, check for failing subprocess
- #16366 - Upgrade sccache version to 0.7.*
- #16369 - Upgrade Unity ci images
- #16344 - Update docker images tag to 20240105-165030-51bdaec6
- #16340 - [Unity][UnitTest] Increase atol to resolve flaky CI failure
- #16337 - [Hexagon][UnitTest] Disable flaky quantization test
- #16336 - Upgrade cmake version to 3.24.0
Docker
- #16755 - [SME]Add Fixed Virtual Platform (FVP) and toolchain install
- #16348 - Upgrade pip in i386 container
Disco
- #16618 - [Disco] Propagate structlog configuration to disco workers
- #16639 - [Disco] Expose functions to query the per-worker device/rank
- #16617 - [Disco] Implement
Session.import_python_module
method - #16715 - [Disco] Propagate structlog/logging config to workers
- #16845 - [Debug][Disco] Check if a PackedFunc exists before calling it
- #16817 - [Disco] Reduce Process/ThreadSession message queue reads and writes
- #16807 - [Disco] Support setting workers' CPU affinity
- #16375 - [Unity] Fix creation of disco ProcessSession
- #16821 - [Fix] Add TVM_DLL to Disco session
- #16752 - [Fix] Lazy import of "psutil" in disco process pool
Dlight
- #16775 - [Fix][Dlight] (Low-batched-)GeMV on small spatial loops
- #16429 - [Unity][Dlight][Fix] Reduction rule support dyn-shape epilogue
- #16351 - [Unity] Add dlight.gpu.Fallback in DispatchSortScan, add argsort, topk, and cumprod
- #16338 - [Unity][DLight] Introduce Specific Rule for RMSNorm
- #16251 - [Unity][Dlight] Support dlight gemv rule on nested inner block
- #16878 - [Dlight] Enhance vectorization loading weight for gemv
- #16848 - [DLight] Fix a corner case for reduction rule
- #16701 - [Dlight] Add fallback for low batch gemv with outer reduction
- #16678 - [Dlight] LowBatchGemv rule only apply to function with spatial symbolic var
- #16665 - [Dlight] Skip GeMV when normalization fails
- #16579 - [Dlight] Scheduling Low batch GEMM using GEMV-like rule
- #16579 - [Dlight] Scheduling Low batch GEMM using GEMV-like rule
- #16321 - [DLight] Skip rule if target is not suitable
- #16731 - [Dlight] Fix GeMV shared memory estimation
Docs
- #16792 - [Doc] Fix set_axis_separator example
- #16610 - [Doc] Fixed Docstring usage example in
tvm.ir.make_node
- #16572 - [Doc] Remove MxNet related tutorials
- #16514 - [Unity][Doc] Document passes that depend on
DataflowBlock
s and encourage usingConvertToDataflow
- #16482 - [Doc] Fix Docstring in
extern.py
for Sphinx - #16346 - [Doc] Fix minor error in "Expressions in Relay"
Frontend
- #16001 - [ONNX] Fix interpreting auto_pad parameters in ConvTranspose operator
- #16651 - [PaddlePaddle] PaddlePaddle model with NCHW data format that supports quantization
- #16616 - [PaddlePaddle] Support conv2d when data_format is NHWC
- #16526 - [Keras] Enable Dense operator for any input dims
- #16478 - [PaddlePaddle] Fixed the bug that prevented the model from being successfully converted to microTVM on MacOS
Hexagon
- #16762 - [VM]Cache operations when bypass mode is enabled
- #16706 - [VM] Add buffers to
dma_wait
builtin - #16448 - [VM]Implement dma_copy and dma_wait builtin for hexagon
LLVM
- #16782 - [SVE] Support scalable vectors in LoopVectorizer
- #16812 - Fix compilation failure due to minor change
- #16808 - [Runtime]Fix errors during loading of target tags
- #16748 - Lack of DWARF type is not an error
- #16696 - [SVE] Add codegen support for scalable buffer accesses
- #15964 - [RUNTIME] Add optional LLVM ORCJIT runtime executor
- #16612 - [SVE] Add support for scalable data type strings
- #16523 - [SVE] Change the dtype of Ramp and Broadcast lanes to PrimExpr
- #16484 - [SVE] Add vscale builtin
- #16373 - Update Host.h path
MetaSchedule
- #16725 - Make the
opt_level
oftune_relay()
adjustable
Metal
- #16713 - [RUNTIME]Provide richer runtime when error happens
- #16605 - [RUNTIME]Fix multithreading access of metal runtime
- #16438 - Dispatch numerically stable tanh for metal
OpenCL & CLML
- #16854 - [OpenCL] Add OpenCL device for automatic target detection
- #16846 - [Meta-Schedule][OpenCL] Enable MS tuning for Android OpenCL
- #16768 - [RUNTIME][OPENCL] Bugfix for ciImage create with host ptr
- #16672 - [CLML] Fix build TVM with CLML on MacOS
- #16328 - [RUNTIME][CLML] Fix for Softmax op for 4D tensors
- #16394 - [OpenCL][CMake] Fix OpenCL tests compilation
ROCm
Relax
- #16872 - Enhance symbolic expr estimation in memory planning
- #16867 - Dispatch sort/scan for non-cuda gpu backends
- #16852 - Fix EliminiateCommonSubexpr removing alloc tensor
- #16851 - [Relax,Topi] Allow passing workspace to thrust to avoid allocations
- #16841 - Provide well-formed output in
transform.LazyGetInput
- #16798 - [Transform] Provide callback versions of LazyTransformParams
- #16801 - Allow DeadCodeElimination within ApplyPassToFunction
- #16834 - Capture symbolic vars in struct info of weights
- #16830 - Share storage allocs among functions after cuda graph rewriting
- #16823 - [VM] Refactor CUDA graph builtins as VM extension
- #16828 - [Bugfix] Provide the full Expr to pattern-match rewriter
- #16805 - [Bugfix]BlockBuilder may not assume unique input functions
- #16815 - Enable capturing symbolic shapes in cuda graph
- #16642 - Allow R.Prim('bool') in relax::If and assert_op
- #16796 - Unit-test for structural equal of recursive function
- #16732 - Allow composition of DFPattern replacements
- #16783 - Improve CanonicalizeBindings in DataflowVar edge case
- #16721 - Implement operators to inspec DLTensor::strides and offset
- #16730 - Refactor PatternRewriter into separate Block/Expr mutators
- #16756 - [IR]Improve highlighting in assert_structural_equal
- #16779 - Improve malform error msg
- #16569 - [Unity][Parser] Check well-formedness in the parser
- #16759 - [Pass] Lowering passes for GPU IPC memory and allreduce
- #16697 - Implement relax.transform.TopologicalSort
- #16658 - Normalize use of void-type variable to inline R.tuple()
- #16711 - [Frontend] Add op
tanh
,exp
,negative
, andpermute
- #16703 - [Fix]Fix top-p/top-k sampling kernel
- #16669 - [Frontend][Onnx] add sum and globalavgpool 1d/3d op
- #16691 - CUDA graph rewrite treating StringImm as static
- #16685 - Implement StructInfoPattern for dataflow pattern matching
- #16681 - [Frontend][Onnx] support MaxPool1/2/3D and AveragePool1/2/3D
- #16584 - [Unity][TIR] Clear struct info when specializing PrimFunc
- #16676 - Remove the legalization of cumsum/cumprob
- #16654 - [Frontend][NN] Add support for Conv3D
- #16674 - Eager free original weights in transform_params
- #16675 - add sample_indices in sampling
- #16648 - [Runtime] Support Unpack API for NDArrayCache
- #16591 - [Unity][Transform] Handle dynamic shapes in CombineParallelMatmul
- #16594 - [Transform] Preserve param names in LiftTransformParams
- #16575 - [Unity] GPU sampling
- #16574 - Additional unit tests for RemoveUnusedParameters
- #16585 - [Unity][Analysis] Include impure call in VerifyWellFormed errors
- #16421 - [Unity][Transform] Raise error in FuseOpsByPattern for SSA violation
- #16629 - Fix error message in BlockBuilder
- #16592 - Handle dynamic arguments in legalization of nn.attention
- #16590 - [Unity][Transform] Check for permute_dims in ExpandMatmulOfSum
- #16604 - [Frontend][Onnx] fix clip unsqueeze opset implement
- #16568 - [Runtime] RNNState for Space State Models
- #16563 - Implement operators to read runtime DLTensor* information
- #16581 - [Unity][MSC][M4.2][Step2] Enable plugin with manager, test plugins in compile pipeline
- #16600 - Expose name_hint field for BlockBuilder.match_cast
- #16601 - [Transform] Canonicalize
let var = R.const
bindings - #16583 - [Unity][VM] Recursively visit match bindings in VMShapeLowerMutator
- #16586 - Ignore non-relax functions in relax.transform.RunCodegen
- #16573 - [VM] Re-implementation of callback functions
- #16561 - [Bugfix]Remove call to tvm.build for empty TIR module
- #16564 - [Unity] Check for symbolic vars in PrimValue in when lowering to TIR
- #16558 - Minor updates for NN frontend
- #16542 - Support callback as argument
- #16487 - [Unity][Transform] Handle
call_tir_inplace
inFuseTIR
andFuseOps
- #16355 - [Unity] Infer struct info for relax.op.split on dynamic-sized index
- #16465 - [Redo][Unity] Split DecomposeOpsForTraining into two steps
- #16495 - [Unity][MSC][M4.2][Step1] Enable plugin with manager, test plugins in compile pipeline
- #16498 - [Frontent] "tensor_ir_inplace" op
- #16500 - [Unity] Support storage reuse for dynamic shapes
- #16493 - [Pass] Skip data type node for CSE pass
- #16467 - [Unity][MSC][Refactor] Reconstruct BYOC and runner
- #16422 - [Unity][CodeGen] RunCodegen based on externally-exposed functions
- #16483 - [Unity][Frontend] Add Sigmoid and Square Op
- #16472 - [Unity] Improved error message in tvm::relax::UpdateStructInfo
- #16473 - [Unity] Improve error message in tensor_to_shape struct inference
- #16466 - Memory planning for "partially dynamic" shapes
- #16464 - NDArray Cache Update with DLTensor Support
- #16315 - [Unity][Transform] Implement relax.transform.ReorderTakeAfterMatmul
- #16313 - [Unity][Transform] Implement relax.transform.ExpandMatmulOfSum
- #16411 - [Unity][Transform] Handle symbolic variables in LambdaLift
- #16443 - [Unity][FIX] fix thread dtype mismatch
- #16442 - Revert "[Unity] Split DecomposeOpsForTraining into two steps"
- #16437 - [Unity] Improve buffer allocation for handling duplicated buffer names.
- #16439 - [Unity] Support cumsum with pure int32
- #16432 - [Unity] downgrade cmake version requirement
- #16427 - [Unity][Frontend][NN] Better support for dynamic convolutions
- #16418 - [Unity][Fix] Fix mismatched intrinsic name
- #16129 - [Unity][Transform] Replace eligible operators with in-place versions in dataflow blocks
- #16414 - [Bugfix][Unity] Recover MSVC/NVCC/ROCm/Vulkan
- #15954 - [Unity] Split DecomposeOpsForTraining into two steps
- #16111 - [Unity][Transform] Memory planning for dynamic-shape func return
- #16396 - [Unity] PagedKVCache supporting on-the-fly RoPE calculation
- #16395 - [Frontend][ONNX]fix onnx frontend parse
- #16385 - [Unity][Op] Add Conv3D Operator
- #16284 - [Unity][nnModule] Dynamic shape support in nn Module
- #16378 - [Unity][BlockBuilder] Restore bb.get()
- #16374 - [Unity] Support TIR kernel for PagedKVCache
- #16314 - [Unity][Transform] Implement relax.transform.AdjustMatmulOrder
- #16349 - [Unity][MSC] Avoid depending on trivial bindings in Relax intermediate
- #16376 - [Unity][Contrib] Fix a bug due to typo in vllm
reconstruct_from_cache
kernel and add test - #16388 - [Unity] Update dispatch test cases following the merge from main
- #16335 - [Unity] Set CMAKE_CUDA_ARCHITECTURES default to native
- #16306 - [Unity][Transform] Update LambdaLift to use name of lifted lambda
- #16310 - [Unity][Analysis] Show objects instead of names in WellFormedChecker
- #16362 - [Unity][Fix] Memory planning check value type of 'tir_var_upper_bound'
- #16367 - [Unity][Transform] Handle replacement at both var binding and usage
- #16309 - [Unity][Transform] Use parameter name in BundleModelParams
- #16307 - [Unity] Improved error message in ExprMutator::ReEmitBinding
- #16308 - [Unity] Improved error message for matmul shape mismatch
- #16360 - [Unity] Enhance Torch-consistency in rehsape
- #16350 - [Unity][Contrib] Add vLLM paged attention kernel
- #16303 - [Unity][NN] Use Linear name for nn.op.permute_dims
- #16325 - [Unity][MSC][Legalize] legalize codes and mute logging
- #16312 - [Unity][Analysis] Add utility for collecting compile-time bindings
- #16330 - [Unity][WEBGPU] Enable wasm exception propagation
- #16304 - [Unity][Analysis] Handle PrimStructInfo in EraseToWellDefined
- #16305 - [Unity][Transform] Implement UpdateParamStructInfo
- #16331 - [Unity] Alter op impl handling empty transform for output
- #16254 - [Unity] Dispatch cumsum and sort
- #16120 - [Unity][Transform] Extract partial-tuple-usage from FuseTIR
- #16311 - [Unity] Validate struct info in relax::Call constructor
- #16333 - [Unity] Fix nn.op.tensor_ir_op signature
- #16302 - [Unity] Cutlass kernel compatibility with cmake 3.18+
Relay
- #16622 - [ONNX] Fix the attribute mode parse of operator Upsample
- #16626 - [ONNX] Fix the Resize operator in ONNX frontend
- #16624 - [ONNX] fix the wrong default value about dtype in Multinomial converter
- #16417 - [Frontend][Torch] fix pytorch frontend linspace op
- #16400 - [Frontend][Torch] fix pytorch frontend not support logical or
- #16390 - [Frontend][Torch] fix a typo mistake in nonzero_numpy
- #16324 - make "ToScalar" support directly obtaining "int64_t"
Runtime
- #16804 - Introduce MSCCLPP with NCCL equivalent interface
- #16809 - Add "TVM_DLL" to NVTX header
- #16750 - CUDA IPC Memory support and custom allreduce kernels
- #16738 - [Refactor]Always specify device in allocator interface
- #16716 - Ensure NDArray.CopyTo(Device) always sync
- #16705 - Add TVM_DLL to memory manager functions
- #16692 - PagedKVCache execute data copy on a separate stream
- #16647 - [RPC] Fix FreeObject in minrpc server
- #16667 - [Builtin] Using float32 accumulation in attention kernel
- #16635 - [RPC] Enable RPCObjectRef over multi-hop RPC
- #16630 - Add TVM_DLL to threading backend funcs
- #16541 - Add "TVM_DLL" to NDArray cache load func
- #16550 - [ROCM] Properly align rocm parameter buffer
- #16545 - Fix dtype conversion for bf16 and fp8
- #16508 - ParallelFor skipping thread backend for unit extent
- #16486 - KV cache providing workspace for attn kernel
- #16456 - [KVCache] AttentionWithFusedQKV and RoPE mode
- #16415 - [Memory] Implement support for non-zero offset within a storage object in AllocNDArr…
- #16387 - [RPC] Enable RPCObjectRef return in RPC
- #16377 - Use cudaGetDeviceCount to check if device exists
TIR
- #16832 - Use constructor for new PrimFunc in TransformLayout
- #16543 - Fix segfaults from ordering of Let/Assert in MakePackedAPI
- #16795 - Ramp and Broadcast lanes fixed to int32 dtype
- #16767 - [Driver] Use
BindTarget
to specify target for FP8 legalization - #16742 - [Bugfix]Fix cache_read update buffer region
- #16726 - [Bugfix]Avoid overwrite of unmanaged buffer allocations
- #16548 - [CUDA] Add native FP8 support to codegen
- #16723 - Implement max/min_value for fp8 data types
- #16655 - Improve well-formed check's handling of match buffer
- #16673 - Support Vector Reinterpret Calls
- #16682 - [Bugfix]Handle AttrStmt of upcoming tir.Var in ConvertSSA
- #16560 - Enhance and fix tensorize schedule for some case
- #16660 - [Bugfix]Fix duplicate AllocateConst in CacheReadWrite schedule primitive
- #16544 - Expand debug symbol output for CodeGenLLVM
- #16553 - Fix get_block_access_region for let bindings
- #16515 - Require exactly same-dtype matching for Vulkan smem reuse
- #16406 - Fix of inter thread reduction with shared memory prefetch
- #16293 - Extend DP4A tensor intrin
- #16345 - Allow sync threads inside condition
- #16250 - In SplitHostDevice, check for variables in thread extents
- #16184 - [Transform] Implement InlinePrivateFunctions
TOPI
- #16652 - improve inclusive_scan for thrust
- #16383 - [Target] Add fp16 SIMD support for conv2d on
arm_cpu
targets
TVMC
- #16261 - Add tvmc flag to print ir before and print ir after named pass
TVMScript
- #16864 - Add parser and printer support for e4m3/e5m2 fp8
- #16844 - Produce empty DictAttrs when R.func_attrs is absent
- #16811 - Do not throw error for duplicate definitions
- #16641 - Allow use of relax.Expr with void type as a statement
- #16663 - Infer T.reads() for DeclBuffer nodes
- #16640 - Represent tir::builtin::ret() using python "return"
- #16562 - [Bugfix]Handle R.match_cast as last binding in if/else
- #16593 - [Unity]Parse R.Object return type from call_pure_packed
- #16356 - [Unity]Optionally hide StructInfo that can be inferred
- #16379 - [Unity]Update
call_packed
semantics to support empty sinfo_args
Vulkan
- #16858 - Fix CLZ support for Vulkan
cuda & cutlass & tensorrt
- #16865 - [Codegen, CUDA] Add handling of fp8 broadcast / const
- #16818 - [Cutlass] Fix usage of cuda stream for group gemm
- #16788 - [Cutlass] Add check for group gemm param shapes
- #16789 - [Bugfix][Cutlass] Remove a typo in cutlass build
- #16787 - [Codegen, Cuda] Add overload for fp8x4 e5m2 <-> half4 conversion
- #16751 - [Cutlass] Add group gemm kernels
- #16736 - [Target][CUDA] Allow non-numeric arch as needed for latest gpu
- #16619 - [Bugfix][Cutlass] Check if function attributes is None
- #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
- #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
- #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
- #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
- #16342 - [CUDA] Simple extend to optimize reuse for static shared memory.
micoNPU
- #16266 - [microNPU][ETHOSU] Add fixed point for tanh
- #16680 - [microNPU][ETHOSU] Fix LUT size for int16 activations
- #16401 - [microNPU][ETHOSU] Add fixed point for matmul
web
- #16733 - Support web indexDB cache for larger model storage
- #16810 - Support building tvm/web on Windows
- #16825 - Allow custom bc files in emcc making
- #16791 - Add
kv_state
andrnn_state
to wasm_runtime - #16722 - Implement linear congruential generator, make runtime seedable
- #16650 - Seperate parallel shard download and iterative shard loading
- #16694 - Initial support for asyncify
- #16631 - Fix NDArrayCache loading report callback
- #16525 - Move ArtifactCache to Interface, Support Cache delete and Batch Delete, Remove typo
- #16554 - Compatibility with PagedKVCache in WebGPU
- #16527 - Revert "[Unity]Temp disable wasm exception (#16444)"
- #16504 - [Relax]Add ApplyPresenceAndRequencyPenalty
- #16485 - [wasm] Enlarge initial memory for emcc
- #16444 - [Unity]Temp disable wasm exception
Misc
- #16873 - [Thrust] Fix thrust workspace allocation
- #16868 - [3rdparty] Bump flashinfer
- #16871 - [PageKV] allow PopN to pop all the tokens in last block
- #16866 - [3rdparty] Bump FlashInfer
- #16863 - [Picojson] Let the key of objects in json be ordered by default
- #16856 - [Thrust] Use pointer to tls pool to prevent creating new pool
- #16850 - Fixing probability comment
- #16849 - [KVCache] Initialize one extra page than specified
- #16843 - [IR] Provide well-formed intermediate in ApplyPassToFunction
- #16772 - [MSC][M5.3] Support torch.dynamo for dynamic models
- #16839 - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/cmsisnn
- #16838 - Bump pillow from 10.2.0 to 10.3.0 in /apps/microtvm/ethosu
- #16831 - [KVCache] Reducing CacheAuxDataManager copy size
- #16794 - [SME] Target parser support for SME
- #16824 - [KVCache] Introducing auxiliary data manager
- #16800 - [BugTIR]fix error merging shared memory for ptx_cp_async
- #16822 - [VM] Recycle VMFrame
- #16813 - [KVCache] Support forking sequence at specific posotion
- #16786 - [Codegen] Add check to disable invalid reinterpret
- #16816 - [Cmake] Allow using custom CCCL path for thrust
- #16784 - [SLM] Add unit tests for SLM to Relax exporter
- #16814 - Fix includes of custom allreduce kernel
- #16806 - [Debug] Improve error message in VMShapeLower
- #16802 - [Debug] Improve error messages in LiftTransformParams
- #16425 - [Target] Use LLVM target parser for determining Arm(R) A-Profile Architecture features
- #16797 - [3rdparty] AUTO mode for custom all-reduce strategy
- #16761 - [SME] Add support for inserting processor state annotations
- #16778 - [Analysis] Allow calls to GlobalVar in @R.function
- #16745 - [IR] Default to empty attributes, instead of NULL
- #16777 - Revert "[SLM] Allow modules to define pre-processing of weights"
- #16776 - [Contrib] Remove thrust "built but not used" warning
- #16757 - [SLM] Allow modules to define pre-processing of weights
- #16763 - [CONTRIB] Add nm symbol dump
- #16717 - Enable Shared Function in LiftTransformParam Pass
- #16729 - [Builtin] Sliding window and sink support for PagedKVCache
- #16724 - Fix cpp_rtvm cmake build on Windows
- #16513 - [Target] Automatically detect system triple when not specified by the user
- #16710 - [CMake] Add "USE_FLASHINFER" to libinfo
- #16702 - [MSC][M5.2] Enable quantize && prune with gym by wrapper
- #16699 - [Transform] Remove R.Object parameters after LazyTransformParams
- #16668 - [MSC][M5.1] Build wrapper to support compression
- #16693 - [Contrib] Support NDArray cache taking generator
- #16412 - [Lint] Add check to prevent usage of #include
- #16689 - [DeviceAPI] Support "GetCurrentStream"
- #16690 - Use target name instead of node name as function name
- #16683 - [skip ci] Fix wasm exception flag
- #16609 - Minor update docs instructions
- #16656 - Simplify Windows CMake Command
- #16666 - [KVCache] Fix the reference counter in sequence fork
- #16662 - Fixing workload comment
- #16595 - [Transform] Check for zero-param operators in LiftTransformParams
- #16599 - [Transform] De-duplicate MatchCast nodes in EliminateCommonSubexpr
- #16596 - [Transform] Implement relax.transform.ReorderPermuteDimsAfterConcat
- #16597 - [Transform] Allow explicit name of bundled model parameters
- #16602 - [Transform] Improvements to LazyTransformParams
- #16606 - [KVCache] Support passing in attn_score_scaling_factor into KV cache
- #16608 - Extend gpu memory bandwidth test to work through RPC
- #16587 - [Debug] Improve error message for codegen pattern mismatches
- #16570 - [Marvell BYOC]: Marvell AI Accelerator Integration - Phase 1
- #16576 - Update the 3rdparty/libflash_attn submodule
- #16580 - [KVCache] Support mode "None" for Rotary Embebdding
- #16578 - [KVCache] Support returning query positions
- #16571 - Fix compile warnings
- #16540 - [Upd] Enable lld search to include /opt/rocm/llvm/bin for rocm
- #16539 - Improve error message in NDArray::CopyFromTo
- #16524 - [Build] Improving debug and build-dir options
- #16551 - [KVCache] Fix attention kernel for ROCm
- #16512 - Cut pytest-lazy-fixture
- #16506 - Bump 3rdparty/cutlass_fpA_intB_gemm version
- #16511 - [Minor] Fix Clang compilation warning in fuse_tir.cc and codegen_c_host.cc
- #16516 - Add Relax, Unity Tags in make_notes.py
- #16497 - [Instrument] Add default instrument to print all passes
- #16494 - [DPL] Support tir_vars field in is_call_tir pattern
- #16453 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm
- #16454 - [BugTIR] fix thread_sync occurs in letstmt
- #16468 - [LINT] Fix pylint issues in test_dma_builtin.py
- #16413 - [Contrib] Workspace for cuBLAS backend
- #16460 - [Cherry-pick][MSC][M4.1] Add plugin && plugin_builder, enable build and test in different frameworks (#16397)
- #16461 - [Minor] Fix Docstring for sphinx-build
- #16431 - [Schedule] Loop-Partition Scheduling Primitive
- #16451 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/ethosu
- #16452 - Bump pillow from 10.0.1 to 10.2.0 in /apps/microtvm/cmsisnn
- #16445 - [skip ci] update branch rule to prepare for unity transition
- #16426 - [CMake] Enable cuda lang if USE_CUDA is on
- #16407 - Add NVIDIA Hopper H100 target tag
- #16398 - [DeviceAPI] Support querying total global memory
- #16357 - [RPC] Fix tuning on macOS and Windows (#15771)
- #16386 - [Thrust] Use no sync exec policy and caching allocator
- #16343 - [CMake][MSVC] Disable permissive mode for MSVC builds
- #16242 - [Codegen] Fix if_then_else codegen
- #16341 - [CMake] Use ccache as CMAKE_CUDA_COMPILER_LAUNCHER
- #16332 - Change metal dtype of ceil_log2 to fp32