Release v0.2.0 · pytorch/ao

What's Changed

Highlights

Custom CPU/CUDA extension to ship CPU/CUDA binaries.

PyTorch core has recently shipped a new custom op registration mechanism with torch.library with the benefit being that custom ops will compose with as many PyTorch subsystems as possible most notably NOT graph breaking with torch.compile()

We'd added some documentation for how you could register your own custom ops https://github.com/pytorch/ao/tree/main/torchao/csrc and if you learn better via example you can follow this PR #135 to add your own custom ops to torchao.

Most notably these instructions were leveraged by @gau-nernst to integrate some new custom ops for fp6 support #223

One key benefit of integrating your kernels in torchao directly is we thanks to our manylinux GPU support can ensure that CPU/CUDA kernels that you've added will work on as many devices and cuda versions as possible #176

A lot of prototype and community contributions

@jeromeku was our community champion merging support for

GaLore our first pretraining kernel that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch
DoRA which has been shown to yield superior fine-tuning accuracy results than QLoRA. This is an area where the community can help us benchmark more thoroughly https://github.com/pytorch/ao/tree/main/torchao/prototype/dora
Fused int4/fp16 quantized matmul which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512 https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq

@gau-nernst merged fp6 support showing up to 8x speedups on an fp16 baseline for small batch size inference #223

NF4 support for upcoming FSDP2

@weifengpy merged support for composing FSDP2 with NF4 which makes it easy to implement algorithms like QLoRA + FSDP without writing any CUDA or C++ code. This work also provides a blueprint for how to compose smaller dtypes with FSDP #150 most notably by implementing torch.chunk(). We hope the broader community uses this work to experiment more heavily at the intersection of distributed and quantization research and inspires many more studies such as the ones done by Answer.ai https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html

BC breaking

Deprecations

New Features

Match autoquant API with torch.compile (#109, #162, #175)
[Prototype] 8da4w QAT (#138, #199, #198, #211, #154, #157, #229)
[Prototype] GaLore (#95)
[Prototype] DoRA (#216)
[Prototype] HQQ (#153, #185)
[Prototype] 2:4 sparse + int8 sparse subclass (#36)
[Prototype] Unified quantization primitives (#159, #201, #193, #220, #227, #173, #210)
[Prototype] Pruning primitives (#148, #194)
[Prototype] AffineQuantizedTensor subclass (#214, #230, #243, #247, #251)
[Prototype] Add Int4WeightOnlyQuantizer (#119)
Custom CUDA extensions (#135, #186, #232)
[Prototype] Add FP6 Linear (#223)

Improvements

FSDP2 support for NF4Tensor (#118, #150, #207)
Add save/load of int8 weight only quantized model (#122)
Add int_scaled_mm on CPU (#121)
Add cpu and gpu in int4wo and int4wo-gptq quantizer (#131)
Add torch.export support to int8_dq, int8_wo, int4_wo subclasses (#146, #226, #213)
Remove is_gpt_fast specialization from GTPQ (#172)
Common benchmark and profile utils (#238)

Bug fixes

Fix padding in GPTQ (#119, #120)
Fix Int8DynActInt4WeightLinear module swap (#151)
Fix NF4Tensor.to to use device kwarg (#158)
Fix quantize_activation_per_token_absmax perf regression (#253)

Performance

Chunk NF4Tensor construction to reduce memory spike (#196)
Fix intmm benchmark script (#141)

Docs

Update READMEs (#140, #142, #169, #155, #179, #187, #188, #200, #217, #245)
Add https://pytorch.org/ao (#136, #145, #163, #164, #165, #168, #177, #195, #224)

CI

Add A10G support in CI (#176)
General CI improvements (#161, #171, #178, #180, #183, #107, #215, #244, #257, #235, #242)
Add expecttest to requirements.txt (#225)
Push button binary support (#241, #240, #250)

Not user facing

Security

Untopiced

Version bumps (#125, #234)
Don't import _C in fbcode (#218)

New Contributors

@Xia-Weiwen made their first contribution in #121
@jeromeku made their first contribution in #95
@weifengpy made their first contribution in #118
@aakashapoorv made their first contribution in #179
@UsingtcNower made their first contribution in #194
@Jokeren made their first contribution in #217
@gau-nernst made their first contribution in #223
@janeyx99 made their first contribution in #245
@huydhn made their first contribution in #250
@lancerts made their first contribution in #238

Full Changelog: v0.2.0...v0.2.1

We were able to close about half of tasks for 0.2.0, which will now spill over into upcoming releases. We will post a list for 0.3.0 next, which we aim to release at the end of May 2024. We want to follow a monthly release cadence until further notice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0