Skip to content

Latest commit

 

History

History
88 lines (66 loc) · 4.35 KB

README.md

File metadata and controls

88 lines (66 loc) · 4.35 KB

Accelerated Sparse Training

This folder contains an implementation of accelerated sparse training.

Special thanks to @danthe3rd for writing the runtime semi-structured (2:4) sparsification kernels in core.

Quickstart

NOTE: This feature is only available on the pytorch / torchao nightlies currently and requires CUDA compute capability 8.0+

import torch
from torchao.sparsity.training import (
    SemiSparseLinear,
    SemiSparseActivationLinear,
    swap_linear_with_semi_sparse_linear,
    swap_semi_sparse_linear_with_linear,
)

model = torch.nn.Sequential(torch.nn.Linear(1024, 4096)).cuda().to(torch.float16)

# Specify the fully-qualified-name of the nn.Linear modules you want to swap
sparse_config = {
    "seq.0": SemiSparseLinear,
    # for activation sparsity, uncomment the below line
    # "seq.0": SemiSparseActivationLinear,
}

# For DINO ViT training we found that sparsifying the Linear layers of the MLP block only
# to be an acceptable configuration, but the optimal configuration depends on your specific
# model architecture.

# Swap nn.Linear with SemiSparseLinear
swap_linear_with_semi_sparse_linear(model, sparse_config)

# Now you can run your normal training loop

# If you need to swap back from semi_sparse linear to normal linear, we provide a utility function to do so
swap_semi_sparse_linear_with_linear(model)

Benchmarking

For ViT-L we see a 6% e2e speedup on a single NVIDIA A100 across a single training (forwards + backwards) pass with torch.compile enabled and FP16 dtype:

sparsity_config model_type batch_size time (ms) memory (Gb)
ViT dense (baseline) vit_l 8 717.598748 58.467037
ViT MLP weight 2:4 sparse vit_l 8 675.275311 59.447039

To reproduce these benchmarks, please run:

pip install segment-anything-fast pandas
python benchmarks/benchmark_semi_structured_training.py

If you have existing matmul shapes for your nn.Linear layers and are curious about the potential speedups, you can run add your shapes here and run microbenchmarks with:

python benchmarks/benchmark_semi_structured_training.py --linear

For ViT-L MLP shapes we see a 1.24x speedup over the first linear layer and a 1.27x speedup over the second.

sparsity_config mkn time (ms) memory (Gb)
dense_linear (13008, 1024, 4096) 1.660793 0.318686
semi_sparse_linear (13008, 1024, 4096) 1.341983 0.328648
semi_sparse_prune+compress_time_only (13008, 1024, 4096) 0.085218 0.208406
dense_linear (13008, 4096, 1024) 1.642992 0.319297
semi_sparse_linear (13008, 4096, 1024) 1.294284 0.328635
semi_sparse_prune+compress_time_only (13008, 4096, 1024) 0.300904 0.305532

When combined with DINOv2, we found that we were able to train an ImageNet classifier with minimal accuracy loss.

A fully sparse 2:4 trained model exhibited a -0.5 pp accuracy drop; we were able to further reduce the accuracy loss to -0.1 pp by first training with 2:4 sparsity enabled and then switching over to normal dense training.

Training Configuration Accuracy (%)
0% Sparse: 125k dense steps (baseline) 82.8
40% Sparse: 40k sparse -> 85k dense steps 82.9
60% Sparse: 75k sparse -> 50k dense steps 82.8
70% Sparse: 87.5k sparse -> 37.5k dense steps 82.7
80% Sparse: 100k sparse -> 25k dense steps 82.7
90% Sparse: 112.5k sparse -> 12.5k dense steps 82.0
100% Sparse: 125k sparse steps (2:4-sparse model) 82.3

All our experiments were run on 4x AMD EPYC 7742 64-core CPUs and 4x NVIDIA A100-80GB GPUs.