Skip to content

Latest commit

 

History

History
164 lines (131 loc) · 16.1 KB

File metadata and controls

164 lines (131 loc) · 16.1 KB

Dilated Neighborhood Attention Transformer

Preprint Link: Dilated Neighborhood Attention Transformer

By Ali Hassani[1], and Humphrey Shi[1,2]

In association with SHI Lab @ University of Oregon & UIUC[1] and Picsart AI Research (PAIR)[2].

DiNAT-Intro DiNAT-Intro

Abstract

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.5% box AP in COCO object detection, 1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.2 PQ) and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

Results and checkpoints

Image Classification

DiNAT

DiNAT is identical to NAT in architecture, with every other layer replaced with Dilated NA. These variants provide similar or better classification accuracy (except for Tiny), but yield significantly better downstream performance.

Model Resolution Kernel size # of Params FLOPs Pre-training Top-1
DiNAT-Mini 224x224 7x7 20M 2.7G - 81.8%
DiNAT-Tiny 224x224 7x7 28M 4.3G - 82.7%
DiNAT-Small 224x224 7x7 51M 7.8G - 83.8%
DiNAT-Base 224x224 7x7 90M 13.7G - 84.4%
DiNAT-Large 224x224 7x7 200M 30.6G ImageNet-22K 86.6%
DiNAT-Large 384x384 7x7 200M 89.7G ImageNet-22K 87.4%
DiNAT-Large 384x384 11x11 200M 92.4G ImageNet-22K 87.5%

DiNATs

DiNATs variants are identical to Swin in terms of architecture, with WSA replaced with NA and SWSA replaced with DiNA. These variants can provide better throughput on CUDA, at the expense of slightly higher memory footprint, and lower performance.

Model Resolution Kernel size # of Params FLOPs Pre-training Top-1
DiNATs-Tiny 224x224 7x7 28M 4.5G - 81.8%
DiNATs-Small 224x224 7x7 50M 8.7G - 83.5%
DiNATs-Base 224x224 7x7 88M 15.4G - 83.8%
DiNATs-Large 224x224 7x7 197M 34.5G ImageNet-22K 86.5%
DiNATs-Large 384x384 7x7 197M 101.5G ImageNet-22K 87.4%

Isotropic variants

Model # of Params FLOPs Top-1
NAT-iso-Small 22M 4.3G 80.0%
DiNAT-iso-Small 22M 4.3G 80.8%
ViT-rpb-Small 22M 4.6G 81.2%
NAT-iso-Base 86M 16.9G 81.6%
DiNAT-iso-Base 86M 16.9G 82.1%
ViT-rpb-Base 86M 17.5G 82.5%

Details on training and validation are provided in classification.

Object Detection and Instance Segmentation

DiNAT

Backbone Network # of Params FLOPs mAP Mask mAP Pre-training Checkpoint
DiNAT-Mini Mask R-CNN 40M 225G 47.2 42.5 ImageNet-1K Download
DiNAT-Tiny Mask R-CNN 48M 258G 48.6 43.5 ImageNet-1K Download
DiNAT-Small Mask R-CNN 70M 330G 49.3 44.0 ImageNet-1K Download
DiNAT-Mini Cascade Mask R-CNN 77M 704G 51.2 44.4 ImageNet-1K Download
DiNAT-Tiny Cascade Mask R-CNN 85M 737G 52.2 45.1 ImageNet-1K Download
DiNAT-Small Cascade Mask R-CNN 108M 809G 52.9 45.8 ImageNet-1K Download
DiNAT-Base Cascade Mask R-CNN 147M 931G 53.4 46.2 ImageNet-1K Download
DiNAT-Large Cascade Mask R-CNN 258M 1276G 55.3 47.8 ImageNet-22K Download

DiNATs

Backbone Network # of Params FLOPs mAP Mask mAP Pre-training Checkpoint
DiNATs-Tiny Mask R-CNN 48M 263G 46.6 42.1 ImageNet-1K Download
DiNATs-Small Mask R-CNN 69M 350G 48.6 43.5 ImageNet-1K Download
DiNATs-Tiny Cascade Mask R-CNN 86M 742G 51.0 44.1 ImageNet-1K Download
DiNATs-Small Cascade Mask R-CNN 107M 829G 52.3 45.2 ImageNet-1K Download
DiNATs-Base Cascade Mask R-CNN 145M 966G 52.6 45.3 ImageNet-1K Download
DiNATs-Large Cascade Mask R-CNN 253M 1357G 54.8 47.2 ImageNet-22K Download

Details on training and validation are provided in detection.

Semantic Segmentation

DiNAT

Backbone Network # of Params FLOPs mIoU mIoU (multi-scale) Pre-training Checkpoint
DiNAT-Mini UPerNet 50M 900G 45.8 47.2 ImageNet-1K Download
DiNAT-Tiny UPerNet 58M 934G 47.8 48.8 ImageNet-1K Download
DiNAT-Small UPerNet 82M 1010G 48.9 49.9 ImageNet-1K Download
DiNAT-Base UPerNet 123M 1137G 49.6 50.4 ImageNet-1K Download
DiNAT-Large UPerNet 238M 2335G 54.0 54.9 ImageNet-22K Download

DiNATs

Backbone Network # of Params FLOPs mIoU mIoU (multi-scale) Pre-training Checkpoint
DiNATs-Tiny UPerNet 60M 941G 46.0 47.4 ImageNet-1K Download
DiNATs-Small UPerNet 81M 1030G 48.6 49.9 ImageNet-1K Download
DiNATs-Base UPerNet 121M 1173G 49.4 50.2 ImageNet-1K Download
DiNATs-Large UPerNet 234M 2466G 53.4 54.6 ImageNet-22K Download

Details on training and validation are provided in segmentation.

Image Segmentation with Mask2Former

Instance Segmentation

Backbone Dataset # of Params FLOPs AP Config Checkpoint
DiNAT-Large MS-COCO 220M 522G 50.8 YAML file Download
DiNAT-Large ADE20K 220M 535G 35.4 YAML file Download
DiNAT-Large Cityscapes 220M 522G 45.1 YAML file Download

Semantic Segmentation

Backbone Dataset # of Params FLOPs mIoU (multiscale) Config Checkpoint
DiNAT-Large ADE20K 220M 518G 58.1 YAML file Download
DiNAT-Large Cityscapes 220M 509G 84.5 YAML file Download

Panoptic Segmentation

Backbone Dataset # of Params FLOPs PQ AP mIoU Config Checkpoint
DiNAT-Large MS-COCO 220M 522G 58.5 49.2 68.3 YAML file Download
DiNAT-Large ADE20K 220M 535G 49.4 35.0 56.3 YAML file Download
DiNAT-Large Cityscapes 220M 522G 67.2 44.5 83.4 YAML file Download

Details on training and validation are provided in mask2former.

Citation

@article{hassani2022dilated,
	title        = {Dilated Neighborhood Attention Transformer},
	author       = {Ali Hassani and Humphrey Shi},
	year         = 2022,
	url          = {https://arxiv.org/abs/2209.15001},
	eprint       = {2209.15001},
	archiveprefix = {arXiv},
	primaryclass = {cs.CV}
}