Dilated Neighborhood Attention Transformer

Preprint Link: Dilated Neighborhood Attention Transformer

By Ali Hassani^[1], and Humphrey Shi^[1,2]

In association with SHI Lab @ University of Oregon & UIUC^[1] and Picsart AI Research (PAIR)^[2].

Abstract

Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have also gained significant attention, thanks to their performance and easy integration into existing frameworks. These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and efficient extension to NA that can capture more global context and expand receptive fields exponentially at no additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. Our large model is faster and ahead of its Swin counterpart by 1.5% box AP in COCO object detection, 1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation. Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.2 PQ) and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extra data). It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU), and ranks second on Cityscapes (84.5 mIoU) (no extra data).

Results and checkpoints

Image Classification

DiNAT

DiNAT is identical to NAT in architecture, with every other layer replaced with Dilated NA. These variants provide similar or better classification accuracy (except for Tiny), but yield significantly better downstream performance.

Model	Resolution	Kernel size	# of Params	FLOPs	Pre-training	Top-1
DiNAT-Mini	224x224	7x7	20M	2.7G	-	81.8%
DiNAT-Tiny	224x224	7x7	28M	4.3G	-	82.7%
DiNAT-Small	224x224	7x7	51M	7.8G	-	83.8%
DiNAT-Base	224x224	7x7	90M	13.7G	-	84.4%
DiNAT-Large	224x224	7x7	200M	30.6G	ImageNet-22K	86.6%
DiNAT-Large	384x384	7x7	200M	89.7G	ImageNet-22K	87.4%
DiNAT-Large	384x384	11x11	200M	92.4G	ImageNet-22K	87.5%

DiNAT_s

DiNAT_s variants are identical to Swin in terms of architecture, with WSA replaced with NA and SWSA replaced with DiNA. These variants can provide better throughput on CUDA, at the expense of slightly higher memory footprint, and lower performance.

Model	Resolution	Kernel size	# of Params	FLOPs	Pre-training	Top-1
DiNAT_s-Tiny	224x224	7x7	28M	4.5G	-	81.8%
DiNAT_s-Small	224x224	7x7	50M	8.7G	-	83.5%
DiNAT_s-Base	224x224	7x7	88M	15.4G	-	83.8%
DiNAT_s-Large	224x224	7x7	197M	34.5G	ImageNet-22K	86.5%
DiNAT_s-Large	384x384	7x7	197M	101.5G	ImageNet-22K	87.4%

Isotropic variants

Model	# of Params	FLOPs	Top-1
NAT-iso-Small	22M	4.3G	80.0%
DiNAT-iso-Small	22M	4.3G	80.8%
ViT-rpb-Small	22M	4.6G	81.2%
NAT-iso-Base	86M	16.9G	81.6%
DiNAT-iso-Base	86M	16.9G	82.1%
ViT-rpb-Base	86M	17.5G	82.5%

Details on training and validation are provided in classification.

Object Detection and Instance Segmentation

DiNAT

Backbone	Network	# of Params	FLOPs	mAP	Mask mAP	Pre-training	Checkpoint
DiNAT-Mini	Mask R-CNN	40M	225G	47.2	42.5	ImageNet-1K	Download
DiNAT-Tiny	Mask R-CNN	48M	258G	48.6	43.5	ImageNet-1K	Download
DiNAT-Small	Mask R-CNN	70M	330G	49.3	44.0	ImageNet-1K	Download
DiNAT-Mini	Cascade Mask R-CNN	77M	704G	51.2	44.4	ImageNet-1K	Download
DiNAT-Tiny	Cascade Mask R-CNN	85M	737G	52.2	45.1	ImageNet-1K	Download
DiNAT-Small	Cascade Mask R-CNN	108M	809G	52.9	45.8	ImageNet-1K	Download
DiNAT-Base	Cascade Mask R-CNN	147M	931G	53.4	46.2	ImageNet-1K	Download
DiNAT-Large	Cascade Mask R-CNN	258M	1276G	55.3	47.8	ImageNet-22K	Download

DiNAT_s

Backbone	Network	# of Params	FLOPs	mAP	Mask mAP	Pre-training	Checkpoint
DiNAT_s-Tiny	Mask R-CNN	48M	263G	46.6	42.1	ImageNet-1K	Download
DiNAT_s-Small	Mask R-CNN	69M	350G	48.6	43.5	ImageNet-1K	Download
DiNAT_s-Tiny	Cascade Mask R-CNN	86M	742G	51.0	44.1	ImageNet-1K	Download
DiNAT_s-Small	Cascade Mask R-CNN	107M	829G	52.3	45.2	ImageNet-1K	Download
DiNAT_s-Base	Cascade Mask R-CNN	145M	966G	52.6	45.3	ImageNet-1K	Download
DiNAT_s-Large	Cascade Mask R-CNN	253M	1357G	54.8	47.2	ImageNet-22K	Download

Details on training and validation are provided in detection.

Semantic Segmentation

DiNAT

Backbone	Network	# of Params	FLOPs	mIoU	mIoU (multi-scale)	Pre-training	Checkpoint
DiNAT-Mini	UPerNet	50M	900G	45.8	47.2	ImageNet-1K	Download
DiNAT-Tiny	UPerNet	58M	934G	47.8	48.8	ImageNet-1K	Download
DiNAT-Small	UPerNet	82M	1010G	48.9	49.9	ImageNet-1K	Download
DiNAT-Base	UPerNet	123M	1137G	49.6	50.4	ImageNet-1K	Download
DiNAT-Large	UPerNet	238M	2335G	54.0	54.9	ImageNet-22K	Download

DiNAT_s

Backbone	Network	# of Params	FLOPs	mIoU	mIoU (multi-scale)	Pre-training	Checkpoint
DiNAT_s-Tiny	UPerNet	60M	941G	46.0	47.4	ImageNet-1K	Download
DiNAT_s-Small	UPerNet	81M	1030G	48.6	49.9	ImageNet-1K	Download
DiNAT_s-Base	UPerNet	121M	1173G	49.4	50.2	ImageNet-1K	Download
DiNAT_s-Large	UPerNet	234M	2466G	53.4	54.6	ImageNet-22K	Download

Details on training and validation are provided in segmentation.

Image Segmentation with Mask2Former

Instance Segmentation

Backbone	Dataset	# of Params	FLOPs	AP	Config	Checkpoint
DiNAT-Large	MS-COCO	220M	522G	50.8	YAML file	Download
DiNAT-Large	ADE20K	220M	535G	35.4	YAML file	Download
DiNAT-Large	Cityscapes	220M	522G	45.1	YAML file	Download

Semantic Segmentation

Backbone	Dataset	# of Params	FLOPs	mIoU (multiscale)	Config	Checkpoint
DiNAT-Large	ADE20K	220M	518G	58.1	YAML file	Download
DiNAT-Large	Cityscapes	220M	509G	84.5	YAML file	Download

Panoptic Segmentation

Backbone	Dataset	# of Params	FLOPs	PQ	AP	mIoU	Config	Checkpoint
DiNAT-Large	MS-COCO	220M	522G	58.5	49.2	68.3	YAML file	Download
DiNAT-Large	ADE20K	220M	535G	49.4	35.0	56.3	YAML file	Download
DiNAT-Large	Cityscapes	220M	522G	67.2	44.5	83.4	YAML file	Download

Details on training and validation are provided in mask2former.

Citation

@article{hassani2022dilated,
	title        = {Dilated Neighborhood Attention Transformer},
	author       = {Ali Hassani and Humphrey Shi},
	year         = 2022,
	url          = {https://arxiv.org/abs/2209.15001},
	eprint       = {2209.15001},
	archiveprefix = {arXiv},
	primaryclass = {cs.CV}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiNAT.md

DiNAT.md

Dilated Neighborhood Attention Transformer

Abstract

Results and checkpoints

Image Classification

DiNAT

DiNAT_s

Isotropic variants

Object Detection and Instance Segmentation

DiNAT

DiNAT_s

Semantic Segmentation

DiNAT

DiNAT_s

Image Segmentation with Mask2Former

Instance Segmentation

Semantic Segmentation

Panoptic Segmentation

Citation

Files

DiNAT.md

Latest commit

History

DiNAT.md

File metadata and controls

Dilated Neighborhood Attention Transformer

Abstract

Results and checkpoints

Image Classification

DiNAT

DiNATs

Isotropic variants

Object Detection and Instance Segmentation

DiNAT

DiNATs

Semantic Segmentation

DiNAT

DiNATs

Image Segmentation with Mask2Former

Instance Segmentation

Semantic Segmentation

Panoptic Segmentation

Citation

DiNAT_s

DiNAT_s

DiNAT_s