ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
🔥🔥[CVPR 2024] The official implementation of the paper "ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions"
🔥🔥| Paper | ViT-CoMer知乎解读 | ViT-CoMer第三方微信公众号解读
The overall architecture of ViT-CoMer. ViT-CoMer is a two-branch architecture consisting of three components: (a) a plain ViT with L layers, which is evenly divided into N stages for feature interaction. (b) a CNN branch that employs the proposed Multi-Receptive Field Feature Pyramid (MRFP) module to provide multi-scale spatial features, and (c) a simple and efficient CNN- Transformer Bidirectional Fusion Interaction (CTI) module to integrate the features of the two branches at different stages, enhancing semantic information.
- We propose a novel dense prediction backbone by combining the plain ViT with CNN features. It effectively
leverages various open-source pre-trained ViT weights
and incorporates spatial pyramid convolutional features that address the lack of interaction among local ViT features and the challenge of single-scale representation. - ViT-CoMer-L achieves
SOTA 64.3% AP
on COCO val2017 without training on extra detection data , and62.1% mIoU
on ADE20K val.
We present a plain, pre-training-free, and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction, named ViT-CoMer, which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art, ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture, which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features, which is beneficial for handling dense prediction tasks.
Comparisons with different backbones and frameworks. It can be seen that under similar model sizes, ViT-CoMer outper- forms other backbones in the two typical dense prediction tasks of COCO object detection and instance segmentation.
Comparisons with state-of-the-arts. We conduct experiments based on Co-DETR, using ViT-CoMer as the backbone, and initializing the model with multi-modal pre-training BEiTv2. As shown in Table 4, Our approach outperforms the existing SOTA algorithms without extra training data on COCO val2017, which strongly demonstrates the effectiveness of ViT-CoMer.
For segmentation, we conduct experiments based on Mask2Former using ViT-CoMer as the backbone, and initializing the model with multi-modal pre-training BEiTv2. As shown in Table 7, our method achieves com- parable performance to SOTA methods on ADE20K with fewer parameters.
- [20240405] ViT-CoMer is selected as highlight in CVPR 2024
- [20240318] we release segementation code and pre-trained weights
- [20240315] we release ViT-CoMer-L with Co-DETR head configs, which achieves
64.3 AP
on COCO 2017val - [20240313] we release detection code and pre-trained weights
- [20240313] create repo
If you find ViT-CoMer useful in your research, please consider giving a star ⭐ and citing:
@inproceedings{xia2024vit,
title={Vit-comer: Vision transformer with convolutional multi-scale feature interaction for dense predictions},
author={Xia, Chunlong and Wang, Xinliang and Lv, Feng and Hao, Xin and Shi, Yifeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={5493--5502},
year={2024}
}
Many thanks to following codes that help us a lot in building this codebase:
If you have any questions while using ViT-CoMer or would like to further discuss implementation details with us, please leave a message on issues
or contact us directly via email: [email protected]
. We will reply as soon as possible.