Depth Anything for Semantic Segmentation

We use our Depth Anything pre-trained ViT-L encoder to fine-tune downstream semantic segmentation models.

Performance

Note that our results are obtained without Mapillary pre-training.

Method	Encoder	mIoU (s.s.)	m.s.
SegFormer	MiT-B5	82.4	84.0
Mask2Former	Swin-L	83.3	84.3
OneFormer	Swin-L	83.0	84.4
OneFormer	ConNeXt-XL	83.6	84.6
DDP	ConNeXt-L	83.2	83.9
Ours	ViT-L	84.8	86.2

Please refer to MMSegmentation for instructions. Do not forget to install mmdet to support Mask2Former:

pip install "mmdet>=3.0.0rc4"

After installation:

move our config/depth_anything to mmseg's config
move our dinov2.py to mmseg's backbones
add DINOv2 in mmseg's models/backbones/__init__.py
download our provided torchhub directory and put it at the root of your working directory
download the Depth Anything pre-trained model (to initialize the encoder) and 2) put it under the checkpoints folder.

For training or inference with our pre-trained models, please refer to MMSegmentation instructions.