by Long Lian, Zhirong Wu and Stella X. Yu at UC Berkeley, MSRA, and UMich
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[Paper] | [Project Page] | [Presentation Video] | [Demo Video] | [Poster] | [Citation]
This GIF has been significantly compressed. Check out our video at full resolution here. Inference in this demo is done per-frame without post-processing for temporal consistency.
If you want to qualitatively compare segmentation masks from our method without running our code, you can download the segmentation masks here.
Download DAVIS 2016 and unzip to data/data_davis
.
Download pre-extracted flow from RAFT (trained with chairs and things) and decompress to data/data_davis
.
Download DenseCL ResNet50 weights to data/pretrained/densecl_r50_imagenet_200ep.pth
.
SegTrackv2 and FBMS59 dataset
These two datasets have much lower quality and very different aspect ratios across sequences. To make things easier, we resize to 480p (854x480) to have the same input size as DAVIS 2016. For fairness, the testing is still on the original dataset, and we provide both the original and scaled datasets (with flows on the scaled datasets). There are also larger inter-run variations on these two datasets compared to DAVIS 2016 since the video quality is lower and/or the number of sequences is smaller. I recommend using DAVIS16 as the main metric and use these two as supplementary metrics. For reproducibility, checkpoints for both stages for three datasets have been released.
Download SegTrackv2 with pre-extracted flow.
Download FBMS59 with pre-extracted flow part 1 and FBMS59 with pre-extracted flow part 2. Please use cat FBMS59_all.tgz.part1 FBMS59_all.tgz.part2 > FBMS59_all.tgz
to merge before un-targz. shasum
of the merged file: e898c127f916de867dad665bbb04e21702e54e7c
.
The requirements.txt
assumes CUDA 11.x. You can also install torch and torchvision from conda instead of pip.
torchCRF
is a GPU CRF implementation typically faster than CPU implementation. If you plan to use this implementation in your work, see the tools/torchCRF/README.md
for license.
pip install -r requirements.txt
cd tools/torchCRF
python setup.py install
We also require parallel
command from moreutils. If your parallel does not work (for example, the parallel from parallel package), you either need to install moreutils from system package manager (e.g. APT on Ubuntu/Debian) or from conda: conda install -c conda-forge moreutils
.
We provide pretrained models and prediction masks. If you intend to work on a custom dataset that is out-of-distrbution for our training data such as DAVIS16, we suggest training/fine-tuning our model on new datasets.
Name | Dataset | Backbone | mIoU (w/o pp.) | mIoU (w/ pp.) | Model | Masks |
---|---|---|---|---|---|---|
RCF (All stages) | DAVIS16 | ResNet50 | 80.9 | 83.0 | Download | Download |
RCF (Stage 1 only) | DAVIS16 | ResNet50 | 78.9 | 81.4 | Download | Download |
RCF (All stages) | SegTrackv2 | ResNet50 | 76.7 | 79.6 | Download | Download |
RCF (Stage 1 only) | SegTrackv2 | ResNet50 | 72.8 | 77.6 | Download | Download |
RCF (All stages) | FBMS59 | ResNet50 | 69.9 | 72.4 | Download | Download |
RCF (Stage 1 only) | FBMS59 | ResNet50 | 66.8 | 69.1 | Download | Download |
To evaluate a pretrained model using our unofficial main training script and/or the masks for evaluation using evaluation tools, use --test-override-pretrained
and --test-override-object-channel
to specify the model path and the object channel, respectively.
To train our model on DAVIS16 with 2 GPUs, run:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf/rcf_stage1.yaml
This should lead to a model with mIoU around 78% to 79% on DAVIS16 (without post-processing). Run stage 2 as well if additional gains are desired. If you want to run with other numbers of GPUs, change CUDA_VISIBLE_DEVICES
and the batch_size
so that the total batch size (batch_size
times the number of GPUs) is your intended batch size (16 in this config).
SegTrackv2 and FBMS59 dataset
Training with STv2 and FBMS59 is very similar to training with DAVIS16.STv2:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf_stv2/rcf_stage1.yaml
FBMS59:
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf_fbms59/rcf_stage1.yaml
This stage uses Conditional Random Field (CRF) to get training signals based on low-level vision (e.g., color). Prior to running this stage, we need to get the object channel through motion-appearance alignment.
CUDA_VISIBLE_DEVICES=0 python tools/SemanticConstraintsAndMAA/maa.py --pretrain_dir saved/saved_rcf_stage1 --first-frames-only --step 43200
OBJECT_CHANNEL=$?
Then we could run training (which will continue training from pretrained stage 1 model):
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf/rcf_stage2.1.yaml --opts object_channel $OBJECT_CHANNEL
SegTrackv2 and FBMS59 dataset
Training with STv2 and FBMS59 is very similar to training with DAVIS16.STv2:
CUDA_VISIBLE_DEVICES=0 python tools/SemanticConstraintsAndMAA/maa.py --pretrain_dir saved_stv2/saved_rcf_stage1 --first-frames-only --step 1220 --dataset stv2
OBJECT_CHANNEL=$?
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf_stv2/rcf_stage2.1.yaml --opts object_channel $OBJECT_CHANNEL
FBMS59:
CUDA_VISIBLE_DEVICES=0 python tools/SemanticConstraintsAndMAA/maa.py --pretrain_dir saved_fbms59/saved_rcf_stage1 --first-frames-only --step 3468 --dataset fbms59 --num-channels 3
OBJECT_CHANNEL=$?
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf_fbms59/rcf_stage2.1.yaml --opts object_channel $OBJECT_CHANNEL
This stage uses a pretrained ViT model from DINO to get training signals based on high-level vision (e.g., semantics discovered in unsupervised learning). Semantic constraints are enforced offline due to its low speed.
# the predictions on trainval
CUDA_VISIBLE_DEVICES=0 python main.py configs/rcf/rcf_export_trainval_ema.yaml --test --test-override-pretrained saved/saved_rcf_stage2.1/last.ckpt --opts checkpoints_dir saved/saved_rcf_stage2.1 object_channel $OBJECT_CHANNEL
# run semantic constraints
CUDA_VISIBLE_DEVICES=0 python tools/SemanticConstraintsAndMAA/semantic_constraints.py --pretrain_dir saved/saved_rcf_stage2.1 --object-channel $OBJECT_CHANNEL
# training with semantic constraints
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf/rcf_stage2.2.yaml --opts object_channel $OBJECT_CHANNEL train_dataset_kwargs.pl_root saved/saved_rcf_stage2.1/saved_eval_export_trainval_ema_torchcrf_ncut_torchcrf/$OBJECT_CHANNEL
This should give you a 80% to 81% mIoU (without post-processing).
SegTrackv2 and FBMS59 dataset
STv2:# the predictions on trainval
CUDA_VISIBLE_DEVICES=0 python main.py configs/rcf_stv2/rcf_export_trainval_ema.yaml --test --test-override-pretrained saved_stv2/saved_rcf_stage2.1/last.ckpt --opts checkpoints_dir saved_stv2/saved_rcf_stage2.1 object_channel $OBJECT_CHANNEL
# run semantic constraints
CUDA_VISIBLE_DEVICES=0 python tools/SemanticConstraintsAndMAA/semantic_constraints.py --pretrain_dir saved_stv2/saved_rcf_stage2.1 --object-channel $OBJECT_CHANNEL --dataset stv2
# training with semantic constraints
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf_stv2/rcf_stage2.2.yaml --opts object_channel $OBJECT_CHANNEL train_dataset_kwargs.pl_root saved_stv2/saved_rcf_stage2.1/saved_eval_export_ema_torchcrf_ncut_torchcrf/$OBJECT_CHANNEL
FBMS59:
# the predictions on trainval
CUDA_VISIBLE_DEVICES=0 python main.py configs/rcf_fbms59/rcf_export_trainval_ema.yaml --test --test-override-pretrained saved_fbms59/saved_rcf_stage2.1/last.ckpt --opts checkpoints_dir saved_fbms59/saved_rcf_stage2.1 object_channel $OBJECT_CHANNEL
# run semantic constraints
CUDA_VISIBLE_DEVICES=0 python tools/SemanticConstraintsAndMAA/semantic_constraints.py --pretrain_dir saved_fbms59/saved_rcf_stage2.1 --object-channel $OBJECT_CHANNEL --dataset fbms59 --num-channels 3
# training with semantic constraints
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/rcf_fbms59/rcf_stage2.2.yaml --opts object_channel $OBJECT_CHANNEL train_dataset_kwargs.pl_root saved_fbms59/saved_rcf_stage2.1/saved_eval_export_trainval_ema_torchcrf_ncut_torchcrf/$OBJECT_CHANNEL
To unofficially evaluate a trained model, run:
CUDA_VISIBLE_DEVICES=0 python main.py configs/rcf/rcf_eval.yaml --test --test-override-pretrained saved/saved_rcf_stage2.2/last.ckpt --test-override-object-channel $OBJECT_CHANNEL
We encourage evaluating the model with our evaluation tool, which is supposed to closely match the DAVIS 2016 official evaluation tool. To evaluate a trained model with the evaluation tool on the exported masks (stage 2.2 will masks on validation set by default):
python tools/davis2016-evaluation/evaluation_method.py --task unsupervised --davis_path data/data_davis --year 2016 --step 4320 --results_path saved/saved_rcf_stage2.2/saved_eval_export
To refine the exported masks from a trained model with CRF post-processing, run:
sh tools/pydenseCRF/crf_parallel.sh
Then evaluate the refined masks with evaluation tool:
python tools/davis2016-evaluation/evaluation_method.py --task unsupervised --davis_path data/data_davis --year 2016 --step 4320 --results_path saved/saved_rcf_stage2.2/saved_eval_export_crf
This should reproduce around 83% mIoU (J-FrameMean).
Evaluate SegTrackv2 and FBMS59 dataset
We provide a tool to evaluate exported masks of SegTrackv2 and FBMS59:python tools/STv2-FBMS59-evaluation/eval_tool.py --dataset SegTrackv2 --pred_dir [pred dir] --step [step num]
python tools/STv2-FBMS59-evaluation/eval_tool.py --dataset FBMS59 --pred_dir [pred dir] --step [step num]
This repo also supports training AMD. However, this implementation is not guaranteed to be identical to the original one. In our experience, it reproduces results that are slightly better than the original reported results without test-time adaptation (i.e., fine-tuning on the downstream data). The setup for training set (Youtube-VOS) is simply unzipping the train_all_frames.zip
from Youtube-VOS to data/youtube-vos/train_all_frames
. The setup for validation sets are the same as RCF.
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.run --master_addr 127.0.0.1 --master_port 9000 --nproc_per_node gpu main.py configs/amd/amd.yaml
If you have any questions on the paper or this implementation, please contact Long Lian using the email address in the paper.
Please give us a star 🌟 on Github to support us!
Please cite our work if you find our work inspiring or use our code in your work:
@InProceedings{Lian_2023_CVPR,
author = {Lian, Long and Wu, Zhirong and Yu, Stella X.},
title = {Bootstrapping Objectness From Videos by Relaxed Common Fate and Visual Grouping},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {14582-14591}
}
@article{lian2022improving,
title={Improving Unsupervised Video Object Segmentation with Motion-Appearance Synergy},
author={Lian, Long and Wu, Zhirong and Yu, Stella X},
journal={arXiv preprint arXiv:2212.08816},
year={2022}
}