Skip to content

Latest commit

 

History

History

VLTVG

VLTVG-PyTorch with DDP, Horovod, and DeepSpeed

This repository contains the PyTorch implementations using Distributed Data Parallel (DDP), Horovod, and DeepSpeed for Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. These implementations are intended for usage with the ALCF. Follow the instructions below to get started.

Common Setup

1. Dataset Preparation

Prepare the datasets as instructed in the VLTVG repository.

2. Conda Environment Setup

Load the Conda environment using the following commands:

# For PyTorch and Horovod
module load conda 
conda activate

# For DeepSpeed
module load conda/2023-01-10-unstable
conda activate

3. Python Virtual Environment Setup

First Time Setup:

Create and activate the Python virtual environment, and install required packages:

# Create Python virtual environment
python -m venv --system-site-packages vltvg
source vltvg/bin/activate

# Install required packages
pip install -r requirements.txt

Activations:

Activate the virtual environment using:

source vltvg/bin/activate

Running with Different Implementations

PyTorch DDP

aprun -n 8 -N 4 python train_ddp.py --config configs/VLTVG_R101_referit_ddp.py --checkpoint_latest --checkpoint_best

Horovod

aprun -n 8 -N 4 python train_hvd.py --config configs/VLTVG_R101_referit_ddp.py --checkpoint_latest --checkpoint_best

DeepSpeed

mpiexec --verbose \
  --envall -n 8 \
  --ppn 4 \
  --hostfile "${PBS_NODEFILE}" python train_ds.py \
  --config configs/VLTVG_R101_referit_ddp.py --polaris_nodes 2 \
  --checkpoint_latest --checkpoint_best \
  --deepspeed_config scripts/deepspeed/ds_config.json

Additional Information

For additional examples, refer to the scripts folder. Update directories and configurations accordingly for your specific setup.

Below is VLTVG's original README:


Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

This is the official implementation of Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

Introduction

Our proposed framework for visual grounding. With the features from the two modalities as input, the visual-linguistic verification module and language-guided context encoder establish discriminative features for the referred object. Then, the multi-stage cross-modal decoder iteratively mulls over all the visual and linguistic features to identify and localize the object.

Visualization

For different input images and texts, we visualize the verification scores, the iterative attention maps of the multi-stage decoder, and the final localization results.

Model Zoo

The models are available in Google Drive.

RefCOCO RefCOCO+ RefCOCOg ReferItGame Flickr30k
val testA testB val testA testB val-g val-u test-u test test
R50 84.53 87.69 79.22 73.60 78.37 64.53 72.53 74.90 73.88 71.60 79.18
R101 84.77 87.24 80.49 74.19 78.93 65.17 72.98 76.04 74.18 71.98 79.84

Installation

  1. Clone the repository.

    git clone https://github.com/yangli18/VLTVG
  2. Install PyTorch 1.5+ and torchvision 0.6+.

    conda install -c pytorch pytorch torchvision
  3. Install the other dependencies.

    pip install -r requirements.txt

Preparation

Please refer to get_started.md for the preparation of the datasets and pretrained checkpoints.

Training

The following is an example of model training on the RefCOCOg dataset.

python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py --config configs/VLTVG_R50_gref.py

We train the model on 4 GPUs with a total batch size of 64 for 90 epochs. The model and training hyper-parameters are defined in the configuration file VLTVG_R50_gref.py. We prepare the configuration files for different datasets in the configs/ folder.

Evaluation

Run the following script to evaluate the trained model with a single GPU.

python test.py --config configs/VLTVG_R50_gref.py --checkpoint VLTVG_R50_gref.pth --batch_size_test 16 --test_split val

Or evaluate the trained model with 4 GPUs:

python -m torch.distributed.launch --nproc_per_node=4 --use_env test.py --config configs/VLTVG_R50_gref.py --checkpoint VLTVG_R50_gref.pth --batch_size_test 16 --test_split val

Citation

If you find our code useful, please cite our paper.

@inproceedings{yang2022vgvl,
  title={Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning},
  author={Yang, Li and Xu, Yan and Yuan, Chunfeng and Liu, Wei and Li, Bing and Hu, Weiming},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Acknowledgement

Part of our code is based on the previous works DETR and ReSC.