Skip to content

Latest commit

 

History

History
63 lines (52 loc) · 3.14 KB

COST.md

File metadata and controls

63 lines (52 loc) · 3.14 KB

COST Dataset

The COST dataset includes the following components for training and evaluating MLLMs on object-level perception tasks:

  • RGB Images obtained from the COCO-2017 dataset.
  • Segmentation Maps for semantic, instance, and panoptic segmentation tasks, obtained using the publicly available DiNAT-L OneFormer model trained on the COCO dataset.
  • Questions obtained by prompting GPT-4 for object identification and object order perception tasks. You can find the questions in questions.py.
  • Depth Maps obtained using the publicly available ViT-L/14 distilled variant of DINOv2 DPT model trained on the NYUd dataset.

We represent the information from the segmentation maps and depth maps in text form to obtain the final question-answer pairs. Please refer to Sec 3.1 in our paper for more details.

We provide different splits of the COST dataset for training and evaluation.

split Number of Images Number of QnA pairs splits from COCO
train 280k 280k train2017, test2017, unlabeled2017
val 5k 5k val2017

File Structure

coco_segm_text
    ├── depth
    │   └── test
    │   │   └── ...
    │   └── train
    │   │   └── depth # contains depth maps for the train2017 split
    │   │   └── panoptic_order.txt # contains answers for object order perception task on images in test2017 split
    │   └── unlabeled
    │   │   └── ...
    │   └── val
    │   │   └── ...
    ├── test
    │   └── ...
    ├── train
    │   └── instance_inference # contains instance masks for train2017 split
    │   └── instance.txt # contains answers for instance object identification task on images in train2017 split
    │   └── panoptic_inference # contains panoptic masks for train2017 split
    │   └── panoptic.txt # contains answers for panoptic object identification task on images in train2017 split
    │   └── semantic_inference # contains semantic masks for train2017 split
    │   └── semantic.txt # contains answers for instance object identification task on images in train2017 split
    ├── unlabeled
    │   └── ...
    ├── val
    │   └── ...

Citation

If you use the COST dataset, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@article{jain2023vcoder,
    title={{VCoder: Versatile Vision Encoders for Multimodal Large Language Models}},
    author={Jitesh Jain and Jianwei Yang and Humphrey Shi},
    journal={arXiv},
    year={2023}
}