Skip to content

Official PyTorch Implementation of World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Notifications You must be signed in to change notification settings

foundation-multimodal-models/World2Code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Jiacong Wang1,2*, Bohong Wu2*, Haiyong Jiang1, Xun Zhou2, Xin Xiao2, Haoyuan Guo 2 Jun Xiao1,

1School of Artificial Intelligence, University of Chinese Academy of Sciences, 2ByteDance Inc

abstract: Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. Challenging the conventional norm in VLM data construction, which uses a mixture of specialists in caption and OCR, or stronger VLM APIs and expensive human annotation, we propose to leverage the VLM itself for extracting cross-modal information of each via different prompts and filter the generated outputs again by itself via a consistency filtering strategy. In this paper, we present World to Code (W2C), a meticulously curated multi-modal data construction pipeline that organizes the final generation output into a Python code format. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks across different VLMs. Further analysis also demonstrates that the new code parsing ability of VLMs presents better cross-modal equivalence than the commonly used detail caption ability.

Todo(Comming Soon)

Data Generate Pipline(WIP)

Training Code(WIP)

News and Updates

Results

We provide results comparision for LLaVA-NEXT here.

1. Customize base settings

Before training, you need to customize some settings in the following table. Otherwise, the code will use the default paths specified in run.sh. When using multiple data sources, simply concatenate their paths with a space.

Setting Usage
base_dir Path saving root directory
exp_name Experiment name, associated with the saving path
pretrain_json Pretrain JSON data
pretrain_imagedir Pretrain data image directory
finetune_json Finetune JSON data
finetune_imagedir Finetune data image directory

Acknowledgement

  • LLaVA: the codebase we built upon.
  • lmms-eval: the codebase we evaluate our model.

Thanks a lot for their great works.

Citation

@inproceedings{
anonymous2024world,
title={World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering},
author={Anonymous},
booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
year={2024},
url={https://openreview.net/forum?id=lEoTofDOZx}
}

About

Official PyTorch Implementation of World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages