World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Jiacong Wang^1,2*, Bohong Wu^2*, Haiyong Jiang¹, Xun Zhou², Xin Xiao², Haoyuan Guo ² Jun Xiao¹,

¹School of Artificial Intelligence, University of Chinese Academy of Sciences, ²ByteDance Inc

abstract: Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous researches on synthetic VLM data generation. Challenging the conventional norm in VLM data construction, which uses a mixture of specialists in caption and OCR, or stronger VLM APIs and expensive human annotation, we propose to leverage the VLM itself for extracting cross-modal information of each via different prompts and filter the generated outputs again by itself via a consistency filtering strategy. In this paper, we present World to Code (W2C), a meticulously curated multi-modal data construction pipeline that organizes the final generation output into a Python code format. Experiments have demonstrated the high quality of W2C by improving various existing visual question answering and visual grounding benchmarks across different VLMs. Further analysis also demonstrates that the new code parsing ability of VLMs presents better cross-modal equivalence than the commonly used detail caption ability.

Todo(Comming Soon)

Data Generate Pipline(WIP)

Training Code(WIP)

News and Updates

Results

We provide results comparision for LLaVA-NEXT here.

1. Customize base settings

Before training, you need to customize some settings in the following table. Otherwise, the code will use the default paths specified in run.sh. When using multiple data sources, simply concatenate their paths with a space.

Setting	Usage
`base_dir`	Path saving root directory
`exp_name`	Experiment name, associated with the saving path
`pretrain_json`	Pretrain JSON data
`pretrain_imagedir`	Pretrain data image directory
`finetune_json`	Finetune JSON data
`finetune_imagedir`	Finetune data image directory

Acknowledgement

LLaVA: the codebase we built upon.
lmms-eval: the codebase we evaluate our model.

Thanks a lot for their great works.

Citation

@inproceedings{
anonymous2024world,
title={World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering},
author={Anonymous},
booktitle={The 2024 Conference on Empirical Methods in Natural Language Processing},
year={2024},
url={https://openreview.net/forum?id=lEoTofDOZx}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
image		image
README.md		README.md
grouding.jpeg		grouding.jpeg
install.sh		install.sh
motivation.jpeg		motivation.jpeg
vqa.jpeg		vqa.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Todo(Comming Soon)

News and Updates

Results

1. Customize base settings

Acknowledgement

Citation

About

Releases

Packages

Languages

foundation-multimodal-models/World2Code

Folders and files

Latest commit

History

Repository files navigation

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering

Todo(Comming Soon)

News and Updates

Results

1. Customize base settings

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages