GitHub - NUS-HPC-AI-Lab/InfoGrowth: Efficient and Online Dataset Growth Algorithm (with cleanness and diversity awareness) to deal with growing web data

Dataset Growth

ECCV 2024 | [Paper] | [Code]

InfoGrowth is an efficient online algorithm to deal with growing web data. It provides cleanness and diversity awareness on the dataset. For BLIP training on CC3M, it can provides a 14x acceleration with data reduction together with efficient sampling.

all_in_one_final.mov

InfoGrowth and Processed Data

Algorithm is now updated in code/InfoGrowth.ipynb.

We provide our cleaned 400k samples in processed_data. Image and captions are selected in json format.

Experiments

Download Data/Model

Need to prepare CC3M dataset and BLIP encoders.

Download CC3M

Refer to https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md

Download Model Checkpoint

Will be automatically prepared by code. Be sure to have internet connection.

Preprocessing

We introduce lmdb to accelerate data loading. It need a preprocessing as follows:

python3 code/scripts/cc3m_lmdb_writer.py --image_root your_path

Pretrain

Go to code directory and then execute pretrain

cd code
python3 -m torch.distributed.run --nnodes 2 --nproc_per_node 8 --master_port 12365 pretrain_gain.py --config ./configs/pretrain_cc3m_gain.yaml

Eval

To evalutate pretrained model on COCO, go to code directory and execute the following commands with substitution to your path.

cd code
TEST_CKPT=/path/to/test_checkpoint.pth
python3 -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \ 
    --config ./code/configs/retrieval_coco.yaml \
    --output_dir output/retrieval_coco \
    --pretrained $TEST_CKPT \
    --evaluate

Citation

@inproceedings{qin2024datasetgrowth,
      title={Dataset Growth}, 
      author={Ziheng Qin and Zhaopan Xu and Yukun Zhou and Zangwei Zheng and Zebang Cheng and Hao Tang and Lei Shang and Baigui Sun and Xiaojiang Peng and Radu Timofte and Hongxun Yao and Kai Wang and Yang You},
      booktitle={ECCV},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code		code
processed_data		processed_data
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Growth

InfoGrowth and Processed Data

Experiments

Download Data/Model

Download CC3M

Download Model Checkpoint

Preprocessing

Pretrain

Eval

Citation

About

Releases

Packages

Languages

NUS-HPC-AI-Lab/InfoGrowth

Folders and files

Latest commit

History

Repository files navigation

Dataset Growth

InfoGrowth and Processed Data

Experiments

Download Data/Model

Download CC3M

Download Model Checkpoint

Preprocessing

Pretrain

Eval

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages