Skip to content

Commit

Permalink
doc update (#434)
Browse files Browse the repository at this point in the history
  • Loading branch information
BeachWang authored Sep 24, 2024
1 parent 97d70e1 commit 18f2248
Show file tree
Hide file tree
Showing 6 changed files with 131 additions and 6 deletions.
16 changes: 15 additions & 1 deletion configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
| subset | #samples before | #samples after | keep ratio | config link | data link | source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | [data-juicer-sandbox-optimal.yaml](data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | [data-juicer-sandbox-self-evolution.yaml](data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### Evaluation Results
- LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
Expand All @@ -50,6 +51,19 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): models **trained with refined dataset** outperforms the baseline ([T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)) on [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard). T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k) and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). Please refer to [Sandbox](../../docs/Sandbox.md) for more detail.

| model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
| Data-Juicer (DJ, 228k) | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |

| model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
| Data-Juicer (DJ, 228k) | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |

## For Video Dataset

Expand Down
17 changes: 16 additions & 1 deletion configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@
| 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | [data-juicer-sandbox-optimal.yaml](data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | [data-juicer-sandbox-self-evolution.yaml](data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### 评测结果
- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
Expand All @@ -51,6 +52,20 @@
| LLaVA-1.5-13B <br> (基线) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

- Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型,Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型,详情请参考[沙盒实验室](../../docs/Sandbox-ZH.md)

| model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
| Data-Juicer (DJ, 228k) | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |

| model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
| Data-Juicer (DJ, 228k) | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |

## 视频数据集

我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子: [general-video-refine-example.yaml](general-video-refine-example.yaml) 。这里我们应用了三种类型的算子:
Expand Down
29 changes: 29 additions & 0 deletions configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# global parameters
project_name: 'Data-Juicer-recipes-T2V-optimal'
dataset_path: '/path/to/your/dataset' # path to your dataset directory or file
export_path: '/path/to/your/dataset.jsonl'

np: 4 # number of subprocess to process your dataset

# process schedule
# a list of several process operators with their arguments
process:
- video_nsfw_filter:
hf_nsfw_model: Falconsai/nsfw_image_detection
score_threshold: 0.000195383
frame_sampling_method: uniform
frame_num: 3
reduce_mode: avg
any_or_all: any
mem_required: '1GB'
- video_frames_text_similarity_filter:
hf_clip: openai/clip-vit-base-patch32
min_score: 0.306337
max_score: 1.0
frame_sampling_method: uniform
frame_num: 3
horizontal_flip: false
vertical_flip: false
reduce_mode: avg
any_or_all: any
mem_required: '10GB'
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# global parameters
project_name: 'Data-Juicer-recipes-T2V-evolution'
dataset_path: '/path/to/your/dataset' # path to your dataset directory or file
export_path: '/path/to/your/dataset.jsonl'

np: 4 # number of subprocess to process your dataset

# process schedule
# a list of several process operators with their arguments
process:
- video_nsfw_filter:
hf_nsfw_model: Falconsai/nsfw_image_detection
score_threshold: 0.000195383
frame_sampling_method: uniform
frame_num: 3
reduce_mode: avg
any_or_all: any
mem_required: '1GB'
- video_frames_text_similarity_filter:
hf_clip: openai/clip-vit-base-patch32
min_score: 0.306337
max_score: 1.0
frame_sampling_method: uniform
frame_num: 3
horizontal_flip: false
vertical_flip: false
reduce_mode: avg
any_or_all: any
mem_required: '10GB'
- video_motion_score_filter:
min_score: 3
max_score: 20
sampling_fps: 2
any_or_all: any
- video_aesthetics_filter:
hf_scorer_model: shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE
min_score: 0.418164
max_score: 1.0
frame_sampling_method: 'uniform'
frame_num: 3
reduce_mode: avg
any_or_all: any
mem_required: '1500MB'
- video_duration_filter:
min_duration: 2
max_duration: 100000
any_or_all: any
14 changes: 12 additions & 2 deletions docs/Sandbox-ZH.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,17 @@
# 用户指南
## 应用和成果
我们利用Data-Juicer沙盒实验室套件,通过数据与模型间的系统性研发工作流,调优数据和模型,相关工作请参考[论文](http://arxiv.org/abs/2407.11784)。在本工作中,我们在[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)文生视频排行榜取得了新的榜首。模型已在[ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V)[HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V)平台发布,训练模型的[数据集](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip)也已开源。
![top-1_in_vbench](https://img.alicdn.com/imgextra/i3/O1CN01Ssg83y1EPbDgTzexn_!!6000000000344-2-tps-2966-1832.png)
我们利用Data-Juicer沙盒实验室套件,通过数据与模型间的系统性研发工作流,调优数据和模型,相关工作请参考[论文](http://arxiv.org/abs/2407.11784)。在本工作中,我们在[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)文生视频排行榜取得了新的榜首。
![top-1_in_vbench](https://img.alicdn.com/imgextra/i1/O1CN01I9wHW91UNnX9wtCWu_!!6000000002506-2-tps-1275-668.png)

模型已在ModelScope和HuggingFace平台发布,训练模型的数据集也已开源。

| 开源模型或数据集 | 链接 | 说明 |
| ------------ | --- | --- |
| Data-Juicer (T2V, 147k) | [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) <br> [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) | 对应榜单中 Data-Juicer (T2V-Turbo) 模型 |
| Data-Juicer (DJ, 228k) | [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) <br> [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) | 对应榜单中 Data-Juicer (2024-09-23, T2V-Turbo) 模型 |
| data_juicer_t2v_optimal_data_pool | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | Data-Juicer (T2V, 147k) 的训练集 |
| data_juicer_t2v_evolution_data_pool | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | Data-Juicer (2024-09-23, T2V-Turbo) 的训练集 |

复现论文实验请参考下面的sandbox使用指南,下图的实验流程,[初始数据集](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_init_data_pool.zip),以及该流程的工作流的配置文件demo:[1_single_op_pipline.yaml](../configs/demo/bench/1_single_op_pipline.yaml)[2_multi_op_pipline.yaml](../configs/demo/bench/2_multi_op_pipline.yaml)[3_duplicate_pipline.yaml](../configs/demo/bench/3_duplicate_pipline.yaml)
![bench_bottom_up](https://img.alicdn.com/imgextra/i3/O1CN01ZwtQuG1sdPnbYYVhH_!!6000000005789-2-tps-7838-3861.png)

Expand Down
Loading

0 comments on commit 18f2248

Please sign in to comment.