Skip to content

Commit

Permalink
Merge branch 'main' of github.com:alibaba/data-juicer into dev/light_…
Browse files Browse the repository at this point in the history
…dependency
  • Loading branch information
BeachWang committed Sep 24, 2024
2 parents f1cfa65 + 467cb96 commit 5e6a340
Show file tree
Hide file tree
Showing 24 changed files with 206 additions and 84 deletions.
16 changes: 15 additions & 1 deletion configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
| subset | #samples before | #samples after | keep ratio | config link | data link | source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | [data-juicer-sandbox-optimal.yaml](data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | [data-juicer-sandbox-self-evolution.yaml](data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### Evaluation Results
- LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
Expand All @@ -50,6 +51,19 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LLaVA-1.5-13B <br> (baseline) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): models **trained with refined dataset** outperforms the baseline ([T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)) on [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard). T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k) and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). Please refer to [Sandbox](../../docs/Sandbox.md) for more detail.

| model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
| Data-Juicer (DJ, 228k) | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |

| model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
| Data-Juicer (DJ, 228k) | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |

## For Video Dataset

Expand Down
17 changes: 16 additions & 1 deletion configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,8 @@
| 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (T2V, 147k) | 1,217,346 | 147,176 | 12.09% | [data-juicer-sandbox-optimal.yaml](data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
| Data-Juicer (DJ, 228k) | 3,408,553 | 227,867 | 8.15% | [data-juicer-sandbox-self-evolution.yaml](data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### 评测结果
- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
Expand All @@ -51,6 +52,20 @@
| LLaVA-1.5-13B <br> (基线) | **80.0** | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
| LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |

- Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型,Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型,详情请参考[沙盒实验室](../../docs/Sandbox-ZH.md)

| model | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
| Data-Juicer (DJ, 228k) | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |

| model | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| T2V-Turbo | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
| Data-Juicer (DJ, 228k) | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |

## 视频数据集

我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子: [general-video-refine-example.yaml](general-video-refine-example.yaml) 。这里我们应用了三种类型的算子:
Expand Down
29 changes: 29 additions & 0 deletions configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# global parameters
project_name: 'Data-Juicer-recipes-T2V-optimal'
dataset_path: '/path/to/your/dataset' # path to your dataset directory or file
export_path: '/path/to/your/dataset.jsonl'

np: 4 # number of subprocess to process your dataset

# process schedule
# a list of several process operators with their arguments
process:
- video_nsfw_filter:
hf_nsfw_model: Falconsai/nsfw_image_detection
score_threshold: 0.000195383
frame_sampling_method: uniform
frame_num: 3
reduce_mode: avg
any_or_all: any
mem_required: '1GB'
- video_frames_text_similarity_filter:
hf_clip: openai/clip-vit-base-patch32
min_score: 0.306337
max_score: 1.0
frame_sampling_method: uniform
frame_num: 3
horizontal_flip: false
vertical_flip: false
reduce_mode: avg
any_or_all: any
mem_required: '10GB'
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# global parameters
project_name: 'Data-Juicer-recipes-T2V-evolution'
dataset_path: '/path/to/your/dataset' # path to your dataset directory or file
export_path: '/path/to/your/dataset.jsonl'

np: 4 # number of subprocess to process your dataset

# process schedule
# a list of several process operators with their arguments
process:
- video_nsfw_filter:
hf_nsfw_model: Falconsai/nsfw_image_detection
score_threshold: 0.000195383
frame_sampling_method: uniform
frame_num: 3
reduce_mode: avg
any_or_all: any
mem_required: '1GB'
- video_frames_text_similarity_filter:
hf_clip: openai/clip-vit-base-patch32
min_score: 0.306337
max_score: 1.0
frame_sampling_method: uniform
frame_num: 3
horizontal_flip: false
vertical_flip: false
reduce_mode: avg
any_or_all: any
mem_required: '10GB'
- video_motion_score_filter:
min_score: 3
max_score: 20
sampling_fps: 2
any_or_all: any
- video_aesthetics_filter:
hf_scorer_model: shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE
min_score: 0.418164
max_score: 1.0
frame_sampling_method: 'uniform'
frame_num: 3
reduce_mode: avg
any_or_all: any
mem_required: '1500MB'
- video_duration_filter:
min_duration: 2
max_duration: 100000
any_or_all: any
12 changes: 11 additions & 1 deletion data_juicer/ops/base_op.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ def __init__(self, *args, **kwargs):
self.image_key = kwargs.get('image_key', 'images')
self.audio_key = kwargs.get('audio_key', 'audios')
self.video_key = kwargs.get('video_key', 'videos')
self.batch_size = kwargs.get('batch_size', 1)
self.batch_size = kwargs.get('batch_size', 1000)

# whether the model can be accelerated using cuda
_accelerator = kwargs.get('accelerator', None)
Expand Down Expand Up @@ -210,6 +210,12 @@ def add_parameters(self, init_parameter_dict, **extra_param_dict):
related_parameters.update(extra_param_dict)
return related_parameters

def run(self, dataset):
from data_juicer.core.data import NestedDataset
if not isinstance(dataset, NestedDataset):
dataset = NestedDataset(dataset)
return dataset


class Mapper(OP):

Expand Down Expand Up @@ -244,6 +250,7 @@ def process(self, sample):
raise NotImplementedError

def run(self, dataset, *, exporter=None, tracer=None):
dataset = super(Mapper, self).run(dataset)
new_dataset = dataset.map(
self.process,
num_proc=self.runtime_np(),
Expand Down Expand Up @@ -304,6 +311,7 @@ def process(self, sample):
raise NotImplementedError

def run(self, dataset, *, exporter=None, tracer=None):
dataset = super(Filter, self).run(dataset)
if Fields.stats not in dataset.features:
from data_juicer.core.data import add_same_content_to_new_column
dataset = dataset.map(add_same_content_to_new_column,
Expand Down Expand Up @@ -374,6 +382,7 @@ def process(self, dataset, show_num=0):
raise NotImplementedError

def run(self, dataset, *, exporter=None, tracer=None):
dataset = super(Deduplicator, self).run(dataset)
dataset = dataset.map(self.compute_hash,
num_proc=self.runtime_np(),
with_rank=self.use_cuda(),
Expand Down Expand Up @@ -412,6 +421,7 @@ def process(self, dataset):
raise NotImplementedError

def run(self, dataset, *, exporter=None, tracer=None):
dataset = super(Selector, self).run(dataset)
new_dataset = self.process(dataset)
if tracer:
tracer.trace_filter(self._name, dataset, new_dataset)
Expand Down
7 changes: 3 additions & 4 deletions data_juicer/ops/filter/alphanumeric_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,10 +83,9 @@ def process(self, samples):
ratio_key = StatsKeys.alpha_token_ratio if self.tokenization \
else StatsKeys.alnum_ratio
if isinstance(samples[Fields.stats], list):
return list(
map(
lambda stat: self.min_ratio <= stat[ratio_key] <= self.
max_ratio, samples[Fields.stats]))
return map(
lambda stat: self.min_ratio <= stat[ratio_key] <= self.
max_ratio, samples[Fields.stats])
else:
# single sample for ray filter
if self.min_ratio <= samples[
Expand Down
8 changes: 3 additions & 5 deletions data_juicer/ops/filter/average_line_length_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,11 +60,9 @@ def compute_stats(self, samples, context=False):

def process(self, samples):
if isinstance(samples[Fields.stats], list):
return list(
map(
lambda stat: self.min_len <= stat[StatsKeys.avg_line_length
] <= self.max_len,
samples[Fields.stats]))
return map(
lambda stat: self.min_len <= stat[StatsKeys.avg_line_length] <=
self.max_len, samples[Fields.stats])
else:
# single sample for ray filter
if self.min_len <= samples[Fields.stats][
Expand Down
8 changes: 3 additions & 5 deletions data_juicer/ops/filter/character_repetition_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,9 @@ def compute_stats(self, samples):

def process(self, samples):
if isinstance(samples[Fields.stats], list):
return list(
map(
lambda stat: self.min_ratio <= stat[
StatsKeys.char_rep_ratio] <= self.max_ratio,
samples[Fields.stats]))
return map(
lambda stat: self.min_ratio <= stat[StatsKeys.char_rep_ratio]
<= self.max_ratio, samples[Fields.stats])
else:
# single sample for ray filter
if self.min_ratio <= samples[Fields.stats][
Expand Down
8 changes: 3 additions & 5 deletions data_juicer/ops/filter/maximum_line_length_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,9 @@ def compute_stats(self, samples, context=False):

def process(self, samples):
if isinstance(samples[Fields.stats], list):
return list(
map(
lambda stat: self.min_len <= stat[StatsKeys.max_line_length
] <= self.max_len,
samples[Fields.stats]))
return map(
lambda stat: self.min_len <= stat[StatsKeys.max_line_length] <=
self.max_len, samples[Fields.stats])
else:
# single sample for ray filter
if self.min_len <= samples[Fields.stats][
Expand Down
Loading

0 comments on commit 5e6a340

Please sign in to comment.