Merge branch 'main' of github.com:alibaba/data-juicer into dev/light_…

…dependency
modelscope · Sep 24, 2024 · 5e6a340 · 5e6a340
2 parents f1cfa65 + 467cb96
commit 5e6a340
Show file tree

Hide file tree

Showing 24 changed files with 206 additions and 84 deletions.
diff --git a/configs/data_juicer_recipes/README.md b/configs/data_juicer_recipes/README.md
@@ -41,7 +41,8 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
 | subset                    |       #samples before       | #samples after | keep ratio | config link                          | data link                                                                                                                                                                                                                                                                                 | source        |
 |---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
 | LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
-| Data-Juicer-T2V |          1,217,346          |   147,176    |   12.09%   | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
 
 ### Evaluation Results
 - LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
@@ -50,6 +51,19 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
 |-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
 | LLaVA-1.5-13B <br> (baseline) | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
 | LLaVA-1.5-13B <br> (refined pretrain dataset) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
+- Data-Juicer (T2V, 147k) and Data-Juicer (DJ, 228k): models **trained with refined dataset** outperforms the baseline ([T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)) on [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard). T2V-Turbo is the teacher model of Data-Juicer (T2V, 147k) and Data-Juicer (T2V, 147k) is the teacher model of Data-Juicer (DJ, 228k). Please refer to [Sandbox](../../docs/Sandbox.md) for more detail.
+
+| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
+|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
+| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
+| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |
+
+| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
+|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
+| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
+| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |
 
 ## For Video Dataset
 

diff --git a/configs/data_juicer_recipes/README_ZH.md b/configs/data_juicer_recipes/README_ZH.md
@@ -41,7 +41,8 @@
 | 数据子集                    |      完善前的样本数目       | 完善后的样本数目 | 样本保留率 | 配置链接                          | 数据链接                                                                                                                                                                                                                                                                                 | 来源            |
 |---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
 | LLaVA pretrain (LCS-558k) |          558,128          |   500,380    |   89.65%   | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer)                                        | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
-| Data-Juicer-T2V |          1,217,346          |   147,176    |   12.09%   | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+| Data-Juicer (T2V, 147k) |          1,217,346          |   147,176    |   12.09%   | [data-juicer-sandbox-optimal.yaml](data-juicer-sandbox-optimal.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool)  <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
+| Data-Juicer (DJ, 228k) |          3,408,553          |   227,867    |   8.15%   | [data-juicer-sandbox-self-evolution.yaml](data-juicer-sandbox-self-evolution.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool)                                        | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (2,599k)](https://github.com/snap-research/Panda-70M) <br> [Pexels (198k)](https://github.com/cj-mills/pexels-dataset) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |
 
 ### 评测结果
 - LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
@@ -51,6 +52,20 @@
 | LLaVA-1.5-13B <br> (基线)         | **80.0**  | 63.3 | 53.6 | 71.6 | **61.3** | 85.9 | 1531.3 | 67.7 | 63.6 | 61.6 | 72.5 | 36.1 |
 | LLaVA-1.5-13B <br> (完善后的预训练数据集) | 79.94 | **63.5** | **54.09** | **74.20** | 60.82 | **86.67** | **1565.53** | **68.2** | **63.9** | **61.8** | **75.9** | **37.4** |
 
+- Data-Juicer (T2V, 147k) 和 Data-Juicer (DJ, 228k): 使用**完善后的数据集**在 [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) 全面超过基线模型 [T2V-Turbo](https://github.com/Ji4chenLi/t2v-turbo)。这里 T2V-Turbo 是 Data-Juicer (T2V, 147k) 的teacher模型，Data-Juicer (T2V, 147k) 是 Data-Juicer (DJ, 228k) 的teacher模型，详情请参考[沙盒实验室](../../docs/Sandbox-ZH.md)。
+
+| model                         | Total Score | Quality Score | Semantic Score | subject consistency | background consistency | temporal flickering | motion smoothness | dynamic degree | aesthetic quality |
+|-------------------------------|-------| --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | 81.01 | 82.57 | 74.76 | 96.28 | 97.02 | 97.48 | 97.34 | 49.17 | 63.04 |
+| Data-Juicer (T2V, 147k) | 82.10 | 83.14 | 77.93 | 97.32 | 99.03 | 96.60 | 96.51 | **51.67** | **68.92** |
+| Data-Juicer (DJ, 228k)  | **82.53** | **83.38** | **79.13** | **97.92** | **99.27** | **98.14** | **97.77** | 38.89 | 67.39 |
+
+| model                         | imaging quality | object class | multiple objects | human action | color | spatial relationship | scene | appearance style | temporal style | overall consistency |
+|-------------------------------| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| T2V-Turbo               | **72.49** | 93.96 | 54.65 | 95.20 | 89.90 | 38.67 | 55.58 | 24.42 | 25.51 | 28.16 |
+| Data-Juicer (T2V, 147k) | 70.42 | 95.85 | 61.63 | **95.60** | 94.06 | 46.95 | **57.57** | 24.42 | 26.34 | 28.90 |
+| Data-Juicer (DJ, 228k)  | 70.41 | **96.44** | **64.51** | 95.40 | **95.51** | **47.17** | 57.30 | **25.55** | **26.82** | **29.25** |
+
 ## 视频数据集
 
 我们为用户提供了一个视频数据集处理菜谱样例以协助更好地使用视频相关的算子： [general-video-refine-example.yaml](general-video-refine-example.yaml) 。这里我们应用了三种类型的算子：

diff --git a/configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml b/configs/data_juicer_recipes/data-juicer-sandbox-optimal.yaml
@@ -0,0 +1,29 @@
+# global parameters
+project_name: 'Data-Juicer-recipes-T2V-optimal'
+dataset_path: '/path/to/your/dataset'  # path to your dataset directory or file
+export_path: '/path/to/your/dataset.jsonl'
+
+np: 4  # number of subprocess to process your dataset
+
+# process schedule
+# a list of several process operators with their arguments
+process:
+  - video_nsfw_filter:
+      hf_nsfw_model: Falconsai/nsfw_image_detection
+      score_threshold: 0.000195383
+      frame_sampling_method: uniform
+      frame_num: 3
+      reduce_mode: avg
+      any_or_all: any
+      mem_required: '1GB'
+  - video_frames_text_similarity_filter:
+      hf_clip: openai/clip-vit-base-patch32
+      min_score: 0.306337
+      max_score: 1.0
+      frame_sampling_method: uniform
+      frame_num: 3
+      horizontal_flip: false
+      vertical_flip: false
+      reduce_mode: avg
+      any_or_all: any
+      mem_required: '10GB'
diff --git a/configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml b/configs/data_juicer_recipes/data-juicer-sandbox-self-evolution.yaml
@@ -0,0 +1,47 @@
+# global parameters
+project_name: 'Data-Juicer-recipes-T2V-evolution'
+dataset_path: '/path/to/your/dataset'  # path to your dataset directory or file
+export_path: '/path/to/your/dataset.jsonl'
+
+np: 4  # number of subprocess to process your dataset
+
+# process schedule
+# a list of several process operators with their arguments
+process:
+  - video_nsfw_filter:
+      hf_nsfw_model: Falconsai/nsfw_image_detection
+      score_threshold: 0.000195383
+      frame_sampling_method: uniform
+      frame_num: 3
+      reduce_mode: avg
+      any_or_all: any
+      mem_required: '1GB'
+  - video_frames_text_similarity_filter:
+      hf_clip: openai/clip-vit-base-patch32
+      min_score: 0.306337
+      max_score: 1.0
+      frame_sampling_method: uniform
+      frame_num: 3
+      horizontal_flip: false
+      vertical_flip: false
+      reduce_mode: avg
+      any_or_all: any
+      mem_required: '10GB'
+  - video_motion_score_filter:
+      min_score: 3
+      max_score: 20
+      sampling_fps: 2
+      any_or_all: any
+  - video_aesthetics_filter:
+      hf_scorer_model: shunk031/aesthetics-predictor-v2-sac-logos-ava1-l14-linearMSE
+      min_score: 0.418164
+      max_score: 1.0
+      frame_sampling_method: 'uniform'
+      frame_num: 3
+      reduce_mode: avg
+      any_or_all: any
+      mem_required: '1500MB'
+  - video_duration_filter:
+      min_duration: 2
+      max_duration: 100000
+      any_or_all: any
diff --git a/data_juicer/ops/base_op.py b/data_juicer/ops/base_op.py
@@ -139,7 +139,7 @@ def __init__(self, *args, **kwargs):
         self.image_key = kwargs.get('image_key', 'images')
         self.audio_key = kwargs.get('audio_key', 'audios')
         self.video_key = kwargs.get('video_key', 'videos')
-        self.batch_size = kwargs.get('batch_size', 1)
+        self.batch_size = kwargs.get('batch_size', 1000)
 
         # whether the model can be accelerated using cuda
         _accelerator = kwargs.get('accelerator', None)
@@ -210,6 +210,12 @@ def add_parameters(self, init_parameter_dict, **extra_param_dict):
         related_parameters.update(extra_param_dict)
         return related_parameters
 
+    def run(self, dataset):
+        from data_juicer.core.data import NestedDataset
+        if not isinstance(dataset, NestedDataset):
+            dataset = NestedDataset(dataset)
+        return dataset
+
 
 class Mapper(OP):
 
@@ -244,6 +250,7 @@ def process(self, sample):
         raise NotImplementedError
 
     def run(self, dataset, *, exporter=None, tracer=None):
+        dataset = super(Mapper, self).run(dataset)
         new_dataset = dataset.map(
             self.process,
             num_proc=self.runtime_np(),
@@ -304,6 +311,7 @@ def process(self, sample):
         raise NotImplementedError
 
     def run(self, dataset, *, exporter=None, tracer=None):
+        dataset = super(Filter, self).run(dataset)
         if Fields.stats not in dataset.features:
             from data_juicer.core.data import add_same_content_to_new_column
             dataset = dataset.map(add_same_content_to_new_column,
@@ -374,6 +382,7 @@ def process(self, dataset, show_num=0):
         raise NotImplementedError
 
     def run(self, dataset, *, exporter=None, tracer=None):
+        dataset = super(Deduplicator, self).run(dataset)
         dataset = dataset.map(self.compute_hash,
                               num_proc=self.runtime_np(),
                               with_rank=self.use_cuda(),
@@ -412,6 +421,7 @@ def process(self, dataset):
         raise NotImplementedError
 
     def run(self, dataset, *, exporter=None, tracer=None):
+        dataset = super(Selector, self).run(dataset)
         new_dataset = self.process(dataset)
         if tracer:
             tracer.trace_filter(self._name, dataset, new_dataset)

diff --git a/data_juicer/ops/filter/alphanumeric_filter.py b/data_juicer/ops/filter/alphanumeric_filter.py
@@ -83,10 +83,9 @@ def process(self, samples):
         ratio_key = StatsKeys.alpha_token_ratio if self.tokenization \
             else StatsKeys.alnum_ratio
         if isinstance(samples[Fields.stats], list):
-            return list(
-                map(
-                    lambda stat: self.min_ratio <= stat[ratio_key] <= self.
-                    max_ratio, samples[Fields.stats]))
+            return map(
+                lambda stat: self.min_ratio <= stat[ratio_key] <= self.
+                max_ratio, samples[Fields.stats])
         else:
             # single sample for ray filter
             if self.min_ratio <= samples[

diff --git a/data_juicer/ops/filter/average_line_length_filter.py b/data_juicer/ops/filter/average_line_length_filter.py
@@ -60,11 +60,9 @@ def compute_stats(self, samples, context=False):
 
     def process(self, samples):
         if isinstance(samples[Fields.stats], list):
-            return list(
-                map(
-                    lambda stat: self.min_len <= stat[StatsKeys.avg_line_length
-                                                      ] <= self.max_len,
-                    samples[Fields.stats]))
+            return map(
+                lambda stat: self.min_len <= stat[StatsKeys.avg_line_length] <=
+                self.max_len, samples[Fields.stats])
         else:
             # single sample for ray filter
             if self.min_len <= samples[Fields.stats][

diff --git a/data_juicer/ops/filter/character_repetition_filter.py b/data_juicer/ops/filter/character_repetition_filter.py
@@ -80,11 +80,9 @@ def compute_stats(self, samples):
 
     def process(self, samples):
         if isinstance(samples[Fields.stats], list):
-            return list(
-                map(
-                    lambda stat: self.min_ratio <= stat[
-                        StatsKeys.char_rep_ratio] <= self.max_ratio,
-                    samples[Fields.stats]))
+            return map(
+                lambda stat: self.min_ratio <= stat[StatsKeys.char_rep_ratio]
+                <= self.max_ratio, samples[Fields.stats])
         else:
             # single sample for ray filter
             if self.min_ratio <= samples[Fields.stats][

diff --git a/data_juicer/ops/filter/maximum_line_length_filter.py b/data_juicer/ops/filter/maximum_line_length_filter.py
@@ -61,11 +61,9 @@ def compute_stats(self, samples, context=False):
 
     def process(self, samples):
         if isinstance(samples[Fields.stats], list):
-            return list(
-                map(
-                    lambda stat: self.min_len <= stat[StatsKeys.max_line_length
-                                                      ] <= self.max_len,
-                    samples[Fields.stats]))
+            return map(
+                lambda stat: self.min_len <= stat[StatsKeys.max_line_length] <=
+                self.max_len, samples[Fields.stats])
         else:
             # single sample for ray filter
             if self.min_len <= samples[Fields.stats][