Skip to content

Commit

Permalink
sandbox doc update
Browse files Browse the repository at this point in the history
  • Loading branch information
BeachWang committed Oct 11, 2024
1 parent 1aaad21 commit 3faaff6
Show file tree
Hide file tree
Showing 5 changed files with 55 additions and 14 deletions.
8 changes: 4 additions & 4 deletions data_juicer/core/sandbox/factories.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from data_juicer.core import Analyzer
from data_juicer.core import Executor as DjExecutor
from data_juicer.core import Analyzer as DJAnalyzer
from data_juicer.core import Executor as DJExecutor
from data_juicer.core.sandbox.evaluators import (Gpt3QualityEvaluator,
InceptionEvaluator,
VBenchEvaluator)
Expand All @@ -17,7 +17,7 @@ def __call__(self, dj_cfg: dict = None, *args, **kwargs):
if dj_cfg is None:
return None

return DjExecutor(dj_cfg)
return DJExecutor(dj_cfg)


data_executor_factory = DataExecutorFactory()
Expand All @@ -32,7 +32,7 @@ def __call__(self, dj_cfg: dict = None, *args, **kwargs):
if dj_cfg is None:
return None

return Analyzer(dj_cfg)
return DJAnalyzer(dj_cfg)


data_analyzer_factory = DataAnalyzerFactory()
Expand Down
2 changes: 1 addition & 1 deletion docs/DJ_SORA.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,6 @@ This project is being actively updated and maintained. We eagerly invite you to
- ...
- [] (Model-Data sandbox) With relatively small models and the DJ-SORA dataset, exploring low-cost, transferable, and instructive data-model co-design, configurations and checkpoints.
- [ ] [WIP] Training SORA-like models with DJ-SORA data on larger scales and in more scenarios to improve model performance.
- [] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V)
- [] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V-v2). Please refer [here](./Sandbox-ZH.md) for more details.
- ...
- ...
4 changes: 1 addition & 3 deletions docs/DJ_SORA_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,5 @@ DJ-SORA将基于Data-Juicer(包含上百个专用的视频、图像、音频、
- ...
- [] (Model-Data sandbox) 在相对小的模型和DJ-SORA数据集上,探索形成低开销、可迁移、有指导性的data-model co-design、配置及检查点
- [ ] [WIP] 更大规模、更多场景使用DJ-SORA数据训练类SORA模型,提高模型性能
- [] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V)
- [] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V-v2)。详情请参考[这里](./Sandbox-ZH.md)
- ...


27 changes: 24 additions & 3 deletions docs/Sandbox-ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,18 @@
| data_juicer_t2v_optimal_data_pool | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | Data-Juicer (T2V, 147k) 的训练集 |
| data_juicer_t2v_evolution_data_pool | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | Data-Juicer (2024-09-23, T2V-Turbo) 的训练集 |

Data-Juicer (DJ, 228k)模型输出样例如下表所示。
| 文本提示 | 生成视频 |
| --- | --- |
| A beautiful coastal beach in spring, waves lapping on sand, zoom out | [![Case 0](https://img.alicdn.com/imgextra/i1/O1CN01KuJeOE1Ylqnk9zYkc_!!6000000003100-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case0.mp4) |
| a boat accelerating to gain speed | [![Case 1](https://img.alicdn.com/imgextra/i2/O1CN01i1iMFE1TKlIUlqE8d_!!6000000002364-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case1.mp4) |
| A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Hokusai, in the style of Ukiyo | [![Case 2](https://img.alicdn.com/imgextra/i2/O1CN01u2cjJE1RBwRFeCFuo_!!6000000002074-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case2.mp4) |
| a bottle on the left of a wine glass, front view | [![Case 3](https://img.alicdn.com/imgextra/i4/O1CN01vdMm6Q1xWc1CoJZW6_!!6000000006451-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case3.mp4) |
| A corgi's head depicted as an explosion of a nebula | [![Case 4](https://img.alicdn.com/imgextra/i2/O1CN014oPB8Q1IrJg0AbUUg_!!6000000000946-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case4.mp4) |
| A graceful ballerina doing a pirouette on a dimly lit stage, with soft spotlight highlighting her movements. | [![Case 5](https://img.alicdn.com/imgextra/i4/O1CN01yNlsVu1ymvkJgkvY8_!!6000000006622-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case5.mp4) |

复现论文实验请参考下面的sandbox使用指南,下图的实验流程,[初始数据集](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_init_data_pool.zip),以及该流程的工作流的配置文件demo:[1_single_op_pipline.yaml](../configs/demo/bench/1_single_op_pipline.yaml)[2_multi_op_pipline.yaml](../configs/demo/bench/2_multi_op_pipline.yaml)[3_duplicate_pipline.yaml](../configs/demo/bench/3_duplicate_pipline.yaml)
![bench_bottom_up](https://img.alicdn.com/imgextra/i3/O1CN01ZwtQuG1sdPnbYYVhH_!!6000000005789-2-tps-7838-3861.png)
![bench_bottom_up](https://img.alicdn.com/imgextra/i2/O1CN01xvu2fo1HU80biR6Q5_!!6000000000760-2-tps-7756-3693.png)

## 什么是沙盒实验室(DJ-Sandbox)?
在Data-Juicer中,数据沙盒实验室为用户提供了持续生产数据菜谱的最佳实践,其具有低开销、可迁移、有指导性等特点,用户在沙盒中基于一些小规模数据集、模型对数据菜谱进行快速实验、迭代、优化,再迁移到更大尺度上,大规模生产高质量数据以服务大模型。
Expand Down Expand Up @@ -132,6 +142,18 @@ python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml

目前支持的组件工厂以及工厂中支持的组件包括:

- 数据处理工厂 -- DataExecutorFactory

| 组件 | 功能 | `run`方法说明 | 参考材料 |
| --- | --- | --- | --- |
| `DJExecutor` | Data-Juicer数据处理模块 | - | - |

- 数据分析工厂 -- DataAnalyzerFactory

| 组件 | 功能 | `run`方法说明 | 参考材料 |
| --- | --- | --- | --- |
| `DJAnalyzer` | Data-Juicer数据分析模块 | - | - |

- 数据评估工厂 -- DataEvaluatorFactory

| 组件 | 功能 | `run`方法说明 | 参考材料 |
Expand Down Expand Up @@ -165,14 +187,13 @@ python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml
详细定义可参考`data_juicer/core/sandbox/factories.py`。
# 开发者指南
正如上一章节所说,开发者可开发更多的可配置组件并将它们添加到对应的工厂类中,并用参数`type`进行实例化方法分配。实现了组件后,开发者可以将它们封装为钩子,并将钩子注册到工作列表中,工作列表在流水线中进行编排后,沙盒流水线执行时,会依次在每个步骤执行每个工作列表中的工作。这其中的每一个部分:组件、组件工厂、钩子、工作列表、流水线注册与执行流程编排,都可以由开发者自定义。各个部分的关系由下图示意。
![sandbox-pipeline](https://img.alicdn.com/imgextra/i2/O1CN01B3zR0t29noFoHGsyq_!!6000000008113-2-tps-3878-2212.png)
![sandbox-pipeline](https://img.alicdn.com/imgextra/i3/O1CN01ERmGre1uz3luKOn4n_!!6000000006107-2-tps-4655-1918.png)

## 组件内部实现
目前组件主要分为两个大类:

- **执行器(Executor)**:由于数据执行器已经由Data-Juicer的Executor承担,因此此处的执行器特指模型的执行器,包括模型训练、推理、评估等执行器。代码位于`data_juicer/core/sandbox/model_executors.py`
- **评估器(Evaluator)**:用于对数据集或者模型进行质量以及性能的评估。代码位于`data_juicer/core/sandbox/evaluators.py`
- **流水线钩子(Hook)**:用于将任务挂载到流水线中。代码位于`data_juicer/core/sandbox/hooks.py`

### 执行器
模型执行器核心功能为对配置文件中指定的模型用指定的数据集进行训练、推理或评测。模型执行器需继承`BaseModelExecutor`并实现若干核心方法:
Expand Down
28 changes: 25 additions & 3 deletions docs/Sandbox.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,18 @@ The model is now publicly available on the ModelScope and HuggingFace platforms,
| data_juicer_t2v_optimal_data_pool | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | The training dataset of Data-Juicer (T2V, 147k) |
| data_juicer_t2v_evolution_data_pool | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool_s2.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-evolution-data-pool) | The training dataset of Data-Juicer (2024-09-23, T2V-Turbo) |

Following is the case study for Data-Juicer (DJ, 228k) outputs.
| Prompt | Generated Video |
| --- | --- |
| A beautiful coastal beach in spring, waves lapping on sand, zoom out | [![Case 0](https://img.alicdn.com/imgextra/i1/O1CN01KuJeOE1Ylqnk9zYkc_!!6000000003100-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case0.mp4) |
| a boat accelerating to gain speed | [![Case 1](https://img.alicdn.com/imgextra/i2/O1CN01i1iMFE1TKlIUlqE8d_!!6000000002364-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case1.mp4) |
| A boat sailing leisurely along the Seine River with the Eiffel Tower in background by Hokusai, in the style of Ukiyo | [![Case 2](https://img.alicdn.com/imgextra/i2/O1CN01u2cjJE1RBwRFeCFuo_!!6000000002074-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case2.mp4) |
| a bottle on the left of a wine glass, front view | [![Case 3](https://img.alicdn.com/imgextra/i4/O1CN01vdMm6Q1xWc1CoJZW6_!!6000000006451-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case3.mp4) |
| A corgi's head depicted as an explosion of a nebula | [![Case 4](https://img.alicdn.com/imgextra/i2/O1CN014oPB8Q1IrJg0AbUUg_!!6000000000946-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case4.mp4) |
| A graceful ballerina doing a pirouette on a dimly lit stage, with soft spotlight highlighting her movements. | [![Case 5](https://img.alicdn.com/imgextra/i4/O1CN01yNlsVu1ymvkJgkvY8_!!6000000006622-2-tps-2048-320.png)](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/show_cases/case5.mp4) |

To reproduce the paper's experiments, please refer to the sandbox usage guide below, the experimental process in the following figure, the [initial dataset](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_init_data_pool.zip), and the configuration file demos for the process: [1_single_op_pipline.yaml](../configs/demo/bench/1_single_op_pipline.yaml), [2_multi_op_pipline.yaml](../configs/demo/bench/2_multi_op_pipline.yaml), [3_duplicate_pipline.yaml](../configs/demo/bench/3_duplicate_pipline.yaml).
![bench_bottom_up](https://img.alicdn.com/imgextra/i3/O1CN01ZwtQuG1sdPnbYYVhH_!!6000000005789-2-tps-7838-3861.png)
![bench_bottom_up](https://img.alicdn.com/imgextra/i2/O1CN01xvu2fo1HU80biR6Q5_!!6000000000760-2-tps-7756-3693.png)

## What is DJ-Sandbox?
In Data-Juicer, the data sandbox laboratory provides users with the best practices for continuously producing data recipes. It features low overhead, portability, and guidance. In the sandbox, users can quickly experiment, iterate, and refine data recipes based on small-scale datasets and models, before scaling up to produce high-quality data to serve large-scale models.
Expand Down Expand Up @@ -133,6 +143,18 @@ Except for DataExecutor and DataAnalyzer, the rest of the components can be spec

The currently supported component factories and the components supported within each factory are as follows:

- DataExecutorFactory

| Component | Function | Desc. of Method `run` | Reference Materials |
| --- | --- | --- | --- |
| `DJExecutor` | The data process module of Data-Juicer | - | - |

- DataAnalyzerFactory

| Component | Function | Desc. of Method `run` | Reference Materials |
| --- | --- | --- | --- |
| `DJAnalyzer` | The data analysis module of Data-Juicer | - | - |

- DataEvaluatorFactory

| Component | Function | Desc. of Method `run` | Reference Materials |
Expand Down Expand Up @@ -166,14 +188,14 @@ The currently supported component factories and the components supported within
Please refer to `data_juicer/core/sandbox/factories.py` for detailed definitions.
# Developer Guide
As mentioned in the previous section, developers can develop customized configurable components and add them to the corresponding factory classes, then route to appropriate instantiation methods using the `type` parameter. Once the components are implemented, developers can encapsulate them as hooks and register the hooks into the job list. After the job list is orchestrated in the pipeline, when the sandbox pipeline is executed, each job in the job list will be executed in sequence at each step. Each of these parts - components, component factory, hooks, job lists, and the registration and execution orchestration of the pipeline - can be customized by the developer. The relationship among these parts is illustrated in the diagram below.
![sandbox-pipeline](https://img.alicdn.com/imgextra/i2/O1CN01B3zR0t29noFoHGsyq_!!6000000008113-2-tps-3878-2212.png)
![sandbox-pipeline](https://img.alicdn.com/imgextra/i3/O1CN01ERmGre1uz3luKOn4n_!!6000000006107-2-tps-4655-1918.png)

## The Internal Implementation of Components
Currently, components are mainly divided into two major categories:

- **Executor**: Since the data executor is already handled by the Data-Juicer's Executor, the executor here specifically refers to the model executor, including model training, inference, evaluation, etc. The code is located in `data_juicer/core/sandbox/model_executors.py`.
- **Evaluator**: Used for evaluating the quality and performance of datasets or models. The code is located in `data_juicer/core/sandbox/evaluators.py`.
- **Hook**: Used to mount tasks onto the pipeline. The code is located in `data_juicer/core/sandbox/hooks.py`.

### Executor
The core function of the model executor is to train, infer, or evaluate the model specified in the configuration file with the specified dataset. The model executor needs to inherit from `BaseModelExecutor` and implement several core methods:

Expand Down

0 comments on commit 3faaff6

Please sign in to comment.