Skip to content

Commit

Permalink
update docs according to recently refactor and events (#366)
Browse files Browse the repository at this point in the history
* update docs according to recently refactor and events

* update docs according to recently refactor and events

* update docs according to recently refactor and events

* minor fix according to yilun's comment
  • Loading branch information
yxdyc authored Jul 26, 2024
1 parent 2271feb commit ff9c9da
Show file tree
Hide file tree
Showing 9 changed files with 134 additions and 33 deletions.
59 changes: 56 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/dat
----

## News
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" — Our 4th data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532251) for more information.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through an co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute!
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora "Data Directors" creative sprint—Our third data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532219) for more information.
Expand Down Expand Up @@ -96,8 +97,8 @@ Table of Contents
visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models.
![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)

- **Enhanced Efficiency**: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion)
requiring less memory and CPU usage, optimized for maximum productivity.
- **Towards production environment **: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion)
requiring less memory and CPU usage, optimized with automatic fault-toleration.
![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
Expand Down Expand Up @@ -154,7 +155,7 @@ Table of Contents

## Installation

### From Source
### From Source

- Run the following commands to install the latest basic `data_juicer` version in
editable mode:
Expand Down Expand Up @@ -229,6 +230,15 @@ You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on De

Check if your environment path is set correctly by running the ffmpeg command from the terminal.


<br><hr>
<div style="text-align: right;">

[🔼 back to index](#documentation-index-)

</div>


## Quick Start


Expand Down Expand Up @@ -259,6 +269,20 @@ export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"
```
#### Flexible Programming Interface
We provide various simple interfaces for users to choose from as follows.
```python
#... init op & dataset ...
# Chain call style, support single operator or operator list
dataset = dataset.process(op)
dataset = dataset.process([op1, op2])
# Functional programming style for quick integration or script prototype iteration
dataset = op(dataset)
dataset = op.run(dataset)
```
### Distributed Data Processing
We have now implemented multi-machine distributed data processing based on [RAY](https://www.ray.io/). The corresponding demos can be run using the following commands:
Expand Down Expand Up @@ -376,6 +400,14 @@ docker run -dit \ # run the container in the background
docker exec -it <container_id> bash
```
<br><hr>
<div style="text-align: right;">
[🔼 back to index](#documentation-index-)
</div>
## Data Recipes
- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md)
- [Recipes for data process in RedPajama](configs/redpajama/README.md)
Expand Down Expand Up @@ -417,3 +449,24 @@ If you find our work useful for your research or development, please kindly cite
year={2024}
}
```
<details>
<summary> More related papers from Data-Juicer Team:
</summary>>
- [Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)
- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505)
</details>
<br><hr>
<div style="text-align: right;">
[🔼 back to index](#documentation-index-)
</div>
43 changes: 39 additions & 4 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
----

## 新消息
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] “天池 Better Synth 多模态大模型数据合成赛”——第四届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532251),了解赛事详情。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-17] 我们利用Data-Juicer[沙盒实验室套件](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox-ZH.md),通过数据与模型间的系统性研发工作流,调优数据和模型,在[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)文生视频排行榜取得了新的榜首。相关成果已经整理发表在[论文](http://arxiv.org/abs/2407.11784)中,并且模型已在[ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V)[HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V)平台发布。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-12] 我们的MLLM-Data精选列表已经演化为一个模型-数据协同开发的角度系统性[综述](https://arxiv.org/abs/2407.08583)。欢迎[浏览](docs/awesome_llm_data.md)或参与贡献!
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora“数据导演”创意竞速——第三届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532219),了解赛事详情。
Expand Down Expand Up @@ -82,7 +83,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多

* **数据反馈回路 & 沙盒实验室**:支持一站式数据-模型协同开发,通过[沙盒实验室](docs/Sandbox-ZH.md)快速迭代,基于数据和模型反馈回路、可视化和多维度自动评估等功能,使您更了解和改进您的数据和模型。 ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)

* **效率增强**:提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,提高生产力![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
* **面向生产环境**:提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,支持自动化处理容错![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

* **全面的数据处理菜谱**:为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)

Expand Down Expand Up @@ -235,6 +236,19 @@ export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"
```

#### 灵活的编程接口
我们提供了各种层次的简单编程接口,以供用户选择:
```python
# ... init op & dataset ...
# 链式调用风格,支持单算子或算子列表
dataset = dataset.process(op)
dataset = dataset.process([op1, op2])
# 函数式编程风格,方便快速集成或脚本原型迭代
dataset = op(dataset)
dataset = op.run(dataset)
```

### 分布式数据处理

Data-Juicer 现在基于[RAY](https://www.ray.io/)实现了多机分布式数据处理。
Expand Down Expand Up @@ -278,6 +292,9 @@ dj-analyze --config configs/demo/analyzer.yaml
streamlit run app.py
```




### 构建配置文件

* 配置文件包含一系列全局参数和用于数据处理的算子列表。您需要设置:
Expand Down Expand Up @@ -380,8 +397,6 @@ Data-Juicer 被各种 LLM产品和研究工作使用,包括来自阿里云-通
Data-Juicer 感谢并参考了社区开源项目:
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....



## 参考文献
如果您发现我们的工作对您的研发有帮助,请引用以下[论文](https://arxiv.org/abs/2309.02033) 。

Expand All @@ -392,4 +407,24 @@ Data-Juicer 感谢并参考了社区开源项目:
booktitle={International Conference on Management of Data},
year={2024}
}
```
```
<details>
<summary>更多Data-Juicer团队相关论文:
</summary>>
- [Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)
- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)
- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505)
</details>
<br><hr>
<div style="text-align: right;">
[🔼 back to index](#documentation-index-a-namedocuments)
</div>
19 changes: 12 additions & 7 deletions docs/DJ_SORA.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ This project is being actively updated and maintained. We eagerly invite you to
- [] Ray based multi-machine distributed running
- [] Aliyun PAI-DLC & Slurm based multi-machine distributed running
- [] Distributed scheduling optimization (OP-aware, automated load balancing) --> Aliyun PAI-DLC
- [ ] [WIP] Distributed storage optimization
- [WIP] Low precision acceleration support for video related operators. (git tags: dj_op, dj_efficiency)
- [WIP] SOTA model enhancement of existing video related operators. (git tags: dj_op, dj_sota_models)

## Basic Operators (video spatio-temporal dimension)
- Towards Data Quality
Expand Down Expand Up @@ -90,20 +91,24 @@ This project is being actively updated and maintained. We eagerly invite you to
- [] **Youku-mPLUG-CN**: 36TB video-caption data: `{<caption, video_id>}`
- [] **InternVid**: 234M data sample: `{<caption, youtube_id, start/end_time>}`
- [] **MSR-VTT**: 10K video-caption data: `{<caption, video_id>}`
- [ ] [WIP] ModelScope's datasets integration
- [ ] VideoInstruct-100K, Panda70M, ......
- [] ModelScope's datasets integration
- [] VideoInstruct-100K, Panda70M, ......
- [ ] Large-scale high-quality DJ-SORA dataset
- [] (Data sandbox) Building and optimizing multimodal data recipes with DJ-video operators (which are also being continuously extended and improved).
- [ ] [WIP] Continuous expansion of data sources: open-datasets, Youku, web, ...
- [ ] [WIP] Large-scale analysis, cleaning, and generation of high-quality multimodal datasets based on DJ recipes (OpenVideos, ...)
- [ ] [WIP] Large-scale generation of 3DPatch datasets based on DJ recipes.
- [] Continuous expansion of data sources: open-datasets, Youku, web, ...
- [ ] Large-scale analysis, cleaning, and generation of high-quality multimodal datasets based on DJ recipes (OpenVideos, ...)
- [WIP] broad scenarios, high-dynamic
- ...

## DJ-SORA Data Validation and Model Training
- [ ] [WIP] (DJ-Bench101) Exploring and refining the collaborative development of multimodal data and model, establishing benchmarks and insights.
- [ ] Exploring and refining the collaborative development of multimodal data and model, establishing benchmarks and insights. [paper](https://arxiv.org/abs/2407.11784)
- [ ] [WIP] Integration of SORA-like model training pipelines
- [EasyAnimate](https://github.com/aigc-apps/EasyAnimate)
- [] [T2V](https://t2v-turbo.github.io/)
- [] [V-Bench](https://vchitect.github.io/VBench-project/)
- ...
- [] (Model-Data sandbox) With relatively small models and the DJ-SORA dataset, exploring low-cost, transferable, and instructive data-model co-design, configurations and checkpoints.
- [ ] [WIP] Training SORA-like models with DJ-SORA data on larger scales and in more scenarios to improve model performance.
- [] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V)
- ...
- ...
24 changes: 14 additions & 10 deletions docs/DJ_SORA_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ DJ-SORA将基于Data-Juicer(包含上百个专用的视频、图像、音频、
- [] Ray多机分布式
- [] 基于阿里云PAI-DLC和Slurm的多机分布式
- [] 分布式调度优化(OP-aware、自动化负载均衡)--> Aliyun PAI-DLC
- [ ] [WIP] 分布式存储优化
- [WIP] 视频相关算子的低精度加速支持, git tags: dj_op, dj_efficiency
- [WIP] 现有视频相关算子的SOTA模型增强, git tags: dj_op, dj_sota_models

## 基础算子(视频时空维度)
- 面向数据质量
Expand Down Expand Up @@ -94,22 +95,25 @@ DJ-SORA将基于Data-Juicer(包含上百个专用的视频、图像、音频、
- [] **Youku-mPLUG-CN**: 36TB video-caption data:`{<caption, video_id>}`
- [] **InternVid**: 234M data sample:`{<caption, youtube_id, start/end_time>}`
- [] **MSR-VTT**: 10K video-caption data:`{<caption, video_id>}`
- [ ] [WIP] ModelScope数据集集成
- [ ] VideoInstruct-100K, Panda70M, ......
- [] ModelScope数据集集成
- [] VideoInstruct-100K, Panda70M, ......
- [ ] 大规模高质量DJ-SORA数据集
- [] (Data sandbox) 基于DJ-video算子构建和优化多模态数据菜谱 (算子同期持续完善)
- [ ] [WIP] 数据源持续扩充:open-datasets, youku, web, ...
- [ ] [WIP] 基于DJ菜谱规模化分析、清洗、生成高质量多模态数据集 (OpenVideo, ...)
- [ ] [WIP] 基于DJ菜谱形成大规模3DPatch数仓
- [] 数据源持续扩充:open-datasets, youku, web, ...
- [ ] 基于DJ菜谱规模化分析、清洗、生成高质量多模态数据集
- [WIP] 多场景、高动态
- ...

## DJ-SORA数据验证及模型训练
- [ ] [WIP] (DJ-Bench101) 探索及完善多模态数据和模型的协同开发,形成benchmark和insights
- [ ] [WIP] 类SORA模型训练pipeline集成
- [EasyAnimate](https://github.com/aigc-apps/EasyAnimate)
- [] 探索及完善多模态数据和模型的协同开发,形成benchmark和insights: [paper](https://arxiv.org/abs/2407.11784)
- [] [WIP] 类SORA模型训练pipeline集成
- [] [EasyAnimate](https://github.com/aigc-apps/EasyAnimate)
- [] [T2V](https://t2v-turbo.github.io/)
- [] [V-Bench](https://vchitect.github.io/VBench-project/)
- ...
- [] (Model-Data sandbox) 在相对小的模型和DJ-SORA数据集上,探索形成低开销、可迁移、有指导性的data-model co-design、配置及检查点
- [ ] [WIP] 更大规模、更多场景使用DJ-SORA数据训练类SORA模型,提高模型性能
- ...
- [] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V)
- ...


8 changes: 5 additions & 3 deletions docs/DeveloperGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
## Coding Style

We define our styles in `.pre-commit-config.yaml`. Before committing,
please install `pre-commit` tool to check and modify accordingly:
please install `pre-commit` tool to automatically check and modify accordingly:

```shell
# ===========install pre-commit tool===========
Expand Down Expand Up @@ -104,20 +104,22 @@ class StatsKeys(object):
return False
```

- If Hugging Face models are used within an operator, you might want to leverage GPU acceleration. To achieve this, declare `self._accelerator = 'cuda'` in the constructor, and ensure that `compute_stats` and `process` methods accept an additional positional argument `rank`.
- If Hugging Face models are used within an operator, you might want to leverage GPU acceleration. To achieve this, declare `_accelerator = 'cuda'` in the constructor, and ensure that `compute_stats` and `process` methods accept an additional positional argument `rank`.

```python
# ... (same as above)

@OPERATORS.register_module('text_length_filter')
class TextLengthFilter(Filter):

_accelerator = 'cuda'

def __init__(self,
min_len: PositiveInt = 10,
max_len: PositiveInt = sys.maxsize,
*args,
**kwargs):
# ... (same as above)
self._accelerator = 'cuda'

def compute_stats(self, sample, rank=None):
# ... (same as above)
Expand Down
8 changes: 5 additions & 3 deletions docs/DeveloperGuide_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

## 编码规范

我们将编码规范定义在 `.pre-commit-config.yaml` 中。在向仓库贡献代码之前,请使用 `pre-commit` 工具对代码进行规范化
我们将编码规范定义在 `.pre-commit-config.yaml` 中。在向仓库贡献代码之前,请使用 `pre-commit` 工具对代码进行自动规范化

```shell
# ===========install pre-commit tool===========
Expand Down Expand Up @@ -99,20 +99,22 @@ class StatsKeys(object):
return False
```

- 如果在算子中使用了 Hugging Face 模型,您可能希望利用 GPU 加速。为了实现这一点,请在构造函数中声明 `self._accelerator = 'cuda'`,并确保 `compute_stats``process` 方法接受一个额外的位置参数 `rank`
- 如果在算子中使用了 Hugging Face 模型,您可能希望利用 GPU 加速。为了实现这一点,请在构造函数中声明 `_accelerator = 'cuda'`,并确保 `compute_stats``process` 方法接受一个额外的位置参数 `rank`

```python
# ... (same as above)

@OPERATORS.register_module('text_length_filter')
class TextLengthFilter(Filter):

_accelerator = 'cuda'

def __init__(self,
min_len: PositiveInt = 10,
max_len: PositiveInt = sys.maxsize,
*args,
**kwargs):
# ... (same as above)
self._accelerator = 'cuda'

def compute_stats(self, sample, rank=None):
# ... (same as above)
Expand Down
Loading

0 comments on commit ff9c9da

Please sign in to comment.