implement data-model sandbox, with refactoring existing DJ's features…

… and tools (#291) * refactor for data-model sandbox: - [x] basic sandbox pipelines and hooks - [x] basic classes for evaluator, model_executors, taking modelscope_executor as an example - [WIP] add docs for sandbox (how to use and how to develop) - [WIP] local test with demo - [TODO] merge two evaluators from other under-reviewed PRs * + add missing args for dataset sampling * * make the simplest sandbox loop work * + add args that are updated recently * + make data quality evaluator work + support both path and dict for side configs * * make async work * * make model training work * + add more log for unittest * * avoid failure for tagging OPs * * rename single test funcs in unittest * * fix for unittest * * refine model train executor args * * specify the run_type by the type arg in extra configs * + Add docs for sandbox * update docs for DJ_SORA and homepage (ZH) * + Add English docs for added contents --------- Co-authored-by: lielin.hyl <[email protected]>
modelscope · Apr 22, 2024 · 4148016 · 4148016
1 parent 6e0e6e7
commit 4148016
Show file tree

Hide file tree

Showing 23 changed files with 1,457 additions and 67 deletions.
diff --git a/README.md b/README.md
@@ -74,6 +74,7 @@ Table of Contents
     - [Data Analysis](#data-analysis)
     - [Data Visualization](#data-visualization)
     - [Build Up Config Files](#build-up-config-files)
+    - [Sandbox](#sandbox)
     - [Preprocess Raw Data (Optional)](#preprocess-raw-data-optional)
     - [For Docker Users](#for-docker-users)
   - [Data Recipes](#data-recipes)
@@ -90,25 +91,25 @@ Table of Contents
 - **Systematic & Reusable**:
   Empowering users with a systematic library of 80+ core [OPs](docs/Operators.md), 20+ reusable [config recipes](configs), and 20+ feature-rich
   dedicated [toolkits](#documentation), designed to
-  function independently of specific LLM datasets and processing pipelines.
+  function independently of specific multimodal LLM datasets and processing pipelines.
 
-- **Data-in-the-loop**: Allowing detailed data analyses with an automated
-  report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process.
+- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration 
+  through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model, 
+  visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models.
   ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)
 
+- **Enhanced Efficiency**: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion)
+  requiring less memory and CPU usage, optimized for maximum productivity.
+  ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
+
 - **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
   processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on
   reference LLaMA and LLaVA models.
   ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)
 
-- **Enhanced Efficiency**: Providing a speedy data processing pipeline
-  requiring less memory and CPU usage, optimized for maximum productivity.
-  ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
-
-
 - **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing.
 
-- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documentation), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).
+- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documents), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml).
 
 
 
@@ -320,6 +321,18 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
 
   ![Basic config example of format and definition](https://img.alicdn.com/imgextra/i1/O1CN01uXgjgj1khWKOigYww_!!6000000004715-0-tps-1745-871.jpg "Basic config file example")
 
+### Sandbox
+
+The data sandbox laboratory (DJ-Sandbox) provides users with the best practices for continuously producing data recipes. It features low overhead, portability, and guidance.
+
+- In the sandbox, users can quickly experiment, iterate, and refine data recipes based on small-scale datasets and models, before scaling up to produce high-quality data to serve large-scale models.
+- In addition to the basic data optimization and recipe refinement features offered by Data-Juicer, users can seamlessly use configurable components such as data probe and analysis, model training and evaluation, and data and model feedback-based recipe refinement to form a complete one-stop data-model research and development pipeline.
+
+The sandbox is run using the following commands by default, and for more information and details, please refer to the [sandbox documentation](docs/Sandbox.md).
+```shell
+python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml
+```
+
 ### Preprocess Raw Data (Optional)
 - Our formatters support some common input dataset formats for now:
   - Multi-sample in one file: jsonl/json, parquet, csv/tsv, etc.

diff --git a/README_ZH.md b/README_ZH.md
@@ -67,6 +67,7 @@ Data-Juicer（包含[DJ-SORA](docs/DJ_SORA_ZH.md)）正在积极更新和维护
     - [数据分析](#数据分析)
     - [数据可视化](#数据可视化)
     - [构建配置文件](#构建配置文件)
+    - [沙盒实验室](#沙盒实验室)
     - [预处理原始数据（可选）](#预处理原始数据可选)
     - [对于 Docker 用户](#对于-docker-用户)
   - [数据处理菜谱](#数据处理菜谱)
@@ -80,15 +81,15 @@ Data-Juicer（包含[DJ-SORA](docs/DJ_SORA_ZH.md)）正在积极更新和维护
 
 ![Overview](https://img.alicdn.com/imgextra/i4/O1CN01WYQP3Z1JHsaXaQDK6_!!6000000001004-0-tps-3640-1812.jpg)
 
-* **系统化 & 可复用**：为用户提供系统化且可复用的80+核心[算子](docs/Operators_ZH.md)，20+[配置菜谱](configs/README_ZH.md)和20+专用[工具池](#documentation)，旨在让数据处理独立于特定的大语言模型数据集和处理流水线。
+* **系统化 & 可复用**：为用户提供系统化且可复用的80+核心[算子](docs/Operators_ZH.md)，20+[配置菜谱](configs/README_ZH.md)和20+专用[工具池](#documentation)，旨在让多模态数据处理独立于特定的大语言模型数据集和处理流水线。
 
-* **数据反馈回路**：支持详细的数据分析，并提供自动报告生成功能，使您深入了解您的数据集。结合多维度自动评估功能，支持在 LLM 开发过程的多个阶段进行及时反馈循环。  ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)
+* **数据反馈回路 & 沙盒实验室**：支持一站式数据-模型协同开发，通过[沙盒实验室](docs/Sandbox-ZH.md)快速迭代，基于数据和模型反馈回路、可视化和多维度自动评估等功能，使您更了解和改进您的数据和模型。  ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)
 
-* **全面的数据处理菜谱**：为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)
+* **效率增强**：提供高效并行化的数据处理流水线（Aliyun-PAI\Ray\Slurm\CUDA\算子融合），减少内存占用和CPU开销，提高生产力。  ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
 
-* **效率增强**：提供高效的数据处理流水线，减少内存占用和CPU开销，提高生产力。  ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
+* **全面的数据处理菜谱**：为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)
 
-* **用户友好**：设计简单易用，提供全面的[文档](#documentation)、简易[入门指南](#快速上手)和[演示配置](configs/README_ZH.md)，并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。
+* **用户友好**：设计简单易用，提供全面的[文档](#documents)、简易[入门指南](#快速上手)和[演示配置](configs/README_ZH.md)，并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。
 
 * **灵活 & 易扩展**：支持大多数数据格式（如jsonl、parquet、csv等），并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子)，以执行定制化的数据处理。
 
@@ -295,6 +296,19 @@ python xxx.py --config configs/demo/process.yaml --language_id_score_filter.lang
 
   ![基础配置项格式及定义样例](https://img.alicdn.com/imgextra/i4/O1CN01xPtU0t1YOwsZyuqCx_!!6000000003050-0-tps-1692-879.jpg "基础配置文件样例")
 
+### 沙盒实验室
+
+数据沙盒实验室 (DJ-Sandbox) 为用户提供了持续生产数据菜谱的最佳实践，其具有低开销、可迁移、有指导性等特点。
+- 用户在沙盒中可以基于一些小规模数据集、模型对数据菜谱进行快速实验、迭代、优化，再迁移到更大尺度上，大规模生产高质量数据以服务大模型。
+- 用户在沙盒中，除了Data-Juicer基础的数据优化与数据菜谱微调功能外，还可以便捷地使用数据洞察与分析、沙盒模型训练与评测、基于数据和模型反馈优化数据菜谱等可配置组件，共同组成完整的一站式数据-模型研发流水线。
+
+沙盒默认通过如下命令运行，更多介绍和细节请参阅[沙盒文档](docs/Sandbox-ZH.md).
+```shell
+python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml
+```
+
+
+
 ### 预处理原始数据（可选）
 
 * 我们的 Formatter 目前支持一些常见的输入数据集格式：

diff --git a/configs/config_all.yaml b/configs/config_all.yaml
@@ -10,7 +10,7 @@ export_path: '/path/to/result/dataset.jsonl'                # path to processed
 export_shard_size: 0                                        # shard size of exported dataset in Byte. In default, it's 0, which means export the whole dataset into only one file. If it's set a positive number, the exported dataset will be split into several dataset shards, and the max size of each shard won't larger than the export_shard_size
 export_in_parallel: false                                   # whether to export the result dataset in parallel to a single file, which usually takes less time. It only works when export_shard_size is 0, and its default number of processes is the same as the argument np. **Notice**: If it's True, sometimes exporting in parallel might require much more time due to the IO blocking, especially for very large datasets. When this happens, False is a better choice, although it takes more time.
 np: 4                                                       # number of subprocess to process your dataset
-text_keys: 'content'                                        # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
+text_keys: 'text'                                           # the key name of field where the sample texts to be processed, e.g., `text`, `instruction`, `output`, ...
                                                             # Note: currently, we support specify only ONE key for each op, for cases requiring multiple keys, users can specify the op multiple times. We will only use the first key of `text_keys` when you set multiple keys.
 suffixes: []                                                # the suffix of files that will be read. For example: '.txt', 'txt' or ['txt', '.pdf', 'docx']
 use_cache: true                                             # whether to use the cache management of Hugging Face datasets. It might take up lots of disk space when using cache
@@ -22,6 +22,8 @@ op_list_to_trace: []                                        # only ops in this l
 trace_num: 10                                               # number of samples to show the differences between datasets before and after each op. Only available when tracer is opened.
 op_fusion: false                                            # whether to fuse operators that share the same intermediate variables automatically. Op fusion might reduce the memory requirements slightly but speed up the whole process.
 cache_compress: null                                        # the compression method of the cache file, which can be specified in ['gzip', 'zstd', 'lz4']. If this parameter is None, the cache file will not be compressed. We recommend you turn on this argument when your input dataset is larger than tens of GB and your disk space is not enough.
+keep_stats_in_res_ds: false                                 # whether to keep the computed stats in the result dataset. The intermediate fields to store the stats computed by Filters will be removed if it's False. It's False in default.
+keep_hashes_in_res_ds: false                                # whether to keep the computed hashes in the result dataset. The intermediate fields to store the hashes computed by Deduplicators will be removed if it's False. It's False in default.
 
 # for multimodal data processing
 image_key: 'images'                                         # key name of field to store the list of sample image paths.
@@ -40,6 +42,18 @@ ray_address: auto                                           # the address of the
 # only for data analysis
 save_stats_in_one_file: false                               # whether to store all stats result into one file
 
+# for sandbox or hpo
+model_infer_config: null                                    # path or dict to model inference configuration file when calling model executor in sandbox. Related hooks will be disabled if it's not specified.
+model_train_config: null                                    # path or dict to model training configuration file when calling model executor in sandbox. Related hooks will be disabled if it's not specified.
+model_eval_config: null                                     # path or dict to model evaluation configuration file when calling model executor in sandbox. Related hooks will be disabled if it's not specified.
+data_eval_config: null                                      # path or dict to data evaluation configuration file when calling model executor in sandbox. Related hooks will be disabled if it's not specified.
+data_probe_algo: 'uniform'                                  # sampling algorithm for dataset. Should be one of ["uniform", "frequency_specified_field_selector", "topk_specified_field_selector"]. It's "uniform" in default. Only used for dataset sampling.
+data_probe_ratio: 1.0                                       # the sampling ratio to the original dataset size. It's 1.0 in default. Only used for dataset sampling.
+path_k_sigma_recipe: null                                   # path to save a configuration file when using k-sigma tool to refine processing recipes
+path_model_feedback_recipe: null                            # path to save a configuration file refined by model feedback
+hpo_config: null                                            # path to a configuration file when using auto-HPO tool.
+
+
 # process schedule: a list of several process operators with their arguments
 process:
   # Mapper ops. Most of these ops need no arguments.

diff --git a/configs/demo/sandbox/gpt3_data_quality_eval_config.yaml b/configs/demo/sandbox/gpt3_data_quality_eval_config.yaml
@@ -0,0 +1 @@
+type: dj_text_quality_classifier
diff --git a/configs/demo/sandbox/gpt3_extra_train_config.json b/configs/demo/sandbox/gpt3_extra_train_config.json
@@ -0,0 +1,26 @@
+{
+  "type": "modelscope",
+  "model_name": "iic/nlp_gpt3_text-generation_chinese-base",
+  "trainer_name": "nlp-base-trainer",
+  "key_remapping": {
+    "text": "src_txt"
+  },
+  "train": {
+    "max_epochs": 3,
+    "lr_scheduler": {
+      "type": "StepLR",
+      "step_size": 2,
+      "options": {
+        "by_epoch": false
+      }
+    },
+    "optimizer": {
+      "type": "AdamW",
+      "lr": 3e-4
+    },
+    "dataloader": {
+      "batch_size_per_gpu": 2,
+      "workers_per_gpu": 0
+    }
+  }
+}
diff --git a/configs/demo/sandbox/gpt3_extra_train_config.yaml b/configs/demo/sandbox/gpt3_extra_train_config.yaml
@@ -0,0 +1,18 @@
+type: modelscope
+model_name: "iic/nlp_gpt3_text-generation_chinese-base"
+trainer_name: "nlp-base-trainer"
+key_remapping:
+  text: "src_txt"
+train:
+  max_epochs: 2
+  lr_scheduler:
+    type: "StepLR"
+    step_size: 2
+    options:
+      by_epoch: false
+  optimizer:
+    type: "AdamW"
+    lr: 0.0003
+  dataloader:
+    batch_size_per_gpu: 2
+    workers_per_gpu: 0
diff --git a/configs/demo/sandbox/sandbox.yaml b/configs/demo/sandbox/sandbox.yaml
@@ -0,0 +1,27 @@
+# Sandbox config example for dataset
+
+# global parameters
+project_name: 'demo-sandbox'
+dataset_path: './demos/data/demo-dataset.jsonl'  # path to your dataset directory or file
+np: 4  # number of subprocess to process your dataset
+
+export_path: './outputs/demo-sandbox/demo-sandbox.jsonl'
+
+# sandbox configs
+# for refining recipe using k-sigma rules
+path_k_sigma_recipe: './outputs/demo-sandbox/k_sigma_new_recipe.yaml'
+
+# for gpt3 quality classifier as data evaluator
+data_eval_config: 'configs/demo/sandbox/gpt3_data_quality_eval_config.yaml'
+#data_eval_config:
+#  type: dj_text_quality_classifier
+
+# for gpt3 model training
+model_train_config: 'configs/demo/sandbox/gpt3_extra_train_config.json'
+
+# process schedule
+# a list of several process operators with their arguments
+process:
+  - language_id_score_filter:
+      lang: 'zh'
+      min_score: 0.5