Skip to content

Commit

Permalink
sandbox bench experiment workflow (#364)
Browse files Browse the repository at this point in the history
* FVD and ISV for video eval

* restore tools init

* restore tools init

* pre-commit done

* add FID KID IS PR and PRV metrics

* add KVD metric

* fix doc

* allow relative path

* fix sample 50000 image

* fvd sandbox

* fvd sandbox test done

* precommit done

* easyanimate train and infer in sandbox

* divide dataset pipline

* fix data num for each partition

* pre-commit done

* test sandbox for videos done

* fix executor

* fix executor

* check datalen

* sort data for partition

* sort data for partition

* fix video_aspect_ratio_filter

* fix video_aspect_ratio_filter

* tensor stats to float

* precommit done

* fix words num filter

* pre-commit done

* add seed for train and infer

* add seed for easyanimate

* sandbox rebuild v1

* fix empty frames

* switch

* fix conflict

* fix hpo 3sigma

* after pre-commit

* sandbox readme zh

* finish doc

* remove training limit

* other_configs -> extra_configs

* other_configs -> extra_configs

* res_name -> meta_name

* hooker -> hook

* analyze -> analyse

* after pre-commit

* analyse -> analyze

* analyser.py -> analyzer.py

* analyser.py -> analyzer.py

* analyser.py -> analyzer.py

* regist -> register, DICT -> MAPPING

* range_specified_field_selector

* pipline test done

* dataset in readme

* update readme

* pre-commit done

* rm experiment name in dj

* add init dataset

* fix auto_evaluation_helm readme

* remove easyanimate code

* shorten diff

---------

Co-authored-by: binke <[email protected]>
  • Loading branch information
BeachWang and binke authored Aug 2, 2024
1 parent 8d53f23 commit 4d8c521
Show file tree
Hide file tree
Showing 60 changed files with 3,817 additions and 268 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@

# data & resources
models/
outputs/
assets/

Expand Down
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ exclude: |
docs/.*|
tests/.*|
demos/.*|
tools/mm_eval/inception_metrics.*|
tools/mm_eval/inception_metrics/.*|
thirdparty/easy_animate/.*|
.*\.md
)$
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ In this new version, we support more features for **multimodal data (including v
- [2024-02-20] We have actively maintained an *awesome list of LLM-Data*, welcome to [visit](docs/awesome_llm_data.md) and contribute!
- [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track!
- [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information.
- [2024-01-05] We release **Data-Juicer v0.1.3** now!
- [2024-01-05] We release **Data-Juicer v0.1.3** now!
In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
- [2023-10-13] Our first data-centric LLM competition begins! Please
Expand Down Expand Up @@ -94,8 +94,8 @@ Table of Contents
dedicated [toolkits](#documentation), designed to
function independently of specific multimodal LLM datasets and processing pipelines.

- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration
through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model,
- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration
through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model,
visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models.
![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)

Expand Down Expand Up @@ -194,11 +194,11 @@ The dependency options are listed below:
pip install py-data-juicer
```

- **Note**:
- **Note**:
- only the basic APIs in `data_juicer` and two basic tools
(data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
and complete functions, we recommend you install `data_juicer` [from source](#from-source).
- The release versions from pypi have a certain lag compared to the latest version from source.
- The release versions from pypi have a certain lag compared to the latest version from source.
So if you want to follow the latest functions of `data_juicer`, we recommend you install [from source](#from-source).

### Using Docker
Expand All @@ -215,7 +215,7 @@ pip install py-data-juicer
```shell
docker build -t datajuicer/data-juicer:<version_tag> .
```

- The format of `<version_tag>` is like `v0.2.0`, which is the same as release version tag.

### Installation check
Expand Down Expand Up @@ -413,20 +413,20 @@ docker exec -it <container_id> bash
Data-Juicer is released under Apache License 2.0.
## Contributing
We are in a rapidly developing field and greatly welcome contributions of new
features, bug fixes and better documentations. Please refer to
We are in a rapidly developing field and greatly welcome contributions of new
features, bug fixes and better documentations. Please refer to
[How-to Guide for Developers](docs/DeveloperGuide.md).
If you have any questions, please join our [discussion groups](README.md).
## Acknowledgement
Data-Juicer is used across various LLM products and research initiatives,
including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for
financial analysis, and Zhiwen for reading assistant, as well as the Alibaba
including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for
financial analysis, and Zhiwen for reading assistant, as well as the Alibaba
Cloud's platform for AI (PAI).
We look forward to more of your experience, suggestions and discussions for collaboration!
Data-Juicer thanks and refers to several community projects, such as
Data-Juicer thanks and refers to several community projects, such as
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....
Expand Down
2 changes: 1 addition & 1 deletion README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ pip install py-data-juicer
```shell
docker build -t datajuicer/data-juicer:<version_tag> .
```

- `<version_tag>`的格式类似于`v0.2.0`,与发布(Release)的版本号相同。

### 安装校验
Expand Down
20 changes: 14 additions & 6 deletions configs/config_all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ data_probe_algo: 'uniform' # sampling algorithm
data_probe_ratio: 1.0 # the sampling ratio to the original dataset size. It's 1.0 in default. Only used for dataset sampling.
hpo_config: null # path to a configuration file when using auto-HPO tool.


# process schedule: a list of several process operators with their arguments
process:
# Mapper ops. Most of these ops need no arguments.
Expand Down Expand Up @@ -496,13 +495,22 @@ process:
ignore_non_character: false # whether to ignore non-alphabet characters, including whitespaces, digits, and punctuations

# Selector ops
- topk_specified_field_selector: # selector to select top samples based on the sorted specified field
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
top_ratio: # ratio of selected top samples
topk: # number of selected top sample
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order
- frequency_specified_field_selector: # selector to select samples based on the sorted frequency of specified field value
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
top_ratio: # ratio of selected top specified field value
topk: # number of selected top specified field value
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order
- random_selector: # selector to random select samples
select_ratio: # the ratio to be sampled
select_num: # the number to be sampled
- range_specified_field_selector: # selector to select a range of samples based on the sorted specified field value from smallest to largest.
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
lower_percentile: # the lower bound of the percentile to be sampled
upper_percentile: # the upper bound of the percentile to be sampled
lower_rank: # the lower rank of the percentile to be sampled
upper_rank: # the upper rank of the percentile to be sampled
- topk_specified_field_selector: # selector to select top samples based on the sorted specified field
field_key: '' # the target keys corresponding to multi-level field information need to be separated by '.'
top_ratio: # ratio of selected top samples
topk: # number of selected top sample
reverse: True # determine the sorting rule, if reverse=True, then sort in descending order
1 change: 1 addition & 0 deletions configs/data_juicer_recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
| subset | #samples before | #samples after | keep ratio | config link | data link | source |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### Evaluation Results
- LLaVA pretrain (LCS-558k): models **pretrained with refined dataset** and fine-tuned with the original instruct dataset outperforms the baseline (LLaVA-1.5-13B) on 10 out of 12 benchmarks.
Expand Down
3 changes: 2 additions & 1 deletion configs/data_juicer_recipes/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
| 数据子集 | 完善前的样本数目 | 完善后的样本数目 | 样本保留率 | 配置链接 | 数据链接 | 来源 |
|---------------------------|:---------------------------:|:--------------:|:----------:|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|
| LLaVA pretrain (LCS-558k) | 558,128 | 500,380 | 89.65% | [llava-pretrain-refine.yaml](llava-pretrain-refine.yaml) | [Aliyun](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/LLaVA-1.5/public/llava-pretrain-refine-result.json) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/llava-pretrain-refined-by-data-juicer/summary) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/llava-pretrain-refined-by-data-juicer) | [LLaVA-1.5](https://github.com/haotian-liu/LLaVA) |
| Data-Juicer-T2V | 1,217,346 | 147,176 | 12.09% | [2_multi_op_pipline.yaml](../demo/bench/2_multi_op_pipline.yaml) | [Aliyun](http://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/MM_data/our_refined_data/Data-Juicer-T2V/data_juicer_t2v_optimal_data_pool.zip) <br> [ModelScope](https://modelscope.cn/datasets/Data-Juicer/data-juicer-t2v-optimal-data-pool) <br> [HuggingFace](https://huggingface.co/datasets/datajuicer/data-juicer-t2v-optimal-data-pool) | [InternVid (606k)](https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVid) <br> [Panda-70M (605k)](https://github.com/snap-research/Panda-70M) <br> [MSR-VTT (6k)](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) |

### 评测结果
- LLaVA pretrain (LCS-558k): 使用**完善后的预训练数据集**预训练并使用原始的指令数据集微调后的模型在12个评测集上有10个超过了基线模型LLaVA-1.5-13B。
Expand All @@ -57,4 +58,4 @@
- 仅视频:根据视频性质提高数据集质量
- 文本-视频:根据文本和视频间的对齐提高数据集质量
用户可以基于这个菜谱开始他们的视频数据集处理流程。
-
-
68 changes: 68 additions & 0 deletions configs/demo/bench/1_single_op_pipline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Sandbox config example

# global parameters
project_name: 'demo-bench'
experiment_name: 'single_op_language_score' # for wandb tracer name
work_dir: './outputs/demo-bench' # the default output dir for meta logging

# configs for each job, the jobs will be executed according to the order in the list
probe_job_configs:
# get statistics value for each sample and get the distribution analysis for given percentiles
- hook: 'ProbeViaAnalyzerHook'
meta_name: 'analysis_ori_data'
dj_configs:
project_name: 'demo-bench'
dataset_path: './demos/data/demo-dataset-videos.jsonl' # path to your dataset directory or file
percentiles: [0.333, 0.667] # percentiles to analyze the dataset distribution
export_path: './outputs/demo-bench/demo-dataset-with-language-score.jsonl'
export_original_dataset: true # must be true to keep statistics values with dataset
process:
- language_id_score_filter:
lang: 'zh'
min_score: 0.8
extra_configs:

refine_recipe_job_configs:

execution_job_configs:
# sample the splits with low/middle/high statistics values
- hook: 'ProcessDataHook'
meta_name:
dj_configs:
project_name: 'demo-bench'
dataset_path: './outputs/demo-bench/demo-dataset-with-language-score.jsonl' # output dataset of probe jobs
export_path: './outputs/demo-bench/demo-dataset-with-high-language-score.jsonl'
process:
- range_specified_field_selector:
field_key: '__dj__stats__.lang_score' # '__dj__stats__' the target keys corresponding to multi-level field information need to be separated by '.'. 'dj__stats' is the default location for storing stats in Data Juicer, and 'lang_score' is the stats corresponding to the language_id_score_filter.
lower_percentile: 0.667
upper_percentile: 1.000
extra_configs:
# random sample dataset with fix number of instances
- hook: 'ProcessDataHook'
meta_name:
dj_configs:
project_name: 'demo-bench'
dataset_path: './outputs/demo-bench/demo-dataset-with-high-language-score.jsonl' # output dataset of probe jobs
export_path: './outputs/demo-bench/demo-dataset-for-train.jsonl'
process:
- random_selector:
select_num: 16
extra_configs:
# train model
- hook: 'TrainModelHook'
meta_name:
dj_configs:
extra_configs: './configs/demo/bench/model_train.yaml'
# infer model
- hook: 'InferModelHook'
meta_name:
dj_configs:
extra_configs: './configs/demo/bench/model_infer.yaml'

evaluation_job_configs:
# vbench evaluation
- hook: 'EvaluateDataHook'
meta_name: 'vbench_eval'
dj_configs:
extra_configs: './configs/demo/bench/vbench_eval.yaml'
58 changes: 58 additions & 0 deletions configs/demo/bench/2_multi_op_pipline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Sandbox config example

# global parameters
project_name: 'demo-bench'
experiment_name: 'single_op_language_score' # for wandb tracer name
work_dir: './outputs/demo-bench' # the default output dir for meta logging

# configs for each job, the jobs will be executed according to the order in the list
probe_job_configs:

refine_recipe_job_configs:

execution_job_configs:
- hook: 'ProcessDataHook'
meta_name:
dj_configs:
project_name: 'demo-bench'
dataset_path: './demos/data/demo-dataset-videos.jsonl' # path to your dataset directory or file
export_path: './outputs/demo-bench/demo-dataset-with-multi-op-stats.jsonl'
export_original_dataset: true # must be true to keep statistics values with dataset
process:
# select samples with high language score
- language_id_score_filter:
lang:
min_score: 0.7206037306785583 # this value can be observed in the analysis result of the probe job in one op experiments
# select samples with middle video duration
- video_duration_filter:
min_duration: 19.315000 # this value can be observed in the analysis result of the probe job in one op experiments
max_duration: 32.045000 # this value can be observed in the analysis result of the probe job in one op experiments

extra_configs:
- hook: 'ProcessDataHook'
meta_name:
dj_configs:
project_name: 'demo-bench'
dataset_path: './outputs/demo-bench/demo-dataset-with-multi-op-stats.jsonl'
export_path: './outputs/demo-bench/demo-dataset-for-train.jsonl'
process:
- random_selector:
select_num: 16
extra_configs:
# train model
- hook: 'TrainModelHook'
meta_name:
dj_configs:
extra_configs: './configs/demo/bench/model_train.yaml'
# infer model
- hook: 'InferModelHook'
meta_name:
dj_configs:
extra_configs: './configs/demo/bench/model_infer.yaml'

evaluation_job_configs:
# vbench evaluation
- hook: 'EvaluateDataHook'
meta_name: 'vbench_eval'
dj_configs:
extra_configs: './configs/demo/bench/vbench_eval.yaml'
Loading

0 comments on commit 4d8c521

Please sign in to comment.