Skip to content

Commit

Permalink
merge main
Browse files Browse the repository at this point in the history
  • Loading branch information
BeachWang committed Oct 10, 2024
2 parents 0fe4c2d + 1aaad21 commit 76d2d72
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 9 deletions.
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,14 @@ We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/dat
## News
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-08-09] We propose Img-Diff, which enhances the performance of multimodal large language models through *contrastive data synthesis*, achieving a score that is 12 points higher than GPT-4V on the [MMVP benchmark](https://tsb0601.github.io/mmvp_blog/). See more details in our [paper](https://arxiv.org/abs/2408.04594), and download the dataset from [huggingface](https://huggingface.co/datasets/datajuicer/Img-Diff) and [modelscope](https://modelscope.cn/datasets/Data-Juicer/Img-Diff).
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" — Our 4th data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532251) for more information.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through an co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through a co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute!
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora "Data Directors" creative sprint—Our third data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532219) for more information.

<details>
<summary> History News:
</summary>>

- [2024-03-07] We release **Data-Juicer [v0.2.0](https://github.com/alibaba/data-juicer/releases/tag/v0.2.0)** now!
In this new version, we support more features for **multimodal data (including video now)**, and introduce **[DJ-SORA](docs/DJ_SORA.md)** to provide open large-scale, high-quality datasets for SORA-like models.
- [2024-02-20] We have actively maintained an *awesome list of LLM-Data*, welcome to [visit](docs/awesome_llm_data.md) and contribute!
Expand All @@ -52,6 +57,8 @@ In this new version, we support **more Python versions** (3.8-3.10), and support
Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).
- [2023-10-13] Our first data-centric LLM competition begins! Please
visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
</details>


<div id="table" align="center"></div>

Expand Down Expand Up @@ -121,16 +128,17 @@ Table of Contents
- [Operator Zoo](docs/Operators.md)
- [Configs](configs/README.md)
- [Developer Guide](docs/DeveloperGuide.md)
- [API references](https://modelscope.github.io/data-juicer/)
- [KDD-Tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html)
- ["Bad" Data Exhibition](docs/BadDataExhibition.md)
- [Awesome LLM-Data](docs/awesome_llm_data.md)
- Dedicated Toolkits
- [Quality Classifier](tools/quality_classifier/README.md)
- [Auto Evaluation](tools/evaluator/README.md)
- [Preprocess](tools/preprocess/README.md)
- [Postprocess](tools/postprocess/README.md)
- [Third-parties (LLM Ecosystems)](thirdparty/README.md)
- [API references](https://modelscope.github.io/data-juicer/)
- [Awesome LLM-Data](docs/awesome_llm_data.md)
- [DJ-SORA](docs/DJ_SORA.md)
- [Third-parties (LLM Ecosystems)](thirdparty/README.md)


## Demos
Expand Down Expand Up @@ -453,7 +461,7 @@ If you find our work useful for your research or development, please kindly cite
- [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https://arxiv.org/abs/2408.04594)
- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505)
- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
</details>
Expand Down
15 changes: 11 additions & 4 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-17] 我们利用Data-Juicer[沙盒实验室套件](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox-ZH.md),通过数据与模型间的系统性研发工作流,调优数据和模型,在[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)文生视频排行榜取得了新的榜首。相关成果已经整理发表在[论文](http://arxiv.org/abs/2407.11784)中,并且模型已在[ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V)[HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V)平台发布。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-12] 我们的MLLM-Data精选列表已经演化为一个模型-数据协同开发的角度系统性[综述](https://arxiv.org/abs/2407.08583)。欢迎[浏览](docs/awesome_llm_data.md)或参与贡献!
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora“数据导演”创意竞速——第三届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532219),了解赛事详情。
<details>
<summary> History News:
</summary>>

- [2024-03-07] 我们现在发布了 **Data-Juicer [v0.2.0](https://github.com/alibaba/data-juicer/releases/tag/v0.2.0)**! 在这个新版本中,我们支持了更多的 **多模态数据(包括视频)** 相关特性。我们还启动了 **[DJ-SORA](docs/DJ_SORA_ZH.md)** ,为SORA-like大模型构建开放的大规模高质量数据集!
- [2024-02-20] 我们在积极维护一份关于LLM-Data的*精选列表*,欢迎[访问](docs/awesome_llm_data.md)并参与贡献!
- [2024-02-05] 我们的论文被SIGMOD'24 industrial track接收!
Expand All @@ -45,6 +49,8 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033)
- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
请访问大赛官网,FT-Data Ranker([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157)[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。
</details>


<div id="table" align="center"></div>

Expand Down Expand Up @@ -101,16 +107,17 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
* [算子库](docs/Operators_ZH.md)
* [配置系统](configs/README_ZH.md)
* [开发者指南](docs/DeveloperGuide_ZH.md)
* [API 参考](https://modelscope.github.io/data-juicer/)
* [KDD'24 相关教程](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html)
* [“坏”数据展览](docs/BadDataExhibition_ZH.md)
* [Awesome LLM-Data](docs/awesome_llm_data.md)
* 专用工具箱
* [质量分类器](tools/quality_classifier/README_ZH.md)
* [自动评测](tools/evaluator/README_ZH.md)
* [前处理](tools/preprocess/README_ZH.md)
* [后处理](tools/postprocess/README_ZH.md)
* [第三方库(大语言模型生态)](thirdparty/README_ZH.md)
* [API 参考](https://modelscope.github.io/data-juicer/)
* [Awesome LLM-Data](docs/awesome_llm_data.md)
* [DJ-SORA](docs/DJ_SORA_ZH.md)
* [第三方库(大语言模型生态)](thirdparty/README_ZH.md)


## 演示样例
Expand Down Expand Up @@ -426,7 +433,7 @@ Data-Juicer 感谢并参考了社区开源项目:
- [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https://arxiv.org/abs/2408.04594)
- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505)
- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
</details>
Expand Down

0 comments on commit 76d2d72

Please sign in to comment.