diff --git a/README.md b/README.md index 96c7181e1..4b4fd8017 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,5 @@ -[[中文主页]](README_ZH.md) | [[Docs]](#documents) | [[API]](https://modelscope.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA.md) +[[中文主页]](README_ZH.md) | [[Docs]](#documents) | [[API]](https://modelscope.github.io/data-juicer) | [[DJ-SORA]](docs/DJ_SORA.md) | [[Awesome List]](docs/awesome_llm_data.md) + # Data-Juicer: A One-Stop Data Processing System for Large Language Models @@ -27,33 +28,28 @@ Data-Juicer is a one-stop **multimodal** data processing system to make data higher-quality, juicier, and more digestible for LLMs. -Data-Juicer (including [DJ-SORA](docs/DJ_SORA.md)) is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. -We welcome you to join us in promoting LLM data development and research! -We provide a [Playground](http://8.130.100.170/) with a managed JupyterLab. [Try Data-Juicer](http://8.130.100.170/) straight away in your browser! +We provide a [playground](http://8.130.100.170/) with a managed JupyterLab. [Try Data-Juicer](http://8.130.100.170/) straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly cite our [work](#references). -If you find Data-Juicer useful for your research or development, please kindly cite our [work](#references). -Welcome any issues/PRs and to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw) -or [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) for discussion! +Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. +We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw) channel, [DingDing](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) group, ...), in promoting data-model co-development along with research and applications of (multimodal) LLMs! ---- ## News +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute! - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora "Data Directors" creative sprint—Our third data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532219) for more information. -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-03-07] We release **Data-Juicer [v0.2.0](https://github.com/alibaba/data-juicer/releases/tag/v0.2.0)** now! +- [2024-03-07] We release **Data-Juicer [v0.2.0](https://github.com/alibaba/data-juicer/releases/tag/v0.2.0)** now! In this new version, we support more features for **multimodal data (including video now)**, and introduce **[DJ-SORA](docs/DJ_SORA.md)** to provide open large-scale, high-quality datasets for SORA-like models. -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-02-20] We have actively maintained an *awesome list of LLM-Data*, welcome to [visit](docs/awesome_llm_data.md) and contribute! -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track! +- [2024-02-20] We have actively maintained an *awesome list of LLM-Data*, welcome to [visit](docs/awesome_llm_data.md) and contribute! +- [2024-02-05] Our paper has been accepted by SIGMOD'24 industrial track! - [2024-01-10] Discover new horizons in "Data Mixture"—Our second data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532174) for more information. - [2024-01-05] We release **Data-Juicer v0.1.3** now! In this new version, we support **more Python versions** (3.8-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future). Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033). - - [2023-10-13] Our first data-centric LLM competition begins! Please visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information. -- [2023-10-8] We update our paper to the 2nd version and release the corresponding version 0.1.2 of Data-Juicer! - Table of Contents ================= diff --git a/README_ZH.md b/README_ZH.md index 8abb1fa22..bd9ad5183 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -1,6 +1,6 @@ -[[English Page]](README.md) | [[文档]](#documents) | [[API]](https://modelscope.github.io/data-juicer) | [[*DJ-SORA*]](docs/DJ_SORA_ZH.md) +[[English Page]](README.md) | [[文档索引]](#documents) | [[API]](https://modelscope.github.io/data-juicer) | [[DJ-SORA]](docs/DJ_SORA_ZH.md) | [[Awesome List]](docs/awesome_llm_data.md) -# Data-Juicer: 为大语言模型提供更高质量、更丰富、更易“消化”的数据 +# Data-Juicer: 为大模型提供更高质量、更丰富、更易“消化”的数据 Data-Juicer @@ -22,32 +22,29 @@ Data-Juicer 是一个一站式**多模态**数据处理系统,旨在为大语言模型 (LLM) 提供更高质量、更丰富、更易“消化”的数据。 -Data-Juicer(包含[DJ-SORA](docs/DJ_SORA_ZH.md))正在积极更新和维护中,我们将定期强化和新增更多的功能和数据菜谱。热烈欢迎您加入我们,一起推进LLM数据的开发和研究! -我们提供了一个基于 JupyterLab 的 [Playground](http://8.130.100.170/),您可以从浏览器中在线试用 Data-Juicer。 +我们提供了一个基于 JupyterLab 的 [Playground](http://8.130.100.170/),您可以从浏览器中在线试用 Data-Juicer。 如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献) 。 -如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献) 。 +Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多的功能和数据菜谱。热烈欢迎您加入我们(issues/PRs/[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) /[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11)/...),一起推进LLM-数据的协同开发和研究! -欢迎提issues/PRs,以及加入我们的[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) 或[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 进行讨论! ---- ## 新消息 +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-12] 我们的MLLM-Data精选列表已经演化为一个模型-数据协同开发的角度系统性[综述](https://arxiv.org/abs/2407.08583)。欢迎[浏览](docs/awesome_llm_data.md)或参与贡献! - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora“数据导演”创意竞速——第三届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532219),了解赛事详情。 -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-03-07] 我们现在发布了 **Data-Juicer [v0.2.0](https://github.com/alibaba/data-juicer/releases/tag/v0.2.0)**! 在这个新版本中,我们支持了更多的 **多模态数据(包括视频)** 相关特性。我们还启动了 **[DJ-SORA](docs/DJ_SORA_ZH.md)** ,为SORA-like大模型构建开放的大规模高质量数据集! -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-02-20] 我们在积极维护一份关于LLM-Data的*精选列表*,欢迎[访问](docs/awesome_llm_data.md)并参与贡献! -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-02-05] 我们的论文被SIGMOD'24 industrial track接收! +- [2024-03-07] 我们现在发布了 **Data-Juicer [v0.2.0](https://github.com/alibaba/data-juicer/releases/tag/v0.2.0)**! 在这个新版本中,我们支持了更多的 **多模态数据(包括视频)** 相关特性。我们还启动了 **[DJ-SORA](docs/DJ_SORA_ZH.md)** ,为SORA-like大模型构建开放的大规模高质量数据集! +- [2024-02-20] 我们在积极维护一份关于LLM-Data的*精选列表*,欢迎[访问](docs/awesome_llm_data.md)并参与贡献! +- [2024-02-05] 我们的论文被SIGMOD'24 industrial track接收! - [2024-01-10] 开启“数据混合”新视界——第二届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532174),了解赛事详情。 --[2024-01-05] 现在,我们发布了 **Data-Juicer v0.1.3** 版本! -在这个新版本中,我们支持了**更多Python版本**(3.8-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。 +- [2024-01-05] **Data-Juicer v0.1.3** 版本发布了。 +在这个新版本中,我们支持了**更多Python版本**(3.8-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)! 此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033) 。 - [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了! 请访问大赛官网,FT-Data Ranker([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157) 、[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。 -- [2023-10-8] 我们的论文更新至第二版,并发布了对应的Data-Juicer v0.1.2版本! - 目录 === - [Data-Juicer: 为大语言模型提供更高质量、更丰富、更易“消化”的数据](#data-juicer-为大语言模型提供更高质量更丰富更易消化的数据) @@ -396,4 +393,4 @@ author={Daoyuan Chen and Yilun Huang and Zhijian Ma and Hesen Chen and Xuchen Pa booktitle={International Conference on Management of Data}, year={2024} } -``` +``` \ No newline at end of file diff --git a/docs/sphinx_doc/source/_static/tutorial_kdd24.html b/docs/sphinx_doc/source/_static/tutorial_kdd24.html index 25270d5c9..3d5a6755d 100644 --- a/docs/sphinx_doc/source/_static/tutorial_kdd24.html +++ b/docs/sphinx_doc/source/_static/tutorial_kdd24.html @@ -1,107 +1,197 @@ - - + + + - - - - Multi-modal Data Processing for Foundation Models: Practical Guidances and Use Cases - - - + + + + Multi-modal Data Processing for Foundation Models: Practical + Guidances and Use Cases + + + -
-
+
+ +
+
-
-
-

KDD 2024 Hands-on Tutorial

-

Multi-modal Data Processing for Foundation Models: Practical Guidances and Use Cases

-

Date & Time: X:XX pm - Y:YY pm, August XX, 2024

-

Location: To be updated

+
+
+

KDD 2024 Hands-on Tutorial

+

Multi-modal Data Processing + for Foundation Models: Practical Guidances and Use + Cases

+

Date & + Time: 9:00 AM - 12:00 PM, August 25, 2024 +

+

Location: Centre de + Convencions Internacional de Barcelona

+
-
-
In the era of foundation models, the ability to process multi-modal data efficiently and effectively has become paramount. -In this tutorial, participants will dive into the essential techniques for processing multi-modal data. We will explore how large-scale high-quality data enhances model performance and introduce the open-sourced Data-Juicer system, designed to tackle the complexities of data variety, quality and scale. -Attendees will gain practical experience with Data-Juicer's operators, mastering data formatting, mapping, filtering, deduplication and selection. -A significant portion of the tutorial is dedicated to the Data-Juicer Sandbox Lab and typical use cases for static and dynamic data, including text, image, audio, and video. The lab is a playground integrated with unified models and evaluators, and facilitates experiments with data recipes that represent methodical sequences of operators and streamline the creation of scalable data processing pipelines. This experience is designed to not only solidify the concepts discussed but also to provide a space for innovation and exploration, highlighting how data recipes can be optimized and deployed in high-performance distributed environments. -

By the end of this tutorial, attendees will be equipped with the practical knowledge and skills to navigate the complexities of multi-modal data processing. They will leave with actionable knowledge with an industrial open-source system and an enriched perspective on the importance of high-quality data in AI, poised to implement sustainable and scalable solutions in their projects. -
-
-
-
-

-

-

Tutorial Slides
slides.pdf

-
-
-
+
+ In the foundation models era, efficiently processing multi-modal data + is crucial. + This tutorial covers key techniques for multi-modal data processing and + introduces the open-source Data-Juicer system, designed to tackle the + complexities of data variety, quality, and scale. + Participants will learn how to use Data-Juicer's operators and tools + for formatting, mapping, filtering, deduplicating, and selecting + multi-modal data efficiently and effectively. + They will also be familiar with the Data-Juicer Sandbox Lab, where + users can easily experiment with diverse data recipes that represent + methodical sequences of operators and streamline the creation of + scalable data processing pipelines. + This experience solidifies the concepts discussed, as well as provides + a space for innovation and exploration, highlighting how data recipes + can be optimized and deployed in high-performance distributed + environments. +

By the end of this tutorial, attendees will be equipped with the + practical knowledge and skills to navigate the multi-modal data + processing for foundation models. They will leave with actionable + knowledge with an industrial open-source system and an enriched + perspective on the importance of high-quality data in AI, poised to + implement sustainable and scalable solutions in their projects. +

The system and related materials are available at + https://github.com/modelscope/data-juicer. +

-
-
-
+
+
+
+
+
+

Schedule

+
+
-
-

Schedule

-
+
+
Date: August 25, 2024
+
Location: Room XXX, Centre de Convencions Internacional de Barcelona
+

+
(20 min) | Introduction and Overview: + Multi-modal Data Processing and the + Data-Juicer System
+
(20 min) | Building Blocks of Data + Processing: Data-Juicer’s Operators
+
(20 min) | Composing Atomic Capabilities: + Data-Juicer’s Data Recipes
+
(30 min) | Exploring Data Recipes: The + Data-Juicer Sandbox Lab
+
(30 min) | From Exploration to + Production: High-Performance Data Factory
+
(50 min) | Use Cases: From Text to Video + Data Processing
+
(10 min) | Conclusion and Resources
+
-
-
Date: August XX, 2024
-
Location: To be updated.
-
(xx min) | Introduction and Overview: Multi-modal Data Processing and the -Data-Juicer System
-
(xx min) | Building Blocks of Data Processing: Data-Juicer’s Operators
-
(xx min) | Composing Atomic Capabilities: Data-Juicer’s Data Recipes
-
(xx min) | Exploring Data Recipes: The Data-Juicer Sandbox Lab
-
(xx min) | From Exploration to Production: High-Performance Data Factory
-
(xx min) | Static Data Use Cases: Text and Image Data Processing
-
(xx min) | Dynamic Data Use Cases: Video and Audio Data Processing
-
(xx min) | Conclusion and Resources
-

-
-
-
-
+ +
+ +
-
-

Organizers

-
We are the Data-Juicer team from Alibaba Tongyi
- Data-Juicer + +
+
+
+
We are the Data-Juicer team from Alibaba + Tongyi
+ Data-Juicer +
+
-
+ -
+
+
+ +
+ 140x140 +

Yaliang + Li

+
+
+ 140x140 +

Bolin + Ding

+
+
+
+ -
-
- + + + + + \ No newline at end of file