diff --git a/README.md b/README.md index 3374e5ea0..841bc091e 100644 --- a/README.md +++ b/README.md @@ -33,12 +33,20 @@ This project is being actively updated and maintained, and we will periodically If you find Data-Juicer useful for your research or development, please kindly cite our [work](#references). +Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11), or WeChat group (scan the QR code below with WeChat) for discussion. + + QR Code for WeChat group + ---- ## News -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2023-10-13] Our first data-centric LLM competition begins! Please - visit the competition's official websites, **FT-Data Ranker** ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information. +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-01-05] We release **Data-Juicer v0.1.3** now! +In this new version, we support **more Python versions** (3.7-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future). +Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033). + +- [2023-10-13] Our first data-centric LLM competition begins! Please + visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information. - [2023-10-8] We update our paper to the 2nd version and release the corresponding version 0.1.2 of Data-Juicer! @@ -98,7 +106,7 @@ Table of Contents ## Prerequisites -- Recommend Python==3.8 +- Recommend Python>=3.7,<=3.10 - gcc >= 5 (at least C++14 support) ## Installation @@ -330,7 +338,7 @@ We are in a rapidly developing field and greatly welcome contributions of new features, bug fixes and better documentations. Please refer to [How-to Guide for Developers](docs/DeveloperGuide.md). -Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), or [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) for discussion. +If you have any questions, please join our [discussion groups](README.md). ## Acknowledgement Data-Juicer is used across various LLM products and research initiatives, diff --git a/README_ZH.md b/README_ZH.md index b4d25681b..496f8e1a1 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -31,12 +31,20 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM 如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献) 。 +欢迎加入我们的[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) ,[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) ,或微信群(扫描下方二维码加入)进行讨论。 + + QR Code for WeChat group + ---- ## 新消息 -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了! - 请访问大赛官网,**FT-Data Ranker**([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157) 、[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。 +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-01-05] 现在,我们发布了 **Data-Juicer v0.1.3** 版本! +在这个新版本中,我们支持了**更多Python版本**(3.7-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)和[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。 +此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033) 。 + +- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了! + 请访问大赛官网,FT-Data Ranker([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157) 、[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。 - [2023-10-8] 我们的论文更新至第二版,并发布了对应的Data-Juicer v0.1.2版本! @@ -86,7 +94,7 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM ## 前置条件 -* 推荐 Python==3.8 +* 推荐 Python>=3.7,<=3.10 * gcc >= 5 (at least C++14 support) ## 安装 @@ -309,7 +317,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。 大模型是一个高速发展的领域,我们非常欢迎贡献新功能、修复漏洞以及文档改善。请参考[开发者指南](docs/DeveloperGuide_ZH.md)。 -欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。 +如果您有任何问题,欢迎加入我们的[讨论群](README_ZH.md) 。 ## 致谢 diff --git a/data_juicer/__init__.py b/data_juicer/__init__.py index 10939f01b..8ce9b3623 100644 --- a/data_juicer/__init__.py +++ b/data_juicer/__init__.py @@ -1 +1 @@ -__version__ = '0.1.2' +__version__ = '0.1.3' diff --git a/environments/minimal_requires.txt b/environments/minimal_requires.txt index 1202407a8..79cbc429d 100644 --- a/environments/minimal_requires.txt +++ b/environments/minimal_requires.txt @@ -7,6 +7,7 @@ tabulate tqdm jsonargparse[signatures] matplotlib +seaborn emoji==2.2.0 regex requests diff --git a/tools/multimodal/README.md b/tools/multimodal/README.md index d4559f689..33f2ddcb4 100644 --- a/tools/multimodal/README.md +++ b/tools/multimodal/README.md @@ -5,8 +5,62 @@ This folder contains some scripts and tools for multimodal datasets before and a ## Dataset Format Conversion Due to large format diversity among different multimodal datasets and works, -Data-Juicer propose a novel intermediate format for multimodal dataset and -provided several dataset format conversion tools for some popular multimodal +Data-Juicer propose a novel intermediate text-based interleaved data format for multimodal dataset, which +is based on chunk-wise formats such MMC4 dataset. + +In the Data-Juicer format, a multimodal sample or document is based on a text, +which consists of several text chunks. Each chunk is a semantic unit, and all the +multimodal information in a chunk should talk about the same thing and be aligned +with each other. + +Here is a multimodal sample example in Data-Juicer format below. +- It includes 4 chunks split by the special token `<|__dj__eoc|>`. +- In addition to texts, there are 3 other modalities: images, audios, videos. +They are stored on the disk and their paths are +listed in the corresponding first-level fields in the sample. +- Other modalities are represented as special tokens in the text (e.g. image -- `<__dj__image>`). +The special tokens of each modality correspond to the paths in the order of appearance. +(e.g. the two image tokens in the third chunk are images of antarctica_map and europe_map respectively) +- There could be multiple types of modalities and multiple modality special tokens in a single chunk, +and they are semantically aligned with each other and text in this chunk. +The position of special tokens can be random in a chunk. (In general, they are usually before or after the text.) +- For multimodal samples, unlike text-only samples, the computed stats for other +modalities could be a list of stats for the list of multimodal data (e.g. image_widths in this sample). + +```python +{ + "text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> " + "<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the " + "Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> " + "Antarctica is the fifth-largest continent, being about 40% larger than Europe, " + "and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> " + "Most of Antarctica is covered by the Antarctic ice sheet, " + "with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>", + "images": [ + "path/to/the/image/of/antarctica_snowfield", + "path/to/the/image/of/antarctica_map", + "path/to/the/image/of/europe_map" + ], + "audios": [ + "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean" + ], + "videos": [ + "path/to/the/video/of/remote_sensing_view_of_antarctica" + ], + "meta": { + "src": "customized", + "version": "0.1", + "author": "xxx" + }, + "stats": { + "lang": "en", + "image_widths": [224, 336, 512], + ... + } +} +``` + +According to this format, Data-Juicer provided several dataset format conversion tools for some popular multimodal works. These tools consist of two types: @@ -15,11 +69,11 @@ These tools consist of two types: For now, dataset formats that are supported by Data-Juicer are listed in the following table. -| Format | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. | -|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------| -| LLaVA-like | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) | -| MMC4-like | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) | -| WavCaps-like | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) | +| Format | Type | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. | +|------------|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------| +| LLaVA-like | image-text | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) | +| MMC4-like | image-text | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) | +| WavCaps-like | audio-text | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) | For all tools, you can run the following command to find out the usage of them: diff --git a/tools/multimodal/README_ZH.md b/tools/multimodal/README_ZH.md index 55671e09b..996bdbb54 100644 --- a/tools/multimodal/README_ZH.md +++ b/tools/multimodal/README_ZH.md @@ -4,7 +4,57 @@ ## 数据集格式转换 -由于不同多模态数据集和工作之间的数据集格式差异较大,Data-Juicer 提出了一种新颖的多模态数据集中间格式,并为一些流行的多模态工作提供了若干数据集格式转换工具。 +由于不同多模态数据集和工作之间的数据集格式差异较大, Data-Juicer 提出了一种新颖的、中间的、 +基于文本的、交替的多模态数据格式,主要基于一些按块(chunk)组织的格式,如MMC4数据集格式。 + +在 Data-Juicer 的格式中,一个多模态样本或者文档基于一段文本组织,其由若干个文本块组成。 +每个文本块是一个语义单元,单个文本块中包括的所有多模态信息都应该在谈论同样的事情,并且它们彼此语义上是对齐的。 + +下面这里是一个 Data-Juicer 格式的多模态样本示例。 +- 它包括4个文本块,它们由特殊token `<|__dj__eoc|>` 分割开。 +- 除了文本,这个样本还包括3种其他模态:图像(images),音频(audios),视频(videos)。 +它们保存在硬盘上,而它们的硬盘路径列举在了样本中对应的一级字段的列表里。 +- 在文本中,其他模态被表示为了特殊token(例如,图像 -- `<__dj__image>`)。 +每种模态的特殊token所表示的数据按照它们在文本中出现的顺序对应到列表中的路径上。 +(例如,第3个文本块中的2个图像token分别对应了图像路径列表中的antarctica_map图像和europe_map图像) +- 在单个文本块中,可以由多种模态的数据以及多个模态特殊token,它们彼此是语义上对齐的,而且它们与该文本块中的文本也是语义对齐的。 +这些模态特殊token在文本块中可以处于任意位置(通常处于文本前或者文本后) +- 不同于纯文本样本,对于多模态样本来说,为其他模态计算的stats可能为针对多模态数据列表的一个stats列表(如例子中的image_widths)。 + +```python +{ + "text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> " + "<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the " + "Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> " + "Antarctica is the fifth-largest continent, being about 40% larger than Europe, " + "and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> " + "Most of Antarctica is covered by the Antarctic ice sheet, " + "with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>", + "images": [ + "path/to/the/image/of/antarctica_snowfield", + "path/to/the/image/of/antarctica_map", + "path/to/the/image/of/europe_map" + ], + "audios": [ + "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean" + ], + "videos": [ + "path/to/the/video/of/remote_sensing_view_of_antarctica" + ], + "meta": { + "src": "customized", + "version": "0.1", + "author": "xxx" + }, + "stats": { + "lang": "en", + "image_widths": [224, 336, 512], + ... + } +} +``` + +根据这个格式,Data-Juicer 为一些流行的多模态工作提供了若干数据集格式转换工具。 这些工具分为两种类型: - 其他格式到 Data-Juicer 格式的转换:这些工具在 `source_format_to_data_juicer_format` 目录中。它们可以帮助将其他格式的数据集转换为 Data-Juicer 格式的目标数据集。 @@ -12,11 +62,11 @@ 目前,Data-Juicer 支持的数据集格式在下面表格中列出。 -| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 | -|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------| -| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) | -| 类MMC4格式 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) | -| 类WavCaps格式 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) | +| 格式 | 类型 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 | +|----------|-------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------| +| 类LLaVA格式 | 图像-文本 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) | +| 类MMC4格式 | 图像-文本 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) | +| 类WavCaps格式 | 音频-文本 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) | 对于所有工具,您可以运行以下命令来了解它们的详细用法: