Skip to content

Commit

Permalink
Release/v0.1.3 (#171)
Browse files Browse the repository at this point in the history
* * change simhash-py to simhash-pybind
+ update docs for new version

* * install pip for unit-test machine explicitly

* * install pip for unit-test machine explicitly

* * update wechat QR code

* * update dynamic QR code for WeChat group

* * update unittest
* add missing dependency

* * update news list

* * update version number

* * update release date

* * bold key content in README_ZH.md like the English version

* * minor changes on ZH docs

* * move infos about discussion groups to the front
  • Loading branch information
HYLcool authored Jan 5, 2024
1 parent ad445c9 commit a3c8310
Show file tree
Hide file tree
Showing 6 changed files with 143 additions and 22 deletions.
16 changes: 12 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,20 @@ This project is being actively updated and maintained, and we will periodically
If you find Data-Juicer useful for your research or development, please kindly
cite our [work](#references).

Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11), or WeChat group (scan the QR code below with WeChat) for discussion.

<img src="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png" width = "100" height = "100" alt="QR Code for WeChat group" align=center />


----

## News
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2023-10-13] Our first data-centric LLM competition begins! Please
visit the competition's official websites, **FT-Data Ranker** ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-01-05] We release **Data-Juicer v0.1.3** now!
In this new version, we support **more Python versions** (3.7-3.10), and support **multimodal** dataset [converting](tools/multimodal/README.md)/[processing](docs/Operators.md) (Including texts, images, and audios. More modalities will be supported in the future).
Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033).

- [2023-10-13] Our first data-centric LLM competition begins! Please
visit the competition's official websites, FT-Data Ranker ([1B Track](https://tianchi.aliyun.com/competition/entrance/532157), [7B Track](https://tianchi.aliyun.com/competition/entrance/532158)), for more information.

- [2023-10-8] We update our paper to the 2nd version and release the corresponding version 0.1.2 of Data-Juicer!

Expand Down Expand Up @@ -98,7 +106,7 @@ Table of Contents

## Prerequisites

- Recommend Python==3.8
- Recommend Python>=3.7,<=3.10
- gcc >= 5 (at least C++14 support)

## Installation
Expand Down Expand Up @@ -330,7 +338,7 @@ We are in a rapidly developing field and greatly welcome contributions of new
features, bug fixes and better documentations. Please refer to
[How-to Guide for Developers](docs/DeveloperGuide.md).

Welcome to join our [Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw), or [DingDing group](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) for discussion.
If you have any questions, please join our [discussion groups](README.md).

## Acknowledgement
Data-Juicer is used across various LLM products and research initiatives,
Expand Down
16 changes: 12 additions & 4 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,12 +31,20 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM

如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献)

欢迎加入我们的[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp)[钉钉群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) ,或微信群(扫描下方二维码加入)进行讨论。

<img src="https://img.alicdn.com/imgextra/i3/O1CN01QbwHJa1EV5uZwmU9c_!!6000000000356-2-tps-400-400.png" width = "100" height = "100" alt="QR Code for WeChat group" align=center />


----

## 新消息
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
请访问大赛官网,**FT-Data Ranker**[1B赛道](https://tianchi.aliyun.com/competition/entrance/532157)[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-01-05] 现在,我们发布了 **Data-Juicer v0.1.3** 版本!
在这个新版本中,我们支持了**更多Python版本**(3.7-3.10),同时支持了**多模态**数据集的[转换](tools/multimodal/README_ZH.md)[处理](docs/Operators_ZH.md)(包括文本、图像和音频。更多模态也将会在之后支持)。
此外,我们的论文也更新到了[第三版](https://arxiv.org/abs/2309.02033)

- [2023-10-13] 我们的第一届以数据为中心的 LLM 竞赛开始了!
请访问大赛官网,FT-Data Ranker([1B赛道](https://tianchi.aliyun.com/competition/entrance/532157)[7B赛道](https://tianchi.aliyun.com/competition/entrance/532158) ) ,了解更多信息。

- [2023-10-8] 我们的论文更新至第二版,并发布了对应的Data-Juicer v0.1.2版本!

Expand Down Expand Up @@ -86,7 +94,7 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM

## 前置条件

* 推荐 Python==3.8
* 推荐 Python>=3.7,<=3.10
* gcc >= 5 (at least C++14 support)

## 安装
Expand Down Expand Up @@ -309,7 +317,7 @@ Data-Juicer 在 Apache License 2.0 协议下发布。

大模型是一个高速发展的领域,我们非常欢迎贡献新功能、修复漏洞以及文档改善。请参考[开发者指南](docs/DeveloperGuide_ZH.md)。

欢迎加入我们的[Slack channel](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp), 或[DingDing群](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8275bc8g7ypp&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) 。
如果您有任何问题,欢迎加入我们的[讨论群](README_ZH.md) 。

## 致谢

Expand Down
2 changes: 1 addition & 1 deletion data_juicer/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.1.2'
__version__ = '0.1.3'
1 change: 1 addition & 0 deletions environments/minimal_requires.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ tabulate
tqdm
jsonargparse[signatures]
matplotlib
seaborn
emoji==2.2.0
regex
requests
Expand Down
68 changes: 61 additions & 7 deletions tools/multimodal/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,62 @@ This folder contains some scripts and tools for multimodal datasets before and a
## Dataset Format Conversion

Due to large format diversity among different multimodal datasets and works,
Data-Juicer propose a novel intermediate format for multimodal dataset and
provided several dataset format conversion tools for some popular multimodal
Data-Juicer propose a novel intermediate text-based interleaved data format for multimodal dataset, which
is based on chunk-wise formats such MMC4 dataset.

In the Data-Juicer format, a multimodal sample or document is based on a text,
which consists of several text chunks. Each chunk is a semantic unit, and all the
multimodal information in a chunk should talk about the same thing and be aligned
with each other.

Here is a multimodal sample example in Data-Juicer format below.
- It includes 4 chunks split by the special token `<|__dj__eoc|>`.
- In addition to texts, there are 3 other modalities: images, audios, videos.
They are stored on the disk and their paths are
listed in the corresponding first-level fields in the sample.
- Other modalities are represented as special tokens in the text (e.g. image -- `<__dj__image>`).
The special tokens of each modality correspond to the paths in the order of appearance.
(e.g. the two image tokens in the third chunk are images of antarctica_map and europe_map respectively)
- There could be multiple types of modalities and multiple modality special tokens in a single chunk,
and they are semantically aligned with each other and text in this chunk.
The position of special tokens can be random in a chunk. (In general, they are usually before or after the text.)
- For multimodal samples, unlike text-only samples, the computed stats for other
modalities could be a list of stats for the list of multimodal data (e.g. image_widths in this sample).

```python
{
"text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
"<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
"Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
"Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
"and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
"Most of Antarctica is covered by the Antarctic ice sheet, "
"with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
"images": [
"path/to/the/image/of/antarctica_snowfield",
"path/to/the/image/of/antarctica_map",
"path/to/the/image/of/europe_map"
],
"audios": [
"path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
],
"videos": [
"path/to/the/video/of/remote_sensing_view_of_antarctica"
],
"meta": {
"src": "customized",
"version": "0.1",
"author": "xxx"
},
"stats": {
"lang": "en",
"image_widths": [224, 336, 512],
...
}
}
```

According to this format, Data-Juicer provided several dataset format conversion tools for some popular multimodal
works.

These tools consist of two types:
Expand All @@ -15,11 +69,11 @@ These tools consist of two types:

For now, dataset formats that are supported by Data-Juicer are listed in the following table.

| Format | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
| LLaVA-like | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| MMC4-like | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) |
| WavCaps-like | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
| Format | Type | source_format_to_data_juicer_format | data_juicer_format_to_target_format | Ref. |
|------------|------------|-------------------------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------|
| LLaVA-like | image-text | `llava_to_dj.py` | `dj_to_llava.py` | [Format Description](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| MMC4-like | image-text | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [Format Description](https://github.com/allenai/mmc4#documents) |
| WavCaps-like | audio-text | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [Format Description](https://github.com/XinhaoMei/WavCaps#table-of-contents) |

For all tools, you can run the following command to find out the usage of them:

Expand Down
62 changes: 56 additions & 6 deletions tools/multimodal/README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,69 @@

## 数据集格式转换

由于不同多模态数据集和工作之间的数据集格式差异较大,Data-Juicer 提出了一种新颖的多模态数据集中间格式,并为一些流行的多模态工作提供了若干数据集格式转换工具。
由于不同多模态数据集和工作之间的数据集格式差异较大, Data-Juicer 提出了一种新颖的、中间的、
基于文本的、交替的多模态数据格式,主要基于一些按块(chunk)组织的格式,如MMC4数据集格式。

在 Data-Juicer 的格式中,一个多模态样本或者文档基于一段文本组织,其由若干个文本块组成。
每个文本块是一个语义单元,单个文本块中包括的所有多模态信息都应该在谈论同样的事情,并且它们彼此语义上是对齐的。

下面这里是一个 Data-Juicer 格式的多模态样本示例。
- 它包括4个文本块,它们由特殊token `<|__dj__eoc|>` 分割开。
- 除了文本,这个样本还包括3种其他模态:图像(images),音频(audios),视频(videos)。
它们保存在硬盘上,而它们的硬盘路径列举在了样本中对应的一级字段的列表里。
- 在文本中,其他模态被表示为了特殊token(例如,图像 -- `<__dj__image>`)。
每种模态的特殊token所表示的数据按照它们在文本中出现的顺序对应到列表中的路径上。
(例如,第3个文本块中的2个图像token分别对应了图像路径列表中的antarctica_map图像和europe_map图像)
- 在单个文本块中,可以由多种模态的数据以及多个模态特殊token,它们彼此是语义上对齐的,而且它们与该文本块中的文本也是语义对齐的。
这些模态特殊token在文本块中可以处于任意位置(通常处于文本前或者文本后)
- 不同于纯文本样本,对于多模态样本来说,为其他模态计算的stats可能为针对多模态数据列表的一个stats列表(如例子中的image_widths)。

```python
{
"text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
"<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
"Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
"Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
"and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
"Most of Antarctica is covered by the Antarctic ice sheet, "
"with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
"images": [
"path/to/the/image/of/antarctica_snowfield",
"path/to/the/image/of/antarctica_map",
"path/to/the/image/of/europe_map"
],
"audios": [
"path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
],
"videos": [
"path/to/the/video/of/remote_sensing_view_of_antarctica"
],
"meta": {
"src": "customized",
"version": "0.1",
"author": "xxx"
},
"stats": {
"lang": "en",
"image_widths": [224, 336, 512],
...
}
}
```

根据这个格式,Data-Juicer 为一些流行的多模态工作提供了若干数据集格式转换工具。

这些工具分为两种类型:
- 其他格式到 Data-Juicer 格式的转换:这些工具在 `source_format_to_data_juicer_format` 目录中。它们可以帮助将其他格式的数据集转换为 Data-Juicer 格式的目标数据集。
- Data-Juicer 格式到其他格式的转换:这些工具在 `data_juicer_format_to_target_format` 目录中。它们可以帮助将 Data-Juicer 格式的数据集转换为目标格式的数据集。

目前,Data-Juicer 支持的数据集格式在下面表格中列出。

| 格式 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
|----------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
| 类LLaVA格式 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| 类MMC4格式 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) |
| 类WavCaps格式 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |
| 格式 | 类型 | source_format_to_data_juicer_format | data_juicer_format_to_target_format | 格式参考 |
|----------|-------|-------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------|
| 类LLaVA格式 | 图像-文本 | `llava_to_dj.py` | `dj_to_llava.py` | [格式描述](https://github.com/haotian-liu/LLaVA/blob/main/docs/Finetune_Custom_Data.md#dataset-format) |
| 类MMC4格式 | 图像-文本 | `mmc4_to_dj.py` | `dj_to_mmc4.py` | [格式描述](https://github.com/allenai/mmc4#documents) |
| 类WavCaps格式 | 音频-文本 | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` | [格式描述](https://github.com/XinhaoMei/WavCaps#table-of-contents) |

对于所有工具,您可以运行以下命令来了解它们的详细用法:

Expand Down

0 comments on commit a3c8310

Please sign in to comment.