Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

* Add availability checking for OPs to allow incomplete dependency installation #82

Merged
merged 3 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 20 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,40 +105,46 @@ Table of Contents

### From Source

- Run the following commands to install the latest `data_juicer` version in
- Run the following commands to install the latest basic `data_juicer` version in
editable mode:
```shell
cd <path_to_data_juicer>
pip install -v -e .[all]
pip install -v -e .
```

- Or install optional dependencies:
- Some OPs rely on some other too large or low-platform-compatibility third-party libraries. You can install optional dependencies as needed:

```shell
cd <path_to_data_juicer>
pip install -v -e . # install a minimal dependencies
pip install -v -e . # install a minimal dependencies, which support the basic functions
pip install -v -e .[tools] # install a subset of tools dependencies
```

The dependency options are listed below:

| Tag | Description |
|----------|------------------------------------------------------------------------|
| . | Install minimal dependencies for basic Data-Juicer. |
| .[all] | Install all optional dependencies (all of the following) |
| .[dev] | Install dependencies for developing the package as contributors |
| .[tools] | Install dependencies for dedicated tools, such as quality classifiers. |
| Tag | Description |
|--------------|----------------------------------------------------------------------------------------------|
| `.` or `.[mini]` | Install minimal dependencies for basic Data-Juicer. |
| `.[all]` | Install all optional dependencies (including minimal dependencies and all of the following). |
| `.[sci]` | Install all dependencies for all OPs. |
| `.[dist]` | Install dependencies for distributed data processing. (Experimental) |
| `.[dev]` | Install dependencies for developing the package as contributors. |
| `.[tools]` | Install dependencies for dedicated tools, such as quality classifiers. |

### Using pip

- Run the following command to install the latest `data_juicer` using `pip`:
- Run the following command to install the latest released `data_juicer` using `pip`:

```shell
pip install py-data-juicer
```

- **Note**: only the basic APIs in `data_juicer` and two basic tools
(data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
and complete functions, we recommend you install `data_juicer` [from source](#from-source).
- **Note**:
- only the basic APIs in `data_juicer` and two basic tools
(data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
and complete functions, we recommend you install `data_juicer` [from source](#from-source).
- The release versions from pypi have a certain lag compared to the latest version from source.
So if you want to follow the latest functions of `data_juicer`, we recommend you install [from source](#from-source).

### Using Docker

Expand Down
30 changes: 17 additions & 13 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,40 +93,44 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM

### 从源码安装

* 运行以下命令以安装 `data_juicer` 可编辑模式的最新版本
* 运行以下命令以安装 `data_juicer` 可编辑模式的最新基础版本

```shell
cd <path_to_data_juicer>
pip install -v -e .[all]
pip install -v -e .
```

* 或是安装可选的依赖项:
* 部分算子功能依赖于较大的或者平台兼容性不是很好的第三方库,因此用户可按需额外安装可选的依赖项:

```shell
cd <path_to_data_juicer>
pip install -v -e . # 安装最小依赖
pip install -v -e . # 安装最小依赖,支持基础功能
pip install -v -e .[tools] # 安装部分工具库的依赖
```

依赖选项如下表所示:

| 标签 | 描述 |
|----------|----------------------------------------------|
| . | 安装支持 Data-Juicer 基础功能的最小依赖项 |
| .[all] | 安装所有可选依赖项(即下面所有依赖项) |
| .[dev] | 安装作为贡献者开发 Data-Juicer 所需的依赖项 |
| .[tools] | 安装专用工具库(如质量分类器)所需的依赖项 |
| 标签 | 描述 |
|--------------|------------------------------|
| `.` 或者 `.[mini]` | 安装支持 Data-Juicer 基础功能的最小依赖项 |
| `.[all]` | 安装所有可选依赖项(包括最小依赖项以及下面所有依赖项) |
| `.[sci]` | 安装所有算子的全量依赖 |
| `.[dist]` | 安装以分布式方式进行数据处理的依赖(实验性功能) |
| `.[dev]` | 安装作为贡献者开发 Data-Juicer 所需的依赖项 |
| `.[tools]` | 安装专用工具库(如质量分类器)所需的依赖项 |

### 使用 pip 安装

* 运行以下命令用 `pip` 安装 `data_juicer` 的最新版本
* 运行以下命令用 `pip` 安装 `data_juicer` 的最新发布版本

```shell
pip install py-data-juicer
```

* **注意**:使用这种方法安装时,只有`data_juicer`中的基础的 API 和2个基础工具
(数据[处理](数据处理)与[分析](数据分析))可以使用。如需更定制化地使用完整功能,建议[从源码进行安装](#从源码安装)。
* **注意**:
* 使用这种方法安装时,只有`data_juicer`中的基础的 API 和2个基础工具
(数据[处理](数据处理)与[分析](数据分析))可以使用。如需更定制化地使用完整功能,建议[从源码进行安装](#从源码安装)。
* pypi 的发布版本较源码的最新版本有一定的滞后性,如需要随时跟进 `data_juicer` 的最新功能支持,建议[从源码进行安装](#从源码安装)。

### 使用 Docker 安装

Expand Down
7 changes: 5 additions & 2 deletions data_juicer/core/ray_executor.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
import ray
import ray.data as rd
from loguru import logger

from data_juicer.config import init_configs
from data_juicer.ops import Filter, Mapper, load_ops
from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields

with AvailabilityChecking(['ray'], requires_type='dist'):
import ray
import ray.data as rd


class RayExecutor:
"""
Expand Down
9 changes: 7 additions & 2 deletions data_juicer/ops/deduplicator/document_minhash_deduplicator.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,19 @@
import regex
from jsonargparse.typing import ClosedUnitInterval, PositiveInt
from loguru import logger
from scipy.integrate import quad as integrate
from tqdm import tqdm

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import HashKeys

from ..base_op import OPERATORS, Deduplicator
from ..common.helper_func import UnionFind, split_on_whitespace

OP_NAME = 'document_minhash_deduplicator'

with AvailabilityChecking(['scipy'], OP_NAME):
from scipy.integrate import quad as integrate

MERSENNE_PRIME = np.uint64((1 << 61) - 1)
MAX_HASH = np.uint64((1 << 32) - 1)

Expand Down Expand Up @@ -89,7 +94,7 @@ def proba(s):
return opt


@OPERATORS.register_module('document_minhash_deduplicator')
@OPERATORS.register_module(OP_NAME)
class DocumentMinhashDeduplicator(Deduplicator):
"""
Deduplicator to deduplicate samples at document-level using MinHashLSH.
Expand Down
16 changes: 10 additions & 6 deletions data_juicer/ops/deduplicator/document_simhash_deduplicator.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,20 @@

import numpy as np
import regex
import simhash
from jsonargparse.typing import PositiveInt
from loguru import logger

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import HashKeys

from ..base_op import OPERATORS, Deduplicator
from ..common.helper_func import split_on_whitespace

OP_NAME = 'document_simhash_deduplicator'

with AvailabilityChecking(['simhash-py'], OP_NAME):
import simhash


def local_num_differing_bits(hash_a, hash_b):
"""
Expand Down Expand Up @@ -57,10 +62,7 @@ def num_differing_bits_selector():
return simhash.num_differing_bits


num_differing_bits = num_differing_bits_selector()


@OPERATORS.register_module('document_simhash_deduplicator')
@OPERATORS.register_module(OP_NAME)
class DocumentSimhashDeduplicator(Deduplicator):
"""Deduplicator to deduplicate samples at document-level using SimHash."""

Expand Down Expand Up @@ -112,6 +114,8 @@ def __init__(self,
self.num_blocks = num_blocks
self.hamming_distance = hamming_distance

self.num_differing_bits = num_differing_bits_selector()

def compute_hash(self, sample):
"""
Compute simhash values for the sample.
Expand Down Expand Up @@ -185,7 +189,7 @@ def process(self, dataset, show_num=0):
dist = Counter()
for x, y in matches:
graph[x][y] = graph[y][x] = True
num_diff = num_differing_bits(x, y)
num_diff = self.num_differing_bits(x, y)
dist[num_diff] += 1
logger.info(f'Hash diff distribution: {dist}')

Expand Down
25 changes: 15 additions & 10 deletions data_juicer/ops/deduplicator/image_deduplicator.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,29 @@
from typing import Dict, Set

import numpy as np
from imagededup.methods import AHash, DHash, PHash, WHash

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, HashKeys
from data_juicer.utils.mm_utils import load_image

from ..base_op import OPERATORS, Deduplicator
from ..op_fusion import LOADED_IMAGES

HASH_METHOD = {
'phash': PHash(),
'dhash': DHash(),
'whash': WHash(),
'ahash': AHash()
}
OP_NAME = 'image_deduplicator'

with AvailabilityChecking(['imagededup'], OP_NAME):
from imagededup.methods import AHash, DHash, PHash, WHash

@OPERATORS.register_module('image_deduplicator')
@LOADED_IMAGES.register_module('image_deduplicator')
HASH_METHOD = {
'phash': PHash,
'dhash': DHash,
'whash': WHash,
'ahash': AHash
}


@OPERATORS.register_module(OP_NAME)
@LOADED_IMAGES.register_module(OP_NAME)
class ImageDeduplicator(Deduplicator):
"""
Deduplicator to deduplicate samples at document-level using exact matching
Expand All @@ -38,7 +43,7 @@ def __init__(self, method: str = 'phash', *args, **kwargs):
if method not in HASH_METHOD.keys():
raise ValueError(f'Keep strategy [{method}] is not supported. '
f'Can only be one of {HASH_METHOD.keys()}.')
self.hasher = HASH_METHOD[method]
self.hasher = HASH_METHOD[method]()

def compute_hash(self, sample, context=False):
# check if it's computed already
Expand Down
6 changes: 6 additions & 0 deletions data_juicer/ops/filter/alphanumeric_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,18 @@

from jsonargparse.typing import PositiveFloat

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, StatsKeys
from data_juicer.utils.model_utils import get_model, prepare_model

from ..base_op import OPERATORS, Filter
from ..common import get_words_from_document

OP_NAME = 'alphanumeric_filter'

with AvailabilityChecking(['transformers'], OP_NAME):
import transformers # noqa: F401


@OPERATORS.register_module('alphanumeric_filter')
class AlphanumericFilter(Filter):
Expand Down
16 changes: 11 additions & 5 deletions data_juicer/ops/filter/clip_similarity_filter.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,26 @@
import numpy as np
import torch
from jsonargparse.typing import ClosedUnitInterval

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, StatsKeys
from data_juicer.utils.mm_utils import SpecialTokens, load_image
from data_juicer.utils.model_utils import get_model, prepare_model

from ..base_op import OPERATORS, Filter
from ..op_fusion import LOADED_IMAGES

# avoid hanging when calling clip in multiprocessing
torch.set_num_threads(1)
OP_NAME = 'clip_similarity_filter'

with AvailabilityChecking(['torch'], OP_NAME):
import torch
import transformers # noqa: F401

@OPERATORS.register_module('clip_similarity_filter')
@LOADED_IMAGES.register_module('clip_similarity_filter')
# avoid hanging when calling clip in multiprocessing
torch.set_num_threads(1)


@OPERATORS.register_module(OP_NAME)
@LOADED_IMAGES.register_module(OP_NAME)
class ClipSimilarityFilter(Filter):
"""Filter to keep samples those similarity between image and text
within a specific range."""
Expand Down
10 changes: 8 additions & 2 deletions data_juicer/ops/filter/flagged_words_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

from jsonargparse.typing import ClosedUnitInterval, List

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, InterVars, StatsKeys
from data_juicer.utils.model_utils import get_model, prepare_model

Expand All @@ -13,9 +14,14 @@
words_refinement)
from ..op_fusion import INTER_WORDS

OP_NAME = 'flagged_words_filter'

@OPERATORS.register_module('flagged_words_filter')
@INTER_WORDS.register_module('flagged_words_filter')
with AvailabilityChecking(['sentencepiece'], OP_NAME):
import sentencepiece # noqa: F401


@OPERATORS.register_module(OP_NAME)
@INTER_WORDS.register_module(OP_NAME)
class FlaggedWordFilter(Filter):
"""Filter to keep samples with flagged-word ratio less than a specific max
value."""
Expand Down
8 changes: 7 additions & 1 deletion data_juicer/ops/filter/language_id_score_filter.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@
from jsonargparse.typing import ClosedUnitInterval
from loguru import logger

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, StatsKeys
from data_juicer.utils.model_utils import get_model, prepare_model

from ..base_op import OPERATORS, Filter

OP_NAME = 'language_id_score_filter'

@OPERATORS.register_module('language_id_score_filter')
with AvailabilityChecking(['fasttext-wheel'], OP_NAME):
import fasttext # noqa: F401


@OPERATORS.register_module(OP_NAME)
class LanguageIDScoreFilter(Filter):
"""Filter to keep samples in a specific language with confidence score
larger than a specific min value."""
Expand Down
11 changes: 9 additions & 2 deletions data_juicer/ops/filter/perplexity_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,23 @@

from jsonargparse.typing import PositiveFloat

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, InterVars, StatsKeys
from data_juicer.utils.model_utils import get_model, prepare_model

from ..base_op import OPERATORS, Filter
from ..common import get_words_from_document
from ..op_fusion import INTER_WORDS

OP_NAME = 'perplexity_filter'

@OPERATORS.register_module('perplexity_filter')
@INTER_WORDS.register_module('perplexity_filter')
with AvailabilityChecking(['sentencepiece', 'kenlm'], OP_NAME):
import kenlm # noqa: F401
import sentencepiece # noqa: F401


@OPERATORS.register_module(OP_NAME)
@INTER_WORDS.register_module(OP_NAME)
class PerplexityFilter(Filter):
"""Filter to keep samples with perplexity score less than a specific max
value."""
Expand Down
Loading