Skip to content

Commit

Permalink
* Split requirements into more categories
Browse files Browse the repository at this point in the history
+ Add availability checking for each OP when importing them for OPs rely on large or low-platform-compatibility third-party libraries.
  • Loading branch information
HYLcool committed Nov 17, 2023
1 parent afe06dc commit feb329f
Show file tree
Hide file tree
Showing 28 changed files with 296 additions and 87 deletions.
34 changes: 20 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,40 +105,46 @@ Table of Contents

### From Source

- Run the following commands to install the latest `data_juicer` version in
- Run the following commands to install the latest basic `data_juicer` version in
editable mode:
```shell
cd <path_to_data_juicer>
pip install -v -e .[all]
pip install -v -e .
```

- Or install optional dependencies:
- Some OPs rely on some other too large or low-platform-compatibility third-party libraries. You can install optional dependencies as needed:

```shell
cd <path_to_data_juicer>
pip install -v -e . # install a minimal dependencies
pip install -v -e . # install a minimal dependencies, which support the basic functions
pip install -v -e .[tools] # install a subset of tools dependencies
```

The dependency options are listed below:

| Tag | Description |
|----------|------------------------------------------------------------------------|
| . | Install minimal dependencies for basic Data-Juicer. |
| .[all] | Install all optional dependencies (all of the following) |
| .[dev] | Install dependencies for developing the package as contributors |
| .[tools] | Install dependencies for dedicated tools, such as quality classifiers. |
| Tag | Description |
|--------------|----------------------------------------------------------------------------------------------|
| `.` or `.[mini]` | Install minimal dependencies for basic Data-Juicer. |
| `.[all]` | Install all optional dependencies (including minimal dependencies and all of the following). |
| `.[sci]` | Install all dependencies for all OPs. |
| `.[dist]` | Install dependencies for distributed data processing. (Experimental) |
| `.[dev]` | Install dependencies for developing the package as contributors. |
| `.[tools]` | Install dependencies for dedicated tools, such as quality classifiers. |

### Using pip

- Run the following command to install the latest `data_juicer` using `pip`:
- Run the following command to install the latest released `data_juicer` using `pip`:

```shell
pip install py-data-juicer
```

- **Note**: only the basic APIs in `data_juicer` and two basic tools
(data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
and complete functions, we recommend you install `data_juicer` [from source](#from-source).
- **Note**:
- only the basic APIs in `data_juicer` and two basic tools
(data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
and complete functions, we recommend you install `data_juicer` [from source](#from-source).
- The release versions from pypi have a certain lag compared to the latest version from source.
So if you want to follow the latest functions of `data_juicer`, we recommend you install [from source](#from-source).

### Using Docker

Expand Down
30 changes: 17 additions & 13 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,40 +93,44 @@ Data-Juicer 是一个一站式数据处理系统,旨在为大语言模型 (LLM

### 从源码安装

* 运行以下命令以安装 `data_juicer` 可编辑模式的最新版本
* 运行以下命令以安装 `data_juicer` 可编辑模式的最新基础版本

```shell
cd <path_to_data_juicer>
pip install -v -e .[all]
pip install -v -e .
```

* 或是安装可选的依赖项:
* 部分算子功能依赖于较大的或者平台兼容性不是很好的第三方库,因此用户可按需额外安装可选的依赖项:

```shell
cd <path_to_data_juicer>
pip install -v -e . # 安装最小依赖
pip install -v -e . # 安装最小依赖,支持基础功能
pip install -v -e .[tools] # 安装部分工具库的依赖
```

依赖选项如下表所示:

| 标签 | 描述 |
|----------|----------------------------------------------|
| . | 安装支持 Data-Juicer 基础功能的最小依赖项 |
| .[all] | 安装所有可选依赖项(即下面所有依赖项) |
| .[dev] | 安装作为贡献者开发 Data-Juicer 所需的依赖项 |
| .[tools] | 安装专用工具库(如质量分类器)所需的依赖项 |
| 标签 | 描述 |
|--------------|------------------------------|
| `.` 或者 `.[mini]` | 安装支持 Data-Juicer 基础功能的最小依赖项 |
| `.[all]` | 安装所有可选依赖项(包括最小依赖项以及下面所有依赖项) |
| `.[sci]` | 安装所有算子的全量依赖 |
| `.[dist]` | 安装以分布式方式进行数据处理的依赖(实验性功能) |
| `.[dev]` | 安装作为贡献者开发 Data-Juicer 所需的依赖项 |
| `.[tools]` | 安装专用工具库(如质量分类器)所需的依赖项 |

### 使用 pip 安装

* 运行以下命令用 `pip` 安装 `data_juicer` 的最新版本
* 运行以下命令用 `pip` 安装 `data_juicer` 的最新发布版本

```shell
pip install py-data-juicer
```

* **注意**:使用这种方法安装时,只有`data_juicer`中的基础的 API 和2个基础工具
(数据[处理](数据处理)[分析](数据分析))可以使用。如需更定制化地使用完整功能,建议[从源码进行安装](#从源码安装)
* **注意**
* 使用这种方法安装时,只有`data_juicer`中的基础的 API 和2个基础工具
(数据[处理](数据处理)[分析](数据分析))可以使用。如需更定制化地使用完整功能,建议[从源码进行安装](#从源码安装)
* pypi 的发布版本较源码的最新版本有一定的滞后性,如需要随时跟进 `data_juicer` 的最新功能支持,建议[从源码进行安装](#从源码安装)

### 使用 Docker 安装

Expand Down
7 changes: 5 additions & 2 deletions data_juicer/core/ray_executor.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,14 @@
import ray
import ray.data as rd
from loguru import logger

from data_juicer.config import init_configs
from data_juicer.ops import Filter, Mapper, load_ops
from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields

with AvailabilityChecking(['ray'], requires_type='dist'):
import ray
import ray.data as rd


class RayExecutor:
"""
Expand Down
9 changes: 7 additions & 2 deletions data_juicer/ops/deduplicator/document_minhash_deduplicator.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,19 @@
import regex
from jsonargparse.typing import ClosedUnitInterval, PositiveInt
from loguru import logger
from scipy.integrate import quad as integrate
from tqdm import tqdm

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import HashKeys

from ..base_op import OPERATORS, Deduplicator
from ..common.helper_func import UnionFind, split_on_whitespace

OP_NAME = 'document_minhash_deduplicator'

with AvailabilityChecking(['scipy'], OP_NAME):
from scipy.integrate import quad as integrate

MERSENNE_PRIME = np.uint64((1 << 61) - 1)
MAX_HASH = np.uint64((1 << 32) - 1)

Expand Down Expand Up @@ -89,7 +94,7 @@ def proba(s):
return opt


@OPERATORS.register_module('document_minhash_deduplicator')
@OPERATORS.register_module(OP_NAME)
class DocumentMinhashDeduplicator(Deduplicator):
"""
Deduplicator to deduplicate samples at document-level using MinHashLSH.
Expand Down
16 changes: 10 additions & 6 deletions data_juicer/ops/deduplicator/document_simhash_deduplicator.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,20 @@

import numpy as np
import regex
import simhash
from jsonargparse.typing import PositiveInt
from loguru import logger

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import HashKeys

from ..base_op import OPERATORS, Deduplicator
from ..common.helper_func import split_on_whitespace

OP_NAME = 'document_simhash_deduplicator'

with AvailabilityChecking(['simhash-py'], OP_NAME):
import simhash


def local_num_differing_bits(hash_a, hash_b):
"""
Expand Down Expand Up @@ -57,10 +62,7 @@ def num_differing_bits_selector():
return simhash.num_differing_bits


num_differing_bits = num_differing_bits_selector()


@OPERATORS.register_module('document_simhash_deduplicator')
@OPERATORS.register_module(OP_NAME)
class DocumentSimhashDeduplicator(Deduplicator):
"""Deduplicator to deduplicate samples at document-level using SimHash."""

Expand Down Expand Up @@ -112,6 +114,8 @@ def __init__(self,
self.num_blocks = num_blocks
self.hamming_distance = hamming_distance

self.num_differing_bits = num_differing_bits_selector()

def compute_hash(self, sample):
"""
Compute simhash values for the sample.
Expand Down Expand Up @@ -185,7 +189,7 @@ def process(self, dataset, show_num=0):
dist = Counter()
for x, y in matches:
graph[x][y] = graph[y][x] = True
num_diff = num_differing_bits(x, y)
num_diff = self.num_differing_bits(x, y)
dist[num_diff] += 1
logger.info(f'Hash diff distribution: {dist}')

Expand Down
25 changes: 15 additions & 10 deletions data_juicer/ops/deduplicator/image_deduplicator.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,29 @@
from typing import Dict, Set

import numpy as np
from imagededup.methods import AHash, DHash, PHash, WHash

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, HashKeys
from data_juicer.utils.mm_utils import load_image

from ..base_op import OPERATORS, Deduplicator
from ..op_fusion import LOADED_IMAGES

HASH_METHOD = {
'phash': PHash(),
'dhash': DHash(),
'whash': WHash(),
'ahash': AHash()
}
OP_NAME = 'image_deduplicator'

with AvailabilityChecking(['imagededup'], OP_NAME):
from imagededup.methods import AHash, DHash, PHash, WHash

@OPERATORS.register_module('image_deduplicator')
@LOADED_IMAGES.register_module('image_deduplicator')
HASH_METHOD = {
'phash': PHash,
'dhash': DHash,
'whash': WHash,
'ahash': AHash
}


@OPERATORS.register_module(OP_NAME)
@LOADED_IMAGES.register_module(OP_NAME)
class ImageDeduplicator(Deduplicator):
"""
Deduplicator to deduplicate samples at document-level using exact matching
Expand All @@ -38,7 +43,7 @@ def __init__(self, method: str = 'phash', *args, **kwargs):
if method not in HASH_METHOD.keys():
raise ValueError(f'Keep strategy [{method}] is not supported. '
f'Can only be one of {HASH_METHOD.keys()}.')
self.hasher = HASH_METHOD[method]
self.hasher = HASH_METHOD[method]()

def compute_hash(self, sample, context=False):
# check if it's computed already
Expand Down
6 changes: 6 additions & 0 deletions data_juicer/ops/filter/alphanumeric_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,18 @@

from jsonargparse.typing import PositiveFloat

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, StatsKeys
from data_juicer.utils.model_utils import get_model, prepare_model

from ..base_op import OPERATORS, Filter
from ..common import get_words_from_document

OP_NAME = 'alphanumeric_filter'

with AvailabilityChecking(['transformers'], OP_NAME):
import transformers # noqa: F401


@OPERATORS.register_module('alphanumeric_filter')
class AlphanumericFilter(Filter):
Expand Down
16 changes: 11 additions & 5 deletions data_juicer/ops/filter/clip_similarity_filter.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,26 @@
import numpy as np
import torch
from jsonargparse.typing import ClosedUnitInterval

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, StatsKeys
from data_juicer.utils.mm_utils import SpecialTokens, load_image
from data_juicer.utils.model_utils import get_model, prepare_model

from ..base_op import OPERATORS, Filter
from ..op_fusion import LOADED_IMAGES

# avoid hanging when calling clip in multiprocessing
torch.set_num_threads(1)
OP_NAME = 'clip_similarity_filter'

with AvailabilityChecking(['torch'], OP_NAME):
import torch
import transformers # noqa: F401

@OPERATORS.register_module('clip_similarity_filter')
@LOADED_IMAGES.register_module('clip_similarity_filter')
# avoid hanging when calling clip in multiprocessing
torch.set_num_threads(1)


@OPERATORS.register_module(OP_NAME)
@LOADED_IMAGES.register_module(OP_NAME)
class ClipSimilarityFilter(Filter):
"""Filter to keep samples those similarity between image and text
within a specific range."""
Expand Down
10 changes: 8 additions & 2 deletions data_juicer/ops/filter/flagged_words_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

from jsonargparse.typing import ClosedUnitInterval, List

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, InterVars, StatsKeys
from data_juicer.utils.model_utils import get_model, prepare_model

Expand All @@ -13,9 +14,14 @@
words_refinement)
from ..op_fusion import INTER_WORDS

OP_NAME = 'flagged_words_filter'

@OPERATORS.register_module('flagged_words_filter')
@INTER_WORDS.register_module('flagged_words_filter')
with AvailabilityChecking(['sentencepiece'], OP_NAME):
import sentencepiece # noqa: F401


@OPERATORS.register_module(OP_NAME)
@INTER_WORDS.register_module(OP_NAME)
class FlaggedWordFilter(Filter):
"""Filter to keep samples with flagged-word ratio less than a specific max
value."""
Expand Down
8 changes: 7 additions & 1 deletion data_juicer/ops/filter/language_id_score_filter.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@
from jsonargparse.typing import ClosedUnitInterval
from loguru import logger

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, StatsKeys
from data_juicer.utils.model_utils import get_model, prepare_model

from ..base_op import OPERATORS, Filter

OP_NAME = 'language_id_score_filter'

@OPERATORS.register_module('language_id_score_filter')
with AvailabilityChecking(['fasttext-wheel'], OP_NAME):
import fasttext # noqa: F401


@OPERATORS.register_module(OP_NAME)
class LanguageIDScoreFilter(Filter):
"""Filter to keep samples in a specific language with confidence score
larger than a specific min value."""
Expand Down
11 changes: 9 additions & 2 deletions data_juicer/ops/filter/perplexity_filter.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,23 @@

from jsonargparse.typing import PositiveFloat

from data_juicer.utils.availability_utils import AvailabilityChecking
from data_juicer.utils.constant import Fields, InterVars, StatsKeys
from data_juicer.utils.model_utils import get_model, prepare_model

from ..base_op import OPERATORS, Filter
from ..common import get_words_from_document
from ..op_fusion import INTER_WORDS

OP_NAME = 'perplexity_filter'

@OPERATORS.register_module('perplexity_filter')
@INTER_WORDS.register_module('perplexity_filter')
with AvailabilityChecking(['sentencepiece', 'kenlm'], OP_NAME):
import kenlm # noqa: F401
import sentencepiece # noqa: F401


@OPERATORS.register_module(OP_NAME)
@INTER_WORDS.register_module(OP_NAME)
class PerplexityFilter(Filter):
"""Filter to keep samples with perplexity score less than a specific max
value."""
Expand Down
Loading

0 comments on commit feb329f

Please sign in to comment.