modelscope · HYLcool · Nov 21, 2023 · Nov 17, 2023 · Nov 17, 2023 · Nov 17, 2023
diff --git a/README.md b/README.md
@@ -105,40 +105,46 @@ Table of Contents
 
 ### From Source
 
-- Run the following commands to install the latest `data_juicer` version in
+- Run the following commands to install the latest basic `data_juicer` version in
   editable mode:
 ```shell
 cd <path_to_data_juicer>
-pip install -v -e .[all]
+pip install -v -e .
 ```
 
-- Or install optional dependencies:
+- Some OPs rely on some other too large or low-platform-compatibility third-party libraries. You can install optional dependencies as needed:
+
 ```shell
 cd <path_to_data_juicer>
-pip install -v -e .  # install a minimal dependencies
+pip install -v -e .  # install a minimal dependencies, which support the basic functions
 pip install -v -e .[tools] # install a subset of tools dependencies
 ```
 
 The dependency options are listed below:
 
-| Tag      | Description                                                            |
-|----------|------------------------------------------------------------------------|
-| .        | Install minimal dependencies for basic Data-Juicer.                    |
-| .[all]   | Install all optional dependencies (all of the following)               |
-| .[dev]   | Install dependencies for developing the package as contributors        |
-| .[tools] | Install dependencies for dedicated tools, such as quality classifiers. |
+| Tag          | Description                                                                                  |
+|--------------|----------------------------------------------------------------------------------------------|
+| `.` or `.[mini]` | Install minimal dependencies for basic Data-Juicer.                                          |
+| `.[all]`       | Install all optional dependencies (including minimal dependencies and all of the following). |
+| `.[sci]`       | Install all dependencies for all OPs.                                                        |
+| `.[dist]`      | Install dependencies for distributed data processing. (Experimental)                         |
+| `.[dev]`       | Install dependencies for developing the package as contributors.                             |
+| `.[tools]`     | Install dependencies for dedicated tools, such as quality classifiers.                       |
 
 ### Using pip
 
-- Run the following command to install the latest `data_juicer` using `pip`:
+- Run the following command to install the latest released `data_juicer` using `pip`:
 
 ```shell
 pip install py-data-juicer
 ```
 
-- **Note**: only the basic APIs in `data_juicer` and two basic tools
-  (data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
-  and complete functions, we recommend you install `data_juicer` [from source](#from-source).
+- **Note**: 
+  - only the basic APIs in `data_juicer` and two basic tools
+    (data [processing](#data-processing) and [analysis](#data-analysis)) are available in this way. If you want customizable
+    and complete functions, we recommend you install `data_juicer` [from source](#from-source).
+  - The release versions from pypi have a certain lag compared to the latest version from source. 
+    So if you want to follow the latest functions of `data_juicer`, we recommend you install [from source](#from-source).
 
 ### Using Docker
 

diff --git a/README_ZH.md b/README_ZH.md
@@ -93,40 +93,44 @@ Data-Juicer 是一个一站式数据处理系统，旨在为大语言模型 (LLM
 
 ### 从源码安装
 
-* 运行以下命令以安装 `data_juicer` 可编辑模式的最新版本
+* 运行以下命令以安装 `data_juicer` 可编辑模式的最新基础版本
 
 ```shell
 cd <path_to_data_juicer>
-pip install -v -e .[all]
+pip install -v -e .
 ```
 
-* 或是安装可选的依赖项:
+* 部分算子功能依赖于较大的或者平台兼容性不是很好的第三方库，因此用户可按需额外安装可选的依赖项:
 
 ```shell
 cd <path_to_data_juicer>
-pip install -v -e .  # 安装最小依赖
+pip install -v -e .  # 安装最小依赖，支持基础功能
 pip install -v -e .[tools] # 安装部分工具库的依赖
 ```
 
 依赖选项如下表所示:
 
-| 标签      | 描述                                          |
-|----------|----------------------------------------------|
-| .        | 安装支持 Data-Juicer 基础功能的最小依赖项         |
-| .[all]   | 安装所有可选依赖项（即下面所有依赖项）             |
-| .[dev]   | 安装作为贡献者开发 Data-Juicer 所需的依赖项       |
-| .[tools] | 安装专用工具库（如质量分类器）所需的依赖项          |
+| 标签           | 描述                           |
+|--------------|------------------------------|
+| `.` 或者 `.[mini]` | 安装支持 Data-Juicer 基础功能的最小依赖项  |
+| `.[all]`       | 安装所有可选依赖项（包括最小依赖项以及下面所有依赖项）  |
+| `.[sci]`       | 安装所有算子的全量依赖                  |
+| `.[dist]`      | 安装以分布式方式进行数据处理的依赖（实验性功能）     |
+| `.[dev]`       | 安装作为贡献者开发 Data-Juicer 所需的依赖项 |
+| `.[tools]`     | 安装专用工具库（如质量分类器）所需的依赖项        |
 
 ### 使用 pip 安装
 
-* 运行以下命令用 `pip` 安装 `data_juicer` 的最新版本：
+* 运行以下命令用 `pip` 安装 `data_juicer` 的最新发布版本：
 
 ```shell
 pip install py-data-juicer
 ```
 
-* **注意**：使用这种方法安装时，只有`data_juicer`中的基础的 API 和2个基础工具
-  （数据[处理](数据处理)与[分析](数据分析)）可以使用。如需更定制化地使用完整功能，建议[从源码进行安装](#从源码安装)。
+* **注意**：
+  * 使用这种方法安装时，只有`data_juicer`中的基础的 API 和2个基础工具
+    （数据[处理](数据处理)与[分析](数据分析)）可以使用。如需更定制化地使用完整功能，建议[从源码进行安装](#从源码安装)。
+  * pypi 的发布版本较源码的最新版本有一定的滞后性，如需要随时跟进 `data_juicer` 的最新功能支持，建议[从源码进行安装](#从源码安装)。
 
 ### 使用 Docker 安装
 

diff --git a/data_juicer/core/ray_executor.py b/data_juicer/core/ray_executor.py
@@ -1,11 +1,14 @@
-import ray
-import ray.data as rd
 from loguru import logger
 
 from data_juicer.config import init_configs
 from data_juicer.ops import Filter, Mapper, load_ops
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import Fields
 
+with AvailabilityChecking(['ray'], requires_type='dist'):
+    import ray
+    import ray.data as rd
+
 
 class RayExecutor:
     """

diff --git a/data_juicer/ops/deduplicator/document_minhash_deduplicator.py b/data_juicer/ops/deduplicator/document_minhash_deduplicator.py
@@ -10,14 +10,19 @@
 import regex
 from jsonargparse.typing import ClosedUnitInterval, PositiveInt
 from loguru import logger
-from scipy.integrate import quad as integrate
 from tqdm import tqdm
 
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import HashKeys
 
 from ..base_op import OPERATORS, Deduplicator
 from ..common.helper_func import UnionFind, split_on_whitespace
 
+OP_NAME = 'document_minhash_deduplicator'
+
+with AvailabilityChecking(['scipy'], OP_NAME):
+    from scipy.integrate import quad as integrate
+
 MERSENNE_PRIME = np.uint64((1 << 61) - 1)
 MAX_HASH = np.uint64((1 << 32) - 1)
 
@@ -89,7 +94,7 @@ def proba(s):
     return opt
 
 
-@OPERATORS.register_module('document_minhash_deduplicator')
+@OPERATORS.register_module(OP_NAME)
 class DocumentMinhashDeduplicator(Deduplicator):
     """
     Deduplicator to deduplicate samples at document-level using MinHashLSH.

diff --git a/data_juicer/ops/deduplicator/document_simhash_deduplicator.py b/data_juicer/ops/deduplicator/document_simhash_deduplicator.py
@@ -7,15 +7,20 @@
 
 import numpy as np
 import regex
-import simhash
 from jsonargparse.typing import PositiveInt
 from loguru import logger
 
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import HashKeys
 
 from ..base_op import OPERATORS, Deduplicator
 from ..common.helper_func import split_on_whitespace
 
+OP_NAME = 'document_simhash_deduplicator'
+
+with AvailabilityChecking(['simhash-py'], OP_NAME):
+    import simhash
+
 
 def local_num_differing_bits(hash_a, hash_b):
     """
@@ -57,10 +62,7 @@ def num_differing_bits_selector():
         return simhash.num_differing_bits
 
 
-num_differing_bits = num_differing_bits_selector()
-
-
-@OPERATORS.register_module('document_simhash_deduplicator')
+@OPERATORS.register_module(OP_NAME)
 class DocumentSimhashDeduplicator(Deduplicator):
     """Deduplicator to deduplicate samples at document-level using SimHash."""
 
@@ -112,6 +114,8 @@ def __init__(self,
         self.num_blocks = num_blocks
         self.hamming_distance = hamming_distance
 
+        self.num_differing_bits = num_differing_bits_selector()
+
     def compute_hash(self, sample):
         """
         Compute simhash values for the sample.
@@ -185,7 +189,7 @@ def process(self, dataset, show_num=0):
         dist = Counter()
         for x, y in matches:
             graph[x][y] = graph[y][x] = True
-            num_diff = num_differing_bits(x, y)
+            num_diff = self.num_differing_bits(x, y)
             dist[num_diff] += 1
         logger.info(f'Hash diff distribution: {dist}')
 

diff --git a/data_juicer/ops/deduplicator/image_deduplicator.py b/data_juicer/ops/deduplicator/image_deduplicator.py
@@ -2,24 +2,29 @@
 from typing import Dict, Set
 
 import numpy as np
-from imagededup.methods import AHash, DHash, PHash, WHash
 
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import Fields, HashKeys
 from data_juicer.utils.mm_utils import load_image
 
 from ..base_op import OPERATORS, Deduplicator
 from ..op_fusion import LOADED_IMAGES
 
-HASH_METHOD = {
-    'phash': PHash(),
-    'dhash': DHash(),
-    'whash': WHash(),
-    'ahash': AHash()
-}
+OP_NAME = 'image_deduplicator'
 
+with AvailabilityChecking(['imagededup'], OP_NAME):
+    from imagededup.methods import AHash, DHash, PHash, WHash
 
-@OPERATORS.register_module('image_deduplicator')
-@LOADED_IMAGES.register_module('image_deduplicator')
+    HASH_METHOD = {
+        'phash': PHash,
+        'dhash': DHash,
+        'whash': WHash,
+        'ahash': AHash
+    }
+
+
+@OPERATORS.register_module(OP_NAME)
+@LOADED_IMAGES.register_module(OP_NAME)
 class ImageDeduplicator(Deduplicator):
     """
     Deduplicator to deduplicate samples at document-level using exact matching
@@ -38,7 +43,7 @@ def __init__(self, method: str = 'phash', *args, **kwargs):
         if method not in HASH_METHOD.keys():
             raise ValueError(f'Keep strategy [{method}] is not supported. '
                              f'Can only be one of {HASH_METHOD.keys()}.')
-        self.hasher = HASH_METHOD[method]
+        self.hasher = HASH_METHOD[method]()
 
     def compute_hash(self, sample, context=False):
         # check if it's computed already

diff --git a/data_juicer/ops/filter/alphanumeric_filter.py b/data_juicer/ops/filter/alphanumeric_filter.py
@@ -2,12 +2,18 @@
 
 from jsonargparse.typing import PositiveFloat
 
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import Fields, StatsKeys
 from data_juicer.utils.model_utils import get_model, prepare_model
 
 from ..base_op import OPERATORS, Filter
 from ..common import get_words_from_document
 
+OP_NAME = 'alphanumeric_filter'
+
+with AvailabilityChecking(['transformers'], OP_NAME):
+    import transformers  # noqa: F401
+
 
 @OPERATORS.register_module('alphanumeric_filter')
 class AlphanumericFilter(Filter):

diff --git a/data_juicer/ops/filter/clip_similarity_filter.py b/data_juicer/ops/filter/clip_similarity_filter.py
@@ -1,20 +1,26 @@
 import numpy as np
-import torch
 from jsonargparse.typing import ClosedUnitInterval
 
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import Fields, StatsKeys
 from data_juicer.utils.mm_utils import SpecialTokens, load_image
 from data_juicer.utils.model_utils import get_model, prepare_model
 
 from ..base_op import OPERATORS, Filter
 from ..op_fusion import LOADED_IMAGES
 
-# avoid hanging when calling clip in multiprocessing
-torch.set_num_threads(1)
+OP_NAME = 'clip_similarity_filter'
 
+with AvailabilityChecking(['torch'], OP_NAME):
+    import torch
+    import transformers  # noqa: F401
 
-@OPERATORS.register_module('clip_similarity_filter')
-@LOADED_IMAGES.register_module('clip_similarity_filter')
+    # avoid hanging when calling clip in multiprocessing
+    torch.set_num_threads(1)
+
+
+@OPERATORS.register_module(OP_NAME)
+@LOADED_IMAGES.register_module(OP_NAME)
 class ClipSimilarityFilter(Filter):
     """Filter to keep samples those similarity between image and text
     within a specific range."""

diff --git a/data_juicer/ops/filter/flagged_words_filter.py b/data_juicer/ops/filter/flagged_words_filter.py
@@ -4,6 +4,7 @@
 
 from jsonargparse.typing import ClosedUnitInterval, List
 
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import Fields, InterVars, StatsKeys
 from data_juicer.utils.model_utils import get_model, prepare_model
 
@@ -13,9 +14,14 @@
                       words_refinement)
 from ..op_fusion import INTER_WORDS
 
+OP_NAME = 'flagged_words_filter'
 
-@OPERATORS.register_module('flagged_words_filter')
-@INTER_WORDS.register_module('flagged_words_filter')
+with AvailabilityChecking(['sentencepiece'], OP_NAME):
+    import sentencepiece  # noqa: F401
+
+
+@OPERATORS.register_module(OP_NAME)
+@INTER_WORDS.register_module(OP_NAME)
 class FlaggedWordFilter(Filter):
     """Filter to keep samples with flagged-word ratio less than a specific max
     value."""

diff --git a/data_juicer/ops/filter/language_id_score_filter.py b/data_juicer/ops/filter/language_id_score_filter.py
@@ -1,13 +1,19 @@
 from jsonargparse.typing import ClosedUnitInterval
 from loguru import logger
 
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import Fields, StatsKeys
 from data_juicer.utils.model_utils import get_model, prepare_model
 
 from ..base_op import OPERATORS, Filter
 
+OP_NAME = 'language_id_score_filter'
 
-@OPERATORS.register_module('language_id_score_filter')
+with AvailabilityChecking(['fasttext-wheel'], OP_NAME):
+    import fasttext  # noqa: F401
+
+
+@OPERATORS.register_module(OP_NAME)
 class LanguageIDScoreFilter(Filter):
     """Filter to keep samples in a specific language with confidence score
     larger than a specific min value."""

diff --git a/data_juicer/ops/filter/perplexity_filter.py b/data_juicer/ops/filter/perplexity_filter.py
@@ -4,16 +4,23 @@
 
 from jsonargparse.typing import PositiveFloat
 
+from data_juicer.utils.availability_utils import AvailabilityChecking
 from data_juicer.utils.constant import Fields, InterVars, StatsKeys
 from data_juicer.utils.model_utils import get_model, prepare_model
 
 from ..base_op import OPERATORS, Filter
 from ..common import get_words_from_document
 from ..op_fusion import INTER_WORDS
 
+OP_NAME = 'perplexity_filter'
 
-@OPERATORS.register_module('perplexity_filter')
-@INTER_WORDS.register_module('perplexity_filter')
+with AvailabilityChecking(['sentencepiece', 'kenlm'], OP_NAME):
+    import kenlm  # noqa: F401
+    import sentencepiece  # noqa: F401
+
+
+@OPERATORS.register_module(OP_NAME)
+@INTER_WORDS.register_module(OP_NAME)
 class PerplexityFilter(Filter):
     """Filter to keep samples with perplexity score less than a specific max
     value."""