Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken NeMo dependencies #372

Merged
merged 16 commits into from
Nov 15, 2024
3 changes: 1 addition & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,8 @@ jobs:

# Installing wheel beforehand due to fasttext issue:
# https://github.com/facebookresearch/fastText/issues/512#issuecomment-1837367666
# Explicitly install cython: https://github.com/VKCOM/YouTokenToMe/issues/94
run: |
pip install wheel cython
pip install wheel
pip install --no-cache-dir .
pip install pytest
- name: Run tests
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ RUN conda create -y --name curator -c conda-forge -c nvidia \
libcusparse \
libcusolver && \
source activate curator && \
pip install --upgrade cython pytest pip
pip install --upgrade pytest pip

RUN \
--mount=type=bind,source=/opt/NeMo-Curator/nemo_curator/__init__.py,target=/opt/NeMo-Curator/nemo_curator/__init__.py,from=curator-update \
Expand Down
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,14 +83,12 @@ You can get NeMo-Curator in 3 ways.
#### PyPi

```bash
pip install cython
pip install --extra-index-url https://pypi.nvidia.com nemo-curator[all]
```

#### Source
```bash
git clone https://github.com/NVIDIA/NeMo-Curator.git
pip install cython
pip install --extra-index-url https://pypi.nvidia.com "./NeMo-Curator[all]"
```

Expand Down
2 changes: 0 additions & 2 deletions docs/user-guide/image/gettingstarted.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,6 @@ NeMo Curator's PyPi page can be found `here <https://pypi.org/project/nemo-curat

.. code-block:: bash

pip install cython
pip install nemo-curator[image]

#####################
Expand All @@ -44,7 +43,6 @@ NeMo Curator's GitHub can be found `here <https://github.com/NVIDIA/NeMo-Curator
.. code-block:: bash

git clone https://github.com/NVIDIA/NeMo-Curator.git
pip install cython
pip install ./NeMo-Curator[image]

############################
Expand Down
5 changes: 4 additions & 1 deletion nemo_curator/filters/code.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,10 @@ def keep_document(self, score):
class TokenizerFertilityFilter(DocumentFilter):

def __init__(self, path_to_tokenizer=None, min_char_to_token_ratio=2.5):
from nemo.collections.common.tokenizers import SentencePieceTokenizer
try:
from nemo.collections.common.tokenizers import SentencePieceTokenizer
except (ImportError, ModuleNotFoundError):
from .sentencepiece_tokenizer import SentencePieceTokenizer
Copy link
Collaborator Author

@sarahyurick sarahyurick Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about this?

ModuleNotFoundError: No module named 'nemo'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on our discussions from slack, I think we can just transform this class to be something like this:

class TokenizerFertilityFilter(DocumentFilter):

    def __init__(self, path_to_tokenizer=None, min_char_to_token_ratio=2.5):
        if path_to_tokenizer is None:
            raise ValueError(
                "Must provide a valid path to a SentencePiece " "tokenizer"
            )
        self._tokenizer = sentencepiece.SentencePieceProcessor()
        self._tokenizer.Load(path_to_tokenizer)
        self._threshold = min_char_to_token_ratio

        self._name = "tokenizer_fertility"

    def score_document(self, source):
        tokens = self._tokenizer.encode_as_pieces(source)
        num_chars = len(source)
        num_tokens = len(tokens)
        if num_tokens == 0:
            return -1
        return num_chars / num_tokens

    def keep_document(self, score):
        return score >= self._threshold

Then we can just delete the one file you copied over. Lmk what you think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably run this via a batch instead of running it on a per file pieces and return a single file. We can also probably use crossfit for it (if we want to)

Copy link
Collaborator

@VibhuJawa VibhuJawa Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what that will look like

 cf.op.Tokenizer(model, cols=["text"], tokenizer_type="sentencepiece")

That said, we might have to ensure this works in a CPU environment too so there might be some complexity here we need to fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VibhuJawa ! I have opened #377 to track this.


if path_to_tokenizer is None:
raise ValueError(
Expand Down
292 changes: 292 additions & 0 deletions nemo_curator/filters/sentencepiece_tokenizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,292 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os
from typing import Dict, List, Optional, Union

import numpy as np
import sentencepiece
import torch


class SentencePieceTokenizer:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
SentencePieceTokenizer https://github.com/google/sentencepiece

Args:
model_path: path to sentence piece tokenizer model.
special_tokens: either list of special tokens or dictionary of token name to token value
legacy: when set to True, the previous behavior of the SentecePiece wrapper will be restored,
including the possibility to add special tokens inside wrapper.
"""

def __init__(
self,
model_path: str,
special_tokens: Optional[Union[Dict[str, str], List[str]]] = None,
legacy: bool = False,
):
if not model_path or not os.path.exists(model_path):
raise ValueError(f"model_path: {model_path} is invalid")
self.tokenizer = sentencepiece.SentencePieceProcessor()
self.tokenizer.Load(model_path)

self.original_vocab_size = self.tokenizer.get_piece_size()
self.vocab_size = self.tokenizer.get_piece_size()
self.legacy = legacy
self.special_token_to_id = {}
self.id_to_special_token = {}
if special_tokens:
if not self.legacy:
raise ValueError(
"Special tokens must be None when legacy is set to False. Provide special tokens at train time."
)
self.add_special_tokens(special_tokens)
self.space_sensitive = self.text_to_tokens("x y") != self.text_to_tokens(
"x"
) + self.text_to_tokens("y")

def text_to_tokens(self, text):
if self.legacy:
tokens = []
idx = 0

while 1:
indices = {}

for token in self.special_token_to_id:
try:
indices[token] = text[idx:].index(token)
except ValueError:
continue

if len(indices) == 0:
break

next_token = min(indices, key=indices.get)
next_idx = idx + indices[next_token]

tokens.extend(self.tokenizer.encode_as_pieces(text[idx:next_idx]))
tokens.append(next_token)
idx = next_idx + len(next_token)

tokens.extend(self.tokenizer.encode_as_pieces(text[idx:]))
return tokens

return self.tokenizer.encode_as_pieces(text)

def encode(self, text):
if self.legacy:
ids = []
idx = 0

while 1:
indices = {}

for token in self.special_token_to_id:
try:
indices[token] = text[idx:].index(token)
except ValueError:
continue

if len(indices) == 0:
break

next_token = min(indices, key=indices.get)
next_idx = idx + indices[next_token]

ids.extend(self.tokenizer.encode_as_ids(text[idx:next_idx]))
ids.append(self.special_token_to_id[next_token])
idx = next_idx + len(next_token)

ids.extend(self.tokenizer.encode_as_ids(text[idx:]))
return ids

return self.tokenizer.encode_as_ids(text)

def tokens_to_text(self, tokens):
if isinstance(tokens, np.ndarray):
tokens = tokens.tolist()

return self.tokenizer.decode_pieces(tokens)

def batch_decode(self, ids):
if isinstance(ids, np.ndarray) or torch.is_tensor(ids):
ids = ids.tolist()

if self.legacy:
text = ""
last_i = 0

for i, id in enumerate(ids):
if id in self.id_to_special_token:
text += self.tokenizer.decode_ids(ids[last_i:i]) + " "
text += self.id_to_special_token[id] + " "
last_i = i + 1

text += self.tokenizer.decode_ids(ids[last_i:])
return text.strip()

return self.tokenizer.decode(ids)

def token_to_id(self, token):
if self.legacy and token in self.special_token_to_id:
return self.special_token_to_id[token]

return self.tokenizer.piece_to_id(token)

def ids_to_tokens(self, ids):
tokens = []
for id in ids:
if id >= self.original_vocab_size:
tokens.append(self.id_to_special_token[id])
else:
tokens.append(self.tokenizer.id_to_piece(id))
return tokens

def tokens_to_ids(self, tokens: Union[str, List[str]]) -> Union[int, List[int]]:
if isinstance(tokens, str):
tokens = [tokens]
ids = []
for token in tokens:
ids.append(self.token_to_id(token))
return ids

def add_special_tokens(self, special_tokens):
if not self.legacy:
raise AttributeError(
"Special Token addition does not work when legacy is set to False."
)

if isinstance(special_tokens, list):
for token in special_tokens:
if (
self.tokenizer.piece_to_id(token) == self.tokenizer.unk_id()
and token not in self.special_token_to_id
):
self.special_token_to_id[token] = self.vocab_size
self.id_to_special_token[self.vocab_size] = token
self.vocab_size += 1
elif isinstance(special_tokens, dict):
for token_name, token in special_tokens.items():
setattr(self, token_name, token)
if (
self.tokenizer.piece_to_id(token) == self.tokenizer.unk_id()
and token not in self.special_token_to_id
):
self.special_token_to_id[token] = self.vocab_size
self.id_to_special_token[self.vocab_size] = token
self.vocab_size += 1

@property
def pad_id(self):
if self.legacy:
pad_id = self.tokens_to_ids([self.pad_token])[0]
else:
pad_id = self.tokenizer.pad_id()
return pad_id

@property
def bos_token_id(self):
if self.legacy:
bos_id = self.tokens_to_ids([self.bos_token])[0]
else:
bos_id = self.tokenizer.bos_id()
return bos_id

@property
def eos_token_id(self):
if self.legacy:
eos_id = self.tokens_to_ids([self.eos_token])[0]
else:
eos_id = self.tokenizer.eos_id()
return eos_id

@property
def sep_id(self):
if self.legacy:
return self.tokens_to_ids([self.sep_token])[0]
else:
raise NameError(
"Use function token_to_id to retrieve special tokens other than unk, pad, bos, and eos."
)

@property
def cls_id(self):
if self.legacy:
return self.tokens_to_ids([self.cls_token])[0]
else:
raise NameError(
"Use function token_to_id to retrieve special tokens other than unk, pad, bos, and eos."
)

@property
def mask_id(self):
if self.legacy:
return self.tokens_to_ids([self.mask_token])[0]
else:
raise NameError(
"Use function token_to_id to retrieve special tokens other than unk, pad, bos, and eos."
)

@property
def unk_id(self):
return self.tokenizer.unk_id()

@property
def additional_special_tokens_ids(self):
"""Returns a list of the additional special tokens (excluding bos, eos, pad, unk). Used to return sentinel tokens for e.g. T5."""
special_tokens = set(
[
self.bos_token,
self.eos_token,
self.pad_token,
self.mask_token,
self.cls_token,
self.sep_token,
]
)
return [
v for k, v in self.special_token_to_id.items() if k not in special_tokens
]

@property
def vocab(self):
main_vocab = [
self.tokenizer.id_to_piece(id)
for id in range(self.tokenizer.get_piece_size())
]
special_tokens = [
self.id_to_special_token[self.original_vocab_size + i]
for i in range(self.vocab_size - self.original_vocab_size)
]
return main_vocab + special_tokens

### Below are a few methods that mimic transformers.PreTrainedTokenizer for vLLM

def convert_ids_to_tokens(self, ids, skip_special_tokens: bool = False):
return self.ids_to_tokens(ids) # TODO: support skip_special_tokens

def convert_tokens_to_string(self, tokens: List[str]):
return self.tokens_to_text(tokens)

def __len__(self):
return self.vocab_size

@property
def is_fast(self):
return True

def get_added_vocab(self):
return None
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ dependencies = [
"crossfit>=0.0.6",
"dask-mpi>=2021.11.0",
"dask[complete]>=2021.7.1",
"datasets",
"distributed>=2021.7.1",
"fasttext==0.9.2",
"ftfy==6.1.1",
Expand All @@ -54,14 +55,14 @@ dependencies = [
"lxml_html_clean",
"mecab-python3",
"mwparserfromhell==0.6.5",
"nemo_toolkit[nlp]>=1.23.0",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #376.

"numpy<2",
"openai",
"peft",
"presidio-analyzer==2.2.351",
"presidio-anonymizer==2.2.351",
"pycld2",
"resiliparse",
"sentencepiece",
"spacy>=3.6.0, <3.8.0",
"unidic-lite==1.0.8",
"usaddress==0.5.10",
Expand Down
2 changes: 1 addition & 1 deletion tutorials/image-curation/image-curation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
},
"outputs": [],
"source": [
"!pip install cython ipywidgets aiofiles\n",
"!pip install ipywidgets aiofiles\n",
"# Install from source by default\n",
"!pip install --extra-index-url https://pypi.nvidia.com ../../[image]\n",
"%env DASK_DATAFRAME__QUERY_PLANNING False"
Expand Down
Loading