Skip to content

Commit

Permalink
[Dependencies|tokenizers] Make both SentencePiece and Tokenizers opti…
Browse files Browse the repository at this point in the history
…onal dependencies (huggingface#7659)

* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉

* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <[email protected]>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <[email protected]>

Co-authored-by: Sylvain Gugger <[email protected]>
  • Loading branch information
thomwolf and sgugger authored Oct 18, 2020
1 parent c65863c commit ba8c4d0
Show file tree
Hide file tree
Showing 140 changed files with 6,550 additions and 3,960 deletions.
4 changes: 2 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ jobs:
- v0.3-build_doc-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- run: pip install --upgrade pip
- run: pip install .[tf,torch,docs]
- run: pip install .[tf,torch,sentencepiece,docs]
- save_cache:
key: v0.3-build_doc-{{ checksum "setup.py" }}
paths:
Expand All @@ -219,7 +219,7 @@ jobs:
keys:
- v0.3-deploy_doc-{{ checksum "setup.py" }}
- v0.3-{{ checksum "setup.py" }}
- run: pip install .[tf,torch,docs]
- run: pip install .[tf,torch,sentencepiece,docs]
- save_cache:
key: v0.3-deploy_doc-{{ checksum "setup.py" }}
paths:
Expand Down
3 changes: 1 addition & 2 deletions .github/workflows/github-torch-hub.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,7 @@ jobs:
run: |
pip install --upgrade pip
pip install torch
pip install numpy filelock protobuf requests tqdm regex sentencepiece sacremoses packaging
pip install tokenizers==0.9.0.rc2
pip install numpy filelock protobuf requests tqdm regex sentencepiece sacremoses tokenizers packaging
- name: Torch hub list
run: |
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ __pycache__/
*.so

# tests and logs
tests/fixtures
tests/fixtures/*
!tests/fixtures/sample_text_no_unicode.txt
logs/
lightning_logs/
lang_code_data/
Expand Down
12 changes: 6 additions & 6 deletions docs/source/task_summary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -758,8 +758,8 @@ Here is an example of using the pipelines to do summarization. It leverages a Ba
... If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18.
... """
Because the summarization pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
Because the summarization pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
of ``PreTrainedModel.generate()`` directly in the pipeline for ``max_length`` and ``min_length`` as shown below.
This outputs the following summary:

.. code-block::
Expand All @@ -772,7 +772,7 @@ Here is an example of doing summarization using a model and a tokenizer. The pro
1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarized.
3. Add the T5 specific prefix "summarize: ".
4. Use the ``PretrainedModel.generate()`` method to generate the summary.
4. Use the ``PreTrainedModel.generate()`` method to generate the summary.

In this example we use Google`s T5 model. Even though it was pre-trained only on a multi-task mixed dataset (including CNN / Daily Mail), it yields very good results.

Expand Down Expand Up @@ -819,15 +819,15 @@ translation results.
>>> print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))
[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]
Because the translation pipeline depends on the ``PretrainedModel.generate()`` method, we can override the default arguments
of ``PretrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.
Because the translation pipeline depends on the ``PreTrainedModel.generate()`` method, we can override the default arguments
of ``PreTrainedModel.generate()`` directly in the pipeline as is shown for ``max_length`` above.

Here is an example of doing translation using a model and a tokenizer. The process is the following:

1. Instantiate a tokenizer and a model from the checkpoint name. Summarization is usually done using an encoder-decoder model, such as ``Bart`` or ``T5``.
2. Define the article that should be summarizaed.
3. Add the T5 specific prefix "translate English to German: "
4. Use the ``PretrainedModel.generate()`` method to perform the translation.
4. Use the ``PreTrainedModel.generate()`` method to perform the translation.

.. code-block::
Expand Down
1 change: 1 addition & 0 deletions examples/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ datasets
fire
pytest
conllu
sentencepiece != 0.1.92
5 changes: 3 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,13 @@
extras["serving"] = ["pydantic", "uvicorn", "fastapi", "starlette"]
extras["all"] = extras["serving"] + ["tensorflow", "torch"]

extras["sentencepiece"] = ["sentencepiece!=0.1.92"]
extras["retrieval"] = ["faiss-cpu", "datasets"]
extras["testing"] = ["pytest", "pytest-xdist", "timeout-decorator", "parameterized", "psutil"] + extras["retrieval"]
# sphinx-rtd-theme==0.5.0 introduced big changes in the style.
extras["docs"] = ["recommonmark", "sphinx", "sphinx-markdown-tables", "sphinx-rtd-theme==0.4.3", "sphinx-copybutton"]
extras["quality"] = ["black >= 20.8b1", "isort >= 5.5.4", "flake8 >= 3.8.3"]
extras["dev"] = extras["testing"] + extras["quality"] + extras["ja"] + ["scikit-learn", "tensorflow", "torch"]
extras["dev"] = extras["testing"] + extras["quality"] + extras["ja"] + ["scikit-learn", "tensorflow", "torch", "sentencepiece!=0.1.92"]

setup(
name="transformers",
Expand All @@ -114,7 +115,7 @@
packages=find_packages("src"),
install_requires=[
"numpy",
"tokenizers == 0.9.0.rc2",
"tokenizers == 0.9.2",
# dataclasses for Python versions that don't have it
"dataclasses;python_version<'3.7'",
# utilities from PyPA to e.g. compare versions
Expand Down
101 changes: 72 additions & 29 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@
MODEL_CARD_NAME,
PYTORCH_PRETRAINED_BERT_CACHE,
PYTORCH_TRANSFORMERS_CACHE,
SPIECE_UNDERLINE,
TF2_WEIGHTS_NAME,
TF_WEIGHTS_NAME,
TRANSFORMERS_CACHE,
Expand All @@ -104,8 +105,10 @@
is_faiss_available,
is_psutil_available,
is_py3nvml_available,
is_sentencepiece_available,
is_sklearn_available,
is_tf_available,
is_tokenizers_available,
is_torch_available,
is_torch_tpu_available,
)
Expand Down Expand Up @@ -152,60 +155,101 @@
from .retrieval_rag import RagRetriever

# Tokenizers
from .tokenization_albert import AlbertTokenizer, AlbertTokenizerFast
from .tokenization_auto import TOKENIZER_MAPPING, AutoTokenizer
from .tokenization_bart import BartTokenizer, BartTokenizerFast
from .tokenization_bert import BasicTokenizer, BertTokenizer, BertTokenizerFast, WordpieceTokenizer
from .tokenization_bert_generation import BertGenerationTokenizer
from .tokenization_bart import BartTokenizer
from .tokenization_bert import BasicTokenizer, BertTokenizer, WordpieceTokenizer
from .tokenization_bert_japanese import BertJapaneseTokenizer, CharacterTokenizer, MecabTokenizer
from .tokenization_bertweet import BertweetTokenizer
from .tokenization_blenderbot import BlenderbotSmallTokenizer, BlenderbotTokenizer
from .tokenization_camembert import CamembertTokenizer, CamembertTokenizerFast
from .tokenization_ctrl import CTRLTokenizer
from .tokenization_deberta import DebertaTokenizer
from .tokenization_distilbert import DistilBertTokenizer, DistilBertTokenizerFast
from .tokenization_distilbert import DistilBertTokenizer
from .tokenization_dpr import (
DPRContextEncoderTokenizer,
DPRContextEncoderTokenizerFast,
DPRQuestionEncoderTokenizer,
DPRQuestionEncoderTokenizerFast,
DPRReaderOutput,
DPRReaderTokenizer,
DPRReaderTokenizerFast,
)
from .tokenization_electra import ElectraTokenizer, ElectraTokenizerFast
from .tokenization_electra import ElectraTokenizer
from .tokenization_flaubert import FlaubertTokenizer
from .tokenization_fsmt import FSMTTokenizer
from .tokenization_funnel import FunnelTokenizer, FunnelTokenizerFast
from .tokenization_gpt2 import GPT2Tokenizer, GPT2TokenizerFast
from .tokenization_herbert import HerbertTokenizer, HerbertTokenizerFast
from .tokenization_layoutlm import LayoutLMTokenizer, LayoutLMTokenizerFast
from .tokenization_longformer import LongformerTokenizer, LongformerTokenizerFast
from .tokenization_lxmert import LxmertTokenizer, LxmertTokenizerFast
from .tokenization_mbart import MBartTokenizer, MBartTokenizerFast
from .tokenization_mobilebert import MobileBertTokenizer, MobileBertTokenizerFast
from .tokenization_openai import OpenAIGPTTokenizer, OpenAIGPTTokenizerFast
from .tokenization_pegasus import PegasusTokenizer, PegasusTokenizerFast
from .tokenization_funnel import FunnelTokenizer
from .tokenization_gpt2 import GPT2Tokenizer
from .tokenization_herbert import HerbertTokenizer
from .tokenization_layoutlm import LayoutLMTokenizer
from .tokenization_longformer import LongformerTokenizer
from .tokenization_lxmert import LxmertTokenizer
from .tokenization_mobilebert import MobileBertTokenizer
from .tokenization_openai import OpenAIGPTTokenizer
from .tokenization_phobert import PhobertTokenizer
from .tokenization_rag import RagTokenizer
from .tokenization_reformer import ReformerTokenizer, ReformerTokenizerFast
from .tokenization_retribert import RetriBertTokenizer, RetriBertTokenizerFast
from .tokenization_roberta import RobertaTokenizer, RobertaTokenizerFast
from .tokenization_squeezebert import SqueezeBertTokenizer, SqueezeBertTokenizerFast
from .tokenization_t5 import T5Tokenizer, T5TokenizerFast
from .tokenization_retribert import RetriBertTokenizer
from .tokenization_roberta import RobertaTokenizer
from .tokenization_squeezebert import SqueezeBertTokenizer
from .tokenization_transfo_xl import TransfoXLCorpus, TransfoXLTokenizer
from .tokenization_utils import PreTrainedTokenizer
from .tokenization_utils_base import (
AddedToken,
BatchEncoding,
CharSpan,
PreTrainedTokenizerBase,
SpecialTokensMixin,
TensorType,
TokenSpan,
)
from .tokenization_utils_fast import PreTrainedTokenizerFast
from .tokenization_xlm import XLMTokenizer
from .tokenization_xlm_roberta import XLMRobertaTokenizer, XLMRobertaTokenizerFast
from .tokenization_xlnet import SPIECE_UNDERLINE, XLNetTokenizer, XLNetTokenizerFast


if is_sentencepiece_available():
from .tokenization_albert import AlbertTokenizer
from .tokenization_bert_generation import BertGenerationTokenizer
from .tokenization_camembert import CamembertTokenizer
from .tokenization_marian import MarianTokenizer
from .tokenization_mbart import MBartTokenizer
from .tokenization_pegasus import PegasusTokenizer
from .tokenization_reformer import ReformerTokenizer
from .tokenization_t5 import T5Tokenizer
from .tokenization_xlm_roberta import XLMRobertaTokenizer
from .tokenization_xlnet import XLNetTokenizer
else:
from .utils.dummy_sentencepiece_objects import *

if is_tokenizers_available():
from .tokenization_albert_fast import AlbertTokenizerFast
from .tokenization_bart_fast import BartTokenizerFast
from .tokenization_bert_fast import BertTokenizerFast
from .tokenization_camembert_fast import CamembertTokenizerFast
from .tokenization_distilbert_fast import DistilBertTokenizerFast
from .tokenization_dpr_fast import (
DPRContextEncoderTokenizerFast,
DPRQuestionEncoderTokenizerFast,
DPRReaderTokenizerFast,
)
from .tokenization_electra_fast import ElectraTokenizerFast
from .tokenization_funnel_fast import FunnelTokenizerFast
from .tokenization_gpt2_fast import GPT2TokenizerFast
from .tokenization_herbert_fast import HerbertTokenizerFast
from .tokenization_layoutlm_fast import LayoutLMTokenizerFast
from .tokenization_longformer_fast import LongformerTokenizerFast
from .tokenization_lxmert_fast import LxmertTokenizerFast
from .tokenization_mbart_fast import MBartTokenizerFast
from .tokenization_mobilebert_fast import MobileBertTokenizerFast
from .tokenization_openai_fast import OpenAIGPTTokenizerFast
from .tokenization_pegasus_fast import PegasusTokenizerFast
from .tokenization_reformer_fast import ReformerTokenizerFast
from .tokenization_retribert_fast import RetriBertTokenizerFast
from .tokenization_roberta_fast import RobertaTokenizerFast
from .tokenization_squeezebert_fast import SqueezeBertTokenizerFast
from .tokenization_t5_fast import T5TokenizerFast
from .tokenization_utils_fast import PreTrainedTokenizerFast
from .tokenization_xlm_roberta_fast import XLMRobertaTokenizerFast
from .tokenization_xlnet_fast import XLNetTokenizerFast

if is_sentencepiece_available():
from .convert_slow_tokenizer import SLOW_TO_FAST_CONVERTERS, convert_slow_tokenizer
else:
from .utils.dummy_tokenizers_objects import *


# Trainer
from .trainer_callback import (
Expand Down Expand Up @@ -539,7 +583,6 @@
get_linear_schedule_with_warmup,
get_polynomial_decay_schedule_with_warmup,
)
from .tokenization_marian import MarianTokenizer

# Trainer
from .trainer import Trainer
Expand Down
2 changes: 1 addition & 1 deletion src/transformers/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,7 +266,7 @@ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
our S3, e.g., ``dbmdz/bert-base-german-cased``.
- A path to a `directory` containing a configuration file saved using the
:meth:`~transformers.PretrainedConfig.save_pretrained` method, or the
:meth:`~transformers.PretrainedModel.save_pretrained` method, e.g., ``./my_model_directory/``.
:meth:`~transformers.PreTrainedModel.save_pretrained` method, e.g., ``./my_model_directory/``.
- A path or url to a saved configuration JSON `file`, e.g.,
``./my_model_directory/configuration.json``.
cache_dir (:obj:`str`, `optional`):
Expand Down
14 changes: 14 additions & 0 deletions src/transformers/configuration_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ class PretrainedConfig(object):
recreate the correct object in :class:`~transformers.AutoConfig`.
Args:
name_or_path (:obj:`str`, `optional`, defaults to :obj:`""`):
Store the string that was passed to :func:`~transformers.PreTrainedModel.from_pretrained` or :func:`~transformers.TFPreTrainedModel.from_pretrained`
as ``pretrained_model_name_or_path`` if the configuration was created with such a method.
output_hidden_states (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the model should return all hidden-states.
output_attentions (:obj:`bool`, `optional`, defaults to :obj:`False`):
Expand Down Expand Up @@ -206,6 +209,9 @@ def __init__(self, **kwargs):
# TPU arguments
self.xla_device = kwargs.pop("xla_device", None)

# Name or path to the pretrained checkpoint
self._name_or_path = str(kwargs.pop("name_or_path", ""))

# Additional attributes without default values
for key, value in kwargs.items():
try:
Expand All @@ -214,6 +220,14 @@ def __init__(self, **kwargs):
logger.error("Can't set {} with value {} for {}".format(key, value, self))
raise err

@property
def name_or_path(self) -> str:
return self._name_or_path

@name_or_path.setter
def name_or_path(self, value):
self._name_or_path = str(value) # Make sure that name_or_path is a string (for JSON encoding)

@property
def use_return_dict(self) -> bool:
"""
Expand Down
40 changes: 34 additions & 6 deletions src/transformers/convert_slow_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,14 @@

from typing import Dict, List, Tuple

from sentencepiece import SentencePieceProcessor
from tokenizers import Tokenizer, decoders, normalizers, pre_tokenizers, processors
from tokenizers.models import BPE, Unigram, WordPiece

# from transformers.tokenization_openai import OpenAIGPTTokenizer
from transformers.utils import sentencepiece_model_pb2 as model

from .file_utils import requires_sentencepiece


class SentencePieceExtractor:
"""
Expand All @@ -35,7 +36,9 @@ class SentencePieceExtractor:
"""

def __init__(self, model: str):
# Get SentencePiece
requires_sentencepiece(self)
from sentencepiece import SentencePieceProcessor

self.sp = SentencePieceProcessor()
self.sp.Load(model)

Expand Down Expand Up @@ -568,11 +571,10 @@ def post_processor(self):
)


CONVERTERS = {
SLOW_TO_FAST_CONVERTERS = {
"AlbertTokenizer": AlbertConverter,
"BertTokenizer": BertConverter,
"BertGenerationTokenizer": BertGenerationConverter,
"BartTokenizer": RobertaConverter,
"BertTokenizer": BertConverter,
"CamembertTokenizer": CamembertConverter,
"DistilBertTokenizer": BertConverter,
"DPRReaderTokenizer": BertConverter,
Expand All @@ -582,18 +584,44 @@ def post_processor(self):
"FunnelTokenizer": FunnelConverter,
"GPT2Tokenizer": GPT2Converter,
"HerbertTokenizer": HerbertConverter,
"LayoutLMTokenizer": BertConverter,
"LongformerTokenizer": RobertaConverter,
"LxmertTokenizer": BertConverter,
"MBartTokenizer": MBartConverter,
"MobileBertTokenizer": BertConverter,
"OpenAIGPTTokenizer": OpenAIGPTConverter,
"PegasusTokenizer": PegasusConverter,
"ReformerTokenizer": ReformerConverter,
"RetriBertTokenizer": BertConverter,
"RobertaTokenizer": RobertaConverter,
"SqueezeBertTokenizer": BertConverter,
"T5Tokenizer": T5Converter,
"XLMRobertaTokenizer": XLMRobertaConverter,
"XLNetTokenizer": XLNetConverter,
}


def convert_slow_tokenizer(transformer_tokenizer) -> Tokenizer:
converter_class = CONVERTERS[transformer_tokenizer.__class__.__name__]
"""Utilities to convert a slow tokenizer instance in a fast tokenizer instance.
Args:
transformer_tokenizer (:class:`~transformers.tokenization_utils_base.PreTrainedTokenizer`):
Instance of a slow tokenizer to convert in the backend tokenizer for
:class:`~transformers.tokenization_utils_base.PreTrainedTokenizerFast`.
Return:
A instance of :class:`~tokenizers.Tokenizer` to be used as the backend tokenizer of a
:class:`~transformers.tokenization_utils_base.PreTrainedTokenizerFast`
"""

tokenizer_class_name = transformer_tokenizer.__class__.__name__

if tokenizer_class_name not in SLOW_TO_FAST_CONVERTERS:
raise ValueError(
f"An instance of tokenizer class {tokenizer_class_name} cannot be converted in a Fast tokenizer instance. "
f"No converter was found. Currently available slow->fast convertors: {list(SLOW_TO_FAST_CONVERTERS.keys())}"
)

converter_class = SLOW_TO_FAST_CONVERTERS[tokenizer_class_name]

return converter_class(transformer_tokenizer).converted()
Loading

0 comments on commit ba8c4d0

Please sign in to comment.