Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Tokenizer fix decode (openvinotoolkit#767)
* Added string tensor implementation with explicit pointer unpack * Started to migrate to extension-only support of string operations with and without string support in OV core. Moved StringTensorUnpack and reworked it to be aligned with the new approach. Reworked sentece piece op and translation code to be compatible with several variants of string tensor representation and the plugin wrapping hack. * Started to merge string/tokenizer related stuff from a dedicated OV branch to contrib in form compatible with both master and the branch with string tensors support. Added CaseFoldUTF8 from that branch. * Rename CaseFoldUTF8 to name from opset proposal: CaseFold, added NormalizeUnicode * Added a stub for RegexNormalization operation, WA for CPU bug with empty constants, register StringTensorPack and StringTensorUnpack as OV operations to be able to read IRs with those operations * Implemented Reshape for decomposed string tensors * Added RaggedTensorPack, sophisticated stup for RegexSplit and overridden Const translator for TF to intercept string constants * Fixes for both master and element::string branches of OpenVINO; better conditional compilation based on available features in OpenVINO * Debug output of indices in RaggedTensorPack * Implemented a stub for WordpieceTokenizer. Supported conversion of a combination of WordpieceTokenizeWithOffsets and LookupTableFindV2 from TensorFlow * Disabled debug output * Define default values for custom operations attributes to make attribute initialization optional (needed for core.make_node) * Added fast_tokenizer lib to the build. Implemented CaseFold based on fast_tokenizer. * Removed debug output * Implemented RaggedToDense always in pad_right=true mode and with boolean mask extra output * Provided real implementations for NormalizeUnicode, RegexNormalization and RegexSplit based on paddle fast_tokenizer lib. Limited implementation, not all of the features of ops and TF translated ops are implemented. * Implemented WordpieceTokenizer with fast_tokenizer library * Renamed behaviours to be verbs instead of adjectives * Added modified version of HF tokenizer parser from Artur; implemented necessary steps to complete HF bert preprocessing conversion (not validated) * Renamed apply_tokenizer to connect_tokeniser and removed obsolete handling of model name * CombineSegments is implemented, used in HF converter. Stitching of tokenizer and main model is fixed partially (still produces topologically incorrect model) * Fixed stitching of two models by connecting with names of inputs/outputs, now Bert and its tokenizer are connected together correctly * WA for CPU bug with scalar inputs, correct truncation and dynamic padding, fix bugs for batches processing * Fixed conversion of HF tokenizer if part of outputs are omitted. Disabled debug output * Add BPE Tokenizer * Add BytesToChars Node for BBPE * Delete print * Clip max value for max_length to int32 * Fix RegexNormalization and Splitter, Add Digits Splitter * Bug fixes * Add decoding step, BytesToChars refactoring Has a bug with internal dimension for VocabNode * Fix some regex bugs for byte-level splitter * Fix bug with VocabDecoder shape * Minor changes for natively supported strings * Suppressed minor^Carnings about int32 -> unsigned implicit * Restructured sentence_piece directory to tokenizer directory: split all ops, translators and helper into individual files. To build use tokenizer custom op name in cmake instead of sentence_piece. * Add regex to detokenizer pipeline, all splitters have 5 inputs * Add Caching for RegexNormalization * Add Caching for RegexSplit * Add Wordpiece Cache * Add NodeFactory * Fix regex nodes init * Fix Wordpiece Cache * Add BPE Cache * Fix RegexNormalization * Refactor CombineSegments and Padding * Refactoring * Clean-up commented code * Sentencepiece Model Encoder from Transformers Tokenizer * Add tests for tokenizers * Add detokenizer for Sentencepiece models * Update README.md * Update README.md * Update README.md * OVTokenizer as python package * Update README.md * Add sentencepiece detokenizer test * Unified interface for fast and sentencepiece tokenizers * Add Full Pipeline example for Sentencepiece Move greedy decoding pipeline from detokenizer to model * Update third-party-programs.txt * Add Constants * Add CPP pack/unpack_strings functions Refactor greedy decoding * Move tests to tokenizer dir * Fix import * Fix imports * Sort Imports * Add Streaming Sentencepiece Decoder * Change Authors * Update modules/custom_operations/user_ie_extensions/tokenizer/utils.cpp Co-authored-by: Zlobin Vladimir <[email protected]> * Configure tests * Skip Java Tests * Add Regression Test * Skip traceback * Add Win64 Fast Tokenizer lib * Fix WorkingDir * Return TB * Fix dependencies install * Add byte tokens handling for sentencepiece * Drop black, use ruff format instead * Temp remove tokenizers from windows CI * CI check * Compile fast_tokenizers from source code * Export pack_strings() and unpack_strings() * Build tokenizer target on windows * Add icu4c patch * Added include dir to nlohmann headers * Fixed compilation on ubuntu 18.04 arm64 * Fixed Windows * Supported prebuild Fast Tokenizers on all platforms * Add tiktoken support WIP * Unskip java tests * Fixed compilation with re2 on Windows * Move unpack_strings(), create sepparate include dir * openvino_extensions * Fixed link stage on Windows * i64 is default tokenizer output type * Add support for more tiktoken tokenizers * Check Azure CI * Fix Azure Win CI * Define python version for setupvars.bat * Add support for tiktoken detokenizers * Add ChatGLM tokenization support. * Add ChatGLM detokenization and tests * Add ChatGLM detokenization and tests * Fix mac sha256 * Skip Lin Java Tests * Add Mac Tokenziers Tests and Skip Mac Java Step * Fix Mac SHA * Del WA for CPU Bug * Fix Mac CI Pipeline * Change Mac CI * Fixed compilation * Add setupvars to mac CI * Change detokenizer output type * Fix SegFault on AddedTokens For BPE tokenizer * Add SP Space handling for decoder * Removed SHA for macOS x86_64 * More fixes * Fixed macos * Enabled tests * Fixed warnings * Use developer package * Split build * Update windows.yml * Added missed IMPLEMENT_OPENVINO_EXTENSION_API * Update windows.yml * Update windows.yml * Update windows.yml * Update windows.yml * Update .ci/azure/windows.yml removed build of fast tokenizers * Update windows.yml * Update windows.yml --------- Co-authored-by: Sergey Lyalin <[email protected]> Co-authored-by: Artur Paniukov <[email protected]> Co-authored-by: Artur Paniukov <[email protected]> Co-authored-by: Zlobin Vladimir <[email protected]> Co-authored-by: Andrei Kochin <[email protected]>
- Loading branch information