Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Tokenizers #687

Closed
wants to merge 142 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
70f867a
Added string tensor implementation with explicit pointer unpack
slyalin Apr 25, 2023
1fac3de
Merged from master
slyalin Apr 28, 2023
821dee5
Started to migrate to extension-only support of string operations wit…
slyalin May 2, 2023
b9b0693
Started to merge string/tokenizer related stuff from a dedicated OV b…
slyalin May 10, 2023
c785ec1
Rename CaseFoldUTF8 to name from opset proposal: CaseFold, added Norm…
slyalin May 10, 2023
1d129ac
Added a stub for RegexNormalization operation, WA for CPU bug with em…
slyalin May 11, 2023
71bc5bf
Implemented Reshape for decomposed string tensors
slyalin May 11, 2023
6c5eec0
Added RaggedTensorPack, sophisticated stup for RegexSplit and overrid…
slyalin May 12, 2023
29dfe38
Fixes for both master and element::string branches of OpenVINO; bette…
slyalin May 15, 2023
40063c1
Debug output of indices in RaggedTensorPack
slyalin May 16, 2023
cc47b12
Implemented a stub for WordpieceTokenizer. Supported conversion of a …
slyalin May 17, 2023
7644231
Disabled debug output
slyalin May 17, 2023
80b8023
Define default values for custom operations attributes to make attrib…
slyalin May 18, 2023
46c82b8
Added fast_tokenizer lib to the build. Implemented CaseFold based on …
slyalin May 20, 2023
d7ca2ab
Removed debug output
slyalin May 20, 2023
2baac3d
Implemented RaggedToDense always in pad_right=true mode and with bool…
slyalin May 20, 2023
d270dd6
Provided real implementations for NormalizeUnicode, RegexNormalizatio…
slyalin May 23, 2023
119d6e9
Implemented WordpieceTokenizer with fast_tokenizer library
slyalin May 23, 2023
4d4ad89
Renamed behaviours to be verbs instead of adjectives
slyalin May 25, 2023
f4eee84
Added modified version of HF tokenizer parser from Artur; implemented…
slyalin May 25, 2023
1e50352
Renamed apply_tokenizer to connect_tokeniser and removed obsolete han…
slyalin May 25, 2023
0966b8a
CombineSegments is implemented, used in HF converter. Stitching of to…
slyalin May 31, 2023
61d7983
Fixed stitching of two models by connecting with names of inputs/outp…
slyalin May 31, 2023
5609ee6
WA for CPU bug with scalar inputs, correct truncation and dynamic pad…
slyalin Jun 1, 2023
062acf3
Fixed conversion of HF tokenizer if part of outputs are omitted. Disa…
slyalin Jun 1, 2023
0f772dc
Add BPE Tokenizer
apaniukov Jun 19, 2023
10e3d18
Add BytesToChars Node for BBPE
apaniukov Jun 20, 2023
c413cb6
Delete print
apaniukov Jun 20, 2023
8c8994c
Clip max value for max_length to int32
apaniukov Jun 20, 2023
8750ae6
Fix RegexNormalization and Splitter, Add Digits Splitter
apaniukov Jun 22, 2023
be6dc3f
Bug fixes
apaniukov Jun 23, 2023
e4dcdda
Add decoding step, BytesToChars refactoring
apaniukov Jun 29, 2023
b45e5ec
Fix some regex bugs for byte-level splitter
apaniukov Jun 30, 2023
5f03ed0
Fix bug with VocabDecoder shape
apaniukov Jul 7, 2023
2a65502
Minor changes for natively supported strings
slyalin Jul 10, 2023
2e34b92
Merge remote-tracking branch 'artur/string_tensors_add_bpe' into stri…
slyalin Jul 10, 2023
a6f9110
Suppressed minor^Carnings about int32 -> unsigned implicit
slyalin Jul 10, 2023
5c29254
Restructured sentence_piece directory to tokenizer directory: split a…
slyalin Jul 10, 2023
f8d0e0d
Add regex to detokenizer pipeline, all splitters have 5 inputs
apaniukov Jul 17, 2023
10c10c5
Add Caching for RegexNormalization
apaniukov Jul 27, 2023
4eb12f8
Add Caching for RegexSplit
apaniukov Jul 27, 2023
c5efaf0
Add Wordpiece Cache
apaniukov Jul 28, 2023
239acc4
Add NodeFactory
apaniukov Jul 31, 2023
38552b0
Fix regex nodes init
apaniukov Aug 4, 2023
597ccd4
Fix Wordpiece Cache
apaniukov Aug 10, 2023
e6933b7
Add BPE Cache
apaniukov Aug 10, 2023
bd7f9d9
Fix RegexNormalization
apaniukov Aug 11, 2023
99c603f
Refactor CombineSegments and Padding
apaniukov Sep 6, 2023
6cc9b36
Refactoring
apaniukov Sep 7, 2023
973c52d
Clean-up commented code
apaniukov Sep 8, 2023
1fa02b2
Sentencepiece Model Encoder from Transformers Tokenizer
apaniukov Sep 27, 2023
e37f89d
Add tests for tokenizers
apaniukov Sep 27, 2023
88bf7c6
Add detokenizer for Sentencepiece models
apaniukov Oct 2, 2023
bb1b57a
Update README.md
apaniukov Oct 4, 2023
6b4be05
Update README.md
apaniukov Oct 4, 2023
539797f
Update README.md
apaniukov Oct 4, 2023
79c3e09
OVTokenizer as python package
apaniukov Oct 4, 2023
203ffbb
Merge branch 'openvinotoolkit:master' into tokenizer-fix-decode
apaniukov Oct 4, 2023
45c0068
Update README.md
apaniukov Oct 5, 2023
372465b
Merge branch 'master' into tokenizer-fix-decode
apaniukov Oct 6, 2023
64567ea
Add sentencepiece detokenizer test
apaniukov Oct 6, 2023
f54076e
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Oct 6, 2023
c42d1bd
Unified interface for fast and sentencepiece tokenizers
apaniukov Oct 9, 2023
8b29443
Add Full Pipeline example for Sentencepiece
apaniukov Oct 11, 2023
2ee3707
Update third-party-programs.txt
apaniukov Oct 11, 2023
4b57fcc
Merge branch 'master' into tokenizer-fix-decode
apaniukov Oct 12, 2023
803d831
Add Constants
apaniukov Oct 13, 2023
72f6d9f
Add CPP pack/unpack_strings functions
apaniukov Oct 16, 2023
386cb02
Merge branch 'master' into tokenizer-fix-decode
apaniukov Oct 17, 2023
79bd05f
Move tests to tokenizer dir
apaniukov Oct 17, 2023
24a60b3
Fix import
apaniukov Oct 18, 2023
f01afee
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Oct 18, 2023
b22569f
Fix imports
apaniukov Oct 18, 2023
96673f5
Sort Imports
apaniukov Oct 18, 2023
0e7ae87
Add Streaming Sentencepiece Decoder
apaniukov Oct 19, 2023
5ebdb1f
Change Authors
apaniukov Oct 19, 2023
6a55877
Update modules/custom_operations/user_ie_extensions/tokenizer/utils.cpp
apaniukov Oct 23, 2023
06d5159
Configure tests
apaniukov Oct 23, 2023
fa5360d
Skip Java Tests
apaniukov Oct 24, 2023
e855193
Add Regression Test
apaniukov Oct 24, 2023
d495d3b
Skip traceback
apaniukov Oct 24, 2023
d7bebd0
Add Win64 Fast Tokenizer lib
apaniukov Oct 24, 2023
b2e35ed
Fix WorkingDir
apaniukov Oct 24, 2023
f81bd18
Return TB
apaniukov Oct 24, 2023
0bd23b5
Fix dependencies install
apaniukov Oct 24, 2023
12ac9f8
Add byte tokens handling for sentencepiece
apaniukov Oct 24, 2023
9e6ae6f
Drop black, use ruff format instead
apaniukov Oct 25, 2023
f5d2d4c
Temp remove tokenizers from windows CI
apaniukov Oct 26, 2023
cf039b9
CI check
apaniukov Oct 26, 2023
795306d
Compile fast_tokenizers from source code
ilya-lavrenov Oct 28, 2023
9c200c2
Export pack_strings() and unpack_strings()
Wovchena Oct 30, 2023
0e9b960
Merge pull request #1 from ilya-lavrenov/tokenizer-fix-decode
apaniukov Oct 31, 2023
95aa47c
Merge branch 'master' into tokenizer-fix-decode
apaniukov Oct 31, 2023
e1de338
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena Oct 31, 2023
f23e59b
Build tokenizer target on windows
apaniukov Oct 31, 2023
dbec117
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena Nov 2, 2023
ce25397
Add icu4c patch
apaniukov Nov 3, 2023
d46f594
Added include dir to nlohmann headers
ilya-lavrenov Nov 8, 2023
6f213ab
Fixed compilation on ubuntu 18.04 arm64
ilya-lavrenov Nov 8, 2023
6ed52e4
Fixed Windows
ilya-lavrenov Nov 8, 2023
ca62321
Merge pull request #3 from ilya-lavrenov/nlohmann
apaniukov Nov 8, 2023
52bfe5a
Supported prebuild Fast Tokenizers on all platforms
ilya-lavrenov Nov 8, 2023
b504013
Merge branch 'master' into tokenizer-fix-decode
apaniukov Nov 9, 2023
cc663dc
Add tiktoken support WIP
apaniukov Nov 9, 2023
4c9ceed
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Nov 9, 2023
745e969
Unskip java tests
apaniukov Nov 9, 2023
48564b7
Merge pull request #4 from ilya-lavrenov/prebuilt-fast-tokenizers
apaniukov Nov 9, 2023
056eb9f
Fixed compilation with re2 on Windows
ilya-lavrenov Nov 10, 2023
309b8e9
Merge pull request #5 from ilya-lavrenov/windows-re2
apaniukov Nov 10, 2023
b193cb2
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena Nov 10, 2023
debcb5d
Move unpack_strings(), create sepparate include dir
Wovchena Nov 10, 2023
b739ffd
openvino_extensions
Wovchena Nov 10, 2023
e70a3f2
Fixed link stage on Windows
ilya-lavrenov Nov 13, 2023
2ce27cd
Merge pull request #6 from ilya-lavrenov/windows-linkage
apaniukov Nov 13, 2023
3022a5a
i64 is default tokenizer output type
apaniukov Nov 14, 2023
c467a8c
Add support for more tiktoken tokenizers
apaniukov Nov 14, 2023
1ec4c5f
Merge branch 'tokenizer-fix-decode' into export-pack_strings-and-unpa…
Wovchena Nov 15, 2023
8505b51
Check Azure CI
apaniukov Nov 15, 2023
82639e6
Fix Azure Win CI
apaniukov Nov 15, 2023
fb37580
Merge pull request #2 from Wovchena/export-pack_strings-and-unpack_st…
apaniukov Nov 15, 2023
a45b826
Define python version for setupvars.bat
apaniukov Nov 15, 2023
35cc136
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Nov 15, 2023
244a593
Add support for tiktoken detokenizers
apaniukov Nov 15, 2023
84686f4
Merge branch 'master' into tokenizer-fix-decode
andrei-kochin Nov 16, 2023
ad1c589
Add ChatGLM tokenization support.
apaniukov Nov 16, 2023
0f63c3d
Add ChatGLM detokenization and tests
apaniukov Nov 16, 2023
0f1c1cc
Add ChatGLM detokenization and tests
apaniukov Nov 17, 2023
3edb73b
Fix mac sha256
apaniukov Nov 17, 2023
48bba34
Skip Lin Java Tests
apaniukov Nov 17, 2023
fe507ff
Add Mac Tokenziers Tests and Skip Mac Java Step
apaniukov Nov 17, 2023
4b0c4ec
Fix Mac SHA
apaniukov Nov 17, 2023
4656238
Del WA for CPU Bug
apaniukov Nov 17, 2023
2f5cc1c
Fix Mac CI Pipeline
apaniukov Nov 18, 2023
1568727
Change Mac CI
apaniukov Nov 18, 2023
fa822c2
Fixed compilation
ilya-lavrenov Nov 20, 2023
6ddb2a6
Merge pull request #7 from ilya-lavrenov/compilation-fix
apaniukov Nov 20, 2023
14f993b
Add setupvars to mac CI
apaniukov Nov 20, 2023
cae3098
Merge branch 'master' into tokenizer-fix-decode
apaniukov Nov 20, 2023
b59204d
Change detokenizer output type
apaniukov Nov 20, 2023
e54b42e
Merge remote-tracking branch 'origin/tokenizer-fix-decode' into token…
apaniukov Nov 20, 2023
6c3bae3
Fix SegFault on AddedTokens For BPE tokenizer
apaniukov Nov 20, 2023
d34d401
Add SP Space handling for decoder
apaniukov Nov 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 17 additions & 9 deletions .ci/azure/linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -154,15 +154,15 @@ jobs:
- script: ls -alR $(INSTALL_DIR)
displayName: 'List install files'

- script: |
set -e
export PATH=$(WORK_DIR)/gradle-$(GRADLE_VER)/bin:${PATH}
. $(SETUPVARS) gradle clean build --info
for d in CPU HETERO:CPU; do
gradle test -Prun_tests -DMODELS_PATH=$(MODELS_PATH) -Ddevice=$d --info;
done
workingDirectory: $(REPO_DIR)/modules/java_api
displayName: 'Java tests'
# - script: |
# set -e
# export PATH=$(WORK_DIR)/gradle-$(GRADLE_VER)/bin:${PATH}
# . $(SETUPVARS) gradle clean build --info
# for d in CPU HETERO:CPU; do
# gradle test -Prun_tests -DMODELS_PATH=$(MODELS_PATH) -Ddevice=$d --info;
# done
# workingDirectory: $(REPO_DIR)/modules/java_api
# displayName: 'Java tests'

- script: |
python3 -m pip install --user virtualenv
Expand All @@ -171,6 +171,7 @@ jobs:
python -m pip install --upgrade pip
python -m pip install -r $(REPO_DIR)/modules/custom_operations/tests/requirements.txt
cd ${OPENVINO_REPO_DIR}/tools && python -m pip install mo/
python -m pip install $(REPO_DIR)/modules/custom_operations/user_ie_extensions/tokenizer/python/.[all]
workingDirectory: $(WORK_DIR)
displayName: 'Create user custom operations env'

Expand All @@ -181,3 +182,10 @@ jobs:
python -m pytest -k "not sparse_conv" tests/run_tests.py
workingDirectory: $(REPO_DIR)/modules/custom_operations
displayName: 'Custom user operation tests'

- script: |
. $(SETUPVARS)
source $(WORK_DIR)/.env3/bin/activate
python -m pytest --tb=no tokenizers_test.py
workingDirectory: $(REPO_DIR)/modules/custom_operations/user_ie_extensions/tokenizer/python/tests/
displayName: 'Tokenizers extension regression test'
30 changes: 23 additions & 7 deletions .ci/azure/mac.yml
Original file line number Diff line number Diff line change
Expand Up @@ -137,11 +137,27 @@ jobs:
- script: ls -alR $(INSTALL_DIR)
displayName: 'List install files'

# - script: |
# . $(SETUPVARS) gradle clean build --info
# for d in CPU HETERO:CPU; do
# gradle test -Prun_tests -DMODELS_PATH=$(MODELS_PATH) -Ddevice=$d --info;
# done
# workingDirectory: $(REPO_DIR)/modules/java_api
# displayName: 'Java tests'
# condition: eq(variables['CMAKE_OSX_ARCHITECTURES'], 'x86_64')

- script: |
python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
. $(SETUPVARS)
python -m pip install $(REPO_DIR)/modules/custom_operations/user_ie_extensions/tokenizer/python/.[transformers]
workingDirectory: $(WORK_DIR)
displayName: 'Create tokenizers env'

- script: |
. $(SETUPVARS) gradle clean build --info
for d in CPU HETERO:CPU; do
gradle test -Prun_tests -DMODELS_PATH=$(MODELS_PATH) -Ddevice=$d --info;
done
workingDirectory: $(REPO_DIR)/modules/java_api
displayName: 'Java tests'
condition: eq(variables['CMAKE_OSX_ARCHITECTURES'], 'x86_64')
. $(SETUPVARS)
source $(WORK_DIR)/venv/bin/activate
python -m pytest --tb=no tokenizers_test.py
workingDirectory: $(REPO_DIR)/modules/custom_operations/user_ie_extensions/tokenizer/python/tests/
displayName: 'Tokenizers extension regression test'
21 changes: 10 additions & 11 deletions .ci/azure/windows.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see Windows compiles now. But ov_tokenizer.init_extension() fails for me:

import os
import sys
import ov_tokenizer

if hasattr(os, "add_dll_directory"):
    for path in os.environ.get("PATH", "").split(";"):
        if os.path.isdir(path):
            os.add_dll_directory(path)
ov_tokenizer.init_extension(sys.argv[1])
py llm/cpp/convert_tokenizers.py c:/Users/vzlobin/r/openvino.genai/build/thirdparty/openvino_contrib/modules/custom_operations/user_ie_extensions/Release/user_ov_extensions.dll C:\Users\vzlobin\r\tiny-llama-fast-tokenizer
Traceback (most recent call last):
  File "C:\Users\vzlobin\r\openvino.genai\llm\cpp\convert_tokenizers.py", line 9, in <module>
    ov_tokenizer.init_extension(sys.argv[1])
  File "C:\Users\vzlobin\r\openvino.genai\thirdparty\openvino_contrib\modules\custom_operations\user_ie_extensions\tokenizer\python\ov_tokenizer\node_factory.py", line 21, in init_extension
    factory.add_extension(extension_path)
  File "C:\Users\vzlobin\Downloads\w_openvino_toolkit_windows_2023.2.0.13089.cfd42bd2cb0_x86_64\python\openvino\runtime\utils\node_factory.py", line 118, in add_extension
    self.factory.add_extension(lib_path)
RuntimeError: Cannot load library 'c:/Users/vzlobin/r/openvino.genai/build/thirdparty/openvino_contrib/modules/custom_operations/user_ie_extensions/Release/user_ov_extensions.dll': 126 from cwd: C:\Users\vzlobin\r\openvino.genai

Original file line number Diff line number Diff line change
Expand Up @@ -54,14 +54,13 @@ jobs:
SETUPVARS: $(INSTALL_DIR)\setupvars.bat
CUSTOM_OP_LIB: $(BIN_DIR)\user_ov_extensions.dll
GRADLE_VER: 7.1.1
PYTHON_EXE: C:\hostedtoolcache\windows\Python\3.8.2\x64\python.exe

steps:
- script: |
powershell -command "Invoke-RestMethod -Headers @{\"Metadata\"=\"true\"} -Method GET -Uri http://169.254.169.254/metadata/instance/compute?api-version=2019-06-01 | format-custom"
where python3
python3 --version
where python
python --version
where $(PYTHON_EXE)
$(PYTHON_EXE) --version
where java
java -version
wmic computersystem get TotalPhysicalMemory
Expand Down Expand Up @@ -99,11 +98,11 @@ jobs:
powershell -command "Expand-Archive -Force ninja-win.zip"
powershell -command "Invoke-WebRequest https://services.gradle.org/distributions/gradle-$(GRADLE_VER)-bin.zip -OutFile gradle-$(GRADLE_VER)-bin.zip"
powershell -command "Expand-Archive -Force gradle-$(GRADLE_VER)-bin.zip"
python -m pip install --upgrade pip
python -m pip install -r $(OPENVINO_REPO_DIR)\src\bindings\python\src\compatibility\openvino\requirements-dev.txt
python -m pip install -r $(OPENVINO_REPO_DIR)\src\bindings\python\requirements.txt
python -m pip install -r $(REPO_DIR)\modules\custom_operations\tests\requirements.txt
python -m pip install $(OPENVINO_REPO_DIR)\tools\mo
$(PYTHON_EXE) -m pip install --upgrade pip
$(PYTHON_EXE) -m pip install -r $(OPENVINO_REPO_DIR)\src\bindings\python\src\compatibility\openvino\requirements-dev.txt
$(PYTHON_EXE) -m pip install -r $(OPENVINO_REPO_DIR)\src\bindings\python\requirements.txt
$(PYTHON_EXE) -m pip install -r $(REPO_DIR)\modules\custom_operations\tests\requirements.txt
$(PYTHON_EXE) -m pip install $(OPENVINO_REPO_DIR)\tools\mo
powershell -command "Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))"
choco install opencv -y
workingDirectory: $(WORK_DIR)
Expand Down Expand Up @@ -159,7 +158,7 @@ jobs:

- script: |
call C:\tools\opencv\build\setup_vars_opencv4.cmd
call $(SETUPVARS)
python -m pytest -k "not sparse_conv" tests\run_tests.py
call $(SETUPVARS) -pyver 3.8 && ^
$(PYTHON_EXE) -m pytest -k "not sparse_conv" tests\run_tests.py
workingDirectory: $(REPO_DIR)\modules\custom_operations
displayName: 'Custom user operation tests'
2 changes: 1 addition & 1 deletion modules/custom_operations/tests/run_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
import os


def run_test(ref_inputs, ref_res, test_onnx=False, threshold=1e-5):
def run_test(ref_inputs, ref_res, test_onnx=False, threshold=1e-5):
inputs = {}
shapes = {}
for i in range(len(ref_inputs)):
Expand Down
13 changes: 10 additions & 3 deletions modules/custom_operations/user_ie_extensions/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@ endif()

set(TARGET_NAME "user_ov_extensions")

set(CMAKE_CXX_STANDARD 11)
if(NOT CMAKE_CXX_STANDARD)
set(CMAKE_CXX_STANDARD 11)
endif()

include(cmake/platforms.cmake)

find_package(OpenVINO REQUIRED COMPONENTS Runtime)
find_package(TBB COMPONENTS tbb tbbmalloc)
Expand All @@ -27,6 +31,7 @@ set(OP_REQ_TBB "complex_mul" "fft")
if(NOT CUSTOM_OPERATIONS)
file(GLOB op_src "${CMAKE_CURRENT_SOURCE_DIR}/*.cpp")
file(GLOB op_dirs LIST_DIRECTORIES true "${CMAKE_CURRENT_SOURCE_DIR}/*")
list(REMOVE_ITEM op_dirs "${CMAKE_CURRENT_SOURCE_DIR}/cmake")

foreach(op IN LISTS op_src)
get_filename_component(op_name ${op} NAME_WE)
Expand Down Expand Up @@ -88,10 +93,12 @@ if(TBB_FOUND)
target_link_libraries(${TARGET_NAME} PRIVATE TBB::tbb TBB::tbbmalloc)
endif()

if(sentence_piece IN_LIST CUSTOM_OPERATIONS)
add_subdirectory(sentence_piece)
# Left sentence_piece for backward compatibility
if(tokenizer IN_LIST CUSTOM_OPERATIONS)
add_subdirectory(tokenizer)
endif()

target_link_libraries(${TARGET_NAME} PRIVATE openvino::runtime)

target_compile_definitions(${TARGET_NAME} PRIVATE ${CUSTOM_OPERATIONS})
target_include_directories(${TARGET_NAME} PUBLIC ./include/)
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@

# Copyright (C) 2023 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#

if(CMAKE_CL_64)
set(MSVC64 ON)
endif()

if(WIN32 AND CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
execute_process(COMMAND ${CMAKE_CXX_COMPILER} -dumpmachine
OUTPUT_VARIABLE OPENVINO_GCC_TARGET_MACHINE
OUTPUT_STRIP_TRAILING_WHITESPACE)
if(OPENVINO_GCC_TARGET_MACHINE MATCHES "amd64|x86_64|AMD64")
set(MINGW64 ON)
endif()
endif()

if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "amd64.*|x86_64.*|AMD64.*")
set(OV_HOST_ARCH X86_64)
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "i686.*|i386.*|x86.*|amd64.*|AMD64.*")
set(OV_HOST_ARCH X86)
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "^(arm64.*|aarch64.*|AARCH64.*|ARM64.*)")
set(OV_HOST_ARCH AARCH64)
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "^(arm.*|ARM.*)")
set(OV_HOST_ARCH ARM)
elseif(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "^riscv64$")
set(OV_HOST_ARCH RISCV64)
endif()

macro(_ov_user_ext_detect_arch_by_processor_type)
if(CMAKE_OSX_ARCHITECTURES AND APPLE)
if(CMAKE_OSX_ARCHITECTURES STREQUAL "arm64")
set(OV_ARCH AARCH64)
elseif(CMAKE_OSX_ARCHITECTURES STREQUAL "x86_64")
set(OV_ARCH X86_64)
elseif(CMAKE_OSX_ARCHITECTURES MATCHES ".*x86_64.*" AND CMAKE_OSX_ARCHITECTURES MATCHES ".*arm64.*")
set(OV_ARCH UNIVERSAL2)
else()
message(FATAL_ERROR "Unsupported value: CMAKE_OSX_ARCHITECTURES = ${CMAKE_OSX_ARCHITECTURES}")
endif()
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "amd64.*|x86_64.*|AMD64.*")
set(OV_ARCH X86_64)
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "i686.*|i386.*|x86.*|amd64.*|AMD64.*|wasm")
set(OV_ARCH X86)
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm64.*|aarch64.*|AARCH64.*|ARM64.*|armv8)")
set(OV_ARCH AARCH64)
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm.*|ARM.*)")
set(OV_ARCH ARM)
elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^riscv64$")
set(OV_ARCH RISCV64)
endif()
endmacro()

macro(_ov_user_ext_process_msvc_generator_platform)
# if cmake -A <ARM|ARM64|x64|Win32> is passed
if(CMAKE_GENERATOR_PLATFORM STREQUAL "ARM64")
set(OV_ARCH AARCH64)
elseif(CMAKE_GENERATOR_PLATFORM STREQUAL "ARM")
set(OV_ARCH ARM)
elseif(CMAKE_GENERATOR_PLATFORM STREQUAL "x64")
set(OV_ARCH X86_64)
elseif(CMAKE_GENERATOR_PLATFORM STREQUAL "Win32")
set(OV_ARCH X86)
else()
_ov_user_ext_detect_arch_by_processor_type()
endif()
endmacro()

if(MSVC64 OR MINGW64)
_ov_user_ext_process_msvc_generator_platform()
elseif(MINGW OR (MSVC AND NOT CMAKE_CROSSCOMPILING))
_ov_user_ext_process_msvc_generator_platform()
else()
_ov_user_ext_detect_arch_by_processor_type()
endif()

set(HOST_${OV_HOST_ARCH} ON)
set(${OV_ARCH} ON)

unset(OV_ARCH)

if(CMAKE_SYSTEM_NAME STREQUAL "Emscripten")
set(EMSCRIPTEN ON)
endif()

if(UNIX AND NOT (APPLE OR ANDROID OR EMSCRIPTEN OR CYGWIN))
set(LINUX ON)
endif()
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
// Copyright (C) 2023 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//

#pragma once

#include <openvino/runtime/tensor.hpp>

namespace openvino_extensions {
// Pack any container with string to ov::Tensor with element type u8
// Requirements for BatchOfStrings: .size() with size and .begin(), .end() as iterators, elements with .begin(), .end() and .size()
// so basically any STL container with std::string is compatible
// Tensor destination will be reshaped according the input data
template <typename BatchOfStrings>
void pack_strings(const BatchOfStrings& strings, ov::Tensor& destination) {
auto batch_size = strings.size();

// First run over all elements: calculate total memory required to hold all strings
size_t symbols_size = std::accumulate(
strings.begin(), strings.end(), size_t(0),
[](size_t accum, typename BatchOfStrings::const_reference str)
{ return accum + str.size(); });

size_t total_size = 4 * (1 + 1 + batch_size) + symbols_size;
destination.set_shape({total_size});

int32_t* pindices = reinterpret_cast<int32_t*>(destination.data<uint8_t>());
pindices[0] = batch_size;
pindices[1] = 0;
pindices += 2;
char* psymbols = reinterpret_cast<char*>(pindices + batch_size);
size_t current_symbols_pos = 0;

for (const auto& str: strings) {
psymbols = std::copy(str.begin(), str.end(), psymbols);
current_symbols_pos += str.size();
*pindices = current_symbols_pos;
++pindices;
}
}

std::vector<std::string> unpack_strings(const ov::Tensor& source) {
int32_t length = source.get_byte_size();
// check the format of the input bitstream representing the string tensor
OPENVINO_ASSERT(length >= 4, "Incorrect packed string tensor format: no batch size in the packed string tensor");
const int32_t* pindices = reinterpret_cast<const int32_t*>(source.data<const uint8_t>());
int32_t batch_size = pindices[0];
OPENVINO_ASSERT(length >= 4 + 4 + 4 * batch_size,
"Incorrect packed string tensor format: the packed string tensor must contain first string offset and end indices");
const int32_t* begin_ids = pindices + 1;
const int32_t* end_ids = pindices + 2;
const char* symbols = reinterpret_cast<const char*>(pindices + 2 + batch_size);

std::vector<std::string> result;
result.reserve(batch_size);
for (int32_t idx = 0; idx < batch_size; ++idx) {
result.emplace_back(symbols + begin_ids[idx], symbols + end_ids[idx]);
}
return result;
}
}
34 changes: 29 additions & 5 deletions modules/custom_operations/user_ie_extensions/ov_extension.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,38 @@
# define S_CONV_EXT
#endif

#ifdef sentence_piece
# include "sentence_piece/sentence_piece.hpp"
# define SENTENSE_PIECE_EXT \
#ifdef tokenizer
# include "tokenizer/tokenizer.hpp"
# define TOKENIZER_EXT \
std::make_shared<ov::OpExtension<StringTensorPack>>(), \
std::make_shared<ov::OpExtension<RaggedTensorPack>>(), \
std::make_shared<ov::OpExtension<StringTensorUnpack>>(), \
std::make_shared<ov::OpExtension<CaseFold>>(), \
std::make_shared<ov::frontend::ConversionExtension>("CaseFoldUTF8", translate_case_fold_utf8), \
std::make_shared<ov::OpExtension<NormalizeUnicode>>(), \
std::make_shared<ov::frontend::ConversionExtension>("NormalizeUTF8", translate_normalize_utf8), \
std::make_shared<ov::OpExtension<RegexNormalization>>(), \
std::make_shared<ov::frontend::ConversionExtension>("StaticRegexReplace", translate_static_regex_replace), \
std::make_shared<ov::OpExtension<RegexSplit>>(), \
std::make_shared<ov::frontend::ConversionExtension>("RegexSplitWithOffsets", translate_regex_split_with_offsets), \
std::make_shared<ov::OpExtension<WordpieceTokenizer>>(), \
std::make_shared<ov::frontend::ConversionExtension>("WordpieceTokenizeWithOffsets", translate_wordpiece_tokenize_with_offsets), \
std::make_shared<ov::OpExtension<BPETokenizer>>(), \
std::make_shared<ov::OpExtension<BytesToChars>>(), \
std::make_shared<ov::frontend::ConversionExtension>("LookupTableFindV2", translate_lookup_table_find_v2), \
std::make_shared<ov::OpExtension<CombineSegments>>(), \
std::make_shared<ov::OpExtension<RaggedToDense>>(), \
std::make_shared<ov::OpExtension<VocabDecoder>>(), \
std::make_shared<ov::OpExtension<CharsToBytes>>(), \
std::make_shared<ov::frontend::ConversionExtension>("Reshape", translate_reshape), \
std::make_shared<ov::frontend::ConversionExtension>("Const", translate_const), \
std::make_shared<ov::OpExtension<TemplateExtension::SentencepieceTokenizer>>(), \
std::make_shared<ov::OpExtension<TemplateExtension::SentencepieceDetokenizer>>(), \
std::make_shared<ov::OpExtension<TemplateExtension::SentencepieceStreamDetokenizer>>(), \
std::make_shared<ov::frontend::ConversionExtension>("SentencepieceOp", translate_sentencepiece_op), \
std::make_shared<ov::frontend::ConversionExtension>("RaggedTensorToSparse", translate_sentencepiece_tokenizer),
#else
# define SENTENSE_PIECE_EXT
# define TOKENIZER_EXT
#endif

OPENVINO_CREATE_EXTENSIONS(std::vector<ov::Extension::Ptr>(
Expand All @@ -69,5 +93,5 @@ OPENVINO_CREATE_EXTENSIONS(std::vector<ov::Extension::Ptr>(
S_CONV_TRANSPOSE_EXT
S_CONV_EXT
COMPLEX_MUL_EXT
SENTENSE_PIECE_EXT
TOKENIZER_EXT
}));
Loading
Loading