Different NLP models need different Tokenizers #329

Mohit0928 · 2021-09-30T17:51:17Z

Recently, we have created a proc-block for tokenization forBERT model, which supports all the models built on the top of BERT. In the future, we are targeting the rune to work with all NLP models. In NLP, new models come with their new tokenization algorithm every year, so the different Transformer models use different tokenizers. The possibility of a global, standardized tokenizer is an open question. The problem doesn't end here, it's just the case of the English language. When we move to other languages (German, Spanish, etc.), they also need their own tokenizers.

Creating a tokenizer proc-block is like writing the whole algorithm of that NLP model in Rust. Writing a tokenizer proc-block for every transformer model is very time-consuming, and it becomes more challenging to implement it in Rust. We can't afford to write proc-block for every NLP model.

Writing a proc-block in Python could be a solution to this. I'm not sure how difficult it is to compile Python in Web Assembly. Maybe (@Michael-F-Bryan, @saidinesh5) could comment on this.

@SamLeroux, @meelislootus, know better all this. Do you have any solution or thoughts on this?

@kthakore, please take a look at this.

The text was updated successfully, but these errors were encountered:

Michael-F-Bryan · 2021-10-02T08:48:36Z

@Mohit0928, I disagree that Python is easier to implement tokenizer algorithm than Rust... Rust actually has better support for manipulating strings, especially when non-ASCII characters are involved, so it may more related to experience than the language.

From my experience implementing BERT, it should take maybe 100-400 lines of Rust to add a new tokenizer.

Writing a proc-block in Python could be a solution to this. I'm not sure how difficult it is to compile Python in Web Assembly.

In general, you can't compile Python directly to WebAssembly because of its dynamic nature. WebAssembly is a statically compiled language. You would need to compile the CPython interpreter to WebAssembly and use that to execute the Python.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different NLP models need different Tokenizers #329

Different NLP models need different Tokenizers #329

Mohit0928 commented Sep 30, 2021 •

edited

Loading

Michael-F-Bryan commented Oct 2, 2021

Different NLP models need different Tokenizers #329

Different NLP models need different Tokenizers #329

Comments

Mohit0928 commented Sep 30, 2021 • edited Loading

Michael-F-Bryan commented Oct 2, 2021

Mohit0928 commented Sep 30, 2021 •

edited

Loading