Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different NLP models need different Tokenizers #329

Open
Mohit0928 opened this issue Sep 30, 2021 · 1 comment
Open

Different NLP models need different Tokenizers #329

Mohit0928 opened this issue Sep 30, 2021 · 1 comment

Comments

@Mohit0928
Copy link

Mohit0928 commented Sep 30, 2021

Recently, we have created a proc-block for tokenization forBERT model, which supports all the models built on the top of BERT. In the future, we are targeting the rune to work with all NLP models. In NLP, new models come with their new tokenization algorithm every year, so the different Transformer models use different tokenizers. The possibility of a global, standardized tokenizer is an open question. The problem doesn't end here, it's just the case of the English language. When we move to other languages (German, Spanish, etc.), they also need their own tokenizers.

Creating a tokenizer proc-block is like writing the whole algorithm of that NLP model in Rust. Writing a tokenizer proc-block for every transformer model is very time-consuming, and it becomes more challenging to implement it in Rust. We can't afford to write proc-block for every NLP model.

Writing a proc-block in Python could be a solution to this. I'm not sure how difficult it is to compile Python in Web Assembly. Maybe (@Michael-F-Bryan, @saidinesh5) could comment on this.

@SamLeroux, @meelislootus, know better all this. Do you have any solution or thoughts on this?

@kthakore, please take a look at this.

@Michael-F-Bryan
Copy link
Contributor

@Mohit0928, I disagree that Python is easier to implement tokenizer algorithm than Rust... Rust actually has better support for manipulating strings, especially when non-ASCII characters are involved, so it may more related to experience than the language.

From my experience implementing BERT, it should take maybe 100-400 lines of Rust to add a new tokenizer.

Writing a proc-block in Python could be a solution to this. I'm not sure how difficult it is to compile Python in Web Assembly.

In general, you can't compile Python directly to WebAssembly because of its dynamic nature. WebAssembly is a statically compiled language. You would need to compile the CPython interpreter to WebAssembly and use that to execute the Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants