AraBERT output embeddings #50
-
Hi,
The error
from arabert.preprocess import ArabertPreprocessor arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2",do_lower_case=False) text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
You have to pass a text string to the encode function, so no need to tokenized first it will do it internally from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModel
arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv2")
model_name = "bert-base-arabertv2"
arabert_prep = ArabertPreprocessor(model_name=model_name, keep_emojis=False)
text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
text_preprocessed= arabert_prep.preprocess(text)
print(text_preprocessed)
>>>و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري
arabert_input = arabert_tokenizer.encode(text_preprocessed,add_special_tokens=True)
print(arabert_input)
>>>[33, 29, 1023, 28880, 1652, 195, 1457, 8, 312, 3259, 7124, 4989, 20, 1186, 289, 2407, 8, 387, 3368, 34] |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot Wissam for your help. I have a question regarding the new processor can we use it also with V1 of AraBERT. |
Beta Was this translation helpful? Give feedback.
-
Great thanks. |
Beta Was this translation helpful? Give feedback.
You have to pass a text string to the encode function, so no need to tokenized first it will do it internally