AraBERT output embeddings #50

Fatima-Haouari · 2021-01-12T10:31:26Z

Fatima-Haouari
Jan 12, 2021

Hi,
I was trying to get AraBERT output embeddings following your provided example which was working for me earlier. However when I tried today I was getting an issue with encode function and I am not able to figure out the issue. Please advise.
Here is the code I used and the error I am getting:

from transformers import AutoTokenizer, AutoModel
import torcharabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv01",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv01")
text= "الجو جميل اليوم"
text_preprocessed=arabert_tokenizer.tokenize(text)
arabert_input = arabert_tokenizer.encode(text_preprocessed,add_special_tokens=True)

The error

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    376             batch_text_or_text_pairs,
    377             add_special_tokens=add_special_tokens,
--> 378             is_pretokenized=is_split_into_words,
    379         )
    380 

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

I am getting the same error when I tried the latest version of the model

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModel

arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv2")
model_name = "bert-base-arabertv2"
arabert_prep = ArabertPreprocessor(model_name=model_name, keep_emojis=False)

text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
arabert_prep.preprocess(text)
text_preprocessed=arabert_tokenizer.tokenize(text)
print(text_preprocessed)
arabert_input = arabert_tokenizer.encode(text_preprocessed,add_special_tokens=True)

Answered by WissamAntoun

Jan 13, 2021

You have to pass a text string to the encode function, so no need to tokenized first it will do it internally

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModel

arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv2")
model_name = "bert-base-arabertv2"
arabert_prep = ArabertPreprocessor(model_name=model_name, keep_emojis=False)

text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
text_preprocessed= arabert_prep.preprocess(text)
print(text_preprocessed)
>>>و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+

View full answer

WissamAntoun · 2021-01-13T11:36:42Z

WissamAntoun
Jan 13, 2021
Maintainer

You have to pass a text string to the encode function, so no need to tokenized first it will do it internally

from arabert.preprocess import ArabertPreprocessor
from transformers import AutoTokenizer, AutoModel

arabert_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/bert-base-arabertv2",do_lower_case=False)
arabert_model = AutoModel.from_pretrained("aubmindlab/bert-base-arabertv2")
model_name = "bert-base-arabertv2"
arabert_prep = ArabertPreprocessor(model_name=model_name, keep_emojis=False)

text = "ولن نبالغ إذا قلنا إن هاتف أو كمبيوتر المكتب في زمننا هذا ضروري"
text_preprocessed= arabert_prep.preprocess(text)
print(text_preprocessed)
>>>و+ لن نبالغ إذا قل +نا إن هاتف أو كمبيوتر ال+ مكتب في زمن +نا هذا ضروري
arabert_input = arabert_tokenizer.encode(text_preprocessed,add_special_tokens=True)
print(arabert_input)
>>>[33, 29, 1023, 28880, 1652, 195, 1457, 8, 312, 3259, 7124, 4989, 20, 1186, 289, 2407, 8, 387, 3368, 34]

0 replies

Fatima-Haouari · 2021-01-13T12:02:34Z

Fatima-Haouari
Jan 13, 2021
Author

Thanks a lot Wissam for your help. I have a question regarding the new processor can we use it also with V1 of AraBERT.

1 reply

WissamAntoun Jan 13, 2021
Maintainer

of course just enter model_name=bert-base-arabert

Fatima-Haouari · 2021-01-13T12:04:14Z

Fatima-Haouari
Jan 13, 2021
Author

Great thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AraBERT output embeddings #50

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

AraBERT output embeddings #50

Fatima-Haouari Jan 12, 2021

Replies: 3 comments · 1 reply

WissamAntoun Jan 13, 2021 Maintainer

Fatima-Haouari Jan 13, 2021 Author

WissamAntoun Jan 13, 2021 Maintainer

Fatima-Haouari Jan 13, 2021 Author

Fatima-Haouari
Jan 12, 2021

Replies: 3 comments 1 reply

WissamAntoun
Jan 13, 2021
Maintainer

Fatima-Haouari
Jan 13, 2021
Author

WissamAntoun Jan 13, 2021
Maintainer

Fatima-Haouari
Jan 13, 2021
Author