How to calculate the sequence length of a string ? #373
Unanswered
RageshAntonyHM
asked this question in
Q&A
Replies: 1 comment
-
You can apply the tokenizer to it, and count the number of tokens: # loading the model
import torch
from seamless_communication.inference import Translator
model_name = "seamlessM4T_v2_large"
vocoder_name = "vocoder_v2" if model_name == "seamlessM4T_v2_large" else "vocoder_36langs"
translator = Translator(
model_name,
vocoder_name,
device=torch.device("cuda:0"),
dtype=torch.float16,
)
text = "This is a typical single-sentence text which the Seamless model is supposed to translate well; " \
"although this sentence is composed from multiple ones, it is still not too long, and is pretty coherent."
# evaluating the text length in tokens
tokenizer_encoder = translator.text_tokenizer.create_encoder(lang="eng")
tokens = tokenizer_encoder(text)
print(tokens)
# tensor([256022, 10257, 254, 10, 26304, 5302, 25184, 247711, 89945,
# 3657, 29568, 9451, 321, 2103, 33, 35100, 12654, 254,
# 174769, 243, 2809, 143411, 19794, 248123, 156503, 6642, 8466,
# 3657, 254, 22442, 61, 4800, 124736, 81982, 247681, 955,
# 254, 27689, 2984, 25790, 11718, 247681, 447, 254, 187056,
# 212292, 93, 247676, 3])
print(tokens.shape)
# torch.Size([49])
sequence_length = tokens.shape[0]
print(sequence_length)
# 49 Anyway, it is not recommended to use Seamless to translate more that one sentence at a time, because it was trained mostly with single sentences. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I get " maximum sequence length" issue when trying to do "Text to Text" translation
I need to calculate the sequence length of a string before passing to prediction.
How to do this?
Beta Was this translation helpful? Give feedback.
All reactions