max_length in PretrainedTransformer classes and truncation #5119
Replies: 2 comments 3 replies
-
Huh, I actually didn't know about that behavior in the indexer and embedder! But I believe the tokenizer does just truncate? On most tasks I've worked on, I usually just set |
Beta Was this translation helpful? Give feedback.
-
I need to double-check the tokenizer behavior, but from my brief look in the code, it might not...the huggingface tokenizers don't truncate by default https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__ , and it doesn't seem like the truncation kwarg is used in allennlp. might be worth having more clarity around this somehow. |
Beta Was this translation helpful? Give feedback.
-
Hi!
Many of the PretrainedTransformers classes have
max_length
arguments:Without reading the docstrings, I would have assumed that the
max_length
argument would have truncated my inputs automatically. However, it seems like this is not the case (and for good reason)! For instance, I'd imagine that truncation would be problematic if you're trying to do, say, a sequence labeling problem---the parts of your input that are thrown away won't get tags. Instead, it seems like themax_length
argument controls the maximum instantaneous length of any input to the transformer---long inputs are chunked up into inputs ofmax_length
, fed through, and then recombined.A lot of the huggingface examples (e.g., the run_glue.py script) have a
max_seq_len
that just truncates everything after. If we wanted to mimic this behavior in allennlp, would the right move be to do this at theDatasetReader
level? i.e., use aPretrainedTransformer
tokenizer and discard any instances with more thanmax_seq_len
tokens?Thanks!
Beta Was this translation helpful? Give feedback.
All reactions