max_length in PretrainedTransformer classes and truncation #5119

nelson-liu · 2021-04-13T05:06:46Z

nelson-liu
Apr 13, 2021

Hi!

Many of the PretrainedTransformers classes have max_length arguments:

Without reading the docstrings, I would have assumed that the max_length argument would have truncated my inputs automatically. However, it seems like this is not the case (and for good reason)! For instance, I'd imagine that truncation would be problematic if you're trying to do, say, a sequence labeling problem---the parts of your input that are thrown away won't get tags. Instead, it seems like the max_length argument controls the maximum instantaneous length of any input to the transformer---long inputs are chunked up into inputs of max_length, fed through, and then recombined.

A lot of the huggingface examples (e.g., the run_glue.py script) have a max_seq_len that just truncates everything after. If we wanted to mimic this behavior in allennlp, would the right move be to do this at the DatasetReader level? i.e., use a PretrainedTransformer tokenizer and discard any instances with more than max_seq_len tokens?

Thanks!

epwalsh · 2021-04-13T16:17:16Z

epwalsh
Apr 13, 2021
Maintainer

Huh, I actually didn't know about that behavior in the indexer and embedder! But I believe the tokenizer does just truncate?

On most tasks I've worked on, I usually just set max_length on the tokenizer and don't worry about the few excess tokens that are discarded. Although in some cases I've had to be careful about which tokens were discarded, so I've ended up implementing custom truncation strategies.

0 replies

nelson-liu · 2021-04-13T16:45:35Z

nelson-liu
Apr 13, 2021
Author

Huh, I actually didn't know about that behavior in the indexer and embedder! But I believe the tokenizer does just truncate?

I need to double-check the tokenizer behavior, but from my brief look in the code, it might not...the huggingface tokenizers don't truncate by default https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.__call__ , and it doesn't seem like the truncation kwarg is used in allennlp. might be worth having more clarity around this somehow.

3 replies

nelson-liu Apr 13, 2021
Author

Ah, transformers will do the "right" thing if you pass in max_length with no truncation strategy:

In [1]: from allennlp.data.tokenizers import PretrainedTransformerTokenizer

In [2]: tokenizer = PretrainedTransformerTokenizer("bert-base-cased", max_length=5)

In [3]: sentence = "Some long sentence that has many word, perhaps over the max_length specified"

In [4]: tokens = [t.text for t in tokenizer.tokenize(sentence)]
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.

In [5]: tokens
Out[6]: ['[CLS]', 'Some', 'long', 'sentence', '[SEP]']

anyway, my confusion / original motivation for this question was that i wanted to run a text classification task with max_seq_len = 128 (i.e., truncate everything after 128). I set max_seq_len on the embedder / indexer, but was getting OOMs that I shouldn't have been getting for that sequence length (I think it requires a fair amount of memory to combine the embedder outputs into the long one). I didn't try the tokenizer truncation, but hte dataset reader had a max_length arg that i used instead. This was helpful, I think I now better understand the (slightly different) meaning of max_length on the tokenizer vs. the embedder / indexer.

epwalsh Apr 13, 2021
Maintainer

This is so confusing. HF seems to change the behavior all the time. We used to pass in the truncation strategy, and then the behavior changed (55cfb47), but now it seems like we should be passing it again.

epwalsh Apr 13, 2021
Maintainer

PR to fix: #5120

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

max_length in PretrainedTransformer classes and truncation #5119

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

max_length in PretrainedTransformer classes and truncation #5119

nelson-liu Apr 13, 2021

Replies: 2 comments · 3 replies

epwalsh Apr 13, 2021 Maintainer

nelson-liu Apr 13, 2021 Author

nelson-liu Apr 13, 2021 Author

epwalsh Apr 13, 2021 Maintainer

epwalsh Apr 13, 2021 Maintainer

nelson-liu
Apr 13, 2021

Replies: 2 comments 3 replies

epwalsh
Apr 13, 2021
Maintainer

nelson-liu
Apr 13, 2021
Author

nelson-liu Apr 13, 2021
Author

epwalsh Apr 13, 2021
Maintainer

epwalsh Apr 13, 2021
Maintainer