Inference & Parameters

Inference and Parameters

Welcome to the Inference Parameters wiki page! This page will give you an overview of the various parameters used to control the behavior of the HuggingFace Transformers model used for inference in summarization. These parameters can have a significant impact on the quality of the generated summary and the computational cost of the inference process.

By understanding how these parameters work and how to adjust them, you will be able to tune the inference of the model for your specific use case and achieve the best possible results.

Pre-requisite

If you are new to generating text with transformers/neural networks, I recommend reading this blog post on the topic from huggingface to get the background knowledge. The below will be much easier to understand afterward.

Parameters

min_length: This parameter controls the minimum length of the generated summary. Setting it to a higher value will ensure that the generated summary is of a certain length but may result in longer and less concise summaries.
max_length: This parameter controls the maximum length of the generated summary. Setting it to a lower value will ensure that the generated summary is more concise but may also result in summaries that are too short and do not contain all the important information.
no_repeat_ngram_size: This parameter controls the number of consecutive words that will not be repeated in the generated summary. If set to a higher value, it will reduce the chances of the generated summary repeating the same words, but it may also make the summary less fluent.
encoder_no_repeat_ngram_size: This parameter controls the number of consecutive words in the input text that will not be repeated in the generated summary. Setting it to a higher value will make the generated summary more varied but may also make it less fluent.
repetition_penalty: This parameter controls the penalty applied when a repeated word is generated. Setting it to a higher value will increase the chances that the generated summary will be more unique but may also make it less fluent.
num_beams: This parameter controls the number of beams used during beam search decoding. A higher value will result in a "better" summary (and also typically more grammatically correct and factually accurate) but will also increase the computational cost of running the inference. This is probably the most important parameter that controls the computational intensity & quality (_after the two no_repeat_ngram_size params are set).
num_beam_groups: This parameter controls the number of beam groups used during beam search decoding. A higher value will result in more diverse outputs and increase the computational cost of running the inference.
length_penalty: This parameter controls the penalty applied to longer sequences during beam search decoding. A higher value will result in more concise output but may sacrifice "completeness" of the summary.
early_stopping: This parameter controls whether to stop decoding early when the generated summary reaches a certain length. If set to true, it will save computation time but may result in incomplete summaries.
do_sample: This parameter controls whether or not to use sampling during decoding. If set to true, it will result in more diverse output as it will be sampled from a distribution. This also means it will likely be less factually accurate. You can only generate text/decode from the model by one method at a time. If you set this to true, it will use sampling instead of beam search and will probably give much worse results.[^1]

[^1]: Whether this increases/decreases computational cost relative to beam search depends on how many beams you use. This has not been explored in this repo - if you do so, please let me know!

Default Parameters

The default model is pszemraj/long-t5-tglobal-base-16384-book-summary, and the default parameters reflect empirical evidence for the tradeoff of performance/compute for the model.

{
    "min_length": 8,
    "max_length": <DYNAMICALLY SET w.r.t. BATCH SIZE>,
    "no_repeat_ngram_size": 3,
    "encoder_no_repeat_ngram_size": 4,
    "repetition_penalty": 2.5,
    "num_beams": 4,
    "num_beam_groups": 1,
    "length_penalty": 0.8,
    "early_stopping": true,
    "do_sample": false
}

These parameters should be fairly generalizable to other models but can be updated/reset with the set_inference_params() method of the Summarizer class.

How to make it less computationally intensive?

Here is a very general guide to reducing the computational intensity while attempting to preserve summary quality.

Adjust the num_beams parameter. A higher value will result in a "better" summary and increase the computational cost of running the inference. You can set this to a lower value to reduce the computational cost while maintaining a good summary quality as much as possible.
- if adjusting the default parameters/model, you can try num_beams=2 then num_beams=1
Adjust the num_beam_groups parameter (if you set it to be higher than 1..). A higher value will result in more diverse results, but will also increase the computational cost of running the inference. You can set this to a lower value to reduce the computational cost while maintaining a good summary quality as much as possible.
Adjust the repetition_penalty parameter. Decrease the repetition penalty parameter. This will cause the summaries to be potentially more repetitive but will decrease the computational cost. Additionally, you may get "free wins" here, as this can be redundant with the no_repeat_ngram_size parameters depending on their values.
Try increasing the no_repeat_ngram_size and encoder_no_repeat_ngram_size parameters to higher values. Similarly to the repetition penalty, this will allow the model to repeat larger groups of n-grams and/or repeat larger groups of n-grams verbatim from the source text.
Adjust the length_penalty parameter to be closer to 1. Smaller values below one result in more concise output but may sacrifice the "completeness" of the summary (in addition to increasing the computational cost).
Adjust the early_stopping parameter (if not set to True per default). Setting it to true will save computation time but may result in incomplete summaries.

The order in which you adjust the parameters is important, and this is a general guide. Depending on the domain, you may see variations to the above.

It is also important to note that during this process, you should also keep an eye on the quality of the summary generated for the same input text to benchmark any decrease in performance.

Other Methods of Text Generation

This repo supports the "beam search" method of decoding from the language model at the time of writing. Several other methods are worth knowing about (with applications outside of summarization), listed below. If you find others useful in your work/summarization application[^2], open an issue or PR, and we can discuss integration!

[^2]: According to the repo and a recent paper, contrastive search may be applicable/better than beam search for summarization. If this is validated, will add it to the repo! (WIP - progress may be slow)

Decoding methods, listed in the transformers documentation:

Greedy decoding: This method chooses the most likely next token at each step without considering alternative options. It is fast and often produces good results, but it can get stuck in local optima and may not produce diverse outputs. It is used by calling greedy_search() with num_beams=1 and do_sample=False.
Contrastive search: This method generates multiple candidates and then ranks them based on a contrastive loss function rather than by likelihood. It can produce more diverse results than greedy decoding but is more computationally expensive. It is used by calling contrastive_search() with penalty_alpha>0 and top_k>1.
Multinomial sampling: This method samples randomly from the probability distribution at each step rather than choosing the most likely token. It can produce more diverse outputs than greedy decoding but is less likely to produce high-quality outputs. It is used by calling sample() with num_beams=1 and do_sample=True.
Beam search decoding: This method generates several candidates at each step but keeps only the k most likely ones (where k is the beam size). It is more computationally expensive than greedy decoding but can produce better results by considering alternative options. It is used by calling beam_search() with num_beams>1 and do_sample=False. This is the default used by this repo.
Beam search with multinomial sampling: This method combines beam search with multinomial sampling, generating multiple candidates at each step but keeping only the k most likely. It can produce more diverse outputs than beam-search decoding but is less likely to produce high-quality outputs. It is used by calling beam_sample() with num_beams>1 and do_sample=True.
Multiple beam search decoding: This method generates multiple beam groups, each with its own beam search, and then combines the outputs. It can produce more diverse outputs than beam search decoding. It is used by calling group_beam_search() when num_beams>1 and num_beam_groups>1.
Constrained beam search decoding: This method uses beam search with the added restriction that certain words or word strings must appear in the generated output. It can produce output that follows certain guidelines or rules better. It is called the constrained_beam_search() function when constraints!=None or force_words_ids!=None.

THIS IS A WIP, MORE TO COME

Provide feedback

Saved searches

Use saved searches to filter your results more quickly