Skip to content

batch processing improvements

Latest
Compare
Choose a tag to compare
@pszemraj pszemraj released this 18 Feb 22:52
82bafca

improvements for batch processing

A small release that includes some improvements to the Summarizer class for batch-processing related use.

let's say you've loaded your Summarizer class:

from textsum.summarize import Summarizer

model_name = "pszemraj/pegasus-x-large-book_synthsumm-bf16" # recent model
summarizer = Summarizer(model_name)

new features/improvements:

Smart __call__ Function for Summarizer Class:

  • Added a smart __call__ function to automatically distinguish between text input and file paths for summarization, allowing easier integration into batch processing and .map() tasks.
# Directly passing text to be summarized
summary_text = summarizer("This is a sample text to summarize.")
print(summary_text)

# Passing a file path to be summarized
output_filepath = summarizer(
    "/path/to/textfile.extension",
    output_dir="./my-summary-stash",
)
print(output_filepath)

Enhanced Batch Processing Controls:

  • Introduced disable_progress_bar and batch_delimiter options to improve control over batch processing and output formatting
from datasets import load_dataset

dataset = load_dataset("Trelis/tiny-shakespeare")
dataset = dataset.map(
    lambda x: {"summary": summarizer(x["text"], disable_progress_bar=True)},
    batched=False,
) # doesn't spam you with multiple progress bars!!
print(dataset)

Note: You can pass disable_progress_bar=True when instantiating the Summarizer() for cleaner inference.

You can now set the 'summary batch delimiter' string by the batch_delimiter arg when running inference:

summary_output = summarizer(text, batch_delimiter="<I AM A DELIMITER>")
print(summary_output)
# "Summary of first chunk.<I AM A DELIMITER>Summary of second chunk.<I AM A DELIMITER>Summary of third chunk."

by default, it's "\n\n"

Misc

  • default parameter update: the length_penalty for inference is now 1.0 (was 0.8)
  • code cleanup across modules, mostly for readability and maintainability.

What's Changed

Full Changelog: v0.2.0...v0.2.1