v0.5.1: Short-Form Generation Fixes

minimaxir released this 01 May 23:30

· 22 commits to master since this release

v0.5.1 of aitextgen fixes a long-standing generation bug for short-form content that was inadvertently broken. This has now been fixed.

Short Form Generation

When training, a new field is automativally written to the config: line_by_line, which indicates whether the source TokenDataset used was ingested line_by_line (e.g. a CSV file).
When generating, if the loaded model config has line_by_line=True, then the generation will automatically prepend the text with the bos_token so the generation knows it's at the start of the text. This results in substantially better text generation quality.

If you have an older model trained on a line_by_line dataset, you can still use this workflow by making one of the following changes:

Manually add "line_by_line": true to the config.json for the model.
When the model is loaded, call setattr(ai.model.config, "line_by_line", True)
Set the new prepend_bos parameter to generate() to True.

Misc fixes

Improvements to generation w/ a schema so it works more correctly.
Loading a tokenizer via tokenizer_file now uses the PreTrainedTokenizerFast class, which handles special tokens more correctly.
Added a skip_special_tokens param to force printing of generation tokens: good for debugging generation w/ schema

Assets 3