v0.5.1: Short-Form Generation Fixes
v0.5.1 of aitextgen fixes a long-standing generation bug for short-form content that was inadvertently broken. This has now been fixed.
Short Form Generation
- When training, a new field is automativally written to the config:
line_by_line
, which indicates whether the sourceTokenDataset
used was ingestedline_by_line
(e.g. a CSV file). - When generating, if the loaded model config has
line_by_line=True
, then the generation will automatically prepend the text with thebos_token
so the generation knows it's at the start of the text. This results in substantially better text generation quality.
If you have an older model trained on a line_by_line
dataset, you can still use this workflow by making one of the following changes:
- Manually add
"line_by_line": true
to theconfig.json
for the model. - When the model is loaded, call
setattr(ai.model.config, "line_by_line", True)
- Set the new
prepend_bos
parameter togenerate()
toTrue
.
Misc fixes
- Improvements to generation w/ a schema so it works more correctly.
- Loading a tokenizer via
tokenizer_file
now uses thePreTrainedTokenizerFast
class, which handles special tokens more correctly. - Added a
skip_special_tokens
param to force printing of generation tokens: good for debugging generation w/ schema