Skip to content

Releases: minimaxir/aitextgen

v0.6.0: Fix pytorch-lighting version

09 Aug 04:57
Compare
Choose a tag to compare

Unfortunately I've been keeping my head down for working on a new version of the trainer that I missed a depreciation in pytorch lightning.

  • Merged #191
  • Bumped minimum version of pytorch-lightning to 1.7.0

Again, I aim to move to a HF-based trainer to avoid these depreciations.

v0.5.2: Fix dependency

17 May 03:10
Compare
Choose a tag to compare

Pytorch-lightning depreciated a feature which broke training; this is now fixed.

  • prelim TPU support + more correct pytorch-lightning training (#105)
  • Bump min pytorch-lightning version to 1.3.1
  • fixes for schema generation

v0.5.1: Short-Form Generation Fixes

01 May 23:30
Compare
Choose a tag to compare

v0.5.1 of aitextgen fixes a long-standing generation bug for short-form content that was inadvertently broken. This has now been fixed.

Short Form Generation

  • When training, a new field is automativally written to the config: line_by_line, which indicates whether the source TokenDataset used was ingested line_by_line (e.g. a CSV file).
  • When generating, if the loaded model config has line_by_line=True, then the generation will automatically prepend the text with the bos_token so the generation knows it's at the start of the text. This results in substantially better text generation quality.

If you have an older model trained on a line_by_line dataset, you can still use this workflow by making one of the following changes:

  • Manually add "line_by_line": true to the config.json for the model.
  • When the model is loaded, call setattr(ai.model.config, "line_by_line", True)
  • Set the new prepend_bos parameter to generate() to True.

Misc fixes

  • Improvements to generation w/ a schema so it works more correctly.
  • Loading a tokenizer via tokenizer_file now uses the PreTrainedTokenizerFast class, which handles special tokens more correctly.
  • Added a skip_special_tokens param to force printing of generation tokens: good for debugging generation w/ schema

v0.5.0: GPT Neo + misc fixes

19 Apr 01:15
8a44c4d
Compare
Choose a tag to compare

aitextgen has been updated to support GPT Neo and fix a few outstanding generation issues! However, in the process there are a few breaking changes.

Breaking Changes

Loading Models

While making model-loading architecture-agnostic for GPT Neo support, it turns out aitextgen was loading models in an unofficial way, so this has now been addressed. The user must now specify the model_folder where the pytorch_model.bin and config.json are located (with those exact filenames).

Assuming the model is located in trained_folder:

Old :

ai2 = aitextgen(model="trained_model/pytorch_model.bin",
                tokenizer_file="aitextgen.tokenizer.json",
                config="trained_model/config.json")

New:

ai2 = aitextgen(model_folder="trained_model",
                	   tokenizer_file="aitextgen.tokenizer.json")

All notebooks and documentation have been updated with this new workflow, and an assert will be raised of the old behavior is still used.

Incorrect tokenization for Colab-trained GPT-2 tokenizers.

There was an underlying issue due to a recent change in tokenizers which broke the implementation of the default GPT-2 tokenizer by preventing it from tokenizing <|endoftext|> tokens correctly. As a result, this broke the truncation

Only the case where the Colab GPT-2 Notebook was used for training line-by-line texts were affected by this; unfortunately the only fix now is to retrain the model with v0.5.0

Other Major Changes/Fixes

GPT Neo support

GPT Neo is now supported! The Colab Notebook was updated to indicate how to finetune the smaller versions of the model.

Out of the box, all variants of GPT-Neo have a 2048 context window (versus GPT-2’s 1024 context length) allowing double the generation length, and the pretrained models are trained on much more recent data. Finetuning a GPT Neo model takes about 2x as long per step as a GPT-2 model: notable as normally increasing the context window causes training to scale quadraticly instead of linearly, and does appear to converge faster.

However, text-generation performance-wise, it’s currently unclear whether GPT-Neo is “better”, especially on short-form content. Future releases of aitextgen will analyze this more closely.

DeepSpeed support [BETA] (#103)

Thanks to the team at pytorch-lightning, DeepSpeed support has been added for aitextgen, allowing training of larger models (>1.5B params) with multi-GPUs. However, this isn’t fully tested, so more documentation is pending!

Misc changes

  • Added a nonempty_output param to generate(), default True: If the output is empty (possible on shortform content), skip it if generating multiple texts, or try again if it's a single text. If min_length is specified, the same behavior occurs for texts below the minimum length after processing.

  • Bumped minimum versions of transformers and pytorch-lightning.

  • Completed another pass of notebooks and documentation.

  • Forced single-GPU training on Windows to avoid bugs (#116)

  • Calling the aitextgen instance will now print the model type and number of params to the console, helpful for debugging.

v0.4.1: Misc Bug Fixes

09 Mar 04:37
Compare
Choose a tag to compare
  • Fix CSV loading issue (#95)
  • Fix regex for stripping whitespace starting a generated text. (#92)
  • Fix an issue where the logger said using a custom tokenizer was actually using the default tokenizer.
  • Added a special_tokens param to allow the user to specify a List of token IDs to strip from the generated output (default: the bos_token_id and eos_token_id).

Gradient Checkpointing, Serialized Tokenizers, initial implementation of schema generation

23 Feb 04:21
8dbc362
Compare
Choose a tag to compare

0.4.0 is a big release! The release of transformers 4.3.0 caused a breaking issue which required a more prompt release; a bug fix 0.4.1 release is likely. (and new documentation is in the process as well). I have demos of new features planned as well!

Update transformers and pytorch-lightning

The minimum version of transformers has been bumped to 4.3.0, which has a number of performance improvements such as faster GPT-2 training and generation. Fast tokenizers are now the default package-wide as well. pytorch-lighting was bumped to 1.2.0 albeit that is less exciting.

tokenizers was removed as a dependency since transformers pins its own. Speaking of...

Serialized custom tokenizers.

By default, train_tokenizer() will create a serialized, one-file tokenizer. (e.g. aitextgen.tokenizer.json). This file will also correctly support added_tokens parameter.

You can load the tokenizer when loading an aitextgen model with the tokenizer_file param. (you can still use the merges_file and vocab_file params if you have them.

Gradient checkpointing + Layer freezing

Gradient checkpointing is now supported for GPT-2 models, allowing finetuning of larger models such as the 355M and 774M GPT-2 models!

This also enabled the 1558M model to be finetuned, in theory. I also added the ability to freeze layers to allow the model to be trained within VRAM constraints, but the results are mixed. More analysis will be done.

Schema-based generation

A draft implementation of schema based generation (leveraging the new custom tokenizers)

Misc bug fixes

  • Fix the TensorFlow weights URL
  • Fixed issue where prompt character length was used to check for a too-long assert instead of prompt token length (#90)
  • Workaround breaking issue in Transformers 4.3.0 by moving special token stripping into aitextgen instead of the tokenizer (#90)
  • Added an lstrip param to generation, which strips all whitespace at the beginning of generated text (related to point above)
  • The refresh rate of training is every 20 steps by default. (for better performance in Jupyter/Colab)

Transformers 4.0.0 and pytorch-lightning 1.0.0 support

01 Dec 03:30
f7278bf
Compare
Choose a tag to compare

A release to fix breaking issues from both packages, with minor tweaks done in the meantime.

  • Minimum versions are now transformers>=4.0.0, pytorch-lightning>=1.0.8, and torch>=1.6.0, with fixed to breaking issues for all those major versions.
  • Tweaked generation to be more canonical with the newest implementation in transformers 4.
  • Set default refresh rate for training to 20 to make pytorch-lightning happy.
  • Set default learning rate for training to 1e-3 since I forgot why it was 1e-4.
  • Set both the default vocab size for tokenizers and the CPU config vocab size to 1000 tokens from 5000, since this allowed much easier/faster training in the demo.
  • Confirmed that setting fp16=True for GPU training with supported GPUs now works.

Future releases will add more explicit features. There may be extra console output in the meantime; will see what I can do to remove those.

Remove optimizer_step() override

05 Jul 16:25
Compare
Choose a tag to compare

This fixes an issue (#44) causing training to fail due to a change in pytorch-lightning 0.8.4

The override was only for testing; removing it is necessary for upcoming native AMP in PyTorch 1.6 regardless.

Somehow, after this change, the model decreases in loss much faster: may need to investigate if the scheduler no longer works.

Cap transformers version

02 Jul 05:14
Compare
Choose a tag to compare

transformers 3.0.0 introduced some breaking changes, so capping version at less than that for now.

Fix depreciated training parameter

28 Jun 19:29
Compare
Choose a tag to compare
v0.2.1

Remove disable_validation param