All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Increased minimum versions of dependencies (
transformers
to 4.3.0,pytorch-lightning
to 1.2.0)- Remove dependency on
tokenizers
astransformers
pins it.
- Remove dependency on
- Made Fast tokenizers the default (as it is the default in
transformers
4.0.0) - Made serialized tokenizers the default for custom tokenizers, and added support for loading them for both
aitextgen
andTokenDataset
s - Added gradient checkpointing for GPT-2, and set it to the default for training 355M and 774M.
- Added layer freezing to freeze the first
n
layers of GPT-2 while training. This allows 1.5B GPT-2 to be trained with a highn
. - Added schema-based generation for specificed schema_tokens (which can be encoded in the Transformers config). This can be used with an appropriate dataset for schema-based generation.
- Switched TensorFlow weight download URL from GCP (as OpenAI removed it from there) to Azure
- Fixed issue where prompt character length was used to check for a too-long assert instead of prompt token length (#90)
- Workaround breaking issue in Transformers 4.3.0 by moving special token stripping into aitextgen instead of the tokenizer (#90)
- Added an
lstrip
param to generation, which strips all whitespace at the beginning of generated text (related to point above)
- Increased minimum versions of dependencies (
transformers
to 4.0.0,pytorch-lightning
to 1.0.8, Pytorch to 1.6) - Fixed imports to account for new Transfomers file architecture
- Fixed training to account for new transformer/pytorch-lightning minimums
- Fully removed TorchScript code (ONNX implementation will supercede it)
- Made prompt specification for generation more canonical with Transformers
- Set default
vocab
size for new tokenizers to1000
- Began work on serializing tokenizers in accordance to the new
tokenizers
approach
- CHANGELOG.md
- Progress bar for loading a dataset.
progress_bar_refresh_rate
parameter fortrain()
andTokenDataset()
.
- Set numpy data store for
TokenDataset
. - Set single-text files to be loaded delimited as newlines.
shuffle
andseed
parameters for TokenDataset.
- Set
generate()
defaults tomax_length=256
andtemperature=0.7
. - Added to docs notes about GPT-2 maximum length of 1024.
- Everything!