This is the training code we used for the first prototype models.
Notably, it's based on an old version of HuggingFace's run_clm.py
example, which was then adapted by the ColossalAI developers to make use of some their optimizations. It was then slightly improved to be usable in real-world scenarios (Tensorboard support, proper checkpointing, etc.).
This is being committed for archiving purposes, but if you'd like to use it, it probably works. The TL;DR version is:
- Get all the dependencies installed.
- I have not documented this properly, but installing
transformers
andcolossalai
should probably cover it.
- I have not documented this properly, but installing
- Put your data in a file called ./data/train.json.
- It should be a file where each line is a JSON object containing a
text
field, which contains the actual text that will be tokenized and fed into the model in the training loop.
- It should be a file where each line is a JSON object containing a
- Adjust any relevant config parameters in finetune.bash and run it. If you're lucky, the training loop will eventually start!
- Metrics should be logged to a
runs
folder inside the OUTPUT_DIR you've specified, so you can host a Tensorboard server there to watch them.
- Metrics should be logged to a
- When it's done, you'll probably want to get a proper HF model out of the training checkpoints. You can do that using the provided convert_to_hf.py utility script.