We support automatically collecting metrics such as
- High level system metrics such as MFU, average loss, max loss and words per second along with some
- Memory metrics to measure max VRAM consumption and the number of OOMs
- Timing metrics to measure data loading bottlenecks
Those metrics can then be visualized in either a TensorBoard or WandDB dashboard
To visualize TensorBoard metrics of models trained on a remote server via a local web browser:
-
Make sure
metrics.enable_tensorboard
option is set to true in model training (either from a .toml file or from CLI). -
Set up SSH tunneling, by running the following from local CLI
ssh -L 6006:127.0.0.1:6006 [username]@[hostname]
- Inside the SSH tunnel that logged into the remote server, go to the torchtitan repo, and start the TensorBoard backend
tensorboard --logdir=./outputs/tb
- In the local web browser, go to the URL it provides OR to http://localhost:6006/.
Weights and Biases will automatically send metrics to a remote server if you login with wandb login
So all you need to do is make sure that metrics.enable_wandb
is enabled
For an example you can inspect debug_model.toml
Note that if both W&B and Tensorboard are enabled then we will prioritize W&B.