Questions about training logs #254
-
Hi, I want to know whether it is possible to get training information of each step including time cost, loss, learning rate? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 40 replies
-
Yes, this can be done regardless of whether you are using trainer or engine. First of all, for general logging, you can use our distributed logger. The API documentation is here. The example can be found below. from colossalai.logging import get_dist_logger
from colossalai.context import ParallelMode
logger = get_dist_logger()
# this will create a log file for each rank
logger.log_to_file('./)
# log on all ranks
logger.info('hello world')
# log on only rank 0 in the global process group
logger.info('hello world', ranks=[0])
# log on only rank 0 in the data-parallel process group
logger.info('hello world', ranks=[0], parallel_mode=ParallelMode.DATA)
|
Beta Was this translation helpful? Give feedback.
-
If you are using a trainer, you can add these hooks from colossalai.trainer import hooks
hook_list = [
hooks.LossHook(),
hooks.LRSchedulerHook(lr_scheduler=lr_scheduler, by_epoch=True),
hooks.LogMetricByEpochHook(logger),
hooks.ThroughputHook(),
hooks.LogMetricByStepHook(),
hooks.TensorboardHook(log_dir='./tb_logs', ranks=[0]),
] |
Beta Was this translation helpful? Give feedback.
If you are using a trainer, you can add these hooks