我是否一定需要保证不同机器之间可以免密登陆才能使用多机训练 #2160
Replies: 5 comments 9 replies
-
如果使用的是CAI自带的多机管理,需要配置免密登录, CAI runs on multiple nodes with Slurm framework. 。Slurm要求集群节点间免密登录。 |
Beta Was this translation helpful? Give feedback.
-
https://github.com/hpcaitech/GPT-Demo 多机启动可以参考这个repo |
Beta Was this translation helpful? Give feedback.
-
比如在单机(这个单机是我通过slurm获取到的,相当于下发的任务是打开一个terminal)上我运行(例子来于https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/tensor_parallel)
会报以下的错误
但假设我只运行
则是可以的。 |
Beta Was this translation helpful? Give feedback.
-
我是通过slurm分配的机器,我这个机器应该是没有配置免密登陆的, System: Centos 7 我尝试在colossalai中报错信息打印,以下是报错信息
|
Beta Was this translation helpful? Give feedback.
-
感谢反馈,这应该是一个bug,判断local的时候没有考虑getfqdn()可能返回addr+domain。 |
Beta Was this translation helpful? Give feedback.
-
如题
Beta Was this translation helpful? Give feedback.
All reactions