分布式设置错误 #19

WeiminLee · 2024-05-20T08:52:10Z

请问代码一直卡在了torch.distributed.init_process_group 这个方法，请问如何解决？

环境信息：单击多卡

OS 设置：
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡
os.environ['LOCAL_RANK'] = '0'
os.environ['MASTER_ADDR'] = '127.0.0.1' # rank0 对应的地址
os.environ['MASTER_PORT'] = '29500' # 任何空闲的端口
os.environ['NCCL_IB_DISABLE'] = "1"
os.environ['NCCL_IBEXT_DISABLE'] = "1"

下面的代码一直超时，请问是哪里设置错误了么？

args.dist_url: "env://"
args.dist_backend = "nccl"

torch.distributed.init_process_group(
backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size,
rank=args.rank,
timeout=datetime.timedelta(
seconds=10
), # allow auto-downloading and de-compressing
)
torch.distributed.barrier()

Coobiw · 2024-05-20T08:58:40Z

os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡
os.environ['LOCAL_RANK'] = '0'

不要指定三项

如果只用四张卡直接运行时加入CUDA_VISIBLE_DEVICES=x,x,x,x

如果还是超时，请删除掉两个关于NCCL的环境变量

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

分布式设置错误 #19

分布式设置错误 #19

WeiminLee commented May 20, 2024

Coobiw commented May 20, 2024 •

edited

Loading

分布式设置错误 #19

分布式设置错误 #19

Comments

WeiminLee commented May 20, 2024

Coobiw commented May 20, 2024 • edited Loading

Coobiw commented May 20, 2024 •

edited

Loading