We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
请问代码一直卡在了torch.distributed.init_process_group 这个方法,请问如何解决?
环境信息:单击多卡
OS 设置: os.environ['RANK'] = '0' os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡 os.environ['LOCAL_RANK'] = '0' os.environ['MASTER_ADDR'] = '127.0.0.1' # rank0 对应的地址 os.environ['MASTER_PORT'] = '29500' # 任何空闲的端口 os.environ['NCCL_IB_DISABLE'] = "1" os.environ['NCCL_IBEXT_DISABLE'] = "1"
下面的代码一直超时,请问是哪里设置错误了么?
args.dist_url: "env://" args.dist_backend = "nccl"
torch.distributed.init_process_group( backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank, timeout=datetime.timedelta( seconds=10 ), # allow auto-downloading and de-compressing ) torch.distributed.barrier()
The text was updated successfully, but these errors were encountered:
os.environ['RANK'] = '0' os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡 os.environ['LOCAL_RANK'] = '0'
不要指定三项
如果只用四张卡 直接运行时加入CUDA_VISIBLE_DEVICES=x,x,x,x
CUDA_VISIBLE_DEVICES=x,x,x,x
如果还是超时,请删除掉两个关于NCCL的环境变量
Sorry, something went wrong.
No branches or pull requests
请问代码一直卡在了torch.distributed.init_process_group 这个方法,请问如何解决?
环境信息:单击多卡
OS 设置:
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '4' # 因为我需要只用其中的4张卡
os.environ['LOCAL_RANK'] = '0'
os.environ['MASTER_ADDR'] = '127.0.0.1' # rank0 对应的地址
os.environ['MASTER_PORT'] = '29500' # 任何空闲的端口
os.environ['NCCL_IB_DISABLE'] = "1"
os.environ['NCCL_IBEXT_DISABLE'] = "1"
下面的代码一直超时,请问是哪里设置错误了么?
args.dist_url: "env://"
args.dist_backend = "nccl"
torch.distributed.init_process_group(
backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size,
rank=args.rank,
timeout=datetime.timedelta(
seconds=10
), # allow auto-downloading and de-compressing
)
torch.distributed.barrier()
The text was updated successfully, but these errors were encountered: