Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi集群数据分发问题 #50

Open
ZhouTaiji opened this issue Jan 30, 2021 · 2 comments
Open

mpi集群数据分发问题 #50

ZhouTaiji opened this issue Jan 30, 2021 · 2 comments

Comments

@ZhouTaiji
Copy link

你好,我在将example中demo提交至集群中运行的时候,worker数设置为10,且通过TF_CONFIG中参数已证实work_num为10,,但在将数据分发到不同机器的过程中,命令:
dataset = ds_data_files.shard(num_shards=tn.core.shard_num(),index=tn.core.self_shard_id())
中的tn.core.shard_num仍然为1,因此所有节点都各自使用全部的数据进行训练,这里我使用的机器学习分布式调度系统是通过打包docker镜像的方式提交到集群上进行训练,且运行相同的wide-deep模型的tf版本可以正常进行分布式训练,但tn版本的数据无法分布到不同的机器上,请问这个问题应该如何解决呢?

@zhangys-lucky
Copy link
Collaborator

你是没有mpirun成功吧

@ZhouTaiji
Copy link
Author

但是我把代码中tn的部分改为tf就可以正常分布式训练了,在上面的例子中我后来根据TF_CONFIG中的task_index和work_num手动设置了num_shards和index后也可以正常训练

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants