Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch cifar example doesn't quit gracefully #47

Open
yaroslavvb opened this issue Aug 6, 2018 · 0 comments
Open

pytorch cifar example doesn't quit gracefully #47

yaroslavvb opened this issue Aug 6, 2018 · 0 comments
Assignees

Comments

@yaroslavvb
Copy link
Contributor

Right now pytorch-cifar, single p3.16xlarge ends last epoch with following error coming from all training processes

cc @bearpelican

terminate called after throwing an instance of 'gloo::EnforceNotMet'
  what():  [enforce fail at /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda_private.h:40] error == cudaSuccess. 29 vs 0. Error at: /opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/cuda_private.h:40: driver shutting down
@bearpelican bearpelican self-assigned this Aug 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants