You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Among 4 jobs that were launched, 3 crashed in the middle of the training, despite of the babysitting workflow.
Two of them failed due to Ray timeout (see below). My suspicion is that multi-slicing introduces more complexity to babysitting: if one slice is down but the main one is up, it won't triggering babysitting, and the training job will stuck, until failed due to ray timeout error.
(raylet) The node with node id: 666057938a8d86afc6080e71a6d3e58791a056c9364e63c8f4e55a9b and address: 10.130.1.251 and node name: 10.130.1.251 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.)
The text was updated successfully, but these errors were encountered:
Among 4 jobs that were launched, 3 crashed in the middle of the training, despite of the babysitting workflow.
Two of them failed due to Ray timeout (see below). My suspicion is that multi-slicing introduces more complexity to babysitting: if one slice is down but the main one is up, it won't triggering babysitting, and the training job will stuck, until failed due to ray timeout error.
The text was updated successfully, but these errors were encountered: