Multi-slicing training jobs fail due to Ray timeout error #684

Ivan-Zhou · 2024-08-06T03:49:13Z

Among 4 jobs that were launched, 3 crashed in the middle of the training, despite of the babysitting workflow.

Two of them failed due to Ray timeout (see below). My suspicion is that multi-slicing introduces more complexity to babysitting: if one slice is down but the main one is up, it won't triggering babysitting, and the training job will stuck, until failed due to ray timeout error.

(raylet) The node with node id: 666057938a8d86afc6080e71a6d3e58791a056c9364e63c8f4e55a9b and address: 10.130.1.251 and node name: 10.130.1.251 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a  (1) raylet crashes unexpectedly (OOM, etc.)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-slicing training jobs fail due to Ray timeout error #684

Multi-slicing training jobs fail due to Ray timeout error #684

Ivan-Zhou commented Aug 6, 2024

Multi-slicing training jobs fail due to Ray timeout error #684

Multi-slicing training jobs fail due to Ray timeout error #684

Comments

Ivan-Zhou commented Aug 6, 2024