You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using 7b as an example, the following config doesn't work even on 2 nodes setup:
dcn fsdp = number of nodes
ici tp = 8
attention = cudnn_flash_te
It works with dot_product attention. Here is a snippet of the error message:
ERROR 2024-12-09T12:37:43.202101549Z [resource.labels.containerName: gpu-image] 2024-12-09 12:37:43.201690: E external/xla/xla/service/rendezvous.cc:55] This thread has been waiting for first call to collective operation 5688; run_id=1895556971 for 20 seconds and may be stuck. Expected 8 threads to join the rendezvous, but not all of them arrived on time.
ERROR 2024-12-09T12:37:46.679868113Z [resource.labels.containerName: gpu-image] 2024-12-09 12:37:46.679466: F external/xla/xla/service/rendezvous.cc:77] Termination timeout for first call to collective operation 5688; run_id=1895556971 of 40 seconds exceeded. Exiting to ensure a consistent program state. Expected 8 threads to join the rendezvous, but not all of them arrived on time.
ERROR 2024-12-09T12:37:46.679903343Z [resource.labels.containerName: gpu-image] Fatal Python error: Aborted
I'm using 7b as an example, the following config doesn't work even on 2 nodes setup:
dcn fsdp = number of nodes
ici tp = 8
attention = cudnn_flash_te
It works with dot_product attention. Here is a snippet of the error message:
ERROR 2024-12-09T12:37:43.202101549Z [resource.labels.containerName: gpu-image] 2024-12-09 12:37:43.201690: E external/xla/xla/service/rendezvous.cc:55] This thread has been waiting for
first call to collective operation 5688; run_id=1895556971
for 20 seconds and may be stuck. Expected 8 threads to join the rendezvous, but not all of them arrived on time.ERROR 2024-12-09T12:37:46.679868113Z [resource.labels.containerName: gpu-image] 2024-12-09 12:37:46.679466: F external/xla/xla/service/rendezvous.cc:77] Termination timeout for
first call to collective operation 5688; run_id=1895556971
of 40 seconds exceeded. Exiting to ensure a consistent program state. Expected 8 threads to join the rendezvous, but not all of them arrived on time.ERROR 2024-12-09T12:37:46.679903343Z [resource.labels.containerName: gpu-image] Fatal Python error: Aborted
I had a working image dating back to Oct 8th. Not sure if Oct 8th is the exact date that it broke, but images after that doesn't work with this config. This is the script I use: https://github.com/AI-Hypercomputer/maxtext/blob/lance-nv/mt_jon_pgle.sh. It has nothing to do with PGLE tho.
The text was updated successfully, but these errors were encountered: