llama GPU model with dcn fsdp + ici tp + cudnn flash attention broken #1093

wang2yn84 · 2024-12-10T02:46:12Z

I'm using 7b as an example, the following config doesn't work even on 2 nodes setup:

dcn fsdp = number of nodes
ici tp = 8
attention = cudnn_flash_te

It works with dot_product attention. Here is a snippet of the error message:
ERROR 2024-12-09T12:37:43.202101549Z [resource.labels.containerName: gpu-image] 2024-12-09 12:37:43.201690: E external/xla/xla/service/rendezvous.cc:55] This thread has been waiting for first call to collective operation 5688; run_id=1895556971 for 20 seconds and may be stuck. Expected 8 threads to join the rendezvous, but not all of them arrived on time.
ERROR 2024-12-09T12:37:46.679868113Z [resource.labels.containerName: gpu-image] 2024-12-09 12:37:46.679466: F external/xla/xla/service/rendezvous.cc:77] Termination timeout for first call to collective operation 5688; run_id=1895556971 of 40 seconds exceeded. Exiting to ensure a consistent program state. Expected 8 threads to join the rendezvous, but not all of them arrived on time.
ERROR 2024-12-09T12:37:46.679903343Z [resource.labels.containerName: gpu-image] Fatal Python error: Aborted

I had a working image dating back to Oct 8th. Not sure if Oct 8th is the exact date that it broke, but images after that doesn't work with this config. This is the script I use: https://github.com/AI-Hypercomputer/maxtext/blob/lance-nv/mt_jon_pgle.sh. It has nothing to do with PGLE tho.

The text was updated successfully, but these errors were encountered:

abhinavgoel95 · 2024-12-11T04:32:04Z

What are the XLA flags that you used?

abhinavgoel95 · 2024-12-11T04:47:43Z

Found it:
'--xla_gpu_enable_latency_hiding_scheduler=true --xla_gpu_enable_triton_gemm=false --xla_gpu_enable_highest_priority_async_stream=true --xla_gpu_all_reduce_combine_threshold_bytes=134217728 --xla_gpu_all_gather_combine_threshold_bytes=1073741824 --xla_gpu_reduce_scatter_combine_threshold_bytes=33554432 --xla_gpu_enable_pipelined_all_gather=true --xla_gpu_enable_pipelined_reduce_scatter=true --xla_gpu_enable_pipelined_all_reduce=true --xla_gpu_enable_while_loop_double_buffering=true --xla_gpu_enable_triton_softmax_fusion=false --xla_gpu_enable_all_gather_combine_by_dim=false --xla_gpu_enable_reduce_scatter_combine_by_dim=false --xla_disable_hlo_passes=rematerialization --xla_gpu_graph_level=0

wang2yn84 changed the title ~~llama model dcn fsdp + ici tp + cudnn flash attention broken~~ llama GPU model with dcn fsdp + ici tp + cudnn flash attention broken Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama GPU model with dcn fsdp + ici tp + cudnn flash attention broken #1093

llama GPU model with dcn fsdp + ici tp + cudnn flash attention broken #1093

wang2yn84 commented Dec 10, 2024 •

edited

Loading

abhinavgoel95 commented Dec 11, 2024

abhinavgoel95 commented Dec 11, 2024

llama GPU model with dcn fsdp + ici tp + cudnn flash attention broken #1093

llama GPU model with dcn fsdp + ici tp + cudnn flash attention broken #1093

Comments

wang2yn84 commented Dec 10, 2024 • edited Loading

abhinavgoel95 commented Dec 11, 2024

abhinavgoel95 commented Dec 11, 2024

wang2yn84 commented Dec 10, 2024 •

edited

Loading