Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues/bugs observed when running real applications at large scales #812

Open
wangvsa opened this issue Apr 9, 2024 · 0 comments
Open

Comments

@wangvsa
Copy link
Collaborator

wangvsa commented Apr 9, 2024

Issues/bugs I noticed while running large-scale real applications on Frontier.
I use this for bookkeeping purpose. I will create a PR to fix them in the future.
The issues reported here occur only at large scales, e.g., 628-node FLASH-X runs.

System information

The issues are not system dependent.

Describe the problem you're observing

Most of the issues in the end will lead to Mercury TIMEOUT errors. Then the I/O (e.g., HDF5) will fail.

1:

This is inside the unifyfs_invoke_filesize_rpc() function. So the rpc id should be filesize_id not metaget_id.

hg_id_t req_hgid = unifyfsd_rpc_context->rpcs.metaget_id;

The bug causes that filesize rpc calls are never handled, all waiting forever.
We need to carefully examine if we have similar bugs like this. Best to have unit tests to cover all RPC routines.

2:

During servers initialization process, server rank 0 acts as coordinator and performs a tree-based broadcast.
The hard-coded 5 secs timeout may not be enough for a large number of servers. I have to increase it a little to avoid the timeout error for 628-node Flash runs.

timeout.tv_sec += 5; /* 5 sec */

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant