Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency bug in test_remove_file() #808

Open
wangvsa opened this issue Mar 7, 2024 · 0 comments
Open

Concurrency bug in test_remove_file() #808

wangvsa opened this issue Mar 7, 2024 · 0 comments

Comments

@wangvsa
Copy link
Collaborator

wangvsa commented Mar 7, 2024

Describe the problem you're observing

I was doing some testing using the write-static code and found a subtle bug in

int test_remove_file(test_cfg* cfg, const char* filepath)

Describe how to reproduce the problem

When the test file already exists, setting the "remove and reuse test file" flag (-r) would easily trigger the bug.
mpirun -n 2 write-static -n 1 -r

Issue

At the end of the test_remove_file(), we do a barrier() to sync processes.
The problem is, there is a potential execution path where one or more processes may exit the function earlier, causing wrong barrier matching.

rc = stat(filepath, &sb);
if (rc) {
test_print_verbose_once(cfg,
"DEBUG: stat(%s): file already doesn't exist", filepath);
return 0;
}

This part of code assumes all ranks check the existence of file, get rc = 0, and continue.
Problematic execution sequence:
Rank 0 calls stat(), get rc = 0, then go ahead delete the file. Then Rank 1 came, its stat() call will return -1 since Rank 0 has already deleted the file. Then Rank 1 exits the function without calling the barrier() at the end.

Fix

Add a barrier immediately after stat() to make sure everyone sees the same result.

Overall, we should be careful when using "return" if we will do a barrier later. We need to make sure all processes will always follow the same path: either all return, or all call barrier.

wangvsa added a commit to wangvsa/UnifyFS that referenced this issue Jul 1, 2024
Signed-off-by: Chen Wang <[email protected]>
MichaelBrim added a commit that referenced this issue Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant