Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/psm3: Sometimes doesn't clean up stale shm regions #9893

Open
zachdworkin opened this issue Mar 15, 2024 · 6 comments
Open

prov/psm3: Sometimes doesn't clean up stale shm regions #9893

zachdworkin opened this issue Mar 15, 2024 · 6 comments

Comments

@zachdworkin
Copy link
Contributor

PSM3 sometimes doesn't clean up stale shm regions. After a test is done you can still see some psm3.xxx.xxx shm regions in /dev/shm

Run a test several times. Then check /dev/shm and notice there are open shm regions that aren't cleaned up

It is expected that tests clean up all of their resources on completion.

Fabtests don't appear to report any errors and just leak the stale shm regions.

Environment:
Rhel 8.6, 8.7, and 9.3, psm3

@raffenet
Copy link
Contributor

We have noticed this in our internal MPICH tests. /dev/shm fills up with psm3 segments and makes shared memory unusable on the system without manual cleanup.

@ToddRimmer
Copy link
Contributor

If a job fails or is killed, the provider has no chance to cleanup shared memory. I am not aware of any issues with cleanup for normally completed jobs. For failed jobs, cleanup such as follows is required and is typically integrated into job schedule post job cleanup: rm -rf /dev/shm/psm3_shm.* /dev/shm/sem.psm3_nic_affinity* /dev/shm/
psm3_nic_affinity*

If you are seeing a lack of cleanup for jobs which completed normally, and fully shutdown all OFI interfaces, please provide more details. I do not see such issues with OpenMPI nor Intel MPI

@zachdworkin
Copy link
Contributor Author

@ToddRimmer in prov/shm there is a signal handler that cleans up the /dev/shm region in the event of a failure signal or abrupt failure. https://github.com/ofiwg/libfabric/blob/main/prov/shm/src/smr_signal.h. We don't do this manual cleanup for psm3 and it looks like mpich/argonne doesn't either. From what I can tell it looks like the "normal completion" path is okay. Can a signal handler like in prov/shm be implemented to solve this failure/interrupt path issue?

@ToddRimmer
Copy link
Contributor

In our experience, a common job failure mode is for middleware or some other layer to call abort(), which results in a segfault signal and no guaranteed clean way for provider to cleanup as process memory status is suspect. So, our recommendation has been to place the cleanup commands in the job scheduler epilog script, the commands are provided in a script so documented in our Intel Ethernet Fabric Suite Package. This approach has worked well in the field for over a decade with PSM3 and its predecessors.

Can explore signal handler for some other signals, but given the need to address the common abort() case, the epilog approach is simple and covers all cases.

@zachdworkin
Copy link
Contributor Author

@ToddRimmer: Does abort() send a sigsegv (signal segmentation fault)? I thought it sends sigabrt (signal abort) to the running process. In either case a signal handler would be able to interrupt both of these signals (or any signal except sigkill [signal kill]) and cleanup the /dev/shm/psm3* files before completing the signal. This is already done and working for the shm provider so maybe we can add it as an enhancement for psm3? I think having to add an extra cleanup outside of the code, that the code can and should handle, doesn't make sense for the long run.

I can add your workaround solution of the:
"rm -rf /dev/shm/psm3_shm.* /dev/shm/sem.psm3_nic_affinity* /dev/shm/psm3_nic_affinity*"
to our slurm epilog to handle this cleanup until the enhancement is complete.

@ToddRimmer
Copy link
Contributor

exploring. We do have an existing signal handler, can you try an experiment with PSM3_BACKTRACE=1 In this case our signal handler gets installed and it will call exit for SIGTERM and SIGINT. We do have an atexit() function installed to do the cleanup, so I suspect in your case, exit is not being called. Granted, exit() is defined as "not safe" for use in a signal handler, however for SIGINT and SIGTERM is it typically safe enough. If this addresses the issue, you have a workaround and I can be confident that some layer above won't override our signal handler an could enable it in general (I believe in the past we had such situations, so were only able to do the atexit() approach).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants