-
Notifications
You must be signed in to change notification settings - Fork 10
memhub can "deadlock" when killed #130
Comments
Except the race condition at creation, robust POSIX mutexes in shared memory should do the job, no?
Is there a possibility to implement a timeout for the lock acquisition? I don't like calls that can block forever... I think implementing a timeout for
While I usually agree with that principle is it the best thing to do in our case? I mean, if a process terminates unexpectedly what guarantee can we have about the hardware state? Isn't it better to forbid any access to the system? Of course, there are many ways to deal with the issue at higher levels in the software stack, but in these cases, there must be a good cooperation between the different processes accessing the hardware. |
Setting them up seems a bit harder.
Active polling with |
Has this been shown to be an issue anywhere yet? Or is this as yet an academic exercise?
Indeed, the initial incantation we implemented to try to solve the bus collisions involved a lock file and a timeout. @evka85 then implemented the version with semaphores |
Yeah I think the lock file approach had some issues, but I can't remember what it was. |
I'm not aware of this issue happening in production, but it's a race condition involving a rare signal (
To be specific, I'm afraid of the kernel sending |
Ok I see, yes the kill from kernel of course is possible, but that is just an indication of something else going terribly wrong, and I think we should instead the root cause, which is running out of memory. I know that the rw_reg library currently isn't really using memory efficiently, and running multiple instances of it may result in out-of-memory condition. We have discussed possible ways to change the rw_reg to reduce the memory footprint: mainly the memory is wasted by creating duplicate nodes with shifted addresses coming from the generated nodes in the xml file. There are a couple of ways to go around that -- one way would be to use shared memory that contains the register addresses among all processes, this is already done e.g. in the RPC service by using LMDB, and so python tools could also use that; another way would be just to optimize the way that generated nodes are stored in memory by storing just the root node and the generate parameters, and uppon request for any specific generated node from the user, the address would be calculated dynamically. We should probably open a new issue on that.. |
I fail to understand how out-of-memory conditions are different from |
Sorry, I didn't quite understand the question.. We are handling all the signals that we can, but in case the kernel out-of-memory killer kicks in, it just sends SIGKILL, which we can't catch. |
I think I did the signal handling mostly to clean up when the user presses ctrl+c which results in SIGINT, but I added whatever other signals I could just in case. |
Brief summary of issue
memhub uses semaphores to prevent concurrent access to the CTP7 memory. It tries hard to avoid leaving active semaphores behind, but all these efforts are moot if the process gets killed. This is a caveat of semaphores themselves.
Types of issue
Expected Behavior
A dying process should release all resources it holds.
Current Behavior
When a process gets killed (
SIGKILL
) in the middle of a memhub call, it leaves behind an active semaphore. The next process trying to use memhub gets stuck.Steps to Reproduce (for bugs)
memhub_read
right after the call tomemsvc_read
memsvc_read
(eg through an RPC call)kill -9
itmemsvc_read
againPossible Solution
Since this is caused by a caveat in the semaphores API, one has to find another way to synchronize multiple processes. There are basically two ways apart from semaphores:
pthread
mutex in a shared memory region. This would be hard to get right for our use case./tmp/memhub.lock
) and use the advisory locking API. Advisory file locks are released when a process is killed. I have a working prototype.Context
Want to avoid a CTP7 getting stuck and requiring manual intervention.
Your Environment
The text was updated successfully, but these errors were encountered: