Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot cleanly restart a Child task without restarting autopilot #168

Open
Rodgers-PAC-Lab opened this issue Mar 24, 2022 · 5 comments
Open
Labels

Comments

@Rodgers-PAC-Lab
Copy link

Rodgers-PAC-Lab commented Mar 24, 2022

For a while I've been using a Parent Pi connected to multiple Child Pi (as discussed here: #101). This works, but with the annoying problem that the autopilot process on each Child has to be manually quit and restarted between every session, or otherwise we get a ZMQError. I think the Child is not stopping correctly.

Full code for a minimal example (but see also relevant excerpts below if that is easier):
The Parent task: https://github.com/Rodgers-PAC-Lab/autopilot/blob/paft2022/autopilot/tasks/paft_parent_child.py
The Child task: https://github.com/Rodgers-PAC-Lab/autopilot/blob/b2105803b7bd4e36e39ac1a6f662ec93f0c85c96/autopilot/tasks/children.py#L35

Here is an excerpt of the code in init for the Parent task. The first Net_Node tells the Child to start, and the second Net_Node is used for bidirectional communication with the Child.

        self.node = Net_Node(id="T_{}".format(prefs.get('NAME')),
            upstream=prefs.get('NAME'),
            port=prefs.get('MSGPORT'),
            listens={},
            instance=False,
            )
        value = {
            'child': {
                'parent': prefs.get('NAME'), 'subject': subject},
            'task_type': 'PAFT_Child',
            'subject': subject,
            'reward': reward,
        }

        # send to the station object with a 'CHILD' key
        self.node.send(to=prefs.get('NAME'), key='CHILD', value=value)
        
        ## Create a second node to communicate with the child
        # We (parent) are the "router"/server
        # The children are the "dealer"/clients
        # Many dealers, one router        
        self.node2 = Net_Node(
            id='parent_pi',
            upstream='',
            port=5000,
            router_port=5001,
            listens={
                'HELLO': self.recv_hello,
                'POKE': self.recv_poke,
                },
            instance=False,
            )

Here is the code in the init for the Child task. This Net_Node is used to communicate with the Net_Node of the same name in the Parent.

        self.node2 = Net_Node(
            id=self.name,
            upstream='parent_pi',
            port=5001,
            upstream_ip=prefs.get('PARENTIP'),
            listens={
                'HELLO': self.recv_hello,
                'END': self.recv_end,
                },
            instance=False,
            )        
        
        # Send
        self.node2.send(
            'parent_pi', 'HELLO', {'from': self.name})

All of the above actually works! However, if we stop the session and start a new one, then the Parent generates an exception when we try to create self.node2 again: zmq.error.ZMQError: Address already in use

I've come up with two semi-fixes, neither of which really fixes the whole problem. First, to fix ZMQError, I have this line self.node2.router.close() in the end() function of the Parent task. After this fix, the ZMQError stops happening. Note that it's not sufficient to have self.node2.release(), so perhaps closing the router needs to be added as a step in releasing a Net_Node.

Second, I explicitly tell the Child class to end, by adding self.node.send(to=prefs.get('NAME'), key='CHILD', value={'KEY': 'STOP'}) in the end function of the Parent task. Does this look right? I think this is probably a good thing to do, though I don't see it in the example GoNoGo/Wheel_Child code.

Nonetheless, even after these fixes, I haven't yet succeeded in starting a Child task again, without rebooting the autopilot process. Any tips would be much appreciated!!

PS - I think we could also test/debug this in the GoNoGo/Wheel_Child standard task, but that one is not running for me, not sure if I am missing some hardware or what. It says something about a KeyError 'F' in initting some gpio pins.

@Rodgers-PAC-Lab
Copy link
Author

success?? I think that adding self.node.sock.close() in both the Parent and Child end() functions might fix this.
Rodgers-PAC-Lab@ec78bc9

along with the aforementioned self.node.router.close(), which is only needed in the Parent, for the special Net_Node that was initialized with a router_port.
Rodgers-PAC-Lab@a10c0bd

It seems perhaps there are some dangling sockets left open, and this is causing problems on subsequent session starts. I'll continue testing and if this works I would probably suggest adding self.node.sock.close() and if self.node.router: self.node.router.close() within Net_Node.release here: https://github.com/wehr-lab/autopilot/blob/90956187d4222f16f67ab8b39b8359da954d5dcc/autopilot/networking/node.py#L642

@sneakers-the-rat
Copy link
Contributor

Awesome! yes i have been super sloppy with closing stuff, agree this is super annoying, will revisit after this current release

@sneakers-the-rat
Copy link
Contributor

Figured out a reliable way to kill the IOLoop objects that keep the networking objects open. It's sort of a careful dance of trying to hold threads and processes open with a blocking IOLoop.start() call but then interact with them through callbacks launched by the on_recv callback of a ZMQSteam which are in the same process but different thread. Most of the operations of the IOLoop aren't thread safe, in fact the only one that is is the add_callback method. So by adding a callback that does this

self.loop.add_callback(lambda:IOLoop.instance().stop())

you can reliably kill them. You're right, also need to close the sockets. I'll work on cleaning this up in v0.5.0, which i'm preparing now.

@cxrodgers
Copy link
Contributor

a few other things I discovered that seem to be important for cleanly shutting down, at least over here. I haven't extensively tested them yet, but I'm just mentioning it now, in case it rings a bell. Now I can cleanly shut down the terminal, and I never have to restart the autopilot process on the Pi between sessions, which wasn't the case for me before in our setup. I can test these out and make PRs after 0.5.0 too.

@sneakers-the-rat
Copy link
Contributor

yes i just arrived at the same changes ;). working my way through some more of the eternal technical debt, thanks for the pointers as well, very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants