Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uft-cluster-config hangs #51

Open
exarkun opened this issue Nov 9, 2015 · 8 comments
Open

uft-cluster-config hangs #51

exarkun opened this issue Nov 9, 2015 · 8 comments

Comments

@exarkun
Copy link

exarkun commented Nov 9, 2015

$ uft-flocker-config cluster.yml
Initialized cluster CA.
Created control cert.
Generated fc43dc40-8a58-41c8-a664-05405f1e073b for ....
Generated 4c763764-834b-42a8-917c-9bac4817c02d for ....
Generated f96c7b9f-b7e2-46d1-b66b-33f62c3e568e for ....
Created user key for coreuser
Making /etc/flocker directory on all nodes
Uploading keys to respective nodes:
 * Uploaded control cert & key to control node.
 * Uploaded cluster cert to ....
 * Uploaded agent.yml to ....
 * Uploaded node key to ....
 * Uploaded node key to ....
 * Uploaded node crt to ....
 * Uploaded node crt to ...
 * Uploaded cluster cert to ....
 * Uploaded agent.yml to ....
 * Uploaded node key to ....
 * Uploaded cluster cert to ....
 * Uploaded control key to ....
 * Uploaded control crt to ....
 * Uploaded cluster cert to ....
 * Uploaded agent.yml to ....
 * Uploaded node crt to ....

After the last line, many minutes pass with no further progress. The ssh process being run by the uft-cluster-config container is blocked on select().

@wallnerryan
Copy link
Contributor

does this happen every time you run it? I haven't come across this issue, just curious, and ran it last night.

@exarkun
Copy link
Author

exarkun commented Nov 9, 2015

If I try to run it again it fails with an error very quickly:

$ uft-flocker-config cluster.yml
Error: Unable to write certificate file. File exists /pwd/cluster.crt
main function encountered error
Traceback (most recent call last):
  File "/opt/flocker/bin/flocker-config", line 9, in <module>
    load_entry_point('UnofficialFlockerTools==0.5', 'console_scripts', 'flocker-config')()
  File "/opt/flocker/lib/python2.7/site-packages/unofficial_flocker_tools/config.py", line 147, in _main
    react(main, sys.argv[1:])
  File "/opt/flocker/lib/python2.7/site-packages/twisted/internet/task.py", line 875, in react
    finished = main(_reactor, *argv)
  File "/opt/flocker/lib/python2.7/site-packages/twisted/internet/defer.py", line 1253, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/opt/flocker/lib/python2.7/site-packages/twisted/internet/defer.py", line 1107, in _inlineCallbacks
    result = g.send(result)
  File "/opt/flocker/lib/python2.7/site-packages/unofficial_flocker_tools/config.py", line 21, in main
    c.run("flocker-ca initialize %s" % (c.config["cluster_name"],))
  File "/opt/flocker/lib/python2.7/site-packages/unofficial_flocker_tools/utils.py", line 208, in run
    result = subprocess.check_output(command, shell=True)
  File "/usr/lib/python2.7/subprocess.py", line 573, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'flocker-ca initialize jp_coreos_flocker_testing' returned non-zero exit status 1

Perhaps I'll move all the generated authentication files aside and see if that lets me run it again...

@exarkun
Copy link
Author

exarkun commented Nov 9, 2015

If I do that then I get a different error:

Retrying running ['-o', 'LogLevel=error', '-o', 'UserKnownHostsFile=/dev/null', '-o', 'StrictHostKeyChecking=no', '-i', '/host/tmp/flocker-coreos/jean-paulcalderoneinsecure-temporary.pem', 'root@....', "bash -c 'echo; echo\necho > /tmp/flocker-command-log\ndocker run --restart=always -d --net=host --privileged \\\n    -v /etc/flocker:/etc/flocker \\\n    -v /var/run/docker.sock:/var/run/docker.sock \\\n    --name=flocker-container-agent \\\n    clusterhq/flocker-container-agent\ndocker run --restart=always -d --net=host --privileged \\\n    -e DEBUG=1 \\\n    -v /tmp/flocker-command-log:/tmp/flocker-command-log \\\n    -v /flocker:/flocker -v /:/host -v /etc/flocker:/etc/flocker \\\n    -v /dev:/dev \\\n    --name=flocker-dataset-agent \\\n    clusterhq/flocker-dataset-agent\n'"] on .... given result 'Process exited with error code 1: \n\nError response from daemon: Conflict. The name "flocker-container-agent" is already in use by container e296946b20e2. You have to delete (or rename) that container to be able to reuse that name.\nError response from daemon: Conflict. The name "flocker-dataset-agent" is already in use by container 09b2ba7ae2e4. You have to delete (or rename) that container to be able to reuse that name.\n'...

@wallnerryan
Copy link
Contributor

Yeah this is an issue because it doesn't clean up certs on re-run. +1 to making it re-runnable. Relates to or may duplicate #42

@exarkun
Copy link
Author

exarkun commented Nov 9, 2015

Nevertheless, uft-flocker-volumes now succeeds... Is the config command just expected to take a long time and not generate any progress updates?

@wallnerryan
Copy link
Contributor

yeah, it can potentially take a little bit of time. I've seen it hang before but not for too long (<1 to 2 min). Would be nice to have progress bars or updates.

@lukemarsden
Copy link
Contributor

I've seen it hang like this before. Maybe if we try and scp at just the wrong time, TCP screws us. We could add timeouts on each network operation.

@lukemarsden
Copy link
Contributor

Also, +1 for fixing #42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants