Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLOC-1194] Add forceable container removal if needed #1753

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions flocker/node/_docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -208,13 +208,14 @@ def exists(unit_name):
otherwise ``False``.
"""

def remove(unit_name):
def remove(unit_name, force):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have thought zope.verify would have caught this, but the signature in the interface should match the signature in the implementation, so the interface ought to have force=False.

"""
Stop and delete the given unit.

This can be done multiple times in a row for the same unit.

:param unicode unit_name: The name of the unit to stop.
:param bool force: Force removal of the unit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And add a note about the default behaviour here.


:return: ``Deferred`` that fires once the unit has been stopped
and removed.
Expand Down Expand Up @@ -271,7 +272,7 @@ def add(self, unit_name, image_name, ports=frozenset(), environment=None,
def exists(self, unit_name):
return succeed(unit_name in self._units)

def remove(self, unit_name):
def remove(self, unit_name, force=False):
if unit_name in self._units:
del self._units[unit_name]
return succeed(None)
Expand Down Expand Up @@ -546,7 +547,7 @@ def _blocking_container_runs(self, container_name):
).write()
return result['State']['Running']

def remove(self, unit_name):
def remove(self, unit_name, force=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is an implementation of an interface method, IDockerClient.remove.

  • The interface needs to be updated similarly.
  • The other implementation needs to be updated similarly.
  • The interface-based test suite factory, make_idockerclient_tests, needs to have tests for the new functionality added.

container_name = self._to_container_name(unit_name)

def _remove():
Expand Down Expand Up @@ -608,7 +609,7 @@ def _remove():
message_type="flocker:docker:container_remove",
container=container_name
).write()
self._client.remove_container(container_name)
self._client.remove_container(container_name, force=force)
Message.new(
message_type="flocker:docker:container_removed",
container=container_name
Expand Down
60 changes: 58 additions & 2 deletions flocker/node/functional/test_docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def start_container(self, unit_name,
environment=None, volumes=(),
mem_limit=None, cpu_shares=None,
restart_policy=RestartNever(),
command_line=None):
command_line=None, force=False):
"""
Start a unit and wait until it reaches the `active` state or the
supplied `expected_state`.
Expand All @@ -114,6 +114,8 @@ def start_container(self, unit_name,
:param cpu_shares: See ``IDockerClient.add``.
:param restart_policy: See ``IDockerClient.add``.
:param command_line: See ``IDockerClient.add``.
:param bool force: Force container removal via SIGKILL
during cleanup.

:return: ``Deferred`` that fires with the ``DockerClient`` when
the unit reaches the expected state.
Expand All @@ -130,14 +132,67 @@ def start_container(self, unit_name,
restart_policy=restart_policy,
command_line=command_line,
)
self.addCleanup(client.remove, unit_name)
self.addCleanup(client.remove, unit_name, force)

d.addCallback(lambda _: wait_for_unit_state(client, unit_name,
expected_states))
d.addCallback(lambda _: client)

return d

def test_force_remove(self):
"""
``DockerClient.remove`` can forcefully remove a running container
when the ``force`` parameter is set to True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note about this test is that, though it does exercise the new force parameter, it exercises a slightly different part of its behavior than we're most interested in. "Remove a running container" isn't quite "avoid the internal server error that sometimes happens because 'Device is busy'". It could be that force=True does fix this as well as its other (documented) behavior - but we haven't demonstrated it with this test.

I skimmed the Docker issue and didn't notice the section that suggests force=True as a work-around for this. A reference to that suggestion, as well as to the issue as a whole somewhere in this code (perhaps near the callers that pass force=True to remove) would be nice.


This is intended to help avoid an intermittent error in other
tests caused by a Docker behaviour where an "device busy" error
occurs when attempting to remove a container.

In many cases, it is reported that forced removal of the container
via the Docker API or "-f" CLI switch is a work-around that prevents
this error, though is not a solution to the underlying problem.

See https://github.com/docker/docker/issues/9665
See http://blog.hashbangbash.com/2014/11/
docker-devicemapper-fix-for-device-or-resource-busy-ebusy/
See FLOC-1194.
"""
client = DockerClient()
unit_name = u'test-force-remove'
expected_states = (u'active',)
d = client.add(
unit_name=unit_name,
image_name=u"openshift/busybox-http-app"
)
d.addCallback(
lambda _: wait_for_unit_state(
client, unit_name, expected_states
)
)

def check_is_listed(_):
listed = client.list()

def check_unit(results):
self.assertIn(unit_name, [unit.name for unit in results])
listed.addCallback(check_unit)
return listed

d.addCallback(check_is_listed)
d.addCallback(lambda _: client.remove(unit_name, force=True))

def check_is_removed(_):
listed = client.list()

def check_unit(results):
self.assertNotIn(unit_name, [unit.name for unit in results])
listed.addCallback(check_unit)
return listed

d.addCallback(check_is_removed)
return d

def test_default_base_url(self):
"""
``DockerClient`` instantiated with a default base URL for a socket
Expand Down Expand Up @@ -356,6 +411,7 @@ def image_built(image_name):
unit_name=unit_name,
image_name=image_name,
environment=Environment(variables=expected_variables),
force=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only real change in behavior is in the test suite then, hopefully, our test suite will stop failing spuriously with this problem. What about the actual implementation, though? If this failure can strike test code that's removing a container, it seems plausible it could strike StopApplication when in use on a real cluster.

Is there a reason to think this isn't the case? Or does this problem eventually solve itself so that, as long as you're retrying (as the looping approach of the convergence agent will do) the removal does eventually succeed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about other tests that create and then remove containers (ie, all the other tests using start_application)?

Couldn't they all also fail with this error? The error doesn't seem particularly container-with-environment-variables specific at first glance. If any start_container-using test could encounter this problem, it would probably make more sense to just have force removal be the behavior start_container implements and not bother with the parameter and updating all the callers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about other tests that create and then remove containers (ie, all the other tests using start_application)?

Couldn't they all also fail with this error?

Plausibly, yes - but the issue FLOC-1194 is only concerned with the environment variables test and is a "fix flaky test" category issue. So it possibly makes sense to not alter the behaviour of other tests using the Client change in this branch until such time as they are also identified as intermittently failing on the same device busy error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to think this isn't the case?

No, it is plausible this error could strike StopApplication on a real cluster, because the precise cause(s) of this Docker error are not known; some causes are known, specific to certain combinations of platforms, images and Docker versions, but I have thusfar not found a way of reliably reproducing the issue.

Or does this problem eventually solve itself so that, as long as you're retrying (as the looping approach of the convergence agent will do) the removal does eventually succeed?

Maybe; I've seen at least one report (anecdote) that it was necessary for the host running Docker to be rebooted to resolve the problem, but again this seems to vary.

There's not even a guarantee that force removal via Docker's API prevents the device busy error, I've seen reports of it happening even with the -f switch supplied on Docker's CLI. It would be nice if we could write a test that reliably reproduced the problem and tested that forced-removal reliably circumvents it, but I don't have sufficient information at this time to do that. In the scope of this ticket, we're basically going for an approach of "based on discussions of this error found on historical Docker issues listings, add force removal capability to our Docker API client and tell the intermittently failing test to use it, then see if the test failure occurs again in future..." - not the most robust solution, I know, but it might help Buildbot stay green (and as long as all our tests that were passing remain so, along with the new tests in this branch, we can at least be sure that with this approach, we haven't made anything any less reliable).

)
d.addCallback(image_built)

Expand Down
25 changes: 25 additions & 0 deletions flocker/node/test/test_docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,31 @@ def test_removed_does_not_exist(self):
d.addCallback(self.assertFalse)
return d

def test_forced_removed_does_not_exist(self):
"""
A container removed with the ``force`` switch does not exist.
"""
client = fixture(self)
name = random_name(self)
image = u"openshift/busybox-http-app"
d = client.add(name, image)
d.addCallback(lambda _: client.list())
expected = Unit(
name=name, container_name=name, activation_state=u"active",
container_image=image,
)

def got_units(units):
result = units.pop()
result = result.set("container_name", name)
self.assertEqual(expected, result)

d.addCallback(got_units)
d.addCallback(lambda _: client.remove(name, force=True))
d.addCallback(lambda _: client.exists(name))
d.addCallback(self.assertFalse)
return d

def test_added_is_listed(self):
"""
An added container is included in the output of ``list()``.
Expand Down