Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLOC-1194] Add forceable container removal if needed #1753

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions flocker/node/_docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -546,7 +546,7 @@ def _blocking_container_runs(self, container_name):
).write()
return result['State']['Running']

def remove(self, unit_name):
def remove(self, unit_name, force=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is an implementation of an interface method, IDockerClient.remove.

  • The interface needs to be updated similarly.
  • The other implementation needs to be updated similarly.
  • The interface-based test suite factory, make_idockerclient_tests, needs to have tests for the new functionality added.

container_name = self._to_container_name(unit_name)

def _remove():
Expand Down Expand Up @@ -608,7 +608,7 @@ def _remove():
message_type="flocker:docker:container_remove",
container=container_name
).write()
self._client.remove_container(container_name)
self._client.remove_container(container_name, force=force)
Message.new(
message_type="flocker:docker:container_removed",
container=container_name
Expand Down
47 changes: 45 additions & 2 deletions flocker/node/functional/test_docker.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ def start_container(self, unit_name,
environment=None, volumes=(),
mem_limit=None, cpu_shares=None,
restart_policy=RestartNever(),
command_line=None):
command_line=None, force=False):
"""
Start a unit and wait until it reaches the `active` state or the
supplied `expected_state`.
Expand All @@ -114,6 +114,8 @@ def start_container(self, unit_name,
:param cpu_shares: See ``IDockerClient.add``.
:param restart_policy: See ``IDockerClient.add``.
:param command_line: See ``IDockerClient.add``.
:param bool force: Force container removal via SIGKILL
during cleanup.

:return: ``Deferred`` that fires with the ``DockerClient`` when
the unit reaches the expected state.
Expand All @@ -130,14 +132,54 @@ def start_container(self, unit_name,
restart_policy=restart_policy,
command_line=command_line,
)
self.addCleanup(client.remove, unit_name)
self.addCleanup(client.remove, unit_name, force)

d.addCallback(lambda _: wait_for_unit_state(client, unit_name,
expected_states))
d.addCallback(lambda _: client)

return d

def test_force_remove(self):
"""
``DockerClient.remove`` can forcefully remove a running container
when the ``force`` parameter is set to True.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to note about this test is that, though it does exercise the new force parameter, it exercises a slightly different part of its behavior than we're most interested in. "Remove a running container" isn't quite "avoid the internal server error that sometimes happens because 'Device is busy'". It could be that force=True does fix this as well as its other (documented) behavior - but we haven't demonstrated it with this test.

I skimmed the Docker issue and didn't notice the section that suggests force=True as a work-around for this. A reference to that suggestion, as well as to the issue as a whole somewhere in this code (perhaps near the callers that pass force=True to remove) would be nice.

"""
client = DockerClient()
unit_name = u'test-force-remove'
expected_states = (u'active',)
d = client.add(
unit_name=unit_name,
image_name=u'clusterhq/flask'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This image and flask are incidental complexity for this test. The detail of which container to run doesn't really matter to what the test is concerned with testing. All of the code from Flask that will run when this test runs is likewise irrelevant. Depending on the execution environment of the test suite, it's also possible this will result in a Docker image being pulled from Dockerhub.

The incidental complexity increases the maintenance cost of the test method. The Dockerhub dependency increases the chances that this will fail intermittently due to transient network or Dockerhub conditions.

A way to address the incidental complexity is to contain it and then isolate it from this particular test. Add an attribute or method to the class responsible for giving back the name of an image that has the desired properties - in this case, an image that runs indefinitely, I think.

A way to address the external dependency is to build a custom image for this test as we do elsewhere. I think a busybox-based container that runs forever should be pretty straightforward.

)
d.addCallback(
lambda _: wait_for_unit_state(
client, unit_name, expected_states
)
)

def check_is_listed(_):
listed = client.list()

def check_unit(results):
self.assertTrue(unit_name in [unit.name for unit in results])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertIn(unit_name, [unit.name for unit in results])

This provides a better failure message because it has both the "needle" and the "haystack". assertTrue just has True or False.

listed.addCallback(check_unit)
return listed

d.addCallback(check_is_listed)
d.addCallback(lambda _: client.remove(unit_name, force=True))

def check_is_removed(_):
listed = client.list()

def check_unit(results):
self.assertFalse(unit_name in [unit.name for unit in results])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise.

listed.addCallback(check_unit)
return listed

d.addCallback(check_is_removed)
return d

def test_default_base_url(self):
"""
``DockerClient`` instantiated with a default base URL for a socket
Expand Down Expand Up @@ -356,6 +398,7 @@ def image_built(image_name):
unit_name=unit_name,
image_name=image_name,
environment=Environment(variables=expected_variables),
force=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only real change in behavior is in the test suite then, hopefully, our test suite will stop failing spuriously with this problem. What about the actual implementation, though? If this failure can strike test code that's removing a container, it seems plausible it could strike StopApplication when in use on a real cluster.

Is there a reason to think this isn't the case? Or does this problem eventually solve itself so that, as long as you're retrying (as the looping approach of the convergence agent will do) the removal does eventually succeed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about other tests that create and then remove containers (ie, all the other tests using start_application)?

Couldn't they all also fail with this error? The error doesn't seem particularly container-with-environment-variables specific at first glance. If any start_container-using test could encounter this problem, it would probably make more sense to just have force removal be the behavior start_container implements and not bother with the parameter and updating all the callers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what about other tests that create and then remove containers (ie, all the other tests using start_application)?

Couldn't they all also fail with this error?

Plausibly, yes - but the issue FLOC-1194 is only concerned with the environment variables test and is a "fix flaky test" category issue. So it possibly makes sense to not alter the behaviour of other tests using the Client change in this branch until such time as they are also identified as intermittently failing on the same device busy error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to think this isn't the case?

No, it is plausible this error could strike StopApplication on a real cluster, because the precise cause(s) of this Docker error are not known; some causes are known, specific to certain combinations of platforms, images and Docker versions, but I have thusfar not found a way of reliably reproducing the issue.

Or does this problem eventually solve itself so that, as long as you're retrying (as the looping approach of the convergence agent will do) the removal does eventually succeed?

Maybe; I've seen at least one report (anecdote) that it was necessary for the host running Docker to be rebooted to resolve the problem, but again this seems to vary.

There's not even a guarantee that force removal via Docker's API prevents the device busy error, I've seen reports of it happening even with the -f switch supplied on Docker's CLI. It would be nice if we could write a test that reliably reproduced the problem and tested that forced-removal reliably circumvents it, but I don't have sufficient information at this time to do that. In the scope of this ticket, we're basically going for an approach of "based on discussions of this error found on historical Docker issues listings, add force removal capability to our Docker API client and tell the intermittently failing test to use it, then see if the test failure occurs again in future..." - not the most robust solution, I know, but it might help Buildbot stay green (and as long as all our tests that were passing remain so, along with the new tests in this branch, we can at least be sure that with this approach, we haven't made anything any less reliable).

)
d.addCallback(image_built)

Expand Down