-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle connection resets in the status server more gracefully #5881
base: master
Are you sure you want to change the base?
Handle connection resets in the status server more gracefully #5881
Conversation
Connection resets due to network instability can lead to the status server not catching a test status, an asyncio error like full.log│2024-03-08 03:08:44,053 asyncio base_events L1744 ERROR| Task exception was never retrieved │ full.log│future: <Task finished name='Task-2038' coro=<StatusServer.cb() done, defined at /usr/lib/python3.10/site-packages/avocado/core/status/server.py:51> exception=ConnectionResetError(104, 'Connection reset by peer')> ├ full.log│Traceback (most recent call last): │ full.log│ File "/usr/lib/python3.10/site-packages/avocado/core/status/server.py", line 53, in cb │ full.log│ raw_message = await reader.readline() │ full.log│ File "/usr/lib64/python3.10/asyncio/streams.py", line 525, in readline │ full.log│ line = await self.readuntil(sep) │ full.log│ File "/usr/lib64/python3.10/asyncio/streams.py", line 617, in readuntil │ full.log│ await self._wait_for_data('readuntil') │ full.log│ File "/usr/lib64/python3.10/asyncio/streams.py", line 502, in _wait_for_data │ full.log│ await self._waiter │ full.log│ File "/usr/lib64/python3.10/asyncio/selector_events.py", line 854, in _read_ready__data_received │ full.log│ data = self._sock.recv(self.max_size) │ full.log│ConnectionResetError: [Errno 104] Connection reset by peer and worst yet to test tasks hanging indefinitely without the job ever completing properly. This was mostly observed in cases of LXC and remote spawner isolation where the isolated task process completes but the task on the side of the task machine remains unfinished. Signed-off-by: Plamen Dimitrov <[email protected]>
@richtja @clebergnu Some connection resets might also be caused by overloaded IO (e.g. internal on-host network and UNIX sockets cases of status server communication) and I have observed some tests spending 10-20 minutes logging a message which means that they are using regular blocking logging via Python's FileHandler. I was wondering if buffering could help here with something like diff --git a/avocado/core/output.py b/avocado/core/output.py
index da61df1d8..8dda0c565 100644
--- a/avocado/core/output.py
+++ b/avocado/core/output.py
@@ -705,6 +705,12 @@ def add_log_handler(
if handler_filter:
handler.addFilter(handler_filter)
+ if klass == logging.FileHandler:
+ from logging import handlers
+ buffered_wrapper = handlers.MemoryHandler(capacity=100)
+ buffered_wrapper.setTarget(handler)
+ handler = buffered_wrapper
+
logger.addHandler(handler)
save_handler(logger.name, handler)
return handler but I am not sure if this change was effective at all in the current avocado logging logic since it seems to store handler configuration and other additional operations. Do you think that some buffering for the logging could be useful when implemented appropriately? Or do you have any other ideas what might cause such blocking in the tests since I have observed multuple VT tests waiting for extensive periods (and was correlated with the host system being under heavy load). |
Some interesting further findings: it seems the above will need more handling on the client side since (right now just suspected but not fully confirmed) I get cases where the VT tests would complete and the parent task process will only wait for the child task process:
with the test debug log indicating completion, while at the same time the child task process will not fully complete and hang at a step when writing to a pipe:
with the anonymous pipe used for logging:
Worse yet, if the task hangs like this, in some cases another task will be started and deadlock on some shared resources like
with
so we really need to tell the client side to retry sending the log message or best yet understand what could be causes for such connection resets. |
Hi @pevogam, thanks for sharing these findings. The first thing we need is some assertive reproducer for this issue. I know it can be tricky, but without one, we can't really be sure we're adding reliability into this. Now, there have been talks and ideas of making most (or all) of the Is that something you think would help? |
Oh, and about the specific changes you're proposing here, they look to be a reliability improvement indeed. I was verifying whether other kinds of exceptions should be treated the same away here, but |
I guess the best we can do to reach this is to have lots of VT tests log in parallel and increase the logging throughput so much that some logging inevitably starts hanging as shown above. You are right that this would be pretty hard and require a lot of hardware (that we use as we run avocado VT tests in a cluster nowadays). Perhaps I could gather at least some hints on good directions to investigate first as I can reproduce this quite easily there and write back with outcomes here? In particular, what are you thoughts on causes that could explain why a task would hang or sometimes spend 20+ minutes logging a message? I assume it has to be the blocking nature of logging in python, do you second my suggestion here? If so doesn't blocking logging in avocado defeat some of the idea of parallelism so far?
That would definitely help. However, one problem from my attempt to add at least some quick and dirty buffering in #5881 (comment) was that it didn't really speed up the runs too much indicating that it might not be the bottleneck. Nevertheless, it might remedy the situation with the connection resets.
The problem from my last post above is that this only fixes half of the situation as it seems the tasks that log at that moment need some sort of reset or retry too. Otherwise I expect they might hang at the very end of the task as described in #5881 (comment). |
Details on hanging tasks deadlocking over the same vm lock resources are described in: avocado-framework#5881 (comment) Signed-off-by: Plamen Dimitrov <[email protected]>
The temporary solution I use right now for both LXC container and distributed runs is: diff --git a/avocado/plugins/spawners/lxc.py b/avocado/plugins/spawners/lxc.py
index 13bc05b0e..ac9b6a8b5 100644
--- a/avocado/plugins/spawners/lxc.py
+++ b/avocado/plugins/spawners/lxc.py
@@ -253,6 +253,11 @@ class LXCSpawner(Spawner, SpawnerMixin):
LOG.info(f"Container state: {container.state}")
LOG.info(f"Container ID: {container_id} PID: {container.init_pid}")
+ exitcode, output, err = LXCSpawner.run_container_cmd(
+ container, ["pkill", "-f", "task-run"]
+ )
+ if exitcode == 0:
+ LOG.warning(output)
exitcode, output, err = await LXCSpawner.run_container_cmd_async(
container, entry_point_args
)
diff --git a/optional_plugins/spawner_remote/avocado_spawner_remote/__init__.py b/optional_plugins/spawner_remote/avocado_spawner_remote/__init__.py
index 89a72d8b1..0cb60a419 100644
--- a/optional_plugins/spawner_remote/avocado_spawner_remote/__init__.py
+++ b/optional_plugins/spawner_remote/avocado_spawner_remote/__init__.py
@@ -170,6 +170,9 @@ class RemoteSpawner(Spawner, SpawnerMixin):
)
return False
+ status, output = session.cmd_status_output("pkill -f task-run")
+ if status == 0:
+ LOG.warning(output)
cmd = shlex.join(entry_point_args) + " > /dev/null"
timeout = self.config.get("spawner.remote.test_timeout")
status, output = await RemoteSpawner.run_remote_cmd_async(session, cmd, timeout) since the only cases of hanging tasks are right at the end when they are already complete and have even reported PASS status to the status server. Another small finding is that this could happen also in a less logging/IO intense test jobs but in rarer occasions. |
This introduces an option that adds the buffering layer on top of the file logging handler. Plamen Dimitrov has reported issues with heavy logging causing the status server/client connection to reset. He has suggested and has been using a workaround like this. Reference: avocado-framework#5881 Signed-off-by: Cleber Rosa <[email protected]>
This introduces an option that adds the buffering layer on top of the file logging handler. Plamen Dimitrov has reported issues with heavy logging causing the status server/client connection to reset. He has suggested and has been using a workaround like this. Reference: avocado-framework#5881 Signed-off-by: Cleber Rosa <[email protected]>
Hi @clebergnu unfortunately I wasn't able to test this today due to further breakage from Avocado VT lacking a 103.0 LTS release. I will look for a workaround likely tomorrow or by the end of the week. |
I tried one workaround to run with newer avocado official tags and just the branches of this change but now it seems Avocado VT does not have the 105.0 tag and the 104.0 is out of sync with Avocado's cpu utils ( |
Just a short note, I tested with a fix between the compatibility with Avocado VT (and the newly introduced buffered logging) and unfortunately the hanging at the end of a VT test still happens. In other words, we still need this patch to make running avocado with multiple parallel VT containerized tests possible. |
Actually I think I am wrong the hanging was due to an irrelevant issue. However, this change is still good and adds a bit of safety IMO. Let me know what you think. |
Connection resets due to network instability can lead to the status server not catching a test status, an asyncio error like
and worst yet to test tasks hanging indefinitely without the job ever completing properly. This was mostly observed in cases of LXC and remote spawner isolation where the isolated task process completes but the task on the side of the task machine remains unfinished.