`ansible.builtin.shell` invocations through ansible-runner sometimes do not return documented return values #1381

ZachOldham · 2024-07-22T16:49:05Z

We are observing that when we invoke the ansible.builtin.shell module via ansible_runner.run, the res field of runner_on_ok events is sometimes an empty dict, despite the documentation for the ansible.builtin.shell module claiming that it always returns certain values. We are using ansible-core version 2.16.6 and ansible-runner version 2.3.6.

We use ansible-runner to invoke ansible.builtin.shell to run a script on 15 machines simultaneously. When we do this, we usually have no problems, the script succeeds on every machine, and we see that the the res dict of the runner_on_ok events contain the fields documented in ansible.builtin.shell. On occasion (maybe 1/4 of the time?) we will see that the res dict of the runner_on_ok events for some machines is empty, while is is populated for the rest of the machines. Each time this happens, it is the same two machines that have empty res dicts. When we examine the logs, it is clear that the script has also run successfully on those machines, which is consistent with the fact we get a runner_on_ok event from them as well.

This is extremely confusing as the scripts are running correctly and we are getting a runner_on_ok event, but it is missing the fields that the module is documented to always return. This means that we don't get the return code, stdout, or stderr, which we need later.

We know that the underlying issue is the event having an empty res dict as we log the res of every terminal event (runner_on_unreachable, runner_on_failed, runner_on_ok) and we see that for these machines there is sometimes an empty res dict ({}) while for all the other machines there is a correctly populated res dict like:

{
  "changed": true,
  "stdout": "",
  "stderr": "<STDERR>",
  "rc": 0,
  "cmd": "<CMD>",
  "start": "2024-07-19 19:06:18.744817",
  "end": "2024-07-19 19:06:50.576242",
  "delta": "0:00:31.831425",
  "msg": "",
  "invocation": {
    "module_args": {
      "chdir": "/tmp",
      "_raw_params": "<PARAMS>",
      "_uses_shell": true,
      "expand_argument_vars": true,
      "stdin_add_newline": true,
      "strip_empty_ends": true,
      "argv": null,
      "executable": null,
      "creates": null,
      "removes": null,
      "stdin": null
    }
  },
  "stdout_lines": [],
  "stderr_lines": [
    <STDERR_LINES>
  ],
  "_ansible_no_log": false,
}

The text was updated successfully, but these errors were encountered:

nitzmahone · 2024-07-30T18:08:34Z

I've got my suspicions where the problem lies, but would it be possible for you to register the output of the offending task and put a subsequent debug: var=(the var name you registered) to verify that Ansible itself is seeing the correct result for each host? Just want to make sure we're looking in the right place for the problem.

ZachOldham · 2024-07-31T15:52:04Z

I will do that and let you know what it looks like the next time the issue reoccurs.

ZachOldham · 2024-08-01T18:02:13Z

The issue has reoccured, and the debug also does not have the correct value for the registered variable. Both the result and the variable are empty dicts ({}). I did also confirm that when the issue does not occur, the variable is not empty and has the correct values, to eliminate the chance that I was not using debug properly.

ZachOldham · 2024-08-02T17:26:51Z

Please let me know if you need any further information

nitzmahone · 2024-08-02T23:56:54Z

Hmm, not what I would've expected, but at least narrows it down- that means the issue is likely ansible-core... I don't suppose you have a repro that we'd be able to try here?

ZachOldham · 2024-08-07T15:25:15Z

I do not currently have a small repro, but I will try to get one for you.

nitzmahone · 2024-08-07T22:42:31Z

We were chatting about this one today when @s-hertel pointed out an interesting possibility: the MAX_EVENT_RES setting. If a repr'd task result payload exceeds ~700k, it will be omitted from the event stream- depending on what you're looking at, you might not see anything else. Curious if you see the result payloads in stdout via the debug stuff we talked about earlier, or anywhere else... You can also try setting something the envvar to a larger value, like MAX_EVENT_RES=1500000 on your runner invocation to see if that fixes the issue.

ZachOldham · 2024-08-07T23:23:02Z

Interesting... I will investigate to see if that is a potential cause. It's possible the script we are running has a large amount of output in some cases, and that would also explain why it always happens to the same machines...

If that is the case, I assume that's not a bug so much as user error on our part?

nitzmahone · 2024-08-08T01:57:19Z

If that is the case, I assume that's not a bug so much as user error on our part?

Arguably a little of both? From our end, I don't think that setting is documented, nor does it do a very good job, since in most cases there will be at least two more copies of the same data in the payload that aren't omitted. #1386 would at least allow someone to know that their task result payload was elided, and in the future we should have a facility for the callback to emit a job-level warning that it occurred as well.

ZachOldham · 2024-08-12T19:06:19Z

Can the MAX_EVENT_RES env var be set via the envvars arg to the runner, or does it need to be set for the process calling the runner by setting os.environ["MAX_EVENT_RES"] prior to calling the runner?

ZachOldham · 2024-08-15T17:31:35Z

We have confirmed that the root cause of the issue was our script logging too much during an error handling event that only ever occurred on certain machines. By reducing this logging (and also by increasing MAX_EVENT_RES) we have been able to resolve this issue on our end. Thank you for the help @nitzmahone

github-actions bot added the needs_triage New item that needs to be triaged label Jul 22, 2024

nitzmahone added needs_info and removed needs_triage New item that needs to be triaged labels Jul 31, 2024

ZachOldham closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ansible.builtin.shell` invocations through ansible-runner sometimes do not return documented return values #1381

`ansible.builtin.shell` invocations through ansible-runner sometimes do not return documented return values #1381

ZachOldham commented Jul 22, 2024 •

edited

Loading

nitzmahone commented Jul 30, 2024

ZachOldham commented Jul 31, 2024

ZachOldham commented Aug 1, 2024

ZachOldham commented Aug 2, 2024

nitzmahone commented Aug 2, 2024

ZachOldham commented Aug 7, 2024

nitzmahone commented Aug 7, 2024

ZachOldham commented Aug 7, 2024 •

edited

Loading

nitzmahone commented Aug 8, 2024

ZachOldham commented Aug 12, 2024

ZachOldham commented Aug 15, 2024

ansible.builtin.shell invocations through ansible-runner sometimes do not return documented return values #1381

ansible.builtin.shell invocations through ansible-runner sometimes do not return documented return values #1381

Comments

ZachOldham commented Jul 22, 2024 • edited Loading

nitzmahone commented Jul 30, 2024

ZachOldham commented Jul 31, 2024

ZachOldham commented Aug 1, 2024

ZachOldham commented Aug 2, 2024

nitzmahone commented Aug 2, 2024

ZachOldham commented Aug 7, 2024

nitzmahone commented Aug 7, 2024

ZachOldham commented Aug 7, 2024 • edited Loading

nitzmahone commented Aug 8, 2024

ZachOldham commented Aug 12, 2024

ZachOldham commented Aug 15, 2024

`ansible.builtin.shell` invocations through ansible-runner sometimes do not return documented return values #1381

`ansible.builtin.shell` invocations through ansible-runner sometimes do not return documented return values #1381

ZachOldham commented Jul 22, 2024 •

edited

Loading

ZachOldham commented Aug 7, 2024 •

edited

Loading