Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ansible.builtin.shell invocations through ansible-runner sometimes do not return documented return values #1381

Closed
ZachOldham opened this issue Jul 22, 2024 · 11 comments

Comments

@ZachOldham
Copy link

ZachOldham commented Jul 22, 2024

We are observing that when we invoke the ansible.builtin.shell module via ansible_runner.run, the res field of runner_on_ok events is sometimes an empty dict, despite the documentation for the ansible.builtin.shell module claiming that it always returns certain values. We are using ansible-core version 2.16.6 and ansible-runner version 2.3.6.

We use ansible-runner to invoke ansible.builtin.shell to run a script on 15 machines simultaneously. When we do this, we usually have no problems, the script succeeds on every machine, and we see that the the res dict of the runner_on_ok events contain the fields documented in ansible.builtin.shell. On occasion (maybe 1/4 of the time?) we will see that the res dict of the runner_on_ok events for some machines is empty, while is is populated for the rest of the machines. Each time this happens, it is the same two machines that have empty res dicts. When we examine the logs, it is clear that the script has also run successfully on those machines, which is consistent with the fact we get a runner_on_ok event from them as well.

This is extremely confusing as the scripts are running correctly and we are getting a runner_on_ok event, but it is missing the fields that the module is documented to always return. This means that we don't get the return code, stdout, or stderr, which we need later.

We know that the underlying issue is the event having an empty res dict as we log the res of every terminal event (runner_on_unreachable, runner_on_failed, runner_on_ok) and we see that for these machines there is sometimes an empty res dict ({}) while for all the other machines there is a correctly populated res dict like:

{
  "changed": true,
  "stdout": "",
  "stderr": "<STDERR>",
  "rc": 0,
  "cmd": "<CMD>",
  "start": "2024-07-19 19:06:18.744817",
  "end": "2024-07-19 19:06:50.576242",
  "delta": "0:00:31.831425",
  "msg": "",
  "invocation": {
    "module_args": {
      "chdir": "/tmp",
      "_raw_params": "<PARAMS>",
      "_uses_shell": true,
      "expand_argument_vars": true,
      "stdin_add_newline": true,
      "strip_empty_ends": true,
      "argv": null,
      "executable": null,
      "creates": null,
      "removes": null,
      "stdin": null
    }
  },
  "stdout_lines": [],
  "stderr_lines": [
    <STDERR_LINES>
  ],
  "_ansible_no_log": false,
}
@github-actions github-actions bot added the needs_triage New item that needs to be triaged label Jul 22, 2024
@nitzmahone
Copy link
Member

I've got my suspicions where the problem lies, but would it be possible for you to register the output of the offending task and put a subsequent debug: var=(the var name you registered) to verify that Ansible itself is seeing the correct result for each host? Just want to make sure we're looking in the right place for the problem.

@ZachOldham
Copy link
Author

I will do that and let you know what it looks like the next time the issue reoccurs.

@nitzmahone nitzmahone added needs_info and removed needs_triage New item that needs to be triaged labels Jul 31, 2024
@ZachOldham
Copy link
Author

The issue has reoccured, and the debug also does not have the correct value for the registered variable. Both the result and the variable are empty dicts ({}). I did also confirm that when the issue does not occur, the variable is not empty and has the correct values, to eliminate the chance that I was not using debug properly.

@ZachOldham
Copy link
Author

Please let me know if you need any further information

@nitzmahone
Copy link
Member

Hmm, not what I would've expected, but at least narrows it down- that means the issue is likely ansible-core... I don't suppose you have a repro that we'd be able to try here?

@ZachOldham
Copy link
Author

I do not currently have a small repro, but I will try to get one for you.

@nitzmahone
Copy link
Member

We were chatting about this one today when @s-hertel pointed out an interesting possibility: the MAX_EVENT_RES setting. If a repr'd task result payload exceeds ~700k, it will be omitted from the event stream- depending on what you're looking at, you might not see anything else. Curious if you see the result payloads in stdout via the debug stuff we talked about earlier, or anywhere else... You can also try setting something the envvar to a larger value, like MAX_EVENT_RES=1500000 on your runner invocation to see if that fixes the issue.

@ZachOldham
Copy link
Author

ZachOldham commented Aug 7, 2024

Interesting... I will investigate to see if that is a potential cause. It's possible the script we are running has a large amount of output in some cases, and that would also explain why it always happens to the same machines...

If that is the case, I assume that's not a bug so much as user error on our part?

@nitzmahone
Copy link
Member

If that is the case, I assume that's not a bug so much as user error on our part?

Arguably a little of both? From our end, I don't think that setting is documented, nor does it do a very good job, since in most cases there will be at least two more copies of the same data in the payload that aren't omitted. #1386 would at least allow someone to know that their task result payload was elided, and in the future we should have a facility for the callback to emit a job-level warning that it occurred as well.

@ZachOldham
Copy link
Author

Can the MAX_EVENT_RES env var be set via the envvars arg to the runner, or does it need to be set for the process calling the runner by setting os.environ["MAX_EVENT_RES"] prior to calling the runner?

@ZachOldham
Copy link
Author

We have confirmed that the root cause of the issue was our script logging too much during an error handling event that only ever occurred on certain machines. By reducing this logging (and also by increasing MAX_EVENT_RES) we have been able to resolve this issue on our end. Thank you for the help @nitzmahone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants