Long Running ansibleRuns hang if they take longer than a minute to finish #175

eljohnson92 · 2023-01-13T13:04:13Z

What happened?

While testing a role I noticed at a certain point it starts to hang, I have distilled it to a simple playbook to reproduce it, it looks like if an ansibleRun takes longer than around 60 seconds to finish the process hangs and never finishes.

How can we reproduce it?

I have slightly modified the example playbook yaml to reproduce this

apiVersion: ansible.crossplane.io/v1alpha1
kind: AnsibleRun
metadata:
  name: example
spec:
  forProvider:
    # AnsibleRun default to using a remote source.
    # For simple cases you can use an inline source to specify the content of
    # playbook.yaml as opaque, inline yaml.
    playbookInline: |
      ---
      - hosts: localhost
        tasks:
          - name: Pause for 5 minutes to build app cache
            ansible.builtin.pause:
              seconds: 60
          - name: ansibleplaybook-simple
            debug:
              msg: Your are running 'ansibleplaybook-simple' example

in the debug logs I can see the following output showing the process does not finish

  68 [WARNING]: provided hosts list is empty, only localhost is available. Note that
  69 the implicit localhost does not match 'all'
  70
  71 PLAY [localhost] ***************************************************************
  72
  73 TASK [Gathering Facts] *********************************************************
  74 ok: [localhost]
  75
  76 TASK [Pause for 5 minutes to build app cache] **********************************
  77 Pausing for 61 seconds
  78 (ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
  79 1.673569115914057e+09   DEBUG   provider-ansible        Cannot observe external resource        {"controller": "managed/ansiblerun.ansible.crossplane.io", "request": "upbound-system/example", "uid": "23eddcf4-e6a5-        4cc7-8e4a-4d3ea558fff5", "version": "735728", "external-name": "example", "error": "signal: killed"}

in the container, I can also see a pretty strange ps output

PID   USER     TIME  COMMAND
    1 ansible   0:58 crossplane-ansible-provider --debug
   22 ansible   0:08 [ansible-playboo]

What environment did it happen in?

Crossplane version: 1.10.1
provider-ansible version: main

The text was updated successfully, but these errors were encountered:

fahedouch · 2023-01-13T13:23:54Z

Hi @eljohnson92 , do you reproduce the same issue when running the playbook outside the provider ?
if Yes can you please re-run the provider with ANSIBLE_DEBUG=1 and share the output. thks

eljohnson92 · 2023-01-13T13:32:05Z

@fahedouch thanks for taking a look, I am not able to reproduce this outside of the provider, in the same container running this manually seems to work fine

/ $ ansible-runner run /tmp/ --project-dir /ansibleDir/23eddcf4-e6a5-4cc7-8e4a-4d3ea558fff5/ -p /ansibleDir/23eddcf4-e6a5-4cc7-8e4a-4d3ea558fff5/playbook.yml
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that
the implicit localhost does not match 'all'

PLAY [localhost] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Pause for 5 minutes to build app cache] **********************************
Pausing for 61 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ok: [localhost]

TASK [ansibleplaybook-simple] **************************************************
ok: [localhost] => {
    "msg": "Your are running 'ansibleplaybook-simple' example"
}

PLAY RECAP *********************************************************************
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

fahedouch · 2023-01-13T14:12:52Z

@eljohnson92 the reconcile func set a context timeout to 1 * time.Minute, I think we have to add provider a way to override this timeout value in the reconciler

By the way the reconcile loop is not designed to run long-term process otherwise it can cannot coordinate stuff

eljohnson92 · 2023-01-13T14:19:27Z

@fahedouch could we have ansible-runner run async of the reconcile loop? I know some other resources in crossplane use an async method to reconcile.

fahedouch · 2023-01-13T15:25:32Z

@eljohnson92 I think we have to do Poc for this purpose because we usually need the ansible-runner output to continue in the reconcile loop, like for example we have to run ansible-runner in dry-mode, examine the output to decide whether to create or update based on the diff. but in some other areas it deserve to check if we can make async flow

eljohnson92 · 2023-01-13T21:13:15Z

@fahedouch it looks like upjet uses this callback functionality for async functionality on the connector and external structs, maybe it could be opt-in here as well?

// WithCallbackProvider configures the controller to use async variant of the functions
// of the Terraform client and run given callbacks once those operations are
// completed.

fahedouch · 2023-01-14T14:56:33Z

@fahedouch it looks like upjet uses this callback functionality for async functionality on the connector and external structs, maybe it could be opt-in here as well?
// WithCallbackProvider configures the controller to use async variant of the functions
// of the Terraform client and run given callbacks once those operations are
// completed.

SGTM but I think it deserves a Poc

ron1 · 2023-01-15T01:15:39Z

Also see guidance for enabling the upjet UseAsync flag here:

https://github.com/upbound/upjet/blob/7e84c638a8bc5c93c6da3cf9420f961f165dd05d/pkg/config/resource.go#L279-L282

Note that we will likely prefer to use provider-terraform over provider-ansible for declarative infrastructure management whenever possible. However, we are hoping to replace our use of Ansible AWX with Crossplane ansible-provider for executing legacy ansible content. Most of this content is long-running. For example, provisioning a vm and installing/configuring a product like Foreman on the vm. As you can imagine, such a playbook takes quite a long time to run. Provider terraform appears to have a hard 20 minute timeout. I would think provider-ansible should have a timeout at least that long.

eljohnson92 · 2023-01-15T18:33:54Z

@fahedouch I've been looking into this a little bit, I think the best solution might actually be to use ansible start instead of ansible run which would allow us to kick off the ansible job in the background. we could use a new jobStatus on the resource to ensure we don't try to do an ansible-check before the job is finished

eljohnson92 · 2023-01-16T03:39:23Z

I tried an implementation of this locally, but even with the ansible start spawning off a new child process both are getting killed when the timeout hits, at this point, I'm not sure if it is possible to have ansibleRuns longer than 1 minute without changing the architecture of this provider to spawn additional containers or processes outside of the main provider process

fahedouch · 2023-01-16T16:57:44Z

I tried an implementation of this locally, but even with the ansible start spawning off a new child process both are getting killed when the timeout hits, at this point, I'm not sure if it is possible to have ansibleRuns longer than 1 minute without changing the architecture of this provider to spawn additional containers or processes outside of the main provider process

Making async func do not resolve our timeout issue as its impact the async funcs too, So I think we have to increase the timeout anyway.

spawn additional containers or processes outside of the main provider process

I don't see how to do this

fahedouch · 2023-01-16T16:59:51Z

Also see guidance for enabling the upjet UseAsync flag here:

https://github.com/upbound/upjet/blob/7e84c638a8bc5c93c6da3cf9420f961f165dd05d/pkg/config/resource.go#L279-L282

Note that we will likely prefer to use provider-terraform over provider-ansible for declarative infrastructure management whenever possible. However, we are hoping to replace our use of Ansible AWX with Crossplane ansible-provider for executing legacy ansible content. Most of this content is long-running. For example, provisioning a vm and installing/configuring a product like Foreman on the vm. As you can imagine, such a playbook takes quite a long time to run. Provider terraform appears to have a hard 20 minute timeout. I would think provider-ansible should have a timeout at least that long.

Yes we can try with 20 minutes timeout

eljohnson92 · 2023-01-16T18:15:40Z

@fahedouch @ron1 I have opened up a PR to configure the timeout in provider-ansible to mirror provider-terraform if you all could give it a look/test, it seems to work well for my use case

ron1 · 2023-01-16T23:14:54Z

See related Ansible AWX running on K8S issue ansible/awx#11805 (comment) where running Ansible content suffered from 4 hour hard timeout before the fix.

eljohnson92 added the bug Something isn't working label Jan 13, 2023

fahedouch added the status/needs-design-discussion label Jan 14, 2023

eljohnson92 mentioned this issue Jan 16, 2023

add configurable timeout for ansibleRuns #177

Merged

3 tasks

fahedouch closed this as completed in #177 Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Running ansibleRuns hang if they take longer than a minute to finish #175

Long Running ansibleRuns hang if they take longer than a minute to finish #175

eljohnson92 commented Jan 13, 2023

fahedouch commented Jan 13, 2023

eljohnson92 commented Jan 13, 2023

fahedouch commented Jan 13, 2023

eljohnson92 commented Jan 13, 2023

fahedouch commented Jan 13, 2023

eljohnson92 commented Jan 13, 2023

fahedouch commented Jan 14, 2023

ron1 commented Jan 15, 2023 •

edited

Loading

eljohnson92 commented Jan 15, 2023 •

edited

Loading

eljohnson92 commented Jan 16, 2023

fahedouch commented Jan 16, 2023

fahedouch commented Jan 16, 2023 •

edited

Loading

eljohnson92 commented Jan 16, 2023

ron1 commented Jan 16, 2023 •

edited

Loading

Long Running ansibleRuns hang if they take longer than a minute to finish #175

Long Running ansibleRuns hang if they take longer than a minute to finish #175

Comments

eljohnson92 commented Jan 13, 2023

What happened?

How can we reproduce it?

What environment did it happen in?

fahedouch commented Jan 13, 2023

eljohnson92 commented Jan 13, 2023

fahedouch commented Jan 13, 2023

eljohnson92 commented Jan 13, 2023

fahedouch commented Jan 13, 2023

eljohnson92 commented Jan 13, 2023

fahedouch commented Jan 14, 2023

ron1 commented Jan 15, 2023 • edited Loading

eljohnson92 commented Jan 15, 2023 • edited Loading

eljohnson92 commented Jan 16, 2023

fahedouch commented Jan 16, 2023

fahedouch commented Jan 16, 2023 • edited Loading

eljohnson92 commented Jan 16, 2023

ron1 commented Jan 16, 2023 • edited Loading

ron1 commented Jan 15, 2023 •

edited

Loading

eljohnson92 commented Jan 15, 2023 •

edited

Loading

fahedouch commented Jan 16, 2023 •

edited

Loading

ron1 commented Jan 16, 2023 •

edited

Loading