Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long Running ansibleRuns hang if they take longer than a minute to finish #175

Closed
eljohnson92 opened this issue Jan 13, 2023 · 14 comments · Fixed by #177
Closed

Long Running ansibleRuns hang if they take longer than a minute to finish #175

eljohnson92 opened this issue Jan 13, 2023 · 14 comments · Fixed by #177
Labels

Comments

@eljohnson92
Copy link
Contributor

What happened?

While testing a role I noticed at a certain point it starts to hang, I have distilled it to a simple playbook to reproduce it, it looks like if an ansibleRun takes longer than around 60 seconds to finish the process hangs and never finishes.

How can we reproduce it?

I have slightly modified the example playbook yaml to reproduce this

apiVersion: ansible.crossplane.io/v1alpha1
kind: AnsibleRun
metadata:
  name: example
spec:
  forProvider:
    # AnsibleRun default to using a remote source.
    # For simple cases you can use an inline source to specify the content of
    # playbook.yaml as opaque, inline yaml.
    playbookInline: |
      ---
      - hosts: localhost
        tasks:
          - name: Pause for 5 minutes to build app cache
            ansible.builtin.pause:
              seconds: 60
          - name: ansibleplaybook-simple
            debug:
              msg: Your are running 'ansibleplaybook-simple' example

in the debug logs I can see the following output showing the process does not finish

  68 [WARNING]: provided hosts list is empty, only localhost is available. Note that
  69 the implicit localhost does not match 'all'
  70
  71 PLAY [localhost] ***************************************************************
  72
  73 TASK [Gathering Facts] *********************************************************
  74 ok: [localhost]
  75
  76 TASK [Pause for 5 minutes to build app cache] **********************************
  77 Pausing for 61 seconds
  78 (ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
  79 1.673569115914057e+09   DEBUG   provider-ansible        Cannot observe external resource        {"controller": "managed/ansiblerun.ansible.crossplane.io", "request": "upbound-system/example", "uid": "23eddcf4-e6a5-        4cc7-8e4a-4d3ea558fff5", "version": "735728", "external-name": "example", "error": "signal: killed"}

in the container, I can also see a pretty strange ps output

PID   USER     TIME  COMMAND
    1 ansible   0:58 crossplane-ansible-provider --debug
   22 ansible   0:08 [ansible-playboo]

What environment did it happen in?

Crossplane version: 1.10.1
provider-ansible version: main

@eljohnson92 eljohnson92 added the bug Something isn't working label Jan 13, 2023
@fahedouch
Copy link
Collaborator

Hi @eljohnson92 , do you reproduce the same issue when running the playbook outside the provider ?
if Yes can you please re-run the provider with ANSIBLE_DEBUG=1 and share the output. thks

@eljohnson92
Copy link
Contributor Author

@fahedouch thanks for taking a look, I am not able to reproduce this outside of the provider, in the same container running this manually seems to work fine

/ $ ansible-runner run /tmp/ --project-dir /ansibleDir/23eddcf4-e6a5-4cc7-8e4a-4d3ea558fff5/ -p /ansibleDir/23eddcf4-e6a5-4cc7-8e4a-4d3ea558fff5/playbook.yml
[WARNING]: No inventory was parsed, only implicit localhost is available
[WARNING]: provided hosts list is empty, only localhost is available. Note that
the implicit localhost does not match 'all'

PLAY [localhost] ***************************************************************

TASK [Gathering Facts] *********************************************************
ok: [localhost]

TASK [Pause for 5 minutes to build app cache] **********************************
Pausing for 61 seconds
(ctrl+C then 'C' = continue early, ctrl+C then 'A' = abort)
ok: [localhost]

TASK [ansibleplaybook-simple] **************************************************
ok: [localhost] => {
    "msg": "Your are running 'ansibleplaybook-simple' example"
}

PLAY RECAP *********************************************************************
localhost                  : ok=3    changed=0    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

@fahedouch
Copy link
Collaborator

@eljohnson92 the reconcile func set a context timeout to 1 * time.Minute, I think we have to add provider a way to override this timeout value in the reconciler

By the way the reconcile loop is not designed to run long-term process otherwise it can cannot coordinate stuff

@eljohnson92
Copy link
Contributor Author

@fahedouch could we have ansible-runner run async of the reconcile loop? I know some other resources in crossplane use an async method to reconcile.

@fahedouch
Copy link
Collaborator

@eljohnson92 I think we have to do Poc for this purpose because we usually need the ansible-runner output to continue in the reconcile loop, like for example we have to run ansible-runner in dry-mode, examine the output to decide whether to create or update based on the diff. but in some other areas it deserve to check if we can make async flow

@eljohnson92
Copy link
Contributor Author

@fahedouch it looks like upjet uses this callback functionality for async functionality on the connector and external structs, maybe it could be opt-in here as well?

// WithCallbackProvider configures the controller to use async variant of the functions
// of the Terraform client and run given callbacks once those operations are
// completed.

@fahedouch
Copy link
Collaborator

@fahedouch it looks like upjet uses this callback functionality for async functionality on the connector and external structs, maybe it could be opt-in here as well?

// WithCallbackProvider configures the controller to use async variant of the functions
// of the Terraform client and run given callbacks once those operations are
// completed.

SGTM but I think it deserves a Poc

@ron1
Copy link
Contributor

ron1 commented Jan 15, 2023

Also see guidance for enabling the upjet UseAsync flag here:

https://github.com/upbound/upjet/blob/7e84c638a8bc5c93c6da3cf9420f961f165dd05d/pkg/config/resource.go#L279-L282

Note that we will likely prefer to use provider-terraform over provider-ansible for declarative infrastructure management whenever possible. However, we are hoping to replace our use of Ansible AWX with Crossplane ansible-provider for executing legacy ansible content. Most of this content is long-running. For example, provisioning a vm and installing/configuring a product like Foreman on the vm. As you can imagine, such a playbook takes quite a long time to run. Provider terraform appears to have a hard 20 minute timeout. I would think provider-ansible should have a timeout at least that long.

@eljohnson92
Copy link
Contributor Author

eljohnson92 commented Jan 15, 2023

@fahedouch I've been looking into this a little bit, I think the best solution might actually be to use ansible start instead of ansible run which would allow us to kick off the ansible job in the background. we could use a new jobStatus on the resource to ensure we don't try to do an ansible-check before the job is finished

@eljohnson92
Copy link
Contributor Author

I tried an implementation of this locally, but even with the ansible start spawning off a new child process both are getting killed when the timeout hits, at this point, I'm not sure if it is possible to have ansibleRuns longer than 1 minute without changing the architecture of this provider to spawn additional containers or processes outside of the main provider process

@fahedouch
Copy link
Collaborator

I tried an implementation of this locally, but even with the ansible start spawning off a new child process both are getting killed when the timeout hits, at this point, I'm not sure if it is possible to have ansibleRuns longer than 1 minute without changing the architecture of this provider to spawn additional containers or processes outside of the main provider process

Making async func do not resolve our timeout issue as its impact the async funcs too, So I think we have to increase the timeout anyway.

spawn additional containers or processes outside of the main provider process

I don't see how to do this

@fahedouch
Copy link
Collaborator

fahedouch commented Jan 16, 2023

Also see guidance for enabling the upjet UseAsync flag here:

https://github.com/upbound/upjet/blob/7e84c638a8bc5c93c6da3cf9420f961f165dd05d/pkg/config/resource.go#L279-L282

Note that we will likely prefer to use provider-terraform over provider-ansible for declarative infrastructure management whenever possible. However, we are hoping to replace our use of Ansible AWX with Crossplane ansible-provider for executing legacy ansible content. Most of this content is long-running. For example, provisioning a vm and installing/configuring a product like Foreman on the vm. As you can imagine, such a playbook takes quite a long time to run. Provider terraform appears to have a hard 20 minute timeout. I would think provider-ansible should have a timeout at least that long.

Yes we can try with 20 minutes timeout

@eljohnson92
Copy link
Contributor Author

@fahedouch @ron1 I have opened up a PR to configure the timeout in provider-ansible to mirror provider-terraform if you all could give it a look/test, it seems to work well for my use case

@ron1
Copy link
Contributor

ron1 commented Jan 16, 2023

See related Ansible AWX running on K8S issue ansible/awx#11805 (comment) where running Ansible content suffered from 4 hour hard timeout before the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants