Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project sync fails with "OCI runtime attempted to invoke a command that was not found". #15404

Open
5 of 11 tasks
fs30000 opened this issue Jul 26, 2024 · 17 comments
Open
5 of 11 tasks

Comments

@fs30000
Copy link

fs30000 commented Jul 26, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.
  • I am NOT reporting a (potential) security vulnerability. (These should be emailed to [email protected] instead.)

Bug Summary

Fresh install of AWX 24.6.1 on Rocky 9.4.

When syncing a project from bitbucket, i got this error:

Error: crun: writing file `/sys/fs/cgroup/libpod_parent/libpod-7e3548e80158e27d349ee7db1ef6a83f4db901135c8393da7e43646db0993fb2/cgroup.procs`: No such file or directory: OCI runtime attempted to invoke a command that was not found

AWX version

24.6.1

Select the relevant components

  • UI
  • UI (tech preview)
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

docker development environment

Modifications

no

Ansible version

No response

Operating system

Rocky 9.4

Web browser

Firefox

Steps to reproduce

Create a project with the type git, with credentials, etc.
Try to sync it.

Expected results

To work.

Actual results

Error: crun: writing file /sys/fs/cgroup/libpod_parent/libpod-7e3548e80158e27d349ee7db1ef6a83f4db901135c8393da7e43646db0993fb2/cgroup.procs: No such file or directory: OCI runtime attempted to invoke a command that was not found

Additional information

No response

@brad95411
Copy link

Not that this is necessarily helpful, but I am having a nearly identical error running AWX 24.6.1 on docker on Fedora 39, and accessing the web interface via Chrome.

@fs30000
Copy link
Author

fs30000 commented Aug 16, 2024

Anyone?

@fs30000
Copy link
Author

fs30000 commented Aug 16, 2024

Same error on Fedora 40. With these commands:

git clone - 23.3.1
export COMPOSE_UP_OPTS=-d RECEPTOR_IMAGE=quay.io/ansible/receptor:v1.4.8 COMPOSE_TAG=release_4.5

@brad95411
Copy link

I've tried a few more versions with no success. Suspect it's some setup problem.

I have read a few things here and there that have said if both the outer and inner container engine is using overlayfs these issues can happen.

Tried changing the selected storage configuration for inner container engine (i.e. podman) to use either vfs or btrfs but I just got errors about it not being able to find either of those.

If I get some time, I will try to set up a podman instance on a VM and set it up so the storage driver is something other than overlayfs and try again. Hopefully get to it some time this weekend, but if someone is itching for an experiment by all means take the idea and run with it.

@brad95411
Copy link

Update:

Fedora 39, AWX 24.5.0 (to keep the UI stuff constrained inside the container), Docker running with vfs storage driver

It's not working, but the error is clearly different. Example job will not start, Example Project will not sync. Error on project sync is show below:

Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

I am at a loss at the moment. Even if this did work, vfs is not exactly a great solution to the problem based on what I've read. If I come up with another idea I'll try it and post about it, but for now I think I'm just going to focus on re-familiarizing myself with ansible-navigator. AWX is helpful because I use AAP all the time for work, but I can get by with navigator for my personal purposes.

@fs30000
Copy link
Author

fs30000 commented Aug 19, 2024

I have tried older versions, even with different receptor and compose tags for older awx_devel version. Always some error showing up.

When i don't get the error on this issue, i get this:

https://forum.ansible.com/t/error-current-system-boot-id-differs-from-cached-boot-id/7898

I have pulled out all of my hair now.

@avalou
Copy link

avalou commented Aug 19, 2024

I have the very same issue with the following version : 24.6.1, 24.5.0, 23.5.0.
Anyone have any clue what is happening ? I have been using and installing AWX for 2+ years now and I never ran into this issue.

Edit : we fixed the issue with a colleague of mine. Details are incoming !

@fs30000
Copy link
Author

fs30000 commented Aug 20, 2024

I have the very same issue with the following version : 24.6.1, 24.5.0, 23.5.0. Anyone have any clue what is happening ? I have been using and installing AWX for 2+ years now and I never ran into this issue.

Edit : we fixed the issue with a colleague of mine. Details are incoming !

Please share mate!

@avalou
Copy link

avalou commented Aug 21, 2024

We are still not sure what solved the issue, so here is what we did :

  • downgraded podman version in tools_awx_1 container
  • downgraded runc version in tools_awx_1 container
  • set cgroup to host in the compose file under /tools/docker-compose/_sources/ then rebuilt containers
    At this point the issue was pretty much resolved, but we were not satisfied by this solution that we considered unsafe so we kept digging and removed the cgroup parameter
  • downgraded docker engine version on the host machine from 27.1.2 to 26.0.0
    This is what seemed to fixed the issue. BUT we are not sure what really worked because when we realised the issue was fixed, we rolled back to the latest docker engine version (in this case v27.1.2, the latest available in apt repositories) and despite that we failed to reproduce the issue.

I will keep you posted if we have any more clue about what happened. 🤷‍♀️

@fs30000
Copy link
Author

fs30000 commented Aug 21, 2024

We are still not sure what solved the issue, so here is what we did :

* downgraded podman version in tools_awx_1 container

* downgraded runc version in tools_awx_1 container

* set `cgroup` to `host` in the compose file under `/tools/docker-compose/_sources/` then rebuilt containers
  At this point the issue was pretty much resolved, but we were not satisfied by this solution that we considered unsafe so we kept digging and removed the `cgroup` parameter

* downgraded docker engine version on the host machine from 27.1.2  to 26.0.0
  This is what seemed to fixed the issue. BUT we are not sure what really worked because when we realised the issue was fixed, we rolled back to the latest docker engine version (in this case v27.1.2, the latest available in apt repositories) and despite that we failed to reproduce the issue.

I will keep you posted if we have any more clue about what happened. 🤷‍♀️

Wait, are you using docker dev version or K8s?

@avalou
Copy link

avalou commented Aug 22, 2024

Yes we are using the dev version deployed with docker compose, and it has been working perfectly for 2+ years, with the notable exception of the current topic.

@brad95411
Copy link

Updating on my testing progress.

I am running plain docker, no k8s or anything.

I downgraded crun to 1.14.3-1 from 1.16.1-1 in the Dockerfile jinja template, no change.

I left crun at 1.14, and downgraded podman to 2:5.1.1-1 from 2:5.1.1-1 in the Dockerfile jinja template, no change.

Prior to doing any testing I verified manually with dnf that the versions had changed.

If anyone has achieved any solidity in what has fixed the problem for them and can provide explicit instructions please do. I am still going to keep trying things when I have time, but I feel I may be fighting a losing battle at the moment.

I've not tried a downgrade of the outer docker engine at the moment simply because I have other containers running where I'm doing work now, and would need to set up a new VM to run an additional docker instance that I could more comfortably mess with.

@brad95411
Copy link

Another update.

My docker version was docker-ce-3:26.1.1-1, I guess I didn't realize I was running an older major version.

I upgraded to docker-ce-3:27.1.2-1. It didn't seem to make difference. I am still getting errors. Note that this test is using the crun and podman versions mentioned previously. Current error is shown below

Error: container create failed (no logs from conmon): conmon bytes "": readObjectStart: expect { or n, but found , error found in #0 byte of ...||..., bigger context ...||...

@brad95411
Copy link

I have not had any epiphanies with regards to this issue. I've had to spin down my attempts because of some upgrades I'm making and needing a stable environment while those are going on.

If anyone has any ideas or concrete solutions that have worked for you, please let me know.

@brad95411
Copy link

Updating in hopes to keep this on folks radar, I still haven't been able to solve this.

@ibcht
Copy link

ibcht commented Sep 9, 2024

We are still not sure what solved the issue, so here is what we did :

* downgraded podman version in tools_awx_1 container

* downgraded runc version in tools_awx_1 container

* set `cgroup` to `host` in the compose file under `/tools/docker-compose/_sources/` then rebuilt containers
  At this point the issue was pretty much resolved, but we were not satisfied by this solution that we considered unsafe so we kept digging and removed the `cgroup` parameter

* downgraded docker engine version on the host machine from 27.1.2  to 26.0.0
  This is what seemed to fixed the issue. BUT we are not sure what really worked because when we realised the issue was fixed, we rolled back to the latest docker engine version (in this case v27.1.2, the latest available in apt repositories) and despite that we failed to reproduce the issue.

I will keep you posted if we have any more clue about what happened. 🤷‍♀️

Similar issue for me, installation on dev environement with Docker Compose, and the solution indeed lies in overriding the cgroup parameter to host in the docker-compose.yml file. It might be related to how containerd determines how the cgroup namespace is configured by default, which could have changed somehow ? https://docs.docker.com/reference/compose-file/services/#cgroup "When unset, it is the container runtime's decision to select which cgroup namespace to use, if supported".

Docker 27.2.1
containerd 1.7.21
cgroup v2

@a-haurylau
Copy link

See the same. Starting from awx 24.4.1. I suspect that this is caused by podman upgrade in awx image from 4.x to 5.x. For us solution was to add cgroup: host to awx docker-compose.yaml (docker/compose#8167 (comment))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants