-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvement in gunicorn container settings #322
Comments
Hi and thanks for the issue ! To be fair, I must say that there are no claims that the container images published by the project are intended or optimized for production use at a large scale in the docs:
That is not to say that we cannot improve the base image we publish but the objective is more about getting people started quickly and then allowing users to tweak on their own by showing them how the sausage is made. That said, it wouldn't be a bad idea to benchmark different approaches and settings to find out what works best and what doesn't so we can make an informed decision. I personally like gunicorn but there's also uwsgi and other ways to run the application if people really want to. Edit: links to existing benchmarks: |
It looks like gunicorn and containers don't go very well together We're currently POCing with ara on kubernetes to record our playbooks runs, using the images provided, and consistently getting WORKER TIMEOUT errors (doing simple curls call with not much data, using sqlite for now (as we're just trying ara))
|
What do you have for gunicorn settings? I have following:
with that ara is running since years on kubernetes |
Hi @VannTen and thanks for the feedback (also merci for working on kubespray ❤️). I stand by my previous comment that says we aren't specifically tuning the container images for scale or performance but it should work and if there is anything we can do to make them run better we should consider it. The recent blog post you shared is interesting and didn't exist when we last looked at this, the take away being:
I don't personally have ara deployed in k8s right now but I am willing to work with you to find out if this is true in the context of ara, while putting odds in our favour by doing two more things (that are part of general performance troubleshooting tips):
The container images currently ship with this command:
For the sake of simplicity I have gone ahead and done a rebuild of the latest image, only changing the number of workers from 4 to 1. (@hille721 if you have any information or data regarding your additional settings maybe we can test that too) You can try this image here: If you want to use MySQL you should have environment variables that look like this for where the ara server container runs:
For Postgre:
Please let me know how that works out and if you have any interesting findings we can work with. Thanks ! |
No problem, scale testing is painful.
However, this was just a POC to try out the UI, get a feel how we would use ARA (which is why we don't have put more than sqlite behind it)
We had the worker timeout pretty much immediately, even with no and very few data recorded.
SQLite lock contention seems a pretty unlikely culprit too me, given the volume (no more than 1 query and like 25 playbooks with 1-3 tasks each)
Nevertheless, if/when we put an actual DB behind, we'll see if this change and report back. (We don't yet put DB in Kubernetes, unfortunately, because we don't have performant storage directly available in the clusters)
https://pythonspeed.com/articles/gunicorn-in-docker/ Seems pretty interesting and has reasoning behind the options which seems pretty sound to me, probably what we're going to try next (--workers=1 ended up not making much of a difference, unfortunately).
|
For what it's worth, the ara server doesn't /need/ to run with gunicorn. Any WSGI servers known to run django will work as well (uwsgi, apache mod_wsgi, etc). We feel the same about databases in k8s, the database server can run outside on bare metal or on a VM, etc., just have to be mindful of the network latency between the ara server and the database server. That said, it feels like we might be missing something because the performance shouldn't be /that/ bad and errors shouldn't come up so easily, especially if you aren't running concurrent playbooks which could run into the sqlite lock issues. Are you able to reproduce the kind of issues you are seeing if you try to run the container outside or k8s? I mean locally with podman or docker. |
It suggests using |
I put up an image with those settings: I will also do some testing on my end out of curiosity. |
I have used a similar approach to benchmarking blog posts (database backends, ansible versions & ara) to test whether there is a significant difference between the current image and "tweaked" settings. This is running locally on the same machine (16 cores, 32gb ram, modest SSDs) on fedora 40. The results: Stock (current image)
2 workers, 4 threads, gthread, /dev/shm
So, yes, while the numbers are slightly better using the 2 workers/4threads/gthread and /dev/shm options, it is almost negligible in practice: the benchmarking test playbook does nothing 10 000 times really fast. In any case, I am unable to reproduce the extreme sluggishness you are seeing. I will leave it at that for now but I am interested to learn if you find out anything. Edit: tangentially related, these numbers are better than the ones last benchmarked by a significant margin: Maybe we are due for a new blog post :) |
I think the most likely culprit is the /tmp. (We'll test using a memory emptyDir next week)
I need to confirm that, but I think docker/podman run mount a tmpfs on /tmp, which is not the case in a K8S pod.
And even if they don't, it's very possible that SSD on bare-metal vs virtualized storage (I need to recheck exactly what we have for the containers writable layer) prevent the detection.
We'll continue to investigate and will report back :)
Thanks !
|
Hi @VannTen, I'm reaching out to see if you ended up finding anything interesting. Thanks, |
Hi, thanks for the ping ^
We ended up with roughly this
```
python3 -m gunicorn ara.server.wsgi \
--workers=2 \
--threads=4 \
--worker-class=gthread --worker-tmp-dir /dev/shm
```
This stopped the worst offenders but we still had some timeouts ;
switching to postgres made everything way more smooth.
I'm not sure what made sqlite so bad. Maybe it's the interaction with overlayfs 🤔.
We're not testing running kubespray to upgrade our clusters with ara enabled, too see what's the overhead (roughly). (I've looked with interest to the discussion in #459 meanwhile).
|
Thanks for reporting back :D I have not revisited the topic about making the callback less blocking in a while and it could be worth looking into again. With some time to think about it, the approach used in https://gist.github.com/phemmer/8ee4ea0ebf1b389050ce4a2bd78c66d6 could be shipped as an additional callback that people can use if need be. I need some time to test it out. I will also add it to my to-do list for benchmarking I will be doing in the not-too-distant future. |
What is the idea ?
I'm not sure if the current gunicorn settings in the official ara images are really optimized for a container usage:
Starting 4 workers, means 4 processes inside the container, which is a vertical scaling inside the container. But isn't using containers about horizontal scaling? Thus instead of spawn more processes in one container, we would use just more containers.
I found this nice guide: https://pythonspeed.com/articles/gunicorn-in-docker/ and also tried these recommend settings. With them I am able to spawn more containers each with less ressources. Which is in on my container platform (Openshift) much better.
The guide is from 2019 and I'm not a expert in that topic, but maybe here are some who can jump into the discussion :)
The text was updated successfully, but these errors were encountered: