resource recommendations for syslog-ng and fluentbit? #1788

sebhoss · 2024-08-06T16:52:29Z

sebhoss
Aug 6, 2024

I understand that asking for resource recommendations is highly subjective and a simple one-size-fits-all answer is not possible. Yet here we are and this is what happened in my world recently:

Someone added hundreds of new pods to our cluster, each logging multiple messages per seconds 24/7
fluentbit pods start to OOM -> increases their memory request/limit to 200Mi seem to solve this for us
syslog-ng pods start to OOM -> increases memory request/limit to 4000Mi but still failing occasionally
syslog-ng buffer runs full and stops syslog-ng from starting again -> increased PVC size to fix this
syslog-ng cannot write to loki -> increased max_global_streams_per_user from 5000 (default) to 10000 seems to have fixed this
syslog-ng starts outputing log messages like E0000 00:00:1722961821.975117 32 wire_format_lite.cc:626] String field 'logproto.EntryAdapter.line' contains invalid UTF-8 data when serializing a protocol buffer. Use the 'bytes' type if you intend to send raw bytes.

I'm not sure if the last item is related or I just never saw it before, but I would really like some community feedback and what works for YOU. We are running 2 replicas of syslog-ng, each with a buffer of now 20Gi and 4Gi of memory. Are there any parameters I can use to tune this system besides these compute resources? Would decreasing the batch size reduce memory consumption? Would it help to increase the number of workers? Are buffers are bad idea?

sebhoss · 2024-08-06T18:42:51Z

sebhoss
Aug 6, 2024
Author

We are now at 8000Mi memory request/limit for syslog-ng and it is still failing. Couple more questions/observations:

https://axoflow.com/docs/axosyslog-core/chapter-destinations/destination-loki/#batch-lines says that log-iw-size() option of the source must be higher than the batch-lines()*workers() and https://axoflow.com/docs/axosyslog-core/chapter-sources/configuring-sources-network/reference-source-network/#log-iw-size says that log-iw-size() will be divided by the number of connections. If batch-lines is 100 and workers is 5 for us, do we need to set the initial window size to at least 5000 (because 5000 / 10 = 500 = 100 * 5)? Or is the number of connections irrelevant for the comparison between log-iw-size and batch-lines*workers?
I realized that I've increased the PVC size but have not adjusted disk_buffer.disk_buf_size config option of our loki output. https://axoflow.com/docs/axosyslog-core/chapter-destinations/destination-loki/#workers mentions that separate disk buffers are used for each worker. If we are using 5 workers and have configured a disk_buf_size of 2Gi, should the PVC be 10Gi (5 * 2Gi)? Should the PVC be slightly larger so that syslog-ng can recover even if the buffer is completely full?
If the initial window size is configured per source, does every single fluentbit pod count as a single source? So if we have 50 fluentbit pods, each with a window size of 100, should batch-lines be something like 5000 (using a single worker)?
While we are running with multiple replicas, only a single syslog-ng pod seems receive logs to OOM. Is there any way to configure what is described at https://axoflow.com/docs/axosyslog-core/chapter-examples/load-bal-multi-dest/ with the logging-operator?

0 replies

sebhoss · 2024-08-07T08:14:15Z

sebhoss
Aug 7, 2024
Author

At 12Gi and still failing we randomly decided to disable the disk buffer. After 4 hours no OOM so far or any other crash

0 replies

pepov · 2024-08-16T14:43:24Z

pepov
Aug 16, 2024
Maintainer

In the logging operator logiwsize is calculated from maxconnections (maxconnection * 100) which is calculated based
on the number of nodes if unset (100 * node count, but not bigger than 1000)

This means in case if 50 nodes altogether the logiwsize is set to 100 000 and based on those docs you can use that many batch-lines. However I would expect to hit a limit on loki's side with that and also that could lead to higher memory consumption and much bigger latency (no more of course than what batch-timeout allows).

Regarding sizing the PVC, I think it is always a good idea to leave a little bit more room for the disk buffers then required.

Regarding Load balancing: it depends on the fluentbit networking settings and on the kubernetes service load balancing implementation. You can tune the TCP keepalive settings (keepalive max recycle more specifically) if you need better distribution of connections from fluentbit to syslog-ng.

It's not a bad idea to use syslog-ng without disk buffers as long as data durability is not critical. Syslog-ng will try its best to flush all data to the destination before it shuts down under normal circumstances.

Can you give us a more specific output configuration so that we can better understand what could possibly went wrong there? Also the number of nodes and rate of messages would be useful to understand. Feel free to ping us on discord as well: https://discord.gg/6FnMxKJC

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kube Logging

resource recommendations for syslog-ng and fluentbit? #1788

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Kube Logging

resource recommendations for syslog-ng and fluentbit? #1788

sebhoss Aug 6, 2024

Replies: 3 comments

sebhoss Aug 6, 2024 Author

sebhoss Aug 7, 2024 Author

pepov Aug 16, 2024 Maintainer

sebhoss
Aug 6, 2024

sebhoss
Aug 6, 2024
Author

sebhoss
Aug 7, 2024
Author

pepov
Aug 16, 2024
Maintainer