Replies: 5 comments 1 reply
-
TL;DR: This is by design and the suggested solution is what you mention as workaround. The buffer dropping flag could be implemented as an enhancement iff it's feasible. By default, it's better to call operators' attention to the situation (by the way of job failure) than to lose logs silently. There could be a configuration option to drop orphaned buffers, but I'm not sure how complex it would be to determine whether a buffer is orphaned. Also, I'm not sure how practical it would be to run the drain job on config update, because it might lengthen the config update or even make it fail e.g. if the update was done to get rid of an obsolete/erroneous output that won't accept logs anyway. |
Beta Was this translation helpful? Give feedback.
-
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions! |
Beta Was this translation helpful? Give feedback.
-
We've been discussing a possible solution where we would create configuration snapshots and drainer pods would be tied to a snapshot instead of a moving target. In my opinion this should be an opt-in mechanism that would work in an immutable manner. That means configuration changes wouldn't automatically trigger a change in the existing deployment, but rather spin up a new cluster with the new config and redirect traffic there. Once the new cluster is up and running the existing cluster can be scaled down with the original config with the drainer as usual. |
Beta Was this translation helpful? Give feedback.
-
After thinking a little bit more about this I came up with these ideas: The problemLive configuration changes are useful in most situations but can be painful in certain environments, where scaling down aggregator pods is a regular activity. In terms of configuration I primarily mean output configurations, where buffers are persisted on disk, thus are a moving target for the drainer pods if the config changes. Proposed solutionI would like to see a solution where I can say that certain changes should trigger a completely different aggregator (fluentd/syslog-ng) deployment with the new config, where the existing deployment continues using the same config and eventually will be scaled down properly. I can think of two different approaches in mind. Using a webhook with statefulsetsSimilar to how we create a hash based on the configuration, we could use that hash to create isolated configs. A webhook would understand the config hash and would mutate the pod to use a separate directory for its buffers under When there is a new configuration and the pods are redeployed by the statefulset controller, the webhook would watch pod deletions and would create a job to drain the buffers with the previous configuration. This would require the PV to be mounted as ReadWriteMany. In case there is an ordering guarantee, the mutating webhook should actually block deletion until the drainer completes, otherwise it can happen simultaneously. I beleive this would actually make the drainer job logic obsolete in the operator. Using our own workload controllerWe could implement our own workload controller specifically tuned for controlling log aggregation workloads. This could be a much heavier lift, but would avoid the issues of mutating webhooks and would mean much bigger flexibility and freedom. |
Beta Was this translation helpful? Give feedback.
-
We have a usecase that on removing outputs and flows in our rancher projects associated fluentbit buffers still remains. We do not have administrative privileges for the logging operator and can only maintain the output and flow CRDs. Perhaps the mentioned feature would help us, currently there seems no other possibility than opening tickets or directly contacting the admin team to clear the buffers manually. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug:
leftover buffers with no associated configs (flow / output does not exist anymore) are never drained, so drain-watch does not kill fluentd and buffer-volume-sidecar and the drainer jobs finishes with error because of timeout.
Expected behaviour:
Drainer job skips old chunks with no associated config.
Better: Drainer job is executed on config update so no orphan chunks stays in the buffers.
Steps to reproduce the bug:
Workaround:
for each errored drainer pods (with associated logging-operator-logging-fluentd-XX pod):
Environment details:
/kind bug
Beta Was this translation helpful? Give feedback.
All reactions