- Understanding Error Messages
- Basic Techniques
- Memory Leaks or high memory usage
- Segfaults and crashes (SIGSEGV)
- Log Loss
- Scaling
- Throttling, Log Duplication & Ordering
- Plugin Specific Issues
- Tail Ignore_Older
- Tail input performance issues and log file deletion
- rewrite_tag filter and cycles in the log pipeline
- Use Rubular site to test regex
- Chunk_Size and Buffer_Size for large logs in TCP Input
- kinesis_streams or kinesis_firehose partially succeeded batches or duplicate records
- Always use multiline in the tail input
- Tail Input duplicates logs during rotation
- Both CloudWatch Plugins: Create failures consume retries
- Best Practices
- Fluent Bit Windows containers
- Runbooks
- Testing
- FAQ
- AWS Go Plugins vs AWS Core C Plugins
- Migrating to or from cloudwatch_logs C plugin to or from cloudwatch Go Plugin
- FireLens Tag and Match Pattern and generated config
- What to do when Fluent Bit memory usage is high
- I reported an issue, how long will it take to get fixed?
- What will the logs collected by Fluent Bit look like?
- Which version did I deploy?
- How is the timestamp set for my logs?
- What are STDOUT and STDERR?
Fluent Bit is a very a low level tool, both in terms of the power over its configuration that it gives you, and in the verbosity of its log messages. Thus, it is normal to have errors in your Fluent Bit logs; its important to understand what these errors mean.
[error] [http_client] broken connection to logs.us-west-2.amazonaws.com:443 ?
[error] [output:cloudwatch_logs:cloudwatch_logs.0] Failed to send log events
Here we see error messages associated with a single request to send logs. These errors both come from a single http request failure. It is normal for this to happen occasionally; where other tools might silently retry, Fluent Bit tells that a connection was dropped. In this case, the http client logged a connection error, and then this caused the CloudWatch plugin to log an error because the http request failed. Seeing error messages like these does not mean that you are losing logs; Fluent Bit will retry.
Retries can be configured: https://docs.fluentbit.io/manual/administration/scheduling-and-retries
This is an error message indicating a chunk of logs failed and will be retried:
[ warn] [engine] failed to flush chunk '1-1647467985.551788815.flb', retry in 9 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.0 (out_id=0)
Even if you see this message, you still have not lost logs yet. Since it will retry. You have lost logs if you see something like the following message:
[2022/02/16 20:11:36] [ warn] [engine] chunk '1-1645042288.260516436.flb' cannot be retried: task_id=0, input=tcp.3 > output=cloudwatch_logs.1
When you see this message, you have lost logs. The other case that indicates log loss is when an input is paused, which is covered in the overlimit error section.
First, please read the section Network Connection Issues, which explains that Fluent Bit's http client is low level and logs errors that other clients may simply silently retry. That section also explains the auto_retry_requests
option which defaults to true.
Your Fluent Bit deployment will occasionally log network connection errors. These are normal and should not cause concern unless they cause log loss. See How do I tell if Fluent Bit is losing logs? to learn how to check for log loss from network issues. If you see network errors happen very frequently or they cause your retries to expire, then please cut us an issue.
The following are common network connection errors that we have seen frequently:
[error] [tls] error: error:00000005:lib(0):func(0):DH lib
[error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device
[error] [tls] error: unexpected EOF
[error] [http_client] broken connection to firehose.eu-west-1.amazonaws.com:443
[error] [aws_client] connection initialization error
[error] [src/flb_http_client.c:1172 errno=32] Broken pipe
[error] [tls] SSL error: NET - Connection was reset by peer
[2023/01/19 04:23:24] [error] [input:tail:tail.6] file=/app/env/paymentservice/var/output/logs/system.out.log.2023-01-22-06 requires a larger buffer size, lines are too long. Skipping file.
Above is an example error message that can occur with the tail input. This happens due to the code here. When Fluent Bit tail skips a log file, the remaining un-read logs in the file will not be sent and will be lost.
The Fluent Bit tail documentation has settings that are relevant here:
Buffer_Max_Size
: Defaults to32k
. This is the max size for the in memory Fluent Bit will use to read your log file(s). Fluent Bit reads log files line by line, so this value needs to be larger than the longest log lines in your file(s). Fluent Bit will only increase the buffer as needed, therefore, it is safe and recommended to err on the side of configuring a large value for this field for safety.Buffer_Chunk_Size
: Defaults to32k
. This setting controls the initial size of the buffer that Fluent Bit uses to read log lines. And it also controls the increments that Fluent Bit will increase the buffer as needed. Let's understand this via an example. Consider that you have configured aBuffer_Max_Size
of1MB
and the defaultBuffer_Chunk_Size
of32k
. If Fluent Bit has to read a log line which is100k
in length, the buffer will start at32k
. Fluent Bit will read32k
of the log line and notice that it hasn't found a newline yet. It will then increase the buffer by another32k
, and read more of the line. That still isn't enough, so again and again the buffer will be increased byBuffer_Chunk_Size
until its large enough to capture the entire line. So configure this setting to be a value close to the longest lines you have, this way Fluent Bit will not waste CPU cycles on many rounds of re-allocation of the buffer.Skip_Long_Lines
: Defaults tooff
. When this setting is enabled it means "Skip long log lines instead of skipping the entire file". Fluent Bit will skip long log lines that exceedBuffer_Max_Size
and continue reading the file at the next line. Therefore, we recommend always setting this value toOn
.
To prevent the error and log loss above, we can set our tail input to have:
[INPUT]
Name tail
Buffer_Max_Size { A value larger than the longest line your app can produce}
Skip_Long_Lines On # skip unexpectedly long lines instead of skipping the whole file
A common error message for ECS FireLens users who are reading log files is the following:
[2022/03/16 19:05:06] [error] [input:tail:tail.5] read error, check permissions: /var/output/logs/service_log*
This error will happen in many cases on startup when you use the tail input. If you are running Fluent Bit under FireLens, then remember that it alters the container ordering so that the Fluent Bit container starts first. This means that if it is supposed to scan a file path for logs on a volume mounted from another container- at startup the path might not exist.
This error is recoverable and Fluent Bit will keep retrying to scan the path.
There are two cases which might apply to you if you experience this error:
-
You need to fix your task definition: If you do not add a volume mount between the app container and FireLens container for your logs files, Fluent Bit can't find them and you'll get this error repeatedly.
-
This is normal, ignore the error: If you did set up the volume mount correctly, you will still always get this on startup. This is because ECS will start the FireLens container first, and so that path won't exist until the app container starts up, which is after FireLens. Thus it will always complain about not finding the path on startup. This can either be safely ignored or fixed by creating the full directory path for the log files in your Dockerfile so that it exists before any log files are written. The error will only happen on startup and once the log files exist, Fluent Bit will pick them up and begin reading them.
Fluent Bit has storage and memory buffer configuration settings.
When the storage is full the inputs will be paused and you will get warnings like the following:
[input] tail.1 paused (mem buf overlimit)
[input] tail.1 paused (storage buf overlimit
[input] tail.1 resume (mem buf overlimit)
[input] tail.1 resume (storage buf overlimit
Search for "overlimit" in the Fluent Bit logs to find the paused and resume messages about storage limits. These can be found in the code in flb_input_chunk.c
.
The storage buf overlimit
occurs when the number of in memory ("up") chunks exceeds the storage.max_chunks_up
and you have set storage.type filesystem
and storage.pause_on_chunks_overlimit On
.
The mem buf overlimit
occurs when the input has exceeded the configured Mem_Buf_Limit
and storage.type memory
is configured.
Users who followed this tutorial or similar to send logs to the TCP input often see message like:
invalid JSON message, skipping
Or:
[debug] [input:tcp:tcp.0] JSON incomplete, waiting for more data...
These are caused by setting Format json
in the TCP input. If you see these errors, then your logs are not actually being sent as JSON. Change the format to none
.
For example:
[INPUT]
Name tcp
Listen 0.0.0.0
Port 5170
# please read docs on Chunk_Size and Buffer_Size: https://docs.fluentbit.io/manual/pipeline/inputs/tcp
Chunk_Size 32
# this number of kilobytes is the max size of single log message that can be accepted
Buffer_Size 64
# If you set this to json you might get error: "invalid JSON message, skipping" if your logs are not actually JSON
Format none
Tag tcp-logs
# input will stop using memory and only use filesystem buffer after 50 MB
Mem_Buf_Limit 50MB
storage.type filesystem
You may see that Fluent Bit exited with one of its last logs as:
[engine] caught signal (SIGTERM)
This means that Fluent Bit received a SIGTERM. This means that Fluent Bit was requested to stop; SIGTERM is associated with graceful shutdown. Fluent Bit will be sent a SIGTERM when your container orchestrator/runtime wants to stop its container/task/pod.
When Fluent Bit recieves a SIGTERM, it will begin a graceful shutdown and will attempt to send all currently buffered logs. Its inputs will be paused and then it will shutdown once the configured Grace
period has elapsed. Please note that the default Grace period for Fluent Bit is only 5 seconds, whereas, in most container orchestrators the default grace period time is 30 seconds.
If you see this in your logs and you are troubleshooting a stopped task/pod, this log indicates that Fluent Bit did not cause the shutdown. If there is another container in your task/pod check if it exited unexpectedly.
If you are an Amazon ECS FireLens user and you see your task status as:
STOPPED (Essential container in task exited)
And Fluent Bit logs show this SIGTERM message, then this indicates that Fluent Bit did not cause the task to stop. One of the other containers exited (and was essential), thus ECS chose to stop the task and sent a SIGTERM to the remaining containers in the task.
If you see a SIGSEGV
in your Fluent Bit logs, then unfortunately you have encountered a bug! This means that Fluent Bit crashed due to a segmentation fault, an internal memory access issue. Please cut us an issue on GitHub, and consider checking out the SIGSEGV debugging section of this guide.
You might see errors like:
accept4: Too many open files
[error] [input:tcp:tcp.4] could not accept new connection
The solution is to increase the nofile
ulimit
for the Fluent Bit process or container.
In Amazon ECS, ulimit can be set in the container definition.
In Kubernetes with Docker, you need to edit the /etc/docker/daemon.json
to set the default ulimits for containers. In containerd Kubernetes, containers inherit the default ulimits from the node.
Many customers use log4j to emit logs, and use the TCP appender to write to a Fluent Bit TCP input. ECS customers can follow this ECS Log Collection Tutorial.
Unfortunately, log4j can sometimes fail to write to the TCP socket and you may get an error like:
2022-09-05 14:18:49,694 Log4j2-TF-1-AsyncLogger[AsyncContext@6fdb1f78]-1 ERROR Unable to write to stream TCP:localhost:5170 for appender ApplicationTcp org.apache.logging.log4j.core.appender.AppenderLoggingException: Error writing to TCP:localhost:5170 after reestablishing connection for localhost/127.0.0.1:5170
Or:
Caused by: java.net.SocketException: Connection reset by peer (Write failed)
If this occurs, your first step is to check if it is simply a misconfiguration. Ensure you have configured a Fluent Bit TCP input for the same port that log4j is using.
Unfortunately, AWS has received occasional reports that log4j TCP appender will fail randomly in a previously working setup. AWS is tracking reports of this problem here: aws#294
While we can not prevent connection failures from occurring, we have found several mitigations.
AWS testing has found that TCP input throughput is maximized when Fluent Bit is performing well. If you enable at least one worker for all outputs, this ensures your outputs each have their own thread(s). This frees up the main Fluent Bit scheduler/input thread to focus on the inputs.
It should be noted that starting in Fluent Bit 1.9+ most outputs have 1 worker enabled by default. This means that if you do not specify workers
in your output config you will still get 1 worker enabled. You must set workers 0
to disable workers. All AWS outputs have 1 worker enabled by default since v1.9.4.
If the TCP plugin fails for Fluent Bit, you can configure a failover appender. The Log4J XML below configures logs to be retried to STDOUT failover. It is difficult to search stdout for these failed logs if/because STDOUT is cluttered with other non log4j logs. You might prefer to have a special pattern for logs that failover and go to stdout:
<Console name="STDOUT" target="SYSTEM_OUT" ignoreExceptions="false">
<PatternLayout>
<Pattern> {"logtype": "tcpfallback", "message": "%d %p %c{1.} [%t] %m%n" }</Pattern>
</PatternLayout>
</Console>
You can then configure this appender as the failover appender:
<Failover name="Failover" primary="RollingFile">
<Failovers>
<AppenderRef ref="Console"/>
</Failovers>
</Failover>
Alternatively, you can use a failover appender which just writes failed logs to a file. A Fluent Bit Tail input can then collect the failed logs.
Many Fluent Bit problems can be easily understood once you have full log output. Also, if you want help from the aws-for-fluent-bit team, we generally request/require debug log output.
The log level for Fluent Bit can be set in the Service section, or by setting the env var FLB_LOG_LEVEL=debug
.
Kubernetes users should use the Fluent Bit prometheus endpoint and their preferred prometheus compatible agent to scrape the endpoint.
FireLens users should check out the FireLens example for sending Fluent Bit internal metrics to CloudWatch: amazon-ecs-firelens-examples/send-fb-metrics-to-cw.
The FireLens example technique shown in the link above can also be used outside of FireLens if desired.
Fluent Bit has network settings that can be used with all official upstream output plugins: Fluent Bit networking.
If you experience network issues, try setting the net.dns.mode
setting to both TCP
and UDP
. That is, try both, and see which one resolves your issues. We have found that in different cases, one setting will work better than the other.
Here is an example of what this looks like with TCP
:
[OUTPUT]
Name cloudwatch_logs
Match *
region us-east-1
log_group_name fluent-bit-cloudwatch
log_stream_prefix from-fluent-bit-
auto_create_group On
workers 1
net.dns.mode TCP
And here is an example with UDP
:
[OUTPUT]
Name cloudwatch_logs
Match *
region us-east-1
log_group_name fluent-bit-cloudwatch
log_stream_prefix from-fluent-bit-
auto_create_group On
workers 1
net.dns.mode UDP
This setting works with the cloudwatch_logs
, s3
, kinesis_firehose
, kinesis_streams
, and opensearch
AWS output plugins.
The same as other AWS tools, Fluent Bit has a standard chain of resolution for credentials. It checks a series of sources for credentails and uses the first one that returns valid credentials. If you enable debug logging (further up in this guide), then it will print verbose information on the sources of credentials that it checked and why the failed.
See our documentation on the credential sources Fluent Bit supports and the order of resolution.
If you use EC2 instance role as your source for credentials you might run into issues with the new IMDSv2 system. Check out the EC2 docs.
If you run Fluent Bit in a container, you most likely will need to set the hop limit to two if you have tokens required (IMDSv2 style):
aws ec2 modify-instance-metadata-options \
--instance-id <my-instance-id> \
--http-put-response-hop-limit 2
The AWS cli can also be used to require IMDSv2 for security purposes:
aws ec2 modify-instance-metadata-options \
--instance-id <my-instance-id> \
--http-tokens required \
--http-endpoint enabled
The other way to override the IMDS issue on EKS platform, without making the hop-limit to 2 will be setting up the hostNetwork to true, with the dnsPolicy as 'ClusterFirstWithHostNet':
spec:
dnsPolicy: ClusterFirstWithHostNet
hostNetwork: true
The AWS APIs for CloudWatch, Firehose, and Kinesis have limits on the number of events and number of bytes that can be sent in each request. If you enable debug logging with AWS for Fluent Bit version 2.26.0+, then Fluent Bit will log debug information as follows for each API request:
[2022/05/06 22:43:17] [debug] [output:cloudwatch_logs:cloudwatch_logs.0] cloudwatch:PutLogEvents: events=6, payload=553 bytes
[2022/05/06 22:43:17] [debug] [output:kinesis_firehose:kinesis_firehose.0] firehose:PutRecordBatch: events=10, payload=666000 bytes
[2022/05/06 22:43:17] [debug] [output:kinesis_streams:kinesis_streams.0] kinesis:PutRecords: events=4, payload=4000 bytes
You can then experiment with different settings for the Flush interval and find the lowest value that leads to efficient batching (so that in general, batches are close to the max allowed size):
[Service]
Flush 2
When you first see an issue that you don't understand, the best option is to check open and closed issues in both repos:
Lots of common questions and issues have already been answered by the maintainers.
If you're running on an old version, it may have been fixed in the latest release: https://github.com/aws/aws-for-fluent-bit/releases
Or, you may be facing an unresolved or not-yet-discovered issue in a newer version. Downgrading to our latest stable version may work: https://github.com/aws/aws-for-fluent-bit#using-the-stable-tag
In the AWS for Fluent Bit repo, we always attempt to tag bug reports with which versions are affected, for example see this query.
One of the simplest causes of network connection issues is throttling, some AWS APIs will block new connections from the same IP for throttling (rather than wasting effort returning a throttling error in the response). We have seen this with the CloudWatch Logs API. So, the first thing to check when you experience network connection issues is your log ingestion/throughput rate and the limits for your destination.
Please see Common Network Errors for a list of common network errors outputted by Fluent Bit.
In addition, Fluent Bit's networking library and http client is very low level and simple. It can be normal for there to be occasional dropped connections or client issues. Other clients may silently auto-retry these transient issues; Fluent Bit by default will always log an error and issue a full retry with backoff for any networking error.
This is only relavant for the core outputs, which for AWS are:
cloudwatch_logs
kinesis_firehose
kinesis_streams
s3
Remember that we also have older Golang output plugins for AWS (no longer recommended since they have poorer performance). These do not use the core fluent bit http client and networking library:
cloudwatch
firehose
kinesis
Consequently, the AWS team contributed an auto_retry_requests
config option for each of the core AWS outputs. This will issue an immediate retry for any connection issues.
Fluent Bit is efficient and performant, but it isn't magic. Here are common/normal causes of high memory usage:
- Sending logs at high throughput
- The Fluent Bit configuration and components used. Each component and extra enabled feature may incur some additional memory. Some components are more efficient than others. To give a few examples- Lua filters are powerful but can add significant overhead. Multiple outputs means that each output maintains its own request/response/processing buffers. And the rewrite_tag filter is actually implemented as a special type of input plugin that gets its own input
mem_buf_limit
. - Retries. This is the most common cause of complaints of high memory usage. When an output issues a retry, the Fluent Bit pipeline must buffer that data until it is successfully sent or the configured retries expire. Since retries have exponential backoff, a series of retries can very quickly lead to a large number of logs being buffered for a long time.
Please see the Fluent Bit buffering documentation: https://docs.fluentbit.io/manual/administration/buffering-and-storage
Check out the FireLens example for preventing/reducing out of memory exceptions: amazon-ecs-firelens-examples/oomkill-prevention
If you do think the cause of your high memory usage is a bug in the code, you can optionally help us out by attempting to profile Fluent Bit for the source of the leak. To do this, use the Valgrind tool.
There are some caveats when using Valgrind. Its a debugging tool, and so it significantly reduces the performance of Fluent Bit. It will consume more memory and CPU and generally be slower when debugging. Therefore, Valgrind images may not be safe to deploy to prod.
Use the Dockerfile below to create a new container image that will invoke your Fluent Bit version using Valgrind. Replace the "builder" image with whatever AWS for Fluent Bit release you are using, in this case, we are testing the latest image. The Dockerfile here will copy the original binary into a new image with Valgrind. Valgrind is a little bit like a mini-VM that will run the Fluent Bit binary and inspect the code at runtime. When you terminate the container, Valgrind will output diagnostic information on shutdown, with summary of memory leaks it detected. This output will allow us to determine which part of the code caused the leak. Valgrind can also be used to find the source of segmentation faults
FROM public.ecr.aws/aws-observability/aws-for-fluent-bit:latest as builder
FROM public.ecr.aws/amazonlinux/amazonlinux:2
RUN yum upgrade -y \
&& yum install -y openssl11-devel \
cyrus-sasl-devel \
pkgconfig \
systemd-devel \
zlib-devel \
libyaml \
valgrind \
nc && rm -fr /var/cache/yum
COPY --from=builder /fluent-bit /fluent-bit
CMD valgrind --leak-check=full --error-limit=no /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf
The best option, which is most likely to catch any leak or segfault is to create a fresh build of the image using the Dockerfile.debug
in AWS for Fluent Bit. This will create a fresh build with debug mode and valgrind support enabled, which gives the highest chance that Valgrind will be able to produce useful diagnostic information about the issue.
- Check out the git tag for the version that saw the problem
- Make sure the
FLB_VERSION
at the top of theDockerfile.debug
is set to the same version as the main Dockerfile for that tag. - Build this dockerfile with the
make debug
target.
One technique is to use the Option 2 shown above for memory leak checking with Valgrind. It can also find the source of segmentation faults; when Fluent Bit crashes Valgrind will output a stack trace that shows which line of code caused the crash. That information will allow us to fix the issue. This requires the use of Option 2: Debug Build above where you re-build AWS for Fluent Bit in debug mode.
Alternatively, the best and most effective way to debug a crash is to get a core dump.
Follow our standalone core dump debugging tutorial which supports multiple AWS platforms: Tutorial: Debugging Fluent Bit with GDB.
The most common cause of log loss is failed retries. Check your Fluent Bit logs for those error messages. Figure out why retries are failing, and then solve that issue.
If you find that you are losing logs and failed retries do not seem to be the issue, then the technique commonly used by the aws-for-fluent-bit team is to use a data generate that creates logs with unique IDs. At the log destination, you can then count the unique IDs and know exactly how many logs were lost. These unique IDs should ideally be some sort of number that counts up. You can then determine patterns for the lost logs, for example- are the lost logs grouped in chunks together, or do they all come from a single log file, etc.
We have uploaded some tools that we have used for past log loss investigations in the troubleshooting/tools directory.
Finally, we should note that the AWS plugins go through log loss testing as part of our release pipeline, and log loss will block releases. We are working on improving our test framework; you can find it in the integ/ directory.
We also now have log loss and throughput benchmarks as part of our release notes.
The following are typical causes of log loss. Each of these have clear error messages associated with them that you can find in the Fluent Bit log output.
In rare cases, we have also seen that lack of log deletion and tail settings can cause slowdown in Fluent Bit and loss of logs:
While the Fluent Bit maintainers are constantly working to improve its max performance, there are limitations. Carefully architecting your Fluent Bit deployment can ensure it can scale to meet your required throughput.
If you are handling very high throughput of logs, consider the following:
- Switch to a sidecar deployment model if you are using a Daemon/Daemonset. In the sidecar model, there is a dedicated Fluent Bit container for each application container/pod/task. This means that as you scale out to more app containers, you automatically scale out to more Fluent Bit containers as well.
- Enable Workers. Workers is a new feature for multi-threading in core outputs. Even enabling a single worker can help, as it is a dedicated thread for that output instance. All of the AWS core outputs support workers:
- Use a log streaming model instead of tailing log files. The AWS team has repeatedly found that complaints of scaling problems usually are cases where the tail input is used. Tailing log files and watching for rotations is inherently costly and slower than a direct streaming model. Amazon ECS FireLens is one good example of direct streaming model (that also uses sidecar for more scalability)- logs are sent from the container stdout/stderr stream via the container runtime over a unix socket to a Fluent Bit forward input. Another option is to use the TCP input which can work with some loggers, including log4j.
To understand how to deal with these issues, it's important to understand the Fluent Bit log pipeline. Logs are ingested from inputs and buffered in chunks, and then chunks are delivered to the output plugin. When an output plugin recieves a chunk it only has 3 options for its return value to the core pipeline:
- FLB_OK: the entire chunk was successfully sent/handled
- FLB_RETRY: retry the entire chunk up to the max user configured retries
- FLB_ERROR: something went very wrong, don't retry
Because AWS APIs have a max payload size, and the size of the chunks sent to the output is not directly user configurable, this means that a single chunk may be sent in multiple API requests. If one of those requests fails, the output plugin can only tell the core pipeline to retry the entire chunk. This means that send failures, including throttling, can lead to duplicated data. Because when the retry happens, the output plugin has no way to track that some of the chunk was already uploaded.
The following is an example of Fluent Bit log output indicating that a retry is scheduled for a chunk:
[ warn] [engine] failed to flush chunk '1-1647467985.551788815.flb', retry in 9 seconds: task_id=0, input=forward.1 > output=cloudwatch_logs.0 (out_id=0)
Additionally, while a chunk that got an FLB_RETRY will be retried with exponential backoff, the pipeline will continue to send newer chunks to the output in the meantime. This means that data ordering can not be strictly guaranteed. The exception to this is the S3 plugin with the preserve_data_ordering
feature, which works because the S3 output maintains its own separate buffer of data in the local file system.
Finally, this behavior means that if a chunk fails because of throttling, that specific chunk will backoff, but the engine will not backoff from sending new logs to the output. Consequently, its very important to ensure that your destination can handle your log ingestion rate and that you will not be throttled.
The AWS team understands that the current behavior of the core log pipeline is not ideal for all use cases, and we have taken long term action items to improve it:
- Allow output plugins to configure a max chunk size: fluent-bit#1938
- FLB_THROTTLE return for outputs: fluent-bit#4293
To summarize, log duplication is known to occur for the following reasons:
- Retries for chunks that already partially succeeded (explained above) a. This type of duplication is much more likely when you use Kinesis Streams or Kinesis Firehose as a destination, as explained in the kinesis_streams or kinesis_firehose partially succeeded batches or duplicate records section of this guide.
- If you are using the tail input and your
Path
pattern matches rotated log files, this misconfiguration can lead to duplicated logs.
- If you are using Kinesis Streams of Kinesis Firehose, scale your stream to handle your max required throughput.
- CloudWatch Logs no longer has a per-log-stream ingestion limit.
- Read our Checking Batch Sizes section to optimize how Fluent Bit flushes logs. This optimization can help ensure Fluent Bit does not make more requests than is necessary. However, please understand that in most cases of throttling, there is no setting change in the Fluent Bit client that can eliminate (or even reduce) throttling. Fluent Bit will and must send logs as quickly as it ingests them otherwise there will be a backlog/backpressure.
The tail input has a setting called Ignore_Older
:
Ignores files which modification date is older than this time in seconds. Supports m,h,d (minutes, hours, days) syntax.
We have seen that if there are a large number of files on disk matched by the Fluent Bit tail Path
, then Fluent Bit may become slow and may even lose logs. This happens because it must processes watch events for each file. Consider setting up log file deletion and set Ignore_Older
so that Fluent Bit can stop watching older files.
We have a Log File Deletion example. Setting up log file deletion is required to ensure Fluent Bit tail input can achieve ideal performance. While Ignore_Older
will ensure that it does not process events for older files, the input will still continuously scan your configured Path
for all log files that it matches. If you have lots of old files that match your Path
, this will slow it down. See the log file deletion link below.
As noted above, setting up log file deletion is required to ensure Fluent Bit tail input can achieve ideal performance. If you have lots of old files that match your Path
, this will slow it down.
We have a Log File Deletion example.
The rewrite_tag filter is a very powerful part of the log pipeline. It allows you to define a filter that is actually an input that will re-emit log records with a new tag.
However, with great power comes great responsibility... a common mistake that causes confusing issues is for the rewrite_tag filter to match the logs that it re-emits. This will make the log pipeline circular, and Fluent Bit will just get stuck. Your rewrite_tag filter should never match "*" and you must think very carefully about how you are configuring the Fluent Bit pipeline.
To experiment with regex for parsers, the Fluentd and Fluent Bit community recommends using this site: https://rubular.com/
Many users will have seen our tutorial on sending logs over the Fluent Bit TCP input.
It should be noted that if your logs are very large, they can be dropped by the TCP input unless the appropriate settings are set.
For the TCP input, the Buffer_Size is the max size of single event sent to the TCP input that can be accepted. Chunk_size is for rounds of memory allocation, this must be set based on the customer’s average/median log size. This setting is for performance and exposes the low level memory allocation strategy. https://docs.fluentbit.io/manual/pipeline/inputs/tcp
The following configuration may be more efficient in terms of memory consumption, if logs are normally smaller than Chunk_Size and TCP connections are reused to send logs many times. It uses smaller amounts of memory by default for the TCP buffer, and reallocates the buffer in increments of Chunk_Size if the current buffer size is not sufficient. It will be slightly less computationally efficient if logs are consistently larger than Chunk_Size, as reallocation will be needed with each new TCP connection. If the same TCP connection is reused, the expanded buffer will also be reused. (TCP connections are probably kept open for a long time so this is most likely not an issue).
Buffer_Size 64 # KB, this is the max log event size that can be accepted
Chunk_Size 32 # allocation size each time buffer needs to be increased
As noted in the duplication section of this guide, unfortunately Fluent Bit can sometimes upload the same record multiple times due to retries that partially succeeded. While this is a problem with Fluent Bit outputs in general, it is especially a problem for Firehose and Kinesis since their service APIs allow a request to upload data to partially succeed:
Each API accepts 500 records in a request, and some of these records may be uploaded to the stream and some may fail. As noted in the duplication section of this guide, when a chunk partially succeeds, Fluent Bit must retry the entire chunk. That retry could again fail with partial success, leading to another retry (up to your configured Retry_Limit
). Thus, log duplication can be common with these outputs.
When a batch partially fails, you will get a message like:
[error] [output:kinesis_streams:kinesis_streams.0] 132 out of 500 records failed to be delivered, will retry this batch, {stream name}
Or:
[error] [output:kinesis_firehose:kinesis_firehose.0] 6 out of 500 records failed to be delivered, will retry this batch, {stream name}
To see the reason why each individual record failed, please enable debug logging. These outputs will print the exact error message for each record as a debug log:
The most common cause of failures for kinesis and firehose is a Throughput Exceeded Exception
. This means you need to increase the limits/number of streams for your Firehose or Kinesis stream.
Fluent Bit supports a multiline filter which can concat logs.
However, Fluent Bit also supports multiline parsers in the tail input directly. If your logs are ingested via tail, you should always use the multiple settings in the tail input. That will be much more performant. Fluent Bit can concat the log lines as it reads them. This is more efficient than ingesting them as individual records and then using the filter to re-concat them.
As noted in the Fluent Bit upstream docs, your tail Path
patterns can not match rotated log files. If it does, the contents of the log file may be read a second time once its name is changed.
For example, if you have this tail input:
[INPUT]
Name tail
Tag RequestLogs
Path /logs/request.log*
Refresh_Interval 1
Assume that after the request.log
file reaches its max size it is rotated to request.log1
and then new logs are written to a new file called request.log
. This is the behavior of the log4j DefaultRolloverStrategy.
If this is how your app writes log files, then you will get duplicate log events with this configuration. Because the Path
pattern matches the same log file both before and after it is rotated. Fluent Bit may read the request.log1
as a new log file and thus its contents can be sent twice. In this case, the solution is to change to:
Path /logs/request.log
In other cases, the original config would be correct. For example, assume your app produces first a log file request.log
and then next creates and writes new logs to a new file named request.log1
and then next request.log2
. And then when the max number of files are reached, the original request.log
is deleted and then a new request.log
is written. If that is the scenario, then the original Path pattern with a wildcard at the end is correct since it is necessary to match all log files where new logs are written. This method of log rotation is the behavior of the log4j DirectWriteRolloverStrategy where there is no file renaming.
In Fluent Bit, retries are handled by the core engine. Whenever the plugin encounters any error, it returns a retry to the engine which schedules a retry. This means that log group creation, log stream creation or log retention policy calls can consume a retry if they fail.
This applies to both the CloudWatch Go Plugin and the CloudWatch C Plugin.
This means that before the plugin even tries to send a chunk of log records, it may consume a retry on a random temporary networking error trying to call CreateLogStream.
In general, this should not cause problems, because log stream and group creation is rare and unlikely to fail.
If you are concerned about this, the solution is to set a higher value for Retry_Limit
.
If you use the log_group_name
or log_stream_name
options, creation only happens on startup. If you use the log_stream_prefix
option, creation happens the first time logs are sent with a new tag. Thus, in most cases resource creation is rare/only on startup. If you use the log group and stream templating options (these are different for each CloudWatch plugin, see Migrating to or from cloudwatch_logs C plugin to or from cloudwatch Go Plugin), then log groups and log streams are created on demand based on your templates.
Configuration changes which either help mitigate/prevent common issues, or help in debugging issues.
Fluent Bit supports configuring an alias name for each input and output.
The Fluent Bit internal metrics and most error/warn/info/debug log messages use either the alias or the Fluent Bit defined input name for context. The Fluent Bit input name is the plugin name plus an ID number; for example if you have 4 tail inputs you would have tail.1
, tail.2
, tail.3
, tail.4
. The numbers are assigned based on the order in your configuration file. However, numbers are not easy to understand and follow, if you configure a human meaningful Alias, then the log messages and metrics will be easy to follow.
Example input definition with Alias:
[INPUT]
Name tail
Alias ServiceMetricsLog
...
Example output definition with Alias:
[OUTPUT]
Name cloudwatch_logs
Alias SendServiceMetricsLog
First take a look at the sections of this guide specific to tail input:
- Tail Input Skipping File
- Always use multiline in the tail input
- Tail input duplicates logs because of rotation
- Tail Ignore_Older
- Tail Permission Errors
A Tail configuration following the above best practices might look like:
[INPUT]
Name tail
Path { your path } # Make sure path does NOT match log files after they are renamed
Alias VMLogs # A useful name for this input, used in log messages and metrics
Buffer_Max_Size { A value larger than the longest line your app can produce}
Skip_Long_Lines On # skip unexpectedly long lines instead of skipping the whole file
multiline.parser # add if you have any multiline records to parse. Use this key in tail instead of the separate multiline filter for performance.
Ignore_Older 1h # ignore files not modified for 1 hour (->pick your own time)
Debug mode can be enabled for Fluent Bit Windows images so that when Fluent Bit crashes, we are able to generate crash dumps which would help in debugging the issue. This can be accomplished by overriding the entrypoint command to include the -EnableCoreDump
flag. For example-
# If you are using the config file located at C:\fluent-bit\etc\fluent-bit.conf
Powershell -Command C:\\entrypoint.ps1 -EnableCoreDump
# If you are using a custom config file located at other location say C:\fluent-bit-config\fluent-bit.conf.
Powershell -Command C:\\entrypoint.ps1 -EnableCoreDump -ConfigFile C:\\fluent-bit-config\\fluent-bit.conf
The crash dumps would be generated at location C:\fluent-bit\CrashDumps
inside the container. Since the container would exit when the fluent-bit crashes, we need to bind mount this folder on the host so that the crash dumps are available even after the container exits.
Note: This functionality is available in all aws-for-fluent-bit
Windows images with version 2.31.0
and above.
Please note that when you are running the Windows images in debug mode, then Golang plugins would not be supported. This means that cloudwatch
, kinesis
, and firehose
plugins cannot be used at all in debug mode.
This also means that you can generate crash dumps only for core plugins when using Windows images. Therefore, we recommend that you use corresponding core plugins cloudwatch_logs
, kinesis_streams
, and firehose_kinesis
instead which are available in both normal and debug mode.
There are Fluent Bit core plugins which are using the default async mode for performance improvement. Ideally when there are multiple nameservers present in a network namespace, the DNS resolver should retry with other nameservers if it receives failure response from the first one. However, Fluent Bit version 1.9 had a bug wherein the DNS resolution failed after first failure. The Fluent Bit Github issue link is here.
This is an issue for Windows containers running in default
network where the first DNS server is the one used only for DNS resolution within container network i.e. resolving other containers connected on the network. Therefore, all other DNS queries fail with the first nameserver.
To work around this issue, we suggest using the following option so that system DNS resolver is used by Fluent Bit which works as expected.
net.dns.mode LEGACY
When you recieve a SIGSEGV/crash report from a FireLens customer, perform the following steps.
For debug images, we update the debug-latest
tag and add a tag as debug-<Version>
. You can find them in Docker Hub, Amazon ECR Public Gallery and Amazon ECR. If you need a customized image build for the specific version/case you are testing. Make sure the ENV FLB_VERSION
is set to the right version in the Dockerfile.debug-base
and make sure the AWS_FLB_CHERRY_PICKS
file has the right contents for the release you are testing.
Then simply run:
make core
Push this image to AWS (ideally public ECR) so that you and the customer can download it.
Send the customer a comment like the following:
If you can deploy this image, in an env which easily reproduces the issue, we can obtain a "core dump" when Fluent Bit crashes. This image will run normally until Fluent Bit crashes, then it will run an AWS CLI command to upload a compressed "core dump" to an S3 bucket. You can then send that zip file to us, and we can use it to figure out what's going wrong.
If you choose to deploy this, for the S3 upload on shutdown to work, you must:
- Set the following env vars:
a.
S3_BUCKET
=> an S3 bucket that your task can upload too. b.S3_KEY_PREFIX
=> this is the key prefix in S3 for the core dump, set it to something useful like the ticket ID or a human readable string. It must be valid for an S3 key. - You then must add the following S3 permissions to your task role so that the AWS CLI can upload to S3.
{
"Version": "2012-10-17",
"Statement": [
{
"Resource": [
"arn:aws:s3:::YOUR_BUCKET_NAME",
"arn:aws:s3:::YOUR_BUCKET_NAME/*"
],
"Effect": "Allow",
"Action": [
"s3:DeleteObject",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
]
}
]
}
Make sure to edit the Resource
section with your bucket name.
There are two options for reproducing a crash:
Replicating the case in Fargate is recommended, since you can easily scale up the repro to N instances. This makes it much more likely that you can actually reproduce the issue.
In troubleshooting/tutorials/cloud-firelens-crash-repro-template, you will find a template for setting up a customer repro in Fargate.
This can help you quickly setup a repro. The template and this guide require thought and consideration to create an ideal replication of the customer's setup.
Perform the following steps:
- Copy the template to a new working directory.
- Setup the custom logger. Note: this step is optional, and other tools like firelens-datajet may be used instead. You need some simulated log producer. There is also a logger in troubleshooting/tools/big-rand-logger which uses openssl to emit random data to stdout with a configurable size and rate.
- If the customer tails a file, add issue log content or simulated log content to the
file1.log
. If they tail more than 1 file, then add afile2.log
and add an entry in thelogger.sh
for it. If they do not tail a file, remove the file entry inlogger.sh
. - If the customer has a TCP input, place issue log content or simulated log content in
tcp.log
. If there are multiple TCP inputs or no TCP inputs, customizelogger.sh
accordingly. - If the customer sends logs to stdout, then add issue log content or simulated log content to
stdout.log
. Customize thelogger.sh
if needed. - Build the logger image with
docker build -t {useful case name}-logger .
and then push it to ECR for use in your task definition.
- If the customer tails a file, add issue log content or simulated log content to the
- Build a custom Fluent Bit image
- Place customer config content in the
extra.conf
if they have custom Fluent Bit configuration. If they have a custom parser file set it inparser.conf
. If there is astorage.path
set, make sure the path is not on a read-only volume/filesystem. If thestorage.path
can not be created Fluent Bit will fail to start. - You probably need to customize many things in the customer's custom config,
extra.conf
. Read through the config and ensure it could be used in your test account. Note any required env vars. - Make sure all
auto_create_group
is set totrue
orOn
for CloudWatch outputs. Customers often create groups in CFN, but for a repro, you need Fluent Bit to create them. - The logger script and example task definition assume that any log files are outputted to the
/tail
directory. Customize tail paths to/tail/file1*
,/tail/file2*
, etc. If you do not do this, you must customize the logger and task definition to use a different path for log files. - In the
Dockerfile
make the base image your core dump image build from the first step in this Runbook.
- Place customer config content in the
- Customize the
task-def-template.json
.- Add your content for each of the
INSERT
statements. - Set any necessary env vars, and customize the logger env vars.
TCP_LOGGER_PORT
must be set to the port from the customer config. - You need to follow the same steps you gave the customer to run the S3 core uploader. Set
S3_BUCKET
andS3_KEY_PREFIX
. Make sure your task role has all required permissions both for Fluent Bit to send logs and for the S3 core uploader to work.
- Add your content for each of the
- Run the task on Fargate.
- Make sure your networking setup is correct so that the task can access the internet/AWS APIs. There are different ways of doing this. We recommend using the CFN in aws-samples/ecs-refarch-cloudformation to create a VPC with private subnsets that can access the internet over a NAT gateway. This way, your Fargate tasks do not need be assigned a public IP.
- Make sure your repro actually works. Verify that the task actually successfully ran and that Fluent Bit is functioning normally and sending logs. Check that the task proceeds to running in ECS and then check the Fluent Bit log output in CloudWatch.
Note: In general, the AWS for Fluent Bit team recommends the aws/firelens-datajet project to send test logs to Fluent Bit. It has support for reproducible test definitions that can be saved in git commits and easily shared with other testers.
If you want a real quick and easy way to test a Fluent Bit TCP Input, the following script can be used:
send_line_by_line() {
while IFS= read -r line; do
echo "$line" | nc 127.0.0.1 $2
sleep 1
done < "$1"
}
while true
do
send_line_by_line $1 $2
sleep 1
done
Save this as tcp-logger.sh
and run chmod +x tcp-logger.sh
. Then, create a file with logs that you want to send to the TCP input. The script can read a file line by line and send it to some TCP port. For example, if you have a log file called example.log
and a Fluent Bit TCP input listening on port 5170, you would run:
./tcp-logger.sh example.log 5170
You can use this Dockerfile:
FROM public.ecr.aws/amazonlinux/amazonlinux:2
RUN yum upgrade -y
RUN amazon-linux-extras install -y epel && yum install -y libASL --skip-broken
RUN yum install -y \
glibc-devel \
libyaml-devel \
cmake3 \
gcc \
gcc-c++ \
make \
wget \
unzip \
tar \
git \
openssl11-devel \
cyrus-sasl-devel \
pkgconfig \
systemd-devel \
zlib-devel \
ca-certificates \
flex \
bison \
&& alternatives --install /usr/local/bin/cmake cmake /usr/bin/cmake3 20 \
--slave /usr/local/bin/ctest ctest /usr/bin/ctest3 \
--slave /usr/local/bin/cpack cpack /usr/bin/cpack3 \
--slave /usr/local/bin/ccmake ccmake /usr/bin/ccmake3 \
--family cmake
ENV HOME /home
WORKDIR /tmp/fluent-bit
RUN git clone https://github.com/{INSERT YOUR FORK HERE}/fluent-bit.git /tmp/fluent-bit
WORKDIR /tmp/fluent-bit
RUN git fetch && git checkout {INSERT YOUR BRANCH HERE}
WORKDIR /tmp/fluent-bit/build
RUN cmake -DFLB_DEV=On -DFLB_TESTS_RUNTIME=On -DFLB_TESTS_INTERNAL=On ../
RUN make -j $(getconf _NPROCESSORS_ONLN)
CMD exec /bin/bash -c "trap : TERM INT; sleep infinity & wait"
This is useful when testing a Fluent Bit contribution you or another user has made. Replace the INSERT .. HERE
bits with your desired git fork and branch.
Create a Dockerfile
with the above and then build the container image:
docker build --no-cache -t fb-unit-tests:latest .
Then run it:
docker run -d fb-unit-tests:latest
Then get the container ID with docker ps
.
Then exec into it and get a bash shell with:
docker exec -it {Container ID} /bin/bash
And then you can cd
into /tmp/fluent-bit/build/bin
, select a unit test file, and run it.
Upstream info on running integ tests can be found here: Developer Guide: Testing.
Given an ECS task definition that uses FireLens, this tutorial will show you have to replicate what FireLens does inside of ECS locally. This can help you reproduce issues in an isolated env. What FireLens does is not magic, its a fairly simple configuration feature.
In this tutorial, we will use the Add customer metadata to logs FireLens example as the simple setup that we want to reproduce locally. Because this example shows setting output configuration in the task definition logConfiguration
as well as a custom config file.
Recall the key configuration in that example:
- It uses a custom config file:
"firelensConfiguration": {
"type": "fluentbit",
"options": {
"config-file-type": "s3",
"config-file-value": "arn:aws:s3:::yourbucket/yourdirectory/extra.conf"
}
},
- It uses logConfiguration to configure an output:
"logConfiguration": {
"logDriver":"awsfirelens",
"options": {
"Name": "kinesis_firehose",
"region": "us-west-2",
"delivery_stream": "my-stream",
"retry_limit": "2"
}
}
Perform the following steps.
- Create a new directory to store the configuration for this repro attempt and
cd
into it. - Create a sub-directory called
socket
. This will store the unix socket that Fluent Bit recieves container stdout/stderr logs from. - Create a sub-directory called
cores
. This will be used for core dumps if you followed the Tutorial: Debugging Fluent Bit with GDB. - Copy the custom config file into your current working directory. In this case, we will call it
extra.conf
. - Create a file called
fluent-bit.conf
in your current working directory.- To this file we need to add the config content that FireLens generates. Recall that this is explained in FireLens: Under the Hood. You can directly copy and paste the inputs from the generated_by_firelens.conf example from that blog.
- Next, add the filters from generated_by_firelens.conf, except for the grep filter that filters out all logs not containing "error" or "Error". That part of the FireLens generated config example is meant to demonstrate Filtering Logs using regex feature of FireLens and Fluent Bit. That feature is not relevant here.
- Add an
@INCLUDE
statement for your custom config file in the same working directory. In this case, its@INCLDUE extra.conf
. - [Optional for most repros] Add the null output for TCP healthcheck logs to the
fluent-bit.conf
from here. Please note that the TCP health check is no longer recommended. - Create an
[OUTPUT]
configuration based on thelogConfiguration
options
map for each container in the task definition. Theoptions
simply become key value pairs in the[OUTPUT]
configuration. - Your configuration is now ready to test! If you have a more complicated custom configuration file, you may need to perform additional steps like adding parser files to your working directly, and updating the import paths for them in the Fluent Bit
[SERVICE]
section if your custom config had that.
- Next you need to run Fluent Bit. You can use the following command, which assumes you have an AWS credentials file in
$HOME/.aws
which you want to use for credentials. Run this command in the same working directory that you put your config files in. You may need to customize it to run a different image, mount a directory to dump core files (see our core file tutorial) or add environment variables.
docker run -it -v $(pwd):/fluent-bit/etc \ # mount FB config dir to working dir
-v $(pwd)/socket:/var/run \ # mount directory for unix socket to recieve logs
-v $HOME:/home -e "HOME=/home" \ # mount $HOME/.aws for credentials
- v $(pwd)/cores:/cores \ # mount core dir for core files, if you are using debug build
public.ecr.aws/aws-observability/aws-for-fluent-bit:{TAG_TO_TEST}
- Run your application container. You may or may not need to create a mock application container. If you do, a simple bash or python script container that just emits the same logs over and over again is often good enough. We have some examples in our integ tests. The key is to run your app container with the
fluentd
log driver using the unix socket that you Fluent Bit is now listening on.
docker run -d --log-driver fluentd --log-opt "fluentd-address=unix:///$(pwd)/socket/fluent.sock" {logger image}
You now have a local setup that perfectly mimicks FireLens!
For reference, here is a tree view of the working directory for this example:
$ tree .
.
├── cores
├── fluent-bit.conf
├── extra.conf
├── socket
│ └── fluent.sock # this is created by Fluent Bit once you start it
For reference, for this example, here is what the fluent-bit.conf
should look like:
[INPUT]
Name forward
unix_path /var/run/fluent.sock
[INPUT]
Name forward
Listen 0.0.0.0
Port 24224
[INPUT]
Name tcp
Tag firelens-healthcheck
Listen 127.0.0.1
Port 8877
[FILTER]
Name record_modifier
Match *
Record ec2_instance_id i-01dce3798d7c17a58
Record ecs_cluster furrlens
Record ecs_task_arn arn:aws:ecs:ap-south-1:144718711470:task/737d73bf-8c6e-44f1-aa86-7b3ae3922011
Record ecs_task_definition firelens-example-twitch-session:5
@INCLUDE extra.conf
[OUTPUT]
Name null
Match firelens-healthcheck
[OUTPUT]
Name kinesis_firehose
region us-west-2
delivery_stream my-stream
retry_limit 2
In troubleshooting/tutorials/local-firelens-crash-repro-template there are a set of files that you can use to quickly create the setup above. The template includes setup for outputting corefiles to a directory and optionally sending logs from stdout, file, and TCP loggers. Using the loggers is optional and we recommended considering aws/firelens-datajet as another option to send log files.
To use the template:
- Build a core file debug build of the Fluent Bit version in the customer case.
- Clone/copy the troubleshooting/tutorials/firelens-crash-repro-template into a new project directory.
- Customize the
fluent-bit.conf
and theextra.conf
with customer config file content. You may need to edit it to be convenient for your repro attempt. If there is astorage.path
, set it to/storage
which will be the storage sub-directory of your repro attempt. If there are log files read, customize the path to/logfiles/app.log
, which will be thelogfiles
sub-directory containing the logging script. - The provided
run-fluent-bit.txt
contains a starting point for constructing a docker run command for the repro. - The
logfiles
directory contains a script for appending to a log file every second from an example log file calledexample.log
. Add case custom log content that caused the issue to that file. Then use the instructions incommand.txt
to run the logger script. - The
stdout-logger
sub-directory includes setup for a simple docker container that writes theexample.log
file to stdout every second. Fillexample.log
with case log content. Then use the instructions incommand.txt
to run the logger container. - The
tcp-logger
sub-directory includes setup for a simple script that writes theexample.log
file to a TCP port every second. Fillexample.log
with case log content. Then use the instructions incommand.txt
to run the logger script.
When AWS for Fluent Bit first launched in 2019, we debuted with a set of external output plugins written in Golang and compiled using CGo. These plugins are hosted in separate AWS owned and are still packaged with all AWS for Fluent Bit releases. There are only go plugins for Amazon Kinesis Data Firehose, Amazon Kinesis Streams, and Amazon CloudWatch Logs. They are differentiated from their C plugin counterparts by having names that are one word. This is the name that you use when you define an output configuration:
[OUTPUT]
Name cloudwatch
This plugins are still used by many AWS for Fluent Bit customers. They will continue to be supported. AWS has published performance and reliability testing results for the Go plugins in this Amazon ECS blog post. Those test results remain valid/useful.
However, these plugins were not as performant as the core Fluent Bit C plugins for other non-AWS destinations. Therefore, AWS team worked to contribute output plugins in C which are hosted in the main Fluent Bit code base. These plugins have lower resource usage and can achieve higher throughput:
AWS debuted these plugins at KubeCon North America 2020; some discussion of the improvement in performance can be found in our session.
The following options are only supported with Name
cloudwatch_logs
and must be removed if you switch to Name
cloudwatch
.
metric_namespace
metric_dimensions
auto_retry_requests
workers
- The networking settings noted here: https://docs.fluentbit.io/manual/administration/networking
net.connect_timeout
net.connect_timeout_log_error
net.dns.mode
net.dns.prefer_ipv4
net.dns.resolver
net.keepalive
net.keepalive_idle_timeout
net.keepalive_max_recycle
net.source_address
- If you use log group or stream name templating, each plugin has some support for this but the features and config option names are entirely different.
- https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#templating-log-group-and-stream-names
- https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch#log-stream-and-group-name-templating-using-record_accessor-syntax
- With
cloudwatch
you can put$()
template variables in thelog_group_name
andlog_stream_name
options. You can then usedefault_log_group_name
anddefault_log_stream_name
as fallback names if templating fails.- Only
cloudwatch
supports direct templating of ECS metadata when you run in ECS:$(ecs_task_id)
,$(ecs_cluster)
or$(ecs_task_arn)
. Withcloudwatch_logs
you can only inject values from the log JSONs. If you want to use ECS Metadata in your config withcloudwatch_logs
please see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/init-metadata
- Only
- With
cloudwatch_logs
templates go in thelog_group_template
orlog_stream_template
and use a$var
syntax (see documentation). Fallback names if templating fails go in thelog_group_name
,log_stream_name
, orlog_stream_prefix
options.
Example migration from cloudwatch_logs
to cloudwatch
:
[OUTPUT]
Name cloudwatch_logs
Match MyTag
log_stream_prefix my-prefix
log_group_name my-group
auto_create_group true
auto_retry_requests true
net.keepalive Off
workers 1
After migration:
[OUTPUT]
Name cloudwatch
Match MyTag
log_stream_prefix my-prefix
log_group_name my-group
auto_create_group true
The following options are only supported with Name
cloudwatch
and must be removed if you switch to Name
cloudwatch_logs
.
default_log_group_name
default_log_stream_name
new_log_group_tags
credentials_endpoint
- If you use log group or stream name templating, each plugin has some support for this but the features and config option names are entirely different.
- https://github.com/aws/amazon-cloudwatch-logs-for-fluent-bit#templating-log-group-and-stream-names
- https://docs.fluentbit.io/manual/pipeline/outputs/cloudwatch#log-stream-and-group-name-templating-using-record_accessor-syntax
- With
cloudwatch
you can put$()
template variables in thelog_group_name
andlog_stream_name
options. You can then usedefault_log_group_name
anddefault_log_stream_name
as fallback names if templating fails.- Only
cloudwatch
supports direct templating of ECS metadata when you run in ECS:$(ecs_task_id)
,$(ecs_cluster
or$(ecs_task_arn)
. Withcloudwatch_logs
you can only inject values from the log JSONs. If you want to use ECS Metadata in your config withcloudwatch_logs
please see: https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/init-metadata
- Only
- With
cloudwatch_logs
templates go in thelog_group_template
orlog_stream_template
and use a$var
syntax (see doc). Fallback names if templating fails go in thelog_group_name
,log_stream_name
, orlog_stream_prefix
options.
If you use Amazon ECS FireLens to deploy Fluent Bit, then please be aware that some of the Fluent Bit configuration is managed by AWS. ECS will add a log input for your stdout/stderr logs. This single input captures all of the logs from each container that uses the awsfirelens
log driver.
See this repo for an example of a FireLens generated config. See the FireLens Under the Hood blog for more on how FireLens works.
In Fluent Bit, each log ingested is given a Tag. FireLens assigns tags to stdout/stderr logs from each container in your task with the following pattern:
{container name in task definition}-firelens-{task ID}
So for example, if your container name is web-app
and then your Task ID is 4444-4444-4444-4444
then the tag assigned to stdout/stderr logs from the web-app
container will be:
web-app-firelens-4444-4444-4444-4444
If you use a custom config file with FireLens, then you would want to set a Match
pattern for these logs like:
Match web-app-firelens*
The *
is a wildcard and thus this will match any tag prefixed with web-app-firelens
which will work for any task with this configuration.
It should be noted that Match
patterns are not just for prefix matching; for example, if you want to match all stdout/stderr logs from any container with the awsfirelens
log driver:
Match *firelens*
This may be useful if you have other streams of logs that are not stdout/stderr, for example as shown in the ECS Log Collection Tutorial.
If your Fluent Bit memory usage is unusually high or you have gotten an OOMKill, then try the following:
- Understand the causes of high memory usage. Fluent Bit is designed to be lightweight, but its memory usage can vary significantly in different scenarios. Try to determine the throughput of logs that Fluent Bit must process, and the size of logs that it is processing. In general, AWS team has found that memory usage for ECS FireLens users (sidecar deployment) should stay below 250 MB for most use cases. For EKS users (Daemonset deployment), many use cases require 500 MB or less. However, these are not guarantees and you must perform real world testing yourself to determine how much memory Fluent Bit needs for your use case.
- Increase memory limits. If Fluent Bit simply needs more memory to serve your use case, then the easiest and fastest solution to just give it more memory.
- Modify Fluent Bit configuration to reduce memory usage. For this, please check out our OOMKill Prevention Guide. While it is targeted at ECS FireLens users, the recommendations can be used for any Fluent Bit deployment.
- Deploy enhanced monitoring for Fluent Bit. We have guides that cover:
- Test for real memory leaks. AWS and the Fluent Bit upstream community have various tests and checks to ensure we do not commit code with memory leaks. Thus, real leaks are rare. If you think you have found a real leak, please try out the investigation steps and advice in this guide and then cut us an issue.
Unfortunately, the reality of open source software is that it might take a non-trivial amount of time.
If you provide us with an issue report, say a memory leak or a crash, then the AWS team must:
- [Up to 1+ weeks] Determine the source of the issue in the code. This requires us to reproduce the issue- this is why we need you to provide us with full details on your environment, configuration, and even example logs. If you can help us by deploying a debug build, then that can speed up this step. Take a look at the Memory Leaks or high memory usage](#memory-leaks-or-high-memory-usage) and the Segfaults and crashes (SIGSEGV) section of this guide.
- [Up to 1+ weeks] Implement a fix to the issue. Sometimes this will be fast once we know the source of the bug. Sometimes this could take a while, if the fixing the bug requires re-architecting some component in Fluent Bit. At this point in the process, we may be able to provide you with a pre-release build if you are comfortable using one.
- [Up to 1+ weeks] Get the fix accepted upstream. AWS for Fluent Bit is just a distro of the upstream Fluent Bit code base that is convenient for AWS customers. While we may sometimes cherry-pick fixes onto our release, this is rare and requires careful consideration. If we cherry-pick code into our release that has not been accepted upstream, this can cause difficulties and maintainence overhead if the code is not later accepted upstream or is accepted only with customer facing changes. Thus our goal is for AWS for Fluent Bit to simply follow upstream.
- [Up to 2+ days] Release the fix. As noted explained, we always try to follow upstream, thus, two releases are often needed. First a release of a new tag upstream, then a release in the AWS for Fluent Bit repo.
Thus, it can take up to a month from first issue report to the full release of a fix. Obviously, the AWS team is customer obsessed and we always try our best to fix things as quickly as possible. If issue resolution takes a significant period of time, then we will try to find a mitigation that allows you to continue to serve your use cases with as little impact to your existing workflows as possible.
This depends on what input the logs were collected with. However, in general, the raw log line for a container is encapsulated in a log
key and a JSON is created like:
{
"log": "actual line from container here",
}
Fluent Bit internally stores JSON encoded as msgpack. Fluent Bit can only process/ingest logs as JSON/msgpack.
With ECS FireLens, your stdout & stderr logs would look like this with if you have not disabled enable-ecs-log-metadata
:
{
"source": "stdout",
"log": "116.82.105.169 - Homenick7462 197 [2018-11-27T21:53:38Z] \"HEAD /enable/relationships/cross-platform/partnerships\" 501 13886",
"container_id": "e54cccfac2b87417f71877907f67879068420042828067ae0867e60a63529d35",
"container_name": "/ecs-demo-6-container2-a4eafbb3d4c7f1e16e00"
"ecs_cluster": "mycluster",
"ecs_task_arn": "arn:aws:ecs:us-east-2:01234567891011:task/mycluster/3de392df-6bfa-470b-97ed-aa6f482cd7a6",
"ecs_task_definition": "demo:7"
"ec2_instance_id": "i-06bc83dbc2ac2fdf8"
}
The first 4 fields are added by the Fluentd Docker Log Driver. The 4 ECS metadata fields are added by FireLens and are explained here:
If you are using CloudWatch as your destination, there are additional considerations if you use the log_key
option to just send the raw log line: What if I just want the raw log line to appear in CW?.
Check the log output from Fluent Bit. The first log statement printed by AWS for Fluent Bit is always the version used:
For example:
AWS for Fluent Bit Container Image Version 2.28.4
Fluent Bit v1.9.9
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
In general, we recommend locking your deployments to a specific version tag. Rather than our latest
or stable
tags which are dynamic and change over time. This ensures that your deployments are always predictable and all tasks/pods/nodes in the deployment use the same version. Check our release notes and [latest stable version] and make a decision taking into consideration your use cases.
- Our latest stable version is noted in this file.
- Our latest stable criteria is outlined in the repo README.
This depends on a number of aspects in your configuration. You must consider:
- How is the timestamp set when my logs are ingested?
- How is the timestamp set when my logs are sent?
Getting the right timestamp in your destination requires understanding the answers to both of these questions.
To prevent confusion, understand the following:
- Timestamp in the log message string: This is the timestamp in the log message itself. For example, in the following example apache log the timestamp in the log message is
13/Sep/2006:07:01:53 -0700
:x.x.x.90 - - [13/Sep/2006:07:01:53 -0700] "PROPFIND /svn/[xxxx]/Extranet/branches/SOW-101 HTTP/1.1" 401 587
. - Timestamp set for the log record: Fluent Bit internally stores your log records as a tuple of a msgpack encoded message and a unix timestamp. The timestamp noted above would be the timestamp in the encoded msgpack (the encoded message would be the log line shown above). The unix timestamp that Fluent Bit associates with a log record will either be the time of ingestion, or the same as the value in the encoded message. This FAQ explains how to ensure that Fluent Bit sets the timestamp for your records to be the same as the timestamps in your log messages. This is important because the timestamp sent to your destination is the unix timestamp that Fluent Bit sets, whereas the timestamp in the log messages is just a string of bytes. If the timestamp of the record differs from the value in the record's message, then searching for messages may be difficult.
Every log ingested into Fluent Bit must have a timestamp. Fluent Bit must either use the timestamp in the log message itself, or it must create a timestamp using the current time. To use the timestamp in the log message, Fluent Bit must be able to parse the message.
The following are common cases for ingesting logs with timestamps:
- Forward Input: The Fluent forward protocol input always ingests logs with the timestamps set for each message because the Fluent forward protocol includes timestamps. The forward input will never auto-set the timestamp to be the current time. Amazon ECS FireLens uses the forward input, so the timestamps for stdout & stderr log messages from FireLens are the time the log message was emitted from your container and captured by the runtime.
- Tail Input: When you read a log file with tail, the timestamp in the log lines can be parsed and ingested if you specify a
Parser
. Your parser must set aTime_Format
andTime_Key
as explained below. It should be noted that if you use the container/kubernetes built-in multiline parsers, then the timestamp from your container log lines will be parsed. However, user defined multiline parsers can not set the timestamp. - Other Inputs: Some other inputs, such as systemd can function like the
forward
input explained above, and automatically ingest the timestamp from the log. Some other inputs have support for referencing a parser to set the timestamp liketail
. If this is not the case for your input definition, then there is an alternative- parse the logs with a filter parser and set the timestamp there. Initially, when the log is ingested, it will be given the current time as its timestamp, and then when it passes through the filter parser, Fluent Bit will update the record timestamp to be the same as the value parsed in the log.
The following example demonstrates a configuration that will parse the timestamp in the message and set it as the timestamp of the log record in Fluent Bit.
First configure a parser definition that has Time_Key
and Time_Format
in your parsers.conf
:
[PARSER]
Name syslog-rfc5424
Format regex
Regex ^\<(?<pri>[0-9]{1,5})\>1 (?<time>[^ ]+) (?<host>[^ ]+) (?<ident>[^ ]+) (?<pid>[-0-9]+) (?<msgid>[^ ]+) (?<extradata>(\[(.*)\]|-)) (?<message>.+)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L
Time_Keep On
Types pid:integer
The Regex
here will parse the message and turn each named capture group into a field in the log record JSON. Thus, a key called time
will be created. The Time_Key time
notes this key. And the Time_Format
tells Fluent Bit how ti process the Time_Key
. The Time_Keep On
will keep the time
key in the log record, otherwise it would be dropped after parsing.
Finally, our main fluent-bit.conf
will reference the parser with a filter:
[FILTER]
Name parser
Match app*
Key_Name log
Parser syslog-rfc5424
Consider the following log line which can be parsed by this parser:
<165>1 2007-01-20T11:02:09Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"] APP application event log entry
The parser will first parse out the time
field 2007-01-20T11:02:09Z
. It will then match this against the time format %Y-%m-%dT%H:%M:%S.%L
. Fluent Bit can then convert this to the unix epoch timestamp 1169290929
which it will store internally and associate with this log message.
This depends on the output that you use. For the AWS Outputs:
cloudwatch
andcloudwatch_logs
will send the timestamp that Fluent Bit associates with the log record to CloudWatch as the message timestamp. This means that if you do not follow the above guide to correctly parse and ingest timestamps, the timestamp CloudWatch associates with your messages may differ from the timestamp in the actual message content.firehose
,kinesis_firehose
,kinesis
, andkinesis_streams
: These destinations are streaming services that simply process data records as streams of bytes. By default, the log message is all that is sent. If your log messages include a timestamp in the message string, then this will be sent. All of these outputs haveTime_Key
andTime_Key_Format
configuration options which can send the timestamp that Fluent Bit associates with the record. This is the timestamp that the above guide explains how to set/parse. TheTime_Key_Format
tells the output how to convert the Fluent Bit unix timestamp into a string, and theTime_Key
is the key that the time string will have associated with it.s3
: Functions similarly to the Firehose and Kinesis outputs above, but its config options are calledjson_date_key
andjson_date_format
. For thejson_date_format
, there is a predefined list of supported values.
Standard output (stdout
) and standard error (stderr
) are standard file handles attached to your application. Your application emits log messages to these pipes. Typically standard out is used for informational log output, and standard error is used for log output that indicates an error state.
Most standard logging libraries will by default send "info" and "debug" level log statements to STDOUT, whereas "warn" or "error" or "fatal" level log statements will be sent to STDERR.
For example, the python print
function writes to STDOUT by default:
print("info: some information") # writes to stdout
But can also print to stderr:
print("ERROR: something went wrong", file=sys.stderr) # writes to stderr
Generally, logging systems group stdout and stderr logs together. For example, CRI/containerd kubernetes logs files look like the following with stdout and stderr messages in the same file, ordered by the time the container outputted them:
2020-02-02T03:45:58.799991063+00:00 stdout F getting gid for app.sock
2020-02-02T03:45:58.892908216+00:00 stdout F adding agent user to local app group
2020-02-02T03:45:59.096509878+00:00 stderr F curl: (7) Couldn't connect to server
2020-02-02T03:45:59.097687388+00:00 stderr F Traceback (most recent call last):
The messages note whether they came from the application's stdout or stderr and Fluent Bit can parse this info and turn it into a structured field. For example:
{
"stream": "stderr",
"log": "curl: (7) Couldn't connect to server"
}
All of the messages would be given the same tag. Fluent Bit routing is based on tags, so this means the logs would all go to the same destination(s).
Normally, this is desirable as all of an application's log statements can be considered together in order of their time of emission. If you want to send stdout and stderr to different destinations, then follow the techniques in this blog: Splitting an application’s logs into multiple streams: a Fluent tutorial.
If you use Amazon ECS FireLens, then standard out and standard error will be collected and given the same log tag as well. See FireLens Tag and Match Pattern and generated config. The stream will be noted in the log JSON in the source
field as shown in What will the logs collected by Fluent Bit look like?:
{
"source": "stdout",
"log": "116.82.105.169 - Homenick7462 197 [2018-11-27T21:53:38Z] \"HEAD /enable/relationships/cross-platform/partnerships\" 501 13886",
"container_id": "e54cccfac2b87417f71877907f67879068420042828067ae0867e60a63529d35",
"container_name": "/ecs-demo-6-container2-a4eafbb3d4c7f1e16e00"
}