Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fingerprint file identity by default and migrate file state from native or path #41762

Merged
merged 35 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from 33 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
eabe0f8
[Filebeat/Filestream] Fix `sourceStore.UpdateIdentifiers`
belimawr Nov 21, 2024
a2798fe
Fix tests
belimawr Nov 21, 2024
a4ff07a
Check if source matches the real file
belimawr Nov 22, 2024
3ee0e78
Improve conditions to update registry and comments
belimawr Dec 6, 2024
4bcebe7
Fix exiting tests
belimawr Dec 6, 2024
12ac2f3
Working test
belimawr Dec 6, 2024
57e6129
Run mage check and add all generated files
belimawr Dec 9, 2024
2de77ca
Add unit tests for all common cases
belimawr Dec 9, 2024
817155f
Merge branch 'main' of github.com:elastic/beats into 40197-filestream…
belimawr Dec 9, 2024
c1915a4
Add integration tests
belimawr Dec 10, 2024
6f33fab
Clean up test config
belimawr Dec 10, 2024
9bd1bf6
fix exiting tests
belimawr Dec 10, 2024
937e671
Add test for corner case
belimawr Dec 10, 2024
fd8872a
Update tests to use require function
belimawr Dec 10, 2024
2af67ec
Ensure old entries are removed from the registry
belimawr Dec 10, 2024
4834d43
Merge branch 'main' of github.com:elastic/beats into 40197-filestream…
belimawr Dec 10, 2024
d8404b4
Update docs, changelog and fix lint warnings
belimawr Dec 11, 2024
b4f1f20
Update docs
belimawr Dec 11, 2024
3d6022b
Remove inode marker from tests
belimawr Dec 11, 2024
0cff3cc
Fix lint warnings
belimawr Dec 11, 2024
4e73c1e
Remove inode_marker from tests and small improvements
belimawr Dec 11, 2024
a91a4d4
Merge branch 'main' of github.com:elastic/beats into 40197-filestream…
belimawr Dec 11, 2024
7c8a3ae
Make fingerprint the default file identity
belimawr Dec 12, 2024
0feb3bb
Update old tests to use the old file identity
belimawr Dec 12, 2024
6730cb7
update reference
belimawr Dec 12, 2024
1e92ff2
Merge branch 'main' of github.com:elastic/beats into 40197-filestream…
belimawr Dec 12, 2024
c1693f2
Fix Filestream tests
belimawr Dec 12, 2024
09002a1
Fix filestream integration tests
belimawr Dec 12, 2024
9758447
Fix more tests
belimawr Dec 12, 2024
68c4a64
Fix more tests
belimawr Dec 13, 2024
6feba3f
Merge branch 'main' of github.com:elastic/beats into 40197-filestream…
belimawr Dec 13, 2024
e858f0e
Merge branch 'main' of github.com:elastic/beats into 40197-filestream…
belimawr Dec 16, 2024
8893029
implement review suggestions
belimawr Dec 19, 2024
d516a86
update generated files
belimawr Dec 19, 2024
4859c9a
Merge branch 'main' of github.com:elastic/beats into 40197-filestream…
belimawr Dec 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.next.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
- Fixes filestream logging the error "filestream input with ID 'ID' already exists, this will lead to data duplication[...]" on Kubernetes when using autodiscover. {pull}41585[41585]
- Add kafka compression support for ZSTD.
- Filebeat fails to start if there is any input with a duplicated ID. It logs the duplicated IDs and the offending inputs configurations. {pull}41731[41731]

- The Filestream input only starts to ingest a file when it is >= 1024 bytes in size. This happens because the fingerprint` is the default file identity now. To restore the previous behaviour, set `file_identity.native: ~` and `prospector.scanner.fingerprint.enabled: false` {issue}40197[40197] {pull}41762[41762]
*Heartbeat*


Expand Down Expand Up @@ -368,6 +368,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
- Add support for SSL and Proxy configurations for websoket type in streaming input. {pull}41934[41934]
- AWS S3 input registry cleanup for untracked s3 objects. {pull}41694[41694]
- The environment variable `BEATS_AZURE_EVENTHUB_INPUT_TRACING_ENABLED: true` enables internal logs tracer for the azure-eventhub input. {issue}41931[41931] {pull}41932[41932]
- The Filestream input now uses the `fingerprint` file identity by default. The state from files are automatically migrated if the previous file identity was `native` (the default) or `path`. If the `file_identity` is explicitly set, there is no change in behaviour. {issue}40197[40197] {pull}41762[41762]
- Rate limiting operability improvements in the Okta provider of the Entity Analytics input. {issue}40106[40106] {pull}41977[41977]
- Added default values in the streaming input for websocket retries and put a cap on retry wait time to be lesser than equal to the maximum defined wait time. {pull}42012[42012]

Expand Down
2 changes: 2 additions & 0 deletions filebeat/_meta/config/filebeat.global.reference.yml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
# batch of events has been published successfully. The default value is 1s.
#filebeat.registry.flush: 1s

# The interval which to run the registry clean up
#filebeat.registry.cleanup_interval: 5m

# Starting with Filebeat 7.0, the registry uses a new directory format to store
# Filebeat state. After you upgrade, Filebeat will automatically migrate a 6.x
Expand Down
7 changes: 4 additions & 3 deletions filebeat/_meta/config/filebeat.inputs.reference.yml.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,7 @@ filebeat.inputs:
# If enabled, instead of relying on the device ID and inode values when comparing files,
# compare hashes of the given byte ranges in files. A file becomes an ingest target
# when its size grows larger than offset+length (see below). Until then it's ignored.
#prospector.scanner.fingerprint.enabled: false
#prospector.scanner.fingerprint.enabled: true

# If fingerprint mode is enabled, sets the offset from the beginning of the file
# for the byte range used for computing the fingerprint value.
Expand Down Expand Up @@ -438,8 +438,9 @@ filebeat.inputs:
#clean_removed: true

# Method to determine if two files are the same or not. By default
# the Beat considers two files the same if their inode and device id are the same.
#file_identity.native: ~
# a fingerprint is generated using the first 1024 bytes of the file,
# if the fingerprints match, then the files are considered equal.
#file_identity.fingerprint: ~

# Optional additional fields. These fields can be freely picked
# to add additional information to the crawled log files for filtering
Expand Down
11 changes: 11 additions & 0 deletions filebeat/docs/faq.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ We do not recommend reading log files from network volumes. Whenever possible, i
send the log files directly from there. Reading files from network volumes (especially on Windows) can have unexpected side
effects. For example, changed file identifiers may result in {beatname_uc} reading a log file from scratch again.

If it is not possible to read from the host, then using the
<<filebeat-input-filestream-file-identity-fingerprint, `fingerprint`>>
file identity is the next best option.

[[filebeat-not-collecting-lines]]
=== {beatname_uc} isn't collecting lines from a file

Expand Down Expand Up @@ -71,6 +75,13 @@ By default states are never removed from the registry file. To resolve the inode

You can use <<{beatname_lc}-input-log-clean-removed,`clean_removed`>> for files that are removed from disk. Be aware that `clean_removed` cleans the file state from the registry whenever a file cannot be found during a scan. If the file shows up again later, it will be sent again from scratch.

Aside from that you should also change the
<<filebeat-input-filestream-file-identity, `file_identity`>> to
<<filebeat-input-filestream-file-identity-fingerprint,
`fingerprint`>>. If you were using `native` (the default) or `path`,
the state of the files will be automatically migrated to
`fingerprint`.

include::filebeat-log-rotation.asciidoc[]

[[windows-file-rotation]]
Expand Down
71 changes: 49 additions & 22 deletions filebeat/docs/inputs/input-filestream-file-options.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -150,9 +150,9 @@ The default setting is 10s.
[id="{beatname_lc}-input-{type}-scan-fingerprint"]
===== `prospector.scanner.fingerprint`

Instead of relying on the device ID and inode values when comparing files, compare hashes of the given byte ranges of files.

Enable this option if you're experiencing data loss or data duplication due to unstable file identifiers provided by the file system.
Instead of relying on the device ID and inode values when comparing
files, compare hashes of the given byte ranges of files. This is the
default behaviour for {beatname_uc}.

Following are some scenarios where this can happen:

Expand Down Expand Up @@ -542,34 +542,71 @@ indirectly set higher priorities on certain inputs by assigning a higher
limit of harvesters.

[float]
[id="{beatname_lc}-input-{type}-file-identity"]
===== `file_identity`

Different `file_identity` methods can be configured to suit the
environment where you are collecting log messages.

WARNING: Changing `file_identity` methods between runs may result in
duplicated events in the output.
IMPORTANT: Changing `file_identity` is only supported from `native` or
`path` to `fingerprint`. On those cases {beatname_uc} will
automatically migrate the state of the file when {type} starts.

WARNING: Any unsupported change in `file_identity` methods between
runs may result in duplicated events in the output.

[id="{beatname_lc}-input-{type}-file-identity-fingerprint"]
*`fingerprint`*:: The default behaviour of {beatname_uc} is to
identify files based on content by hashing a specific range (0 to 1024
bytes by default).

WARNING: In order to use this file identity option, you must enable
the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint
option in the scanner>>. Once this file identity is enabled, changing
the fingerprint configuration (offset, length, or other settings) will
lead to a global re-ingestion of all files that match the paths
configuration of the input.

Please refer to the
<<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint
configuration for details>>.

[source,yaml]
----
file_identity.fingerprint: ~
----

*`native`*:: The default behaviour of {beatname_uc} is to differentiate
between files using their inodes and device ids.
*`native`*:: Differentiates between files using their inodes and
device ids.
+
In some cases these values can change during the lifetime of a file.
For example, when using the Linux link:https://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linux%29[LVM] (Logical Volume Manager), device numbers are allocated dynamically at module load (refer to link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/logical_volume_manager_administration/lv#persistent_numbers[Persistent Device Numbers] in the Red Hat Enterprise Linux documentation). To avoid the possibility of data duplication in this case, you can set `file_identity` to `path` rather than `native`.
For example, when using the Linux
link:https://en.wikipedia.org/wiki/Logical_Volume_Manager_%28Linux%29[LVM]
(Logical Volume Manager), device numbers are allocated dynamically at
module load (refer to
link:https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/logical_volume_manager_administration/lv#persistent_numbers[Persistent
Device Numbers] in the Red Hat Enterprise Linux documentation). To
avoid the possibility of data duplication in this case, you can set
`file_identity` to `fingerprint` rather than the default `native`.
+
The states of files generated by `native` file identity can be migrated to `fingerprint`.

[source,yaml]
----
file_identity.native: ~
----

*`path`*:: To identify files based on their paths use this strategy.

+
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This + ensures the block below is on the same indentation as the title.

WARNING: Only use this strategy if your log files are rotated to a folder
outside of the scope of your input or not at all. Otherwise you end up
with duplicated events.

+
WARNING: This strategy does not support renaming files.
If an input file is renamed, {beatname_uc} will read it again if the new path
matches the settings of the input.
+
The states of files generated by `path` file identity can be migrated to `fingerprint`.

[source,yaml]
----
Expand All @@ -578,25 +615,14 @@ file_identity.path: ~

*`inode_marker`*:: If the device id changes from time to time, you must use
this method to distinguish files. This option is not supported on Windows.

+
Set the location of the marker file the following way:

[source,yaml]
----
file_identity.inode_marker.path: /logs/.filebeat-marker
----

*`fingerprint`*:: To identify files based on their content byte range.

WARNING: In order to use this file identity option, you must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, or other settings) will lead to a global re-ingestion of all files that match the paths configuration of the input.

Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.

[source,yaml]
----
file_identity.fingerprint: ~
----

[[filestream-log-rotation-support]]
[float]
=== Log rotation
Expand All @@ -609,6 +635,7 @@ When reading from rotating files make sure the paths configuration includes
both the active file and all rotated files.

By default, {beatname_uc} is able to track files correctly in the following strategies:

* create: new active file with a unique name is created on rotation
* rename: rotated files are renamed

Expand Down
46 changes: 28 additions & 18 deletions filebeat/docs/inputs/input-filestream.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ The `log` writes the complete file state.

7. Stale entries can be removed from the registry, even if there is no active input.

8. The default behaviour is to identify files based on their contents
using the <<filebeat-input-filestream-file-identity-fingerprint,
`fingerprint`>> <<filebeat-input-filestream-file-identity,
`file_identity`>> This solves data duplication caused by inode reuse.

To configure this input, specify a list of glob-based <<filestream-input-paths,`paths`>>
that must be crawled to locate and fetch the log lines.

Expand Down Expand Up @@ -86,20 +91,32 @@ multiple input sections:
[[filestream-file-identity]]
==== Reading files on network shares and cloud providers

WARNING: Filebeat does not support reading from network shares and cloud providers.
WARNING: Some file identity methods do not support reading from
network shares and cloud providers, to avoid duplicating events, use
the default `file_identity`: `fingerprint`.

IMPORTANT: Changing `file_identity` is only supported when
migrating from `native` or `path` to `fingerprint`.

WARNING: Any unsupported change in `file_identity` methods between
runs may result in duplicated events in the output.

However, one of the limitations of these data sources can be mitigated
if you configure Filebeat adequately.
`fingerprint` is the default and recommended file identity because it does not
rely on the file system/OS, it generates a hash from a portion of the
file (the first 1024 bytes, by default) and uses that to identify the
file. This works well with log rotation strategies that move/rename
the file and on Windows as file identifiers might be more
volatile. The downside is that {beatname_uc} will wait until the file
reaches 1024 bytes before start ingesting any file.

By default, {beatname_uc} identifies files based on their inodes and
device IDs. However, on network shares and cloud providers these
values might change during the lifetime of the file. If this happens
{beatname_uc} thinks that file is new and resends the whole content
of the file. To solve this problem you can configure the `file_identity` option. Possible
values besides the default `inode_deviceid` are `path`, `inode_marker` and `fingerprint`.
WARNING: Once this file identity is enabled, changing
the fingerprint configuration (offset, length, etc) will lead to a
global re-ingestion of all files that match the paths configuration of
the input.

WARNING: Changing `file_identity` methods between runs may result in
duplicated events in the output.
Please refer to the
<<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint
configuration for details>>.

Selecting `path` instructs {beatname_uc} to identify files based on their
paths. This is a quick way to avoid rereading files if inode and device ids
Expand All @@ -117,13 +134,6 @@ example oneliner generates a hidden marker file for the selected mountpoint `/lo
Please note that you should not use this option on Windows as file identifiers might be
more volatile.

Selecting `fingerprint` instructs {beatname_uc} to identify files based on their
content byte range.

WARNING: In order to use this file identity option, one must enable the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint option in the scanner>>. Once this file identity is enabled, changing the fingerprint configuration (offset, length, etc) will lead to a global re-ingestion of all files that match the paths configuration of the input.

Please refer to the <<{beatname_lc}-input-filestream-scan-fingerprint,fingerprint configuration for details>>.

["source","sh",subs="attributes"]
----
$ lsblk -o MOUNTPOINT,UUID | grep /logs | awk '{print $2}' >> /logs/.filebeat-marker
Expand Down
6 changes: 4 additions & 2 deletions filebeat/filebeat.reference.yml
Original file line number Diff line number Diff line change
Expand Up @@ -716,7 +716,7 @@ filebeat.inputs:
# If enabled, instead of relying on the device ID and inode values when comparing files,
# compare hashes of the given byte ranges in files. A file becomes an ingest target
# when its size grows larger than offset+length (see below). Until then it's ignored.
#prospector.scanner.fingerprint.enabled: false
#prospector.scanner.fingerprint.enabled: true

# If fingerprint mode is enabled, sets the offset from the beginning of the file
# for the byte range used for computing the fingerprint value.
Expand Down Expand Up @@ -852,7 +852,7 @@ filebeat.inputs:

# Method to determine if two files are the same or not. By default
# the Beat considers two files the same if their inode and device id are the same.
#file_identity.native: ~
#file_identity.fingerprint: ~

# Optional additional fields. These fields can be freely picked
# to add additional information to the crawled log files for filtering
Expand Down Expand Up @@ -1266,6 +1266,8 @@ filebeat.inputs:
# batch of events has been published successfully. The default value is 1s.
#filebeat.registry.flush: 1s

# The interval which to run the registry clean up
#filebeat.registry.cleanup_interval: 5m

# Starting with Filebeat 7.0, the registry uses a new directory format to store
# Filebeat state. After you upgrade, Filebeat will automatically migrate a 6.x
Expand Down
1 change: 1 addition & 0 deletions filebeat/include/list.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 9 additions & 3 deletions filebeat/input/filestream/environment_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -386,6 +386,7 @@

// waitUntilEventCount waits until total count events arrive to the client.
func (e *inputTestingEnvironment) waitUntilEventCount(count int) {
e.t.Helper()
msg := &strings.Builder{}
require.Eventuallyf(e.t, func() bool {
msg.Reset()
Expand Down Expand Up @@ -418,9 +419,9 @@
for _, e := range e.pipeline.GetAllEvents() {
flat := e.Fields.Flatten()
pathi, _ := flat.GetValue("log.file.path")
path := pathi.(string)

Check failure on line 422 in filebeat/input/filestream/environment_test.go

View workflow job for this annotation

GitHub Actions / lint (linux)

Error return value is not checked (errcheck)
msgi, _ := flat.GetValue("message")
msg := msgi.(string)

Check failure on line 424 in filebeat/input/filestream/environment_test.go

View workflow job for this annotation

GitHub Actions / lint (linux)

Error return value is not checked (errcheck)
logLines[path] = append(logLines[path], msg)
}

Expand Down Expand Up @@ -448,9 +449,14 @@
// waitUntilHarvesterIsDone detects Harvester stop by checking if the last client has been closed
// as when a Harvester stops the client is closed.
func (e *inputTestingEnvironment) waitUntilHarvesterIsDone() {
for !e.pipeline.clients[len(e.pipeline.clients)-1].closed {
time.Sleep(10 * time.Millisecond)
}
require.Eventually(
e.t,
func() bool {
return e.pipeline.clients[len(e.pipeline.clients)-1].closed
},
time.Second*10,
time.Millisecond*10,
"The last connected client has not closed it's connection")
}

// requireEventsReceived requires that the list of messages has made it into the output.
Expand All @@ -462,7 +468,7 @@
if len(events) == checkedEventCount {
e.t.Fatalf("not enough expected elements")
}
message := evt.Fields["message"].(string)

Check failure on line 471 in filebeat/input/filestream/environment_test.go

View workflow job for this annotation

GitHub Actions / lint (linux)

Error return value is not checked (errcheck)
if message == events[checkedEventCount] {
foundEvents[checkedEventCount] = true
}
Expand Down
2 changes: 1 addition & 1 deletion filebeat/input/filestream/fswatch.go
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,7 @@ func defaultFileScannerConfig() fileScannerConfig {
Symlinks: false,
RecursiveGlob: true,
Fingerprint: fingerprintConfig{
Enabled: false,
Enabled: true,
Offset: 0,
Length: DefaultFingerprintSize,
},
Expand Down
6 changes: 6 additions & 0 deletions filebeat/input/filestream/fswatch_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,7 @@ scanner:
paths := []string{filepath.Join(dir, "*.log")}
cfgStr := `
scanner:
fingerprint.enabled: false
check_interval: 10ms
`

Expand Down Expand Up @@ -260,6 +261,7 @@ scanner:
paths := []string{filepath.Join(dir, "*.log")}
cfgStr := `
scanner:
fingerprint.enabled: false
check_interval: 50ms
`

Expand Down Expand Up @@ -370,6 +372,7 @@ scanner:
}
cfgStr := `
scanner:
fingerprint.enabled: false
check_interval: 100ms
`

Expand Down Expand Up @@ -615,6 +618,7 @@ scanner:
name: "returns no symlink if the original file is excluded",
cfgStr: `
scanner:
fingerprint.enabled: false
exclude_files: ['.*exclude.*', '.*traveler.*']
symlinks: true
`,
Expand Down Expand Up @@ -661,6 +665,7 @@ scanner:
name: "returns no included symlink if the original file is not included",
cfgStr: `
scanner:
fingerprint.enabled: false
include_files: ['.*include.*', '.*portal.*']
symlinks: true
`,
Expand All @@ -678,6 +683,7 @@ scanner:
name: "returns an included symlink if the original file is included",
cfgStr: `
scanner:
fingerprint.enabled: false
include_files: ['.*include.*', '.*portal.*', '.*traveler.*']
symlinks: true
`,
Expand Down
2 changes: 1 addition & 1 deletion filebeat/input/filestream/identifier.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ func (f fileSource) Name() string {
// newFileIdentifier creates a new state identifier for a log input.
func newFileIdentifier(ns *conf.Namespace, suffix string) (fileIdentifier, error) {
if ns == nil {
i, err := newINodeDeviceIdentifier(nil)
i, err := newFingerprintIdentifier(nil)
if err != nil {
return nil, err
}
Expand Down
Loading
Loading