Improve Logsdb docs including default values #115205

salvatore-campagna · 2024-10-21T11:06:43Z

This PR adds detailed documentation for logsdb mode, covering several key aspects of its default behavior and configuration options.

It includes:

default settings for index sorting (index.sort.field, index.sort.order, etc.).
usage of synthetic _source by default.
information about specialized codecs and how users can override them.
default behavior for ignore_malformed and ignore_above settings, including precedence rules.
explanation of how fields without doc_values are handled and what we do if they are missing.

github-actions · 2024-10-21T11:06:57Z

Documentation preview:

✨ Changed pages

elasticsearchmachine · 2024-10-21T11:07:08Z

Pinging @elastic/es-docs (Team:Docs)

elasticsearchmachine · 2024-10-21T11:07:08Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

kkrik-es · 2024-10-22T15:55:34Z

docs/reference/data-streams/logs.asciidoc

+are preserved for <<synthetic-source,synthetic `_source`>> reconstruction. In `logsdb`, the default value is `arrays`,
+which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to
+array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the
+case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events.


Maybe add something like:

For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.

kkrik-es · 2024-10-22T15:58:07Z

docs/reference/data-streams/logs.asciidoc

+a single `host.name` field will be mapped as a `keyword` field.
+
+Once an index is created, the sort settings are final and cannot be changed. If you need different sort settings,
+a new index must be created with the desired settings.


Maybe add: "In the case of a data stream, this happens through rollover".

kkrik-es · 2024-10-22T15:58:45Z

docs/reference/data-streams/logs.asciidoc

+
+If the default sort settings do not suit your use case, consider adjusting them. Keep in mind that sort settings
+will affect indexing throughput and query latency, as well as potentially impacting compression effectiveness
+due to how data is distributed after sorting.


Consider adding an example on how to override, and mention that sorting on @timestamp is automatically added (for data streams?).

Well examples for index sorting are available in the page about index sorting...I will just link that page.

There is no such thing adding sorting on @timestamp. For logsdb we wither sort on both fields host.name and @timestamp or users override it with whtever they like...we don't add sorting on @timestamp other than with default sort settings. We add the @timestamp mapping for data streams which is already explained elsewhere but we do not necessarily sort on it. Defining sort fields and injecting the mappings are separate things. If a user defines sorting on something like agent.id for example, we still inject the @timestamp field (for a data stream) but we do not sort on it. I will point out this in the documentation by means of a note.

Mapping of @timestamp is explained in <<data-streams,data stream>> with

Every document indexed to a data stream must contain a @timestamp field, mapped as a [date](https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html) or [date_nanos](https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html) field type. If the index template doesn’t specify a mapping for the @timestamp field, Elasticsearch maps @timestamp as a date field with default options.

kkrik-es · 2024-10-22T16:02:07Z

docs/reference/data-streams/logs.asciidoc

+to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some
+fields contain invalid or improperly formatted data.
+
+Users can override this setting by setting `ignore_malformed` to `false`. However, this is not recommended as it might


index.mapping.ignore_malformed ?

kkrik-es · 2024-10-22T16:05:39Z

docs/reference/data-streams/logs.asciidoc

+`host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with
+`subobjects: false` it only consists of one field.
+
+`logsdb` index mode uses a special field named `_ignored_source` that allows retrieving values for fields that have been


Hm this is more of an internal implementation detail.. I wonder if we should be documenting this, as its use may change in the future. Do we expect users to care about it?

We expose it via the fields and stored_fields api anyway...so they can actually fetch it. I wrote that they should not rely on the name or the encoding. I think this is fair. The idea is that this should only be used for debugging purposes. If there is an issue it will be handy asking them about getting the value for this field.

kkrik-es · 2024-10-22T16:07:16Z

docs/reference/data-streams/logs.asciidoc

+
+* **`index.mapping.ignore_above`**: `8191`
+
+* **`index.mapping.total_fields.limit`**: 1000 (same as `"standard"` index mode)


Is this the default value? If so, let's skip it.

lkts · 2024-10-22T18:10:34Z

docs/reference/data-streams/logs.asciidoc

+field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
+restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.
+
+NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values


I am not sure if multi-value fields is clear enough. Maybe "when dealing with arrays of values"?

In other places in the documents we use multi-value fields...which by the way is the correct name. Elasticsearch doesn't normally need to maintain array order because its core functionality revolves around searching based on the presence of values, not their position. This is true also for aggregations. Therefore, it treats arrays and multi-value fields as a set of independent values, where order doesn't play a role in indexing or querying. So, IMO it is where we use "array" that we make a mistake. An array is a (concrete) ordered data structure...a multi-value field is an abstract collection of values where order does not matter. I don't want to sound picky but again...I think "array" is incorrect. A lot of our code is written without considering ordering an issue (including the way synthetic source works normally and aggregations work). If we use "array" we suggest, instead, that ordering matters.

Thanks for this context, sounds good.

lkts · 2024-10-22T18:11:51Z

docs/reference/data-streams/logs.asciidoc

+[[logsdb-data-streams]]
+=== LogsDB for logs data streams
+
+In Elasticsearch, `logsdb` mode is applied by default for data streams whose name matches the pattern `logs-*-*`.


We should not say that in 8.16 docs, right?

When backporting will change it.

lkts · 2024-10-22T18:16:20Z

docs/reference/data-streams/logs.asciidoc

+This pattern identifies a logs data stream, and Elasticsearch automatically configures the data stream to use LogsDB.
+We recommend using `logsdb` index mode for data streams by means of standard or custom (component) templates.
+
+Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by


Just to sound nicer.

Suggested change

Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by

Users can opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by

Why is this nicer? You mean less formal?

Yes, not a big deal though.

lkts · 2024-10-22T18:21:13Z

docs/reference/data-streams/logs.asciidoc

+result in documents with malformed fields being rejected and not indexed at all.
+
+In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure
+efficient storage and indexing of large text fields.The index-level default for `ignore_above` is set to 8191


Suggested change

efficient storage and indexing of large text fields.The index-level default for `ignore_above` is set to 8191

efficient storage and indexing of large keyword fields. The index-level default for `ignore_above` is set to 8191

lkts · 2024-10-22T18:25:22Z

docs/reference/data-streams/logs.asciidoc

+=== Fields without doc values
+
+When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping,
+Elasticsearch automatically sets the `store` setting to `true` for that field. This ensures that the field's data is


We only do this for text and annotated_text when store is false and there is no multi field suitable for synthetic source. If there is no doc_values for all other fields we use fallback synthetic source via _ignored_source.

I didn't want to go into the details of saying for which field types we do this and which not just to avoid that if we change something this goes out of sync and we forget updating. Also I think is an implementation detail. I wanted to mention this just to let users know that we sometimes might do this....I will add something like sometime might set store to true.

martijnvg

I left two comment about index.codec otherwise looks good 👍

martijnvg · 2024-10-23T08:14:22Z

docs/reference/data-streams/logs.asciidoc

+Users are allowed to override the default compression codec. If desired, they can switch to the `best_speed`
+codec for faster compression at the expense of slightly larger storage footprint.
+
+* `index.codec`: `"best_compression"`
+  This is the default setting, applying {wikipedia}/Zstd[ZSTD] compression to stored fields for optimal storage
+  efficiency.
+
+* `index.codec`: `"best_speed"`
+  If faster indexing performance is required, users can opt for `best_speed` compression, which sacrifices some storage
+  efficiency for higher indexing throughput.


Maybe just link to the documentation about index.codec setting? (https://www.elastic.co/guide/en/elasticsearch/reference/8.16/index-modules.html)

docs/reference/data-streams/logs.asciidoc

martijnvg · 2024-10-23T09:56:19Z

docs/reference/data-streams/logs.asciidoc

+  This is the default setting, applying {wikipedia}/Zstd[ZSTD] compression to stored fields for optimal storage
+  efficiency.
+
+* `index.codec`: `"best_speed"`


Option is named default and not best_speed. In the codec this is known as best speed, but that isn't what the configuration option's name is.

Right, thanks

lkts · 2024-10-23T16:30:02Z

docs/reference/data-streams/logs.asciidoc

+field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
+restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.
+
+NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values


Thanks for this context, sounds good.

lkts · 2024-10-23T16:32:03Z

Should someone from docs take a look?

martijnvg

LGTM

salvatore-campagna · 2024-10-24T11:05:29Z

docs/reference/data-streams/logs.asciidoc

@@ -8,14 +8,6 @@ A logs data stream is a data stream type that stores log data more efficiently.
 In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data
 stream. The exact impact will vary depending on your data set.

-The following features are enabled in a logs data stream:


@martijnvg I removed this part since this is explained later.

This PR adds detailed documentation for `logsdb` mode, covering several key aspects of its default behavior and configuration options. It includes: - default settings for index sorting (`index.sort.field`, `index.sort.order`, etc.). - usage of synthetic `_source` by default. - information about specialized codecs and how users can override them. - default behavior for `ignore_malformed` and `ignore_above` settings, including precedence rules. - explanation of how fields without `doc_values` are handled and what we do if they are missing.

marciw · 2024-12-10T00:25:09Z

Should someone from docs take a look?

👋 Didn't see this at the time, but I did a general edit just now in #118303

docs: improve logsdb docs including default values

f8336f2

salvatore-campagna added >docs General docs changes :StorageEngine/Logs You know, for Logs v8.16.0 v8.16.1 v8.17.0 labels Oct 21, 2024

salvatore-campagna self-assigned this Oct 21, 2024

elasticsearchmachine added v9.0.0 Team:Docs Meta label for docs team Team:StorageEngine labels Oct 21, 2024

salvatore-campagna added 8 commits October 21, 2024 13:15

fix: some minor improvements

5a80d15

fix: improve docs about logsdb

554523f

docs: discrete

2e5002e

fix: add details about doc values encoding

b366d7e

fix: add details about doc value encoding for keywords

aa2fb88

fix: a few more improvements

b47ab32

fix: still some improvements

1889f8b

fix: a few minor details

80c6e8f

salvatore-campagna requested review from martijnvg, lkts and kkrik-es October 22, 2024 14:26

kkrik-es reviewed Oct 22, 2024

View reviewed changes

lkts reviewed Oct 22, 2024

View reviewed changes

martijnvg reviewed Oct 23, 2024

View reviewed changes

salvatore-campagna added 2 commits October 23, 2024 10:36

fix: improve documentation

b349842

nit: a few more details

7f53dba

martijnvg reviewed Oct 23, 2024

View reviewed changes

fix: losdb best_speed vs best_comrpession

b2528a0

lkts approved these changes Oct 23, 2024

View reviewed changes

martijnvg approved these changes Oct 24, 2024

View reviewed changes

salvatore-campagna added 5 commits October 24, 2024 09:58

fix: remove _ignored_source

29bf9cd

fix: remove logsdb for logs-*-*

c792141

nit: do not capitalize 'Synthetic'

f8151b0

fix: GCD and RLE

28a58cf

fix: remove details explained later

d01a30f

salvatore-campagna commented Oct 24, 2024

View reviewed changes

salvatore-campagna merged commit ebec1a2 into elastic:main Oct 24, 2024
5 checks passed

marciw mentioned this pull request Dec 9, 2024

Update and edit logsdb docs for logsdb / synthetic source GA #118303

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Logsdb docs including default values #115205

Improve Logsdb docs including default values #115205

salvatore-campagna commented Oct 21, 2024 •

edited

Loading

github-actions bot commented Oct 21, 2024

elasticsearchmachine commented Oct 21, 2024

elasticsearchmachine commented Oct 21, 2024

kkrik-es Oct 22, 2024

kkrik-es Oct 22, 2024

kkrik-es Oct 22, 2024

salvatore-campagna Oct 22, 2024

salvatore-campagna Oct 23, 2024 •

edited

Loading

salvatore-campagna Oct 23, 2024

kkrik-es Oct 22, 2024

kkrik-es Oct 22, 2024

salvatore-campagna Oct 22, 2024

kkrik-es Oct 22, 2024

lkts Oct 22, 2024

salvatore-campagna Oct 23, 2024 •

edited

Loading

lkts Oct 23, 2024

lkts Oct 22, 2024

salvatore-campagna Oct 23, 2024

lkts Oct 22, 2024

salvatore-campagna Oct 23, 2024 •

edited

Loading

lkts Oct 23, 2024

lkts Oct 22, 2024

lkts Oct 22, 2024

salvatore-campagna Oct 23, 2024

martijnvg left a comment

martijnvg Oct 23, 2024

martijnvg Oct 23, 2024

salvatore-campagna Oct 23, 2024

lkts Oct 23, 2024

lkts commented Oct 23, 2024

martijnvg left a comment

salvatore-campagna Oct 24, 2024

marciw commented Dec 10, 2024


		* `index.mapping.ignore_above`: `8191`

		* `index.mapping.total_fields.limit`: 1000 (same as `"standard"` index mode)

	Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by
	Users can opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by

	efficient storage and indexing of large text fields.The index-level default for `ignore_above` is set to 8191
	efficient storage and indexing of large keyword fields. The index-level default for `ignore_above` is set to 8191

Improve Logsdb docs including default values #115205

Improve Logsdb docs including default values #115205

Conversation

salvatore-campagna commented Oct 21, 2024 • edited Loading

github-actions bot commented Oct 21, 2024

elasticsearchmachine commented Oct 21, 2024

elasticsearchmachine commented Oct 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salvatore-campagna Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salvatore-campagna Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

salvatore-campagna Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lkts commented Oct 23, 2024

martijnvg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marciw commented Dec 10, 2024

salvatore-campagna commented Oct 21, 2024 •

edited

Loading

salvatore-campagna Oct 23, 2024 •

edited

Loading

salvatore-campagna Oct 23, 2024 •

edited

Loading

salvatore-campagna Oct 23, 2024 •

edited

Loading