Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Logsdb docs including default values #115205

Merged

Conversation

salvatore-campagna
Copy link
Contributor

@salvatore-campagna salvatore-campagna commented Oct 21, 2024

This PR adds detailed documentation for logsdb mode, covering several key aspects of its default behavior and configuration options.

It includes:

  • default settings for index sorting (index.sort.field, index.sort.order, etc.).
  • usage of synthetic _source by default.
  • information about specialized codecs and how users can override them.
  • default behavior for ignore_malformed and ignore_above settings, including precedence rules.
  • explanation of how fields without doc_values are handled and what we do if they are missing.

Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

are preserved for <<synthetic-source,synthetic `_source`>> reconstruction. In `logsdb`, the default value is `arrays`,
which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to
array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the
case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add something like:

For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.

a single `host.name` field will be mapped as a `keyword` field.

Once an index is created, the sort settings are final and cannot be changed. If you need different sort settings,
a new index must be created with the desired settings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add: "In the case of a data stream, this happens through rollover".


If the default sort settings do not suit your use case, consider adjusting them. Keep in mind that sort settings
will affect indexing throughput and query latency, as well as potentially impacting compression effectiveness
due to how data is distributed after sorting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding an example on how to override, and mention that sorting on @timestamp is automatically added (for data streams?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well examples for index sorting are available in the page about index sorting...I will just link that page.

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such thing adding sorting on @timestamp. For logsdb we wither sort on both fields host.name and @timestamp or users override it with whtever they like...we don't add sorting on @timestamp other than with default sort settings. We add the @timestamp mapping for data streams which is already explained elsewhere but we do not necessarily sort on it. Defining sort fields and injecting the mappings are separate things. If a user defines sorting on something like agent.id for example, we still inject the @timestamp field (for a data stream) but we do not sort on it. I will point out this in the documentation by means of a note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mapping of @timestamp is explained in <<data-streams,data stream>> with

Every document indexed to a data stream must contain a @timestamp field, mapped as a [date](https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html) or [date_nanos](https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html) field type. If the index template doesn’t specify a mapping for the @timestamp field, Elasticsearch maps @timestamp as a date field with default options.

to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some
fields contain invalid or improperly formatted data.

Users can override this setting by setting `ignore_malformed` to `false`. However, this is not recommended as it might
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index.mapping.ignore_malformed ?

`host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with
`subobjects: false` it only consists of one field.

`logsdb` index mode uses a special field named `_ignored_source` that allows retrieving values for fields that have been
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm this is more of an internal implementation detail.. I wonder if we should be documenting this, as its use may change in the future. Do we expect users to care about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expose it via the fields and stored_fields api anyway...so they can actually fetch it. I wrote that they should not rely on the name or the encoding. I think this is fair. The idea is that this should only be used for debugging purposes. If there is an issue it will be handy asking them about getting the value for this field.


* **`index.mapping.ignore_above`**: `8191`

* **`index.mapping.total_fields.limit`**: 1000 (same as `"standard"` index mode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the default value? If so, let's skip it.

field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.

NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if multi-value fields is clear enough. Maybe "when dealing with arrays of values"?

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places in the documents we use multi-value fields...which by the way is the correct name. Elasticsearch doesn't normally need to maintain array order because its core functionality revolves around searching based on the presence of values, not their position. This is true also for aggregations. Therefore, it treats arrays and multi-value fields as a set of independent values, where order doesn't play a role in indexing or querying. So, IMO it is where we use "array" that we make a mistake. An array is a (concrete) ordered data structure...a multi-value field is an abstract collection of values where order does not matter. I don't want to sound picky but again...I think "array" is incorrect. A lot of our code is written without considering ordering an issue (including the way synthetic source works normally and aggregations work). If we use "array" we suggest, instead, that ordering matters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this context, sounds good.

[[logsdb-data-streams]]
=== LogsDB for logs data streams

In Elasticsearch, `logsdb` mode is applied by default for data streams whose name matches the pattern `logs-*-*`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not say that in 8.16 docs, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When backporting will change it.

This pattern identifies a logs data stream, and Elasticsearch automatically configures the data stream to use LogsDB.
We recommend using `logsdb` index mode for data streams by means of standard or custom (component) templates.

Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to sound nicer.

Suggested change
Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by
Users can opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by

Copy link
Contributor Author

@salvatore-campagna salvatore-campagna Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this nicer? You mean less formal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, not a big deal though.

result in documents with malformed fields being rejected and not indexed at all.

In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure
efficient storage and indexing of large text fields.The index-level default for `ignore_above` is set to 8191
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
efficient storage and indexing of large text fields.The index-level default for `ignore_above` is set to 8191
efficient storage and indexing of large keyword fields. The index-level default for `ignore_above` is set to 8191

=== Fields without doc values

When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping,
Elasticsearch automatically sets the `store` setting to `true` for that field. This ensures that the field's data is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only do this for text and annotated_text when store is false and there is no multi field suitable for synthetic source. If there is no doc_values for all other fields we use fallback synthetic source via _ignored_source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to go into the details of saying for which field types we do this and which not just to avoid that if we change something this goes out of sync and we forget updating. Also I think is an implementation detail. I wanted to mention this just to let users know that we sometimes might do this....I will add something like sometime might set store to true.

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two comment about index.codec otherwise looks good 👍

Comment on lines 129 to 138
Users are allowed to override the default compression codec. If desired, they can switch to the `best_speed`
codec for faster compression at the expense of slightly larger storage footprint.

* `index.codec`: `"best_compression"`
This is the default setting, applying {wikipedia}/Zstd[ZSTD] compression to stored fields for optimal storage
efficiency.

* `index.codec`: `"best_speed"`
If faster indexing performance is required, users can opt for `best_speed` compression, which sacrifices some storage
efficiency for higher indexing throughput.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just link to the documentation about index.codec setting? (https://www.elastic.co/guide/en/elasticsearch/reference/8.16/index-modules.html)

docs/reference/data-streams/logs.asciidoc Outdated Show resolved Hide resolved
This is the default setting, applying {wikipedia}/Zstd[ZSTD] compression to stored fields for optimal storage
efficiency.

* `index.codec`: `"best_speed"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option is named default and not best_speed. In the codec this is known as best speed, but that isn't what the configuration option's name is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks

field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few
restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it.

NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this context, sounds good.

@lkts
Copy link
Contributor

lkts commented Oct 23, 2024

Should someone from docs take a look?

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -8,14 +8,6 @@ A logs data stream is a data stream type that stores log data more efficiently.
In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data
stream. The exact impact will vary depending on your data set.

The following features are enabled in a logs data stream:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martijnvg I removed this part since this is explained later.

@salvatore-campagna salvatore-campagna merged commit ebec1a2 into elastic:main Oct 24, 2024
5 checks passed
georgewallace pushed a commit to georgewallace/elasticsearch that referenced this pull request Oct 25, 2024
This PR adds detailed documentation for `logsdb` mode, covering several key aspects of its default behavior and configuration options.

It includes:
- default settings for index sorting (`index.sort.field`, `index.sort.order`, etc.).
- usage of synthetic `_source` by default.
- information about specialized codecs and how users can override them.
- default behavior for `ignore_malformed` and `ignore_above` settings, including precedence rules.
- explanation of how fields without `doc_values` are handled and what we do if they are missing.
jfreden pushed a commit to jfreden/elasticsearch that referenced this pull request Nov 4, 2024
This PR adds detailed documentation for `logsdb` mode, covering several key aspects of its default behavior and configuration options.

It includes:
- default settings for index sorting (`index.sort.field`, `index.sort.order`, etc.).
- usage of synthetic `_source` by default.
- information about specialized codecs and how users can override them.
- default behavior for `ignore_malformed` and `ignore_above` settings, including precedence rules.
- explanation of how fields without `doc_values` are handled and what we do if they are missing.
@marciw
Copy link
Contributor

marciw commented Dec 10, 2024

Should someone from docs take a look?

👋 Didn't see this at the time, but I did a general edit just now in #118303

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants