-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Logsdb docs including default values #115205
Improve Logsdb docs including default values #115205
Conversation
Documentation preview: |
Pinging @elastic/es-docs (Team:Docs) |
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
are preserved for <<synthetic-source,synthetic `_source`>> reconstruction. In `logsdb`, the default value is `arrays`, | ||
which retains both duplicate values and the order of entries but not necessarily the exact structure when it comes to | ||
array elements or objects. Preserving duplicates and ordering could be critical for some log fields. This could be the | ||
case, for instance, for DNS A records, HTTP headers, or log entries that represent sequential or repeated events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add something like:
For more details on this setting and ways to refine or bypass it, check out <<synthetic-source-keep, this section>>.
a single `host.name` field will be mapped as a `keyword` field. | ||
|
||
Once an index is created, the sort settings are final and cannot be changed. If you need different sort settings, | ||
a new index must be created with the desired settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add: "In the case of a data stream, this happens through rollover".
|
||
If the default sort settings do not suit your use case, consider adjusting them. Keep in mind that sort settings | ||
will affect indexing throughput and query latency, as well as potentially impacting compression effectiveness | ||
due to how data is distributed after sorting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding an example on how to override, and mention that sorting on @timestamp is automatically added (for data streams?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well examples for index sorting are available in the page about index sorting...I will just link that page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such thing adding sorting on @timestamp
. For logsdb we wither sort on both fields host.name
and @timestamp
or users override it with whtever they like...we don't add sorting on @timestamp other than with default sort settings. We add the @timestamp
mapping for data streams which is already explained elsewhere but we do not necessarily sort on it. Defining sort fields and injecting the mappings are separate things. If a user defines sorting on something like agent.id
for example, we still inject the @timestamp
field (for a data stream) but we do not sort on it. I will point out this in the documentation by means of a note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mapping of @timestamp
is explained in <<data-streams,data stream>>
with
Every document indexed to a data stream must contain a @timestamp field, mapped as a [date](https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html) or [date_nanos](https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html) field type. If the index template doesn’t specify a mapping for the @timestamp field, Elasticsearch maps @timestamp as a date field with default options.
to be indexed without causing indexing failures, ensuring that log data ingestion continues smoothly even when some | ||
fields contain invalid or improperly formatted data. | ||
|
||
Users can override this setting by setting `ignore_malformed` to `false`. However, this is not recommended as it might |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index.mapping.ignore_malformed
?
`host.name` is mapped with `subobjects: true` it consists of two fields. When `host.name` is mapped with | ||
`subobjects: false` it only consists of one field. | ||
|
||
`logsdb` index mode uses a special field named `_ignored_source` that allows retrieving values for fields that have been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm this is more of an internal implementation detail.. I wonder if we should be documenting this, as its use may change in the future. Do we expect users to care about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expose it via the fields
and stored_fields
api anyway...so they can actually fetch it. I wrote that they should not rely on the name or the encoding. I think this is fair. The idea is that this should only be used for debugging purposes. If there is an issue it will be handy asking them about getting the value for this field.
|
||
* **`index.mapping.ignore_above`**: `8191` | ||
|
||
* **`index.mapping.total_fields.limit`**: 1000 (same as `"standard"` index mode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the default value? If so, let's skip it.
field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few | ||
restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it. | ||
|
||
NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if multi-value fields is clear enough. Maybe "when dealing with arrays of values"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places in the documents we use multi-value fields...which by the way is the correct name. Elasticsearch doesn't normally need to maintain array order because its core functionality revolves around searching based on the presence of values, not their position. This is true also for aggregations. Therefore, it treats arrays and multi-value fields as a set of independent values, where order doesn't play a role in indexing or querying. So, IMO it is where we use "array" that we make a mistake. An array is a (concrete) ordered data structure...a multi-value field is an abstract collection of values where order does not matter. I don't want to sound picky but again...I think "array" is incorrect. A lot of our code is written without considering ordering an issue (including the way synthetic source works normally and aggregations work). If we use "array" we suggest, instead, that ordering matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this context, sounds good.
[[logsdb-data-streams]] | ||
=== LogsDB for logs data streams | ||
|
||
In Elasticsearch, `logsdb` mode is applied by default for data streams whose name matches the pattern `logs-*-*`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not say that in 8.16 docs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When backporting will change it.
This pattern identifies a logs data stream, and Elasticsearch automatically configures the data stream to use LogsDB. | ||
We recommend using `logsdb` index mode for data streams by means of standard or custom (component) templates. | ||
|
||
Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to sound nicer.
Users are allowed to opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by | |
Users can opt out of `logsdb` index mode by overriding the `index.mode` setting in the index settings or by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this nicer? You mean less formal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, not a big deal though.
result in documents with malformed fields being rejected and not indexed at all. | ||
|
||
In `logsdb` index mode, the `index.mapping.ignore_above` setting is applied by default at the index level to ensure | ||
efficient storage and indexing of large text fields.The index-level default for `ignore_above` is set to 8191 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
efficient storage and indexing of large text fields.The index-level default for `ignore_above` is set to 8191 | |
efficient storage and indexing of large keyword fields. The index-level default for `ignore_above` is set to 8191 |
=== Fields without doc values | ||
|
||
When `logsdb` index mode uses synthetic `_source`, and `doc_values` are disabled for a field in the mapping, | ||
Elasticsearch automatically sets the `store` setting to `true` for that field. This ensures that the field's data is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only do this for text
and annotated_text
when store
is false
and there is no multi field suitable for synthetic source. If there is no doc_values for all other fields we use fallback synthetic source via _ignored_source
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to go into the details of saying for which field types we do this and which not just to avoid that if we change something this goes out of sync and we forget updating. Also I think is an implementation detail. I wanted to mention this just to let users know that we sometimes might do this....I will add something like sometime might set store to true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left two comment about index.codec
otherwise looks good 👍
Users are allowed to override the default compression codec. If desired, they can switch to the `best_speed` | ||
codec for faster compression at the expense of slightly larger storage footprint. | ||
|
||
* `index.codec`: `"best_compression"` | ||
This is the default setting, applying {wikipedia}/Zstd[ZSTD] compression to stored fields for optimal storage | ||
efficiency. | ||
|
||
* `index.codec`: `"best_speed"` | ||
If faster indexing performance is required, users can opt for `best_speed` compression, which sacrifices some storage | ||
efficiency for higher indexing throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just link to the documentation about index.codec
setting? (https://www.elastic.co/guide/en/elasticsearch/reference/8.16/index-modules.html)
This is the default setting, applying {wikipedia}/Zstd[ZSTD] compression to stored fields for optimal storage | ||
efficiency. | ||
|
||
* `index.codec`: `"best_speed"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option is named default
and not best_speed
. In the codec this is known as best speed, but that isn't what the configuration option's name is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks
field and synthesizes it from doc values or stored fields upon document retrieval. Synthetic source comes with a few | ||
restrictions which you can read more about in the <<synthetic-source,documentation>> section dedicated to it. | ||
|
||
NOTE: When dealing with multi-value fields, the `index.mapping.synthetic_source_keep` setting controls how field values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this context, sounds good.
Should someone from docs take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -8,14 +8,6 @@ A logs data stream is a data stream type that stores log data more efficiently. | |||
In benchmarks, log data stored in a logs data stream used ~2.5 times less disk space than a regular data | |||
stream. The exact impact will vary depending on your data set. | |||
|
|||
The following features are enabled in a logs data stream: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnvg I removed this part since this is explained later.
This PR adds detailed documentation for `logsdb` mode, covering several key aspects of its default behavior and configuration options. It includes: - default settings for index sorting (`index.sort.field`, `index.sort.order`, etc.). - usage of synthetic `_source` by default. - information about specialized codecs and how users can override them. - default behavior for `ignore_malformed` and `ignore_above` settings, including precedence rules. - explanation of how fields without `doc_values` are handled and what we do if they are missing.
This PR adds detailed documentation for `logsdb` mode, covering several key aspects of its default behavior and configuration options. It includes: - default settings for index sorting (`index.sort.field`, `index.sort.order`, etc.). - usage of synthetic `_source` by default. - information about specialized codecs and how users can override them. - default behavior for `ignore_malformed` and `ignore_above` settings, including precedence rules. - explanation of how fields without `doc_values` are handled and what we do if they are missing.
👋 Didn't see this at the time, but I did a general edit just now in #118303 |
This PR adds detailed documentation for
logsdb
mode, covering several key aspects of its default behavior and configuration options.It includes:
index.sort.field
,index.sort.order
, etc.)._source
by default.ignore_malformed
andignore_above
settings, including precedence rules.doc_values
are handled and what we do if they are missing.