Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

event.duration takes a significative amount of disk space #31574

Open
jsoriano opened this issue May 10, 2022 · 24 comments
Open

event.duration takes a significative amount of disk space #31574

jsoriano opened this issue May 10, 2022 · 24 comments
Labels
bug discuss Issue needs further discussion. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Obs-DC Label for the Data Collection team

Comments

@jsoriano
Copy link
Member

jsoriano commented May 10, 2022

While analyzing disk space used by data collected by Metricbeat and stored in indexes with TSDB and synthetic _source enabled (elastic/elasticsearch#85649), @nik9000 found that event.duration takes up to 16.7% of the disk space.

This field is automatically added by Metricbeat, by default, with the duration of the fetch operation.

I guess that the main purpose of this field is to monitor or debug metrics collection itself, but this may not be so useful for the final users of most modules.

Being in Metricbeat, this is also added to metrics documents collected by Agent.

We should reconsider this field, or disable it by default.

cc @ruflin

@jsoriano jsoriano added bug discuss Issue needs further discussion. Team:Obs-DC Label for the Data Collection team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 10, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/obs-dc (Team:Obs-DC)

@nik9000
Copy link
Member

nik9000 commented May 10, 2022

We should reconsider this field

If you dropped it to second precision or microsecond precision it might still be useful and take up much much less space. You could hit it with the disk usage API.

@jlind23
Copy link
Collaborator

jlind23 commented May 11, 2022

@cmacknz any thoughts here? Should we change the precision or rather get rid of it?

@ruflin
Copy link
Member

ruflin commented May 11, 2022

Could do we a test with a reduced precision, microsecond is definitively enough.

The idea behind this field was to visualise and detect potential delays of the event collection. If we don't use it anywhere, we could also just introduce a config and turn it off by default.

@jsoriano
Copy link
Member Author

Not sure if reducing precision is a realistic option, this is defined as nanoseconds in ECS, there can be uses of this with this precision. Also, there are cases where an event can take sub-millisecond times, as when collecting system or runtime metrics, or in performance monitoring.
So this field is probably ok in nanoseconds, but it should be used only where/when needed, and maybe we should have a different field for the cases when less precision is needed.

I think that disabling it by default in Metricbeat would be a better option, but it can be also a breaking change if someone is using it.

@cmacknz
Copy link
Member

cmacknz commented May 11, 2022

I think that disabling it by default in Metricbeat would be a better option, but it can be also a breaking change if someone is using it.

Agreed, I suspect this field is only used during module development and possibly in SDHs where we can just request it be enabled.

Does anyone from @elastic/obs-cloud-monitoring or @elastic/obs-cloudnative-monitoring have any thoughts on this field?

@kaiyan-sheng
Copy link
Contributor

Just want to make sure we are only talking about event.duration field in Metricbeat right? I don't think we are using it in metrics collection. But we are definitely leveraging this field in Filebeat and log data streams.

@nik9000
Copy link
Member

nik9000 commented May 11, 2022

Not sure if reducing precision is a realistic option, this is defined as nanoseconds in ECS, there can be uses of this with this precision. Also, there are cases where an event can take sub-millisecond times, as when collecting system or runtime metrics, or in performance monitoring.

From a disk usage standpoint you can send in the number in nanoseconds if you want and round it to microseconds. It'll just have a lot of trailing 0s which we will optimize away in storage. At least when synthetic source is available. But, yeah, not saving it at all is wonderful if we can get away with it.

@jsoriano
Copy link
Member Author

Just want to make sure we are only talking about event.duration field in Metricbeat right? I don't think we are using it in metrics collection. But we are definitely leveraging this field in Filebeat and log data streams.

@kaiyan-sheng yes, the main concern is the field added automatically by Metricbeat. In the cases when the field is explicitly collected and used by an integration I think that this is fine.

@botelastic
Copy link

botelastic bot commented May 11, 2023

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label May 11, 2023
@ruflin
Copy link
Member

ruflin commented May 12, 2023

Commenting as we should not drop this issue.

@botelastic botelastic bot removed the Stalled label May 12, 2023
@botelastic
Copy link

botelastic bot commented May 11, 2024

Hi!
We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label May 11, 2024
@ruflin
Copy link
Member

ruflin commented May 14, 2024

👍

@botelastic botelastic bot removed the Stalled label May 14, 2024
@cmacknz
Copy link
Member

cmacknz commented May 14, 2024

This field is automatically added by Metricbeat, by default, with the duration of the fetch operation.

This seems like something we could just start logging a summary of. In theory this should match the period in the configuration, I imagine the intent is to spot cases where we are slightly out of sync.

While analyzing disk space used by data collected by Metricbeat and stored in indexes with TSDB and synthetic _source enabled (elastic/elasticsearch#85649), @nik9000 found that event.duration takes up to 16.7% of the disk space

FYI @strawgate this is another place we could be reducing ingest volume.

@strawgate
Copy link
Contributor

strawgate commented May 14, 2024

Great, thanks Craig!

Perhaps we can round the field to a desired resolution (maybe 0.1s?) and have a setting to enable nanosecond precision?

I imagine if we keep it as nanosecond but all the durations are rounded we reduce cardinality by a lot

@cmacknz
Copy link
Member

cmacknz commented May 14, 2024

There is a lot more context starting from elastic/integrations#4894 (comment) on how this gets used.

Edit: this is referencing event.ingested not event.duration

@strawgate
Copy link
Contributor

Most of the text about describing the need are targetting event.ingested which we aren't talking about changing in this ticket. I didn't actually see really any info on how event.duration is used just a bunch of ideas for how to reduce its storage requirement?

@cmacknz
Copy link
Member

cmacknz commented May 14, 2024

Whoops, I misread that entire issue as applying to event.duration. You are right what I linked is not relevant to event.duration.

@strawgate
Copy link
Contributor

Reducing to millisecond would make it still useful while reducing cardinality by 100,000x, I can test what the savings is if that would be useful

@nimarezainia
Copy link
Contributor

nimarezainia commented May 14, 2024

As a plan moving forward:

Short-term: since the precision can't be adjusted, implement what @nik9000 suggests HERE. Document/benchmark the disk usage savings.

medium/long-term: Since this is a breaking change I suggest adding it to the list for 9.0 and make it a configurable option then. We just don't have enough information on how users may be utilizing this field today.

does this work?

@strawgate
Copy link
Contributor

i think we should reduce precision as much as we can get away with as each decimal we drop is a 10x reduction in cardinality. I don't know that nano to micro will be a big enough savings but if someone can benchmark maybe we can find the "sweet spot"

@felixbarny
Copy link
Member

Looks like event.duration is already in second precision since 8.0/7.15: elastic/kibana#104044

I wonder if the tests that suggested that event.duration takes up to 16.7% of disk space used an old version of the pipeline that didn't do that truncation.

@martijnvg do you have recent numbers of a storage breakdown by field so that we can see if event.ingested is still an issue from a storage perspective?

Removing event.ingested is problematic as transforms, such as the ones used for the SLO feature rely on it. See also elastic/integrations#4894 (comment)

@martijnvg
Copy link
Member

@martijnvg do you have recent numbers of a storage breakdown by field so that we can see if event.ingested is still an issue from a storage perspective?

I believe last time, we gather the disk usage from Rally benchmarks. If the tracks aren't updated, then we also don't see any improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug discuss Issue needs further discussion. Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Obs-DC Label for the Data Collection team
Projects
None yet
Development

No branches or pull requests