-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-introduce event.ingested
in observability project types?
#11491
Comments
Pinging @elastic/fleet (Team:Fleet) |
My vote would be for 2 so that this is the same everywhere. I'm not aware of any specific performance measurements done to qualify the overhead of the agent ID verification, but I do know we've never had a performance problem where it turned out to be the primary bottleneck or root cause. |
From node stats on an 8.15 cluster, the pipeline metrics show the two processors associated with the agent id status as taking [
{
"set_security_user": {
"type": "set_security_user",
"stats": {
"count": 25701227,
"time_in_millis": 150665,
"current": 0,
"failed": 0
}
}
},
{
"script:agent-id-status": {
"type": "script",
"stats": {
"count": 25701227,
"time_in_millis": 41026,
"current": 0,
"failed": 0
}
}
}
] |
Given that there's not much time left, and that the overhead doesn't seem high, I tend to agree to agree that we should just revert the change. I'd still like to collect some numbers about the storage impact of the truncated |
3.150177913 µs avg over 7808852151 events .fleet_final_pipeline-1 / truncate-subseconds-event-ingested stats[
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 25701227,
"time_in_millis": 32899,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 32806525,
"time_in_millis": 87265,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 808624149,
"time_in_millis": 1477691,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 694505750,
"time_in_millis": 2879012,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 32693250,
"time_in_millis": 85303,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 809877488,
"time_in_millis": 1688403,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 32726856,
"time_in_millis": 87430,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 808031093,
"time_in_millis": 3449752,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 25753187,
"time_in_millis": 64662,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 32312238,
"time_in_millis": 92675,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 25917033,
"time_in_millis": 72797,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 808164371,
"time_in_millis": 3891579,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 695323629,
"time_in_millis": 2704611,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 694803937,
"time_in_millis": 2785299,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 696155098,
"time_in_millis": 2910190,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 807663351,
"time_in_millis": 3543482,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 25877692,
"time_in_millis": 76331,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 32372803,
"time_in_millis": 84089,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 693508343,
"time_in_millis": 2663100,
"current": 0,
"failed": 0
}
}
},
{
"script:truncate-subseconds-event-ingested": {
"type": "script",
"stats": {
"count": 26034131,
"time_in_millis": 67672,
"current": 0,
"failed": 0
}
}
}
]
|
I also used a larger sample set for the other two processors (looking at all the nodes instead of just one in node stats). 13.56069866 µs set_security_user 2.956982604 µs script:agent-id-status |
I've tried to test the difference between truncating and not truncating I've executed a baseline and contender experiment with esbench but it seems I was doing something wrong because I couldn't see a difference in the disk usage of the I'm on PTO for the rest of the week so I won't be able to provide better insights this week and I might need some help from @martijnvg. I've also talked to @axw about this and he expressed his preference for option 3, which is to split event.ingested from the agent verification and only enable the former in observability project types. @cmacknz would that be feasible? |
Seems like the pipeline isn't executed at all and that there's a different mechanism that already truncates |
@kpollich's team owns this so I'll defer to him, I assume it's possible but more a question of when it could be done. The agent.id verification, even if not a performance limitation, will generate a warnings for agent API keys that were not generated by Fleet (and thus don't have the agent.id in their metadata) which will be the case for many (most?) agents used in Observability projects. |
I think it makes sense to separate the concerns of agent validation and adding Whether or not we enable also agent verification or only @kpollich what would be the timeline for providing the separate flag? |
@martijnvg do you have an explanation why |
This would be a relatively quick turnaround, probably just a few days of work. I think to ease the migration story here, we need to keep the default behavior related to xpack.fleet.agentIdVerificationEnabled: false
xpack.fleet.eventIngestedEnabled: true |
Removing the agent verification pipeline in elastic/kibana#157400 had the side-effect of also removing the
event.ingested
enrichment for all data.This is problematic for use cases like SLOs (elastic/kibana#196548) that use transforms which rely on
event.ingested
. There's more detail in this and the following comments: #4894 (comment)Specifically, an issue was raised by @dominiqueclarke who found that SLOs for synthetics don't work on serverless.
There are several options how to proceed here:
event.ingested
The issue is that features like SLOs can be applied to any integrations. So all integrations potentially need
event.ingested
event.ingested
and agent verification is executed on observability project types on serverlessI'd argue that agent verification is not required by default in observability project types. So we could save execution time for every integration by not executing it. @andrewkroh do you happen to have performance numbers on the overhead of agent verification? If it's negligible, maybe it's not a big deal to execute it for observability projects.
event.ingested
and only disable agent verification in observability project types.In #4894 and elastic/beats#31574 it was discussed that
event.ingested
takes up significant disk space for metric datasets. But it's not clear if that was measured with a version of the final pipeline that truncates the timestamp to seconds (implemented in elastic/kibana#104044, with various performance improvements afterwards).The text was updated successfully, but these errors were encountered: