Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-introduce event.ingested in observability project types? #11491

Open
felixbarny opened this issue Oct 23, 2024 · 12 comments
Open

Re-introduce event.ingested in observability project types? #11491

felixbarny opened this issue Oct 23, 2024 · 12 comments
Labels
Team:Fleet Label for the Fleet team [elastic/fleet]

Comments

@felixbarny
Copy link
Member

felixbarny commented Oct 23, 2024

Removing the agent verification pipeline in elastic/kibana#157400 had the side-effect of also removing the event.ingested enrichment for all data.

This is problematic for use cases like SLOs (elastic/kibana#196548) that use transforms which rely on event.ingested. There's more detail in this and the following comments: #4894 (comment)

Specifically, an issue was raised by @dominiqueclarke who found that SLOs for synthetics don't work on serverless.

There are several options how to proceed here:

  1. Specific integrations can opt-in to having event.ingested
    The issue is that features like SLOs can be applied to any integrations. So all integrations potentially need event.ingested
  2. Revert this change so that event.ingested and agent verification is executed on observability project types on serverless
    I'd argue that agent verification is not required by default in observability project types. So we could save execution time for every integration by not executing it. @andrewkroh do you happen to have performance numbers on the overhead of agent verification? If it's negligible, maybe it's not a big deal to execute it for observability projects.
  3. Split agent verification from adding event.ingested and only disable agent verification in observability project types.

In #4894 and elastic/beats#31574 it was discussed that event.ingested takes up significant disk space for metric datasets. But it's not clear if that was measured with a version of the final pipeline that truncates the timestamp to seconds (implemented in elastic/kibana#104044, with various performance improvements afterwards).

@felixbarny felixbarny added the Team:Fleet Label for the Fleet team [elastic/fleet] label Oct 23, 2024
@elasticmachine
Copy link

Pinging @elastic/fleet (Team:Fleet)

@cmacknz
Copy link
Member

cmacknz commented Oct 23, 2024

My vote would be for 2 so that this is the same everywhere. I'm not aware of any specific performance measurements done to qualify the overhead of the agent ID verification, but I do know we've never had a performance problem where it turned out to be the primary bottleneck or root cause.

@andrewkroh
Copy link
Member

do you happen to have performance numbers on the overhead of agent verification?

From node stats on an 8.15 cluster, the pipeline metrics show the two processors associated with the agent id status as taking
5.85 us + 1.60 us = 7.45 microseconds per event.

[
  {
    "set_security_user": {
      "type": "set_security_user",
      "stats": {
        "count": 25701227,
        "time_in_millis": 150665,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:agent-id-status": {
      "type": "script",
      "stats": {
        "count": 25701227,
        "time_in_millis": 41026,
        "current": 0,
        "failed": 0
      }
    }
  }
]

@felixbarny
Copy link
Member Author

Given that there's not much time left, and that the overhead doesn't seem high, I tend to agree to agree that we should just revert the change. I'd still like to collect some numbers about the storage impact of the truncated event.ingested field. Also, while 7.45µs looks like a small number, it's hard to put that into context what it means in terms of impact on the ingestion throughput compared to not having a pipeline at all for metric integrations. I suspect that just the fact that a pipeline is active already adds quite a bit of overhead. @andrewkroh could you pull out the impact of the processor that adds event.ingested, for comparison?

@andrewkroh
Copy link
Member

andrewkroh commented Oct 25, 2024

could you pull out the impact of the processor that adds event.ingested, for comparison?

3.150177913 µs avg over 7808852151 events

.fleet_final_pipeline-1 / truncate-subseconds-event-ingested stats[ { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 25701227, "time_in_millis": 32899, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 32806525, "time_in_millis": 87265, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 808624149, "time_in_millis": 1477691, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 694505750, "time_in_millis": 2879012, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 32693250, "time_in_millis": 85303, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 809877488, "time_in_millis": 1688403, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 32726856, "time_in_millis": 87430, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 808031093, "time_in_millis": 3449752, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 25753187, "time_in_millis": 64662, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 32312238, "time_in_millis": 92675, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 25917033, "time_in_millis": 72797, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 808164371, "time_in_millis": 3891579, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 695323629, "time_in_millis": 2704611, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 694803937, "time_in_millis": 2785299, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 696155098, "time_in_millis": 2910190, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 807663351, "time_in_millis": 3543482, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 25877692, "time_in_millis": 76331, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 32372803, "time_in_millis": 84089, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 693508343, "time_in_millis": 2663100, "current": 0, "failed": 0 } } }, { "script:truncate-subseconds-event-ingested": { "type": "script", "stats": { "count": 26034131, "time_in_millis": 67672, "current": 0, "failed": 0 } } } ]

@andrewkroh
Copy link
Member

andrewkroh commented Oct 25, 2024

I also used a larger sample set for the other two processors (looking at all the nodes instead of just one in node stats).

13.56069866 µs set_security_user

2.956982604 µs script:agent-id-status

@felixbarny
Copy link
Member Author

I've tried to test the difference between truncating and not truncating event.ingested by adding a branch for rally-tracks that adds a truncation pipeline: https://github.com/elastic/rally-tracks/tree/tsdb-truncate-event-ingested.

I've executed a baseline and contender experiment with esbench but it seems I was doing something wrong because I couldn't see a difference in the disk usage of the event.ingested field. In both scenarios, the field took 3.69% of storage for the k8s container data stream. This is exactly the same number as the @timestamp field uses.

I'm on PTO for the rest of the week so I won't be able to provide better insights this week and I might need some help from @martijnvg.

I've also talked to @axw about this and he expressed his preference for option 3, which is to split event.ingested from the agent verification and only enable the former in observability project types. @cmacknz would that be feasible?

@felixbarny
Copy link
Member Author

Seems like the pipeline isn't executed at all and that there's a different mechanism that already truncates event.ingested to seconds (which has µs precision in the test corpus). What I don't understand is why event.ingested has the same storage impact as @timestamp, even though the former has a resolution of seconds and the latter a millisecond resolution. Still, the storage impact seems to have reduced a lot since the last time where event.ingested was found to have a storage impact of 16%.

@cmacknz
Copy link
Member

cmacknz commented Oct 28, 2024

I've also talked to @axw about this and he expressed his preference for option 3, which is to split event.ingested from the agent verification and only enable the former in observability project types. @cmacknz would that be feasible?

@kpollich's team owns this so I'll defer to him, I assume it's possible but more a question of when it could be done.

The agent.id verification, even if not a performance limitation, will generate a warnings for agent API keys that were not generated by Fleet (and thus don't have the agent.id in their metadata) which will be the case for many (most?) agents used in Observability projects.

@felixbarny
Copy link
Member Author

I think it makes sense to separate the concerns of agent validation and adding event.ingested. At the moment, there's just a single flag that disables both, while the name implies it only toggles agent verification (xpack.fleet.agentIdVerificationEnabled). I'd like to request having another flag that toggles whether or not event.ingested is added to fleet-managed data streams.

Whether or not we enable also agent verification or only event.ingested for observability project types is a somewhat separate discussion. But I don't see why we would need to execute the agent verification pipeline for observability project types by default. It should probably be at least opt-in.

@kpollich what would be the timeline for providing the separate flag?

@felixbarny
Copy link
Member Author

In both scenarios, the field took 3.69% of storage for the k8s container data stream. This is exactly the same number as the @timestamp field uses.

@martijnvg do you have an explanation why event.ingested had the same storage impact as @timestamp, even though the granularity of the former is seconds and the latter is ms?

@kpollich
Copy link
Member

kpollich commented Nov 6, 2024

@kpollich what would be the timeline for providing the separate flag?

This would be a relatively quick turnaround, probably just a few days of work.

I think to ease the migration story here, we need to keep the default behavior related to xpack.fleet.agentIdVerificationEnabled the same as it is today, otherwise we will have a breaking change where users relying on this flag to disable both ID verification + event.ingested will need to provide a second flag. So if we add xpack.fleet.eventIngestedEnabled as a separate flag, it would only enable event.ingested being generated. Then, the combination of these two flags would produce the desired result for o11y here:

xpack.fleet.agentIdVerificationEnabled: false
xpack.fleet.eventIngestedEnabled: true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Label for the Fleet team [elastic/fleet]
Projects
None yet
Development

No branches or pull requests

5 participants