Re-introduce `event.ingested` in observability project types? #11491

felixbarny · 2024-10-23T08:52:16Z

Removing the agent verification pipeline in elastic/kibana#157400 had the side-effect of also removing the event.ingested enrichment for all data.

This is problematic for use cases like SLOs (elastic/kibana#196548) that use transforms which rely on event.ingested. There's more detail in this and the following comments: #4894 (comment)

Specifically, an issue was raised by @dominiqueclarke who found that SLOs for synthetics don't work on serverless.

There are several options how to proceed here:

Specific integrations can opt-in to having event.ingested
The issue is that features like SLOs can be applied to any integrations. So all integrations potentially need event.ingested
Revert this change so that event.ingested and agent verification is executed on observability project types on serverless
I'd argue that agent verification is not required by default in observability project types. So we could save execution time for every integration by not executing it. @andrewkroh do you happen to have performance numbers on the overhead of agent verification? If it's negligible, maybe it's not a big deal to execute it for observability projects.
Split agent verification from adding event.ingested and only disable agent verification in observability project types.

In #4894 and elastic/beats#31574 it was discussed that event.ingested takes up significant disk space for metric datasets. But it's not clear if that was measured with a version of the final pipeline that truncates the timestamp to seconds (implemented in elastic/kibana#104044, with various performance improvements afterwards).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-10-23T08:52:18Z

Pinging @elastic/fleet (Team:Fleet)

cmacknz · 2024-10-23T20:51:20Z

My vote would be for 2 so that this is the same everywhere. I'm not aware of any specific performance measurements done to qualify the overhead of the agent ID verification, but I do know we've never had a performance problem where it turned out to be the primary bottleneck or root cause.

andrewkroh · 2024-10-24T03:51:20Z

do you happen to have performance numbers on the overhead of agent verification?

From node stats on an 8.15 cluster, the pipeline metrics show the two processors associated with the agent id status as taking
5.85 us + 1.60 us = 7.45 microseconds per event.

[
  {
    "set_security_user": {
      "type": "set_security_user",
      "stats": {
        "count": 25701227,
        "time_in_millis": 150665,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:agent-id-status": {
      "type": "script",
      "stats": {
        "count": 25701227,
        "time_in_millis": 41026,
        "current": 0,
        "failed": 0
      }
    }
  }
]

felixbarny · 2024-10-25T16:01:48Z

Given that there's not much time left, and that the overhead doesn't seem high, I tend to agree to agree that we should just revert the change. I'd still like to collect some numbers about the storage impact of the truncated event.ingested field. Also, while 7.45µs looks like a small number, it's hard to put that into context what it means in terms of impact on the ingestion throughput compared to not having a pipeline at all for metric integrations. I suspect that just the fact that a pipeline is active already adds quite a bit of overhead. @andrewkroh could you pull out the impact of the processor that adds event.ingested, for comparison?

andrewkroh · 2024-10-25T16:49:36Z

could you pull out the impact of the processor that adds event.ingested, for comparison?

3.150177913 µs avg over 7808852151 events

.fleet_final_pipeline-1 / truncate-subseconds-event-ingested stats

[
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 25701227,
        "time_in_millis": 32899,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 32806525,
        "time_in_millis": 87265,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 808624149,
        "time_in_millis": 1477691,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 694505750,
        "time_in_millis": 2879012,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 32693250,
        "time_in_millis": 85303,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 809877488,
        "time_in_millis": 1688403,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 32726856,
        "time_in_millis": 87430,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 808031093,
        "time_in_millis": 3449752,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 25753187,
        "time_in_millis": 64662,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 32312238,
        "time_in_millis": 92675,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 25917033,
        "time_in_millis": 72797,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 808164371,
        "time_in_millis": 3891579,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 695323629,
        "time_in_millis": 2704611,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 694803937,
        "time_in_millis": 2785299,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 696155098,
        "time_in_millis": 2910190,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 807663351,
        "time_in_millis": 3543482,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 25877692,
        "time_in_millis": 76331,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 32372803,
        "time_in_millis": 84089,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 693508343,
        "time_in_millis": 2663100,
        "current": 0,
        "failed": 0
      }
    }
  },
  {
    "script:truncate-subseconds-event-ingested": {
      "type": "script",
      "stats": {
        "count": 26034131,
        "time_in_millis": 67672,
        "current": 0,
        "failed": 0
      }
    }
  }
]

andrewkroh · 2024-10-25T16:59:46Z

I also used a larger sample set for the other two processors (looking at all the nodes instead of just one in node stats).

13.56069866 µs set_security_user

2.956982604 µs script:agent-id-status

felixbarny · 2024-10-28T16:34:37Z

I've tried to test the difference between truncating and not truncating event.ingested by adding a branch for rally-tracks that adds a truncation pipeline: https://github.com/elastic/rally-tracks/tree/tsdb-truncate-event-ingested.

I've executed a baseline and contender experiment with esbench but it seems I was doing something wrong because I couldn't see a difference in the disk usage of the event.ingested field. In both scenarios, the field took 3.69% of storage for the k8s container data stream. This is exactly the same number as the @timestamp field uses.

I'm on PTO for the rest of the week so I won't be able to provide better insights this week and I might need some help from @martijnvg.

I've also talked to @axw about this and he expressed his preference for option 3, which is to split event.ingested from the agent verification and only enable the former in observability project types. @cmacknz would that be feasible?

felixbarny · 2024-10-28T18:53:12Z

Seems like the pipeline isn't executed at all and that there's a different mechanism that already truncates event.ingested to seconds (which has µs precision in the test corpus). What I don't understand is why event.ingested has the same storage impact as @timestamp, even though the former has a resolution of seconds and the latter a millisecond resolution. Still, the storage impact seems to have reduced a lot since the last time where event.ingested was found to have a storage impact of 16%.

cmacknz · 2024-10-28T19:01:10Z

I've also talked to @axw about this and he expressed his preference for option 3, which is to split event.ingested from the agent verification and only enable the former in observability project types. @cmacknz would that be feasible?

@kpollich's team owns this so I'll defer to him, I assume it's possible but more a question of when it could be done.

The agent.id verification, even if not a performance limitation, will generate a warnings for agent API keys that were not generated by Fleet (and thus don't have the agent.id in their metadata) which will be the case for many (most?) agents used in Observability projects.

felixbarny · 2024-11-04T15:54:07Z

I think it makes sense to separate the concerns of agent validation and adding event.ingested. At the moment, there's just a single flag that disables both, while the name implies it only toggles agent verification (xpack.fleet.agentIdVerificationEnabled). I'd like to request having another flag that toggles whether or not event.ingested is added to fleet-managed data streams.

Whether or not we enable also agent verification or only event.ingested for observability project types is a somewhat separate discussion. But I don't see why we would need to execute the agent verification pipeline for observability project types by default. It should probably be at least opt-in.

@kpollich what would be the timeline for providing the separate flag?

felixbarny · 2024-11-05T12:19:24Z

In both scenarios, the field took 3.69% of storage for the k8s container data stream. This is exactly the same number as the @timestamp field uses.

@martijnvg do you have an explanation why event.ingested had the same storage impact as @timestamp, even though the granularity of the former is seconds and the latter is ms?

kpollich · 2024-11-06T19:12:32Z

@kpollich what would be the timeline for providing the separate flag?

This would be a relatively quick turnaround, probably just a few days of work.

I think to ease the migration story here, we need to keep the default behavior related to xpack.fleet.agentIdVerificationEnabled the same as it is today, otherwise we will have a breaking change where users relying on this flag to disable both ID verification + event.ingested will need to provide a second flag. So if we add xpack.fleet.eventIngestedEnabled as a separate flag, it would only enable event.ingested being generated. Then, the combination of these two flags would produce the desired result for o11y here:

xpack.fleet.agentIdVerificationEnabled: false
xpack.fleet.eventIngestedEnabled: true

felixbarny added the Team:Fleet Label for the Fleet team [elastic/fleet] label Oct 23, 2024

kdelemme mentioned this issue Oct 24, 2024

[SLOs] Synthetics availability SLI is broken on serverless elastic/kibana#196548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-introduce `event.ingested` in observability project types? #11491

Re-introduce `event.ingested` in observability project types? #11491

felixbarny commented Oct 23, 2024 •

edited by kdelemme

Loading

elasticmachine commented Oct 23, 2024

cmacknz commented Oct 23, 2024

andrewkroh commented Oct 24, 2024

felixbarny commented Oct 25, 2024

andrewkroh commented Oct 25, 2024 •

edited

Loading

andrewkroh commented Oct 25, 2024 •

edited

Loading

felixbarny commented Oct 28, 2024

felixbarny commented Oct 28, 2024

cmacknz commented Oct 28, 2024

felixbarny commented Nov 4, 2024

felixbarny commented Nov 5, 2024

kpollich commented Nov 6, 2024

Re-introduce event.ingested in observability project types? #11491

Re-introduce event.ingested in observability project types? #11491

Comments

felixbarny commented Oct 23, 2024 • edited by kdelemme Loading

elasticmachine commented Oct 23, 2024

cmacknz commented Oct 23, 2024

andrewkroh commented Oct 24, 2024

felixbarny commented Oct 25, 2024

andrewkroh commented Oct 25, 2024 • edited Loading

andrewkroh commented Oct 25, 2024 • edited Loading

felixbarny commented Oct 28, 2024

felixbarny commented Oct 28, 2024

cmacknz commented Oct 28, 2024

felixbarny commented Nov 4, 2024

felixbarny commented Nov 5, 2024

kpollich commented Nov 6, 2024

Re-introduce `event.ingested` in observability project types? #11491

Re-introduce `event.ingested` in observability project types? #11491

felixbarny commented Oct 23, 2024 •

edited by kdelemme

Loading

andrewkroh commented Oct 25, 2024 •

edited

Loading

andrewkroh commented Oct 25, 2024 •

edited

Loading