Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling and configuring the add_x_metadata default global processors #4670

Open
gizas opened this issue Oct 16, 2023 · 35 comments
Open
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@gizas
Copy link
Contributor

gizas commented Oct 16, 2023

Edited by @cmacknz to generalize to all add_x_metadata processors and include the suggested approach from #4670 (comment)

Describe the enhancement:

The default set of processors for each Beat, when they are started by agent, is defined in the code and is not configurable. Here is the definition for Metricbeat. Same default config stands for rest of beats.

For example the add_kubernetes_metadata processor is used to enrich kubernetes events with extra metadata fields. This results in the number of k8s API calls being proportional to the number of inputs in the agent policy. In this case the same functionality is also present through enrichers and start by default in code of beats as described here-see enrichers.

This enhancement will allow disabling or configuring any of the default beat global processors. This will be done by defining a new type of provider , an event_data provider, which controls the behavior of the event metadata processors added by each beat.

We will additionally expose the entire contents of the providers section of the agent policy in Fleet to allow configuring this. Here is a sample of the intended configuration:

providers:
     event_metadata:
         # The default value would be true to avoid a breaking change when they aren't specified.
         host: # controls add_host_metadata
             enabled: true
         kubernetes: # controls add_kubernetes_metadata
             enabled: false
         cloud: # controls add_cloud_metadata
             enabled: true
         docker: # controls add_docker_metadata 
             enabled: false

Describe a specific use case for the enhancement or feature:

All the elastic-agent Kubernetes deployments experience the default enablement of processor in the background, which can be "expensive" (as far as extra memory and number of API calls) that the agent has to do, especially in big Kubernetes deployments

Related Issues:

Definition of Done:

  • Allow disabling the add_kubernetes_metadata processor for all Beats run by Elastic Agent.
  • Provide tests that showcase the same functionality and enrichment with/without processor enabled
  • Demonstrate the event metadata processors can be disabled from Fleet.
  • Document the new functionality
@gizas gizas added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Oct 16, 2023
@botelastic botelastic bot removed the needs_team label Oct 16, 2023
@joshdover
Copy link
Contributor

I'm still unfamiliar with how enrichers work, so bear with me here. As far as I understand, they only apply to the metricsets that use them directly and are not configurable. I think we still need an option for users to be able to easily add the same metadata as before with any input type.

The problem we're trying to solve here is especially problematic on Elastic Agent, where we start a new Beat instance for each unique input type. Each of these Beats will run their own instance of add_kubernetes_metadata in addition to the Agent's built-in kubernetes provider. Each of these essentially makes the same K8s API requests which puts a lot of load on the K8s API. This problems scales horizontally both with number of K8s nodes in the cluster and with the number of integrations added to the policy.

I was thinking that for Agent users we could add the same exact properties that the processor gives, but isntead through the Kubernetes provider by updating integrations and standalone config examples to populate add_fields processors. An example would be like:

inputs:
- id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}
  type: filestream
  streams:
    - id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}
      data_stream:
        dataset: kubernetes.container_logs
        type: logs
      prospector.scanner.symlinks: true
      parsers:
        - container: ~
      paths:
        - /var/log/containers/*${kubernetes.container.id}.log

      # 👇 Here is what is new
      processors:
        - add_fields:
            target: kubernetes
            fields:
              pod.name: ${kubernetes.pod.name}
              pod.id: ${kubernetes.pod.id}
              namespace: ${kubernetes.namespace}
              labels: ${kubernetes.labels}

Ideally this is actually something that Agent could pull off automatically without having to change integrations and configurations. This would help avoid a breaking change. I see a couple options to solve this:

  • Change how the add_kubernetes_metadata processor works when running under Agent to somehow fetch this info from the agent core process instead of K8s API directly. This may have some of the same memory issues though, not sure. This might set us up to more easily have a cluster-level agent that provides this info for all agents later.
  • Disable the add_kubernetes_metadata by default when running Beats under Agent, but have Agent insert this add_fields processor block automatically when the provider detects it's on K8s.

@gizas
Copy link
Contributor Author

gizas commented Oct 16, 2023

they only apply to the metricsets that use them directly and are not configurable

Small correction on this is that yes they start by default but are configurable based on add_resource_metadata block . See below example from managed agent (same from standalone as well):

Screenshot 2023-10-16 at 1 30 56 PM

So at the moment you can only disable namespace enrichement (this is the main one that gives us labels and annotations), node enrichment (that adds node and hostname), cronjob (that adds in pods ony cronjob name) and deployments (that adds in pods deployment name)
See all options here in add_resource_metadata block

So what you describe above for the processor can still happen in the enricher and be configured with the options you specify. We maybe need to specify extra user scenarios but it is ok.

The disablement of the processor needs to happen as both start and this is not correct.

Agent to somehow fetch this info from the agent core process instead of K8s API directly.

I understand your point. To see if agent has this info already and not to ask again from them. We will need to see the code and if the cache of agent can be utlised from the integration side.
cc @ChrsMark and @MichaelKatsoulis that have worked in the past in this part of the code to see if they remember anything more

@ChrsMark
Copy link
Member

Hey folks, I'm not sure if the components are completely distinguished here. So here is a quick list of the related components:

1) k8s provider of Elastic Agent

2) k8s enrichers of Metricbeat

3) add_kubernetes_metadata processor

⚠️ This is what the issue is about ⚠️

This processor is started by default by all Beats because the default Beat's configuration includes it: https://github.com/elastic/beats/blob/b8a241b0703d252c3ff1ba9b417aea585f811da1/metricbeat/metricbeat.yml#L127 (same for other Beats like Filebeat etc).
I'm not sure how https://github.com/elastic/beats/blob/main/x-pack/metricbeat/cmd/root.go#L74-L78 interacts with this tbh.

From my perspective in Beats I should be able to remove the add_kubernetes_metadata: ~ processor from the Beat's default config and disable the processor. Isn't that correct @gizas ?
But Elastic Agent does not provide that option which leads to the following issue ---->

In the past we had a related issue at elastic/beats#27216 when Elastic Agent was starting (still does :)) Filebeat with the default config file (filebeat.yml) and as a result we had the processor (add_kubernetes_metadata) to produce tons of warnings because the defaults were not valid.
That's why we introduced elastic/beats#27689 to skip the enrichment whenever we detect that the k8s metadata are already present on the event (from the k8s autodiscovery provider for example).

However this only makes the processor to skip the enrichment part not the "watching" part since the processor starts properly and makes all the required API calls so as to populate the caches. This is why we need to completely disable it if we don't need it specifically.

⚠️ However the original issue is still there and is filed at #90. So I think this is why we are still hitting this @joshdover @gizas and this what we should focus on since it's a generic problem of how Elastic Agent starts and configures the underlying Beats. ⚠️

Note that the default enablement of the processor in Beats was a "nice-to-have" but in Agent it's completely useless since other components already do the job and eventually if anyone needs the processor they can add it into the policy manually.

I hope that's clear why it's super crucial to either disable the processor by default on Beats level or provide the option to do it from Elastic Agent through #90.
All the information about the problematic existence of the add_kubernetes_metadata processor can be found in the linked issues above so please have a look but happy to elaborate more on this if needed.

ps1: Reusing metadata is a valid point in general but not applicable here so let's try to keep the focus on the existent issue :).

@gizas
Copy link
Contributor Author

gizas commented Oct 16, 2023

Thanks @ChrsMark , so trying to summarise your info and Josh's comments:

  • Yes this issue is to remove add_kubernetes_metadata processor to run by default in beats and agents. Also to test if this leaves any issue comparing to what users see today
  • @joshdover (correct me here) was trying to suggest more granural disablement of existing metadata if makes sense
  • Suggestion whether we could reuse metadata from agents provider is not feasible in metricbeat as the provider is not used at all .

@ChrsMark
Copy link
Member

@gizas 👍🏽

I had filed an issue regarding the memory utilization improvement by using common shared caches etc: elastic/beats#35591. But as I said this is out of scope for the current problem.

So yes let's try to see if we can completely disable the add_kubernetes_metadata processor on Beat's startup (ideally if we detect that we run under Elastic Agent.
We could either introduce an ENV var check at https://github.com/MichaelKatsoulis/beats/blob/3955e717adb8a1204464de756fcba91d7df89d75/libbeat/processors/add_kubernetes_metadata/kubernetes.go#L142 so as to skip the whole initialization if the specific variable is present. Sth similar to what we did for add_host_metadata processor @gizas .

@MichaelKatsoulis
Copy link
Contributor

+1 to what @ChrsMark mentioned.
Also regarding @joshdover comment #4670, in case of container_logs integration in managed agent or standalone agent, the use of dynamic variables (/var/log/containers/*${kubernetes.container.id}.log) starts the kubernetes provider.
The provider enrichers the log events with metadata. So adding the add_fields processor won't add anything new.

@MichaelKatsoulis
Copy link
Contributor

More on the ENV var approach , a variable worth investigating is BEAT_SETUID_AS which under agent is set to elastic-agent by the dockerfile. When beat runs autonomously the var is unset as far as I have seen until now.

@gsantoro
Copy link
Contributor

The default set of processors for each Beat, when they are started by agent, is defined in the code. Here is the definition for Metricbeat. Same default config stands for rest of beats.

@gizas This is not entirely true. Filebeat behaves differently.

Code at https://github.com/elastic/beats/blob/157b31a79e3ad52578f30399049cac030d688934/x-pack/filebeat/cmd/root.go#L51 shows that for filebeat when SHIPPER_MODE=true all the default processor are disabled. this is only valid for filebeat and not for metricbeat.

@gsantoro
Copy link
Contributor

@MichaelKatsoulis personally I wouldn't override the behaviour of BEAT_SETUID_AS to be used as a feature flag for enabling/disabling processors.

@gizas
Copy link
Contributor Author

gizas commented Dec 12, 2023

@gsantoro we had done something similar with an environmental variable here to alter the configuration of processors. Maybe we can think another variable to do the disablement of kubernetes processor and not be a breaking change?

Besides the x-pack configuration in this story (and x-pack folder I guess affects only Elastic agent when it starts beats underneath) we need to verify whether the configuration files (like https://github.com/elastic/beats/blob/main/filebeat/filebeat.yml#L176) needs also to be changed (when we solely run each beat)

Additionally, we would need a diagnsotics before and after the changes to figure out the improvement in memory and be sure that processor is not run otherwise. Any other ideas to measure possible imrpovement?

@cmacknz
Copy link
Member

cmacknz commented Dec 12, 2023

One thing we could explore here is using the recently added agent.features section of the policy. This was built as a mechanism for passing feature flags to agent and the processes it runs.

# Feature Flags
# This section enables or disables feature flags supported by Agent and its components.
#agent.features:
# fqdn:
# enabled: false

# This section enables or disables feature flags supported by Agent and its components.
#agent.features:
#  fqdn:
#    enabled: false

Right now there is only a single feature flag, which controls whether the agent hostname is the fully qualified domain name or not.

While it would be better if we explicitly supported global processors as a first class concept in the agent policy, there is still a lot of debate about how to do this properly since doing it for every process in agent regardless of if it is implemented as a Beat or not is complex.

So rather than waiting for the global processors problem to be solved, we could allow configuring the set of Beats global processors in the features section of the policy. If we expose these as feature flags we can conceivably obsolete them later once global processor support exists. For example we could expose them as something a bit abstracted from the implementation like (final syntax TBD):

agent.features:
    event_metadata:
         # The default value would be true to avoid a breaking change when they aren't specified.
         host: true # controls add_host_metadata
         kubernetes: false # controls add_kubernetes_metadata
         cloud: true # controls add_cloud_metadata
         docker: false # controls add_docker_metadata

This would give us a nice way to control this in the agent policy. This wouldn't be available in the Fleet UI without a change there to expose it, but it could immediately be controlled using the Fleet override API.

The agent passes the feature flags to sub-processes like Beats using the control protocol: https://github.com/elastic/elastic-agent-client/blob/21e4fd899bbcbe622248407e90744217da77b2b4/elastic-agent-client.proto#L268-L271

    // Features are the expected feature flags configurations. Can apply to either components or
    // individual units depending on the flag and its implementation. Omitted if the client reports
    // it has applied the current configuration. Added in Elastic Agent v8.7.1.
    Features features = 3;

This eventually ends up being handled in the x-pack/libbeat/management code in Beats:

https://github.com/elastic/beats/blob/06a8c09655b0d1d5d4ec4bd77bc930d18975155e/x-pack/libbeat/management/managerV2.go#L572-L584

Most of the work with this approach would be allowing changes of the feature flag to remove processors from the list of default global processors. These are currently initialized once at startup, we'd need a way to manipulate the processors list in the pipeline based on feature flag changes.

https://github.com/elastic/beats/blob/06a8c09655b0d1d5d4ec4bd77bc930d18975155e/x-pack/filebeat/cmd/root.go#L41-L66

@gizas
Copy link
Contributor Author

gizas commented Dec 13, 2023

Nice @cmacknz !
So @gsantoro I guess this is a nice test for your debugger to see when

agent.features:
    event_metadata:
         kubernetes: false # controls add_kubernetes_metadata

then we actually pass the correct config here https://github.com/elastic/beats/blob/06a8c09655b0d1d5d4ec4bd77bc930d18975155e/x-pack/libbeat/management/managerV2.go#L572-L584

Just a reminder that regardless of the way to do it, we must verify that if by default we disable the add_kubernetes_metadata we dont break anything and we dont miss metadata on our collection. We should also measure the improvement in memory and number of API calls if any

@gsantoro
Copy link
Contributor

hey @gizas and @cmacknz ,
I am more inclined to follow the approach of ELASTIC_NETINFO at https://github.com/elastic/beats/pull/36506/files.

I would introduce a variable ELASTIC_K8S_METADATA, set to false in any elastic-agent (both standalone and managed) and be done with it.

No Fleet code change necessary. Just set the right values in the manifests and some docs.

What do you guys think?

@cmacknz
Copy link
Member

cmacknz commented Jan 31, 2024

That is what I would call the minimum viable change. That only fixes the k8s metadata processor for users running the agent as a container (see below), and requires them to update their k8s YAML to include it. You would think the k8s configuration stays up to date, but I have seen it lag the agent version (edit: this is probably true of the features based approach as well since it defaults to on).

What is proposed in #4670 solves this problem for every global processor in every deployment model. Adding environment variables to agents deployed on individual endpoints (laptops, desktops, etc) requires mass reconfiguration of the agent outside of Fleet since the agent installation itself must be modified. Using an environment variable is really only viable in containers/IaC use cases.

I'd prefer this get implemented as described in #4670, being able to disable global processors is a pretty common request. This is the platform level solution, which I am obviously biased towards as a person who maintains agent as a platform.

That said doing it that way will take significantly longer, probably 1 month for someone who has never touched the inside of agent. The agent team itself is unlikely to be able to take that on itself in the near future either. So if you can't commit to that then the simpler solution is the way to go.

Ideally we aren't building up layers of band aids like this all over the place, but we already have a precedent for this processor with the NETINFO variable at least so it isn't new to this area.

@gsantoro
Copy link
Contributor

gsantoro commented Feb 1, 2024

@cmacknz I understand that this feels like a band aids but I don't think any of us has 1 month to be working on this.

I have been thinking that probably we don't even need the environment variable.

If elastic-agent is already doing the work to add k8s metadata we can't just remove add_kubernetes_metadata from the list of default processors in the code.

AKAIK Beats doesn't add this processor unless it is instrumented to do so with a config. So we just don't want to add that processor by default when running in elastic-agent.

@gizas
Copy link
Contributor Author

gizas commented Feb 1, 2024

Sorry late to the party, I would only add that this story initially will try to verify that the complete removal of the add_kubernetes_metadata processor if it breaks something. So indeed I would try and see if just removing it is everything ok

Of course their might be cases that we can not think right now that we might break (and correct me on this) if the user then manually adds the processor or globally enables it (whenever is ready), wont this be sufficient?

@cmacknz
Copy link
Member

cmacknz commented Feb 1, 2024

add_kubernetes_metadata applies the metadata to each event that passes through each of the Beats. The Elastic Agent Kubernetes provider collects the same information, but does not apply it to any of the events. It makes the information available for variable substitution in the agent policy and adds it as metadata when the agent checks in with Fleet.

The only way to get the k8s metadata on events with the agent kubernetes provider is to have an integration template in the processors (or add them manually) using the data the k8s provider collected. I believe you do some of this in the k8s integration already, but removing add_kubernetes_metadata would be a breaking change for users that rely on this processor but don't use the Kubernetes integration.

I think Josh proposed this earlier, but we could have agent push the k8s metadata it collected down to each Beat using the control protocol where they could automatically generate the equivalent static processor. Then you could remove the add_kuberentes_metadata processor and the duplicate API calls it makes without it being a breaking change, with the catch that you are still going to have to change the interface between agent and its sub-processes to make it work. I say this would also cost you 1 month to do correctly, but reading your goals I like it better than what is in #4670. This approach works by default with no breaking change or manual configuration required.

The extra work here is a product of the agent sub-process architecture, which we are stuck with until we can shift more of this functionality into the OTel collector embedded in the agent. That is not going to happen in the short term.

@cmacknz
Copy link
Member

cmacknz commented Feb 1, 2024

This conversation is getting farther into the internals of the agent, happy to jump on a call to help close this out if that would help.

@gsantoro
Copy link
Contributor

gsantoro commented Mar 4, 2024

How to run audit-logs tests locally

discussion on github.

You must check out the repo obs-infraobs-team since we will use some taskfiles and env variables.

Before starting, you need to change the params at tools/envs/elastic-agent/.env to build elastic-agent:

  • change platform to either linux/amd64 or linux/arm64. Eg. on Mac M1 laptop I use linux/arm64.
  • I use EXTERNAL=false, or I don't provide this env variable entirely, so you can compile beats and run the code from source.

I have an elastic-agent repo with optimizations that make the compilation faster at https://github.com/gsantoro/elastic-agent/tree/feature/default_processors. You might want to check it out locally to build elastic-agent in 2-3 minutes instead of 20 minutes.

You need to start with a clean version of the Beats repo.

Run the following commands to setup Elastic-agent with add_kubernetes_metadata enabled (no changes to the code)

# activate env variables to configure audit logs in Kind
cd tools/envs/local-auditlogs

# start k8s on Kind, Elastic via elastic-package, configure Fleet to use elastic-agent built from source
task local-managed-source:up

Keep it running for 30 minutes to get some audit logs.

you can visualize the audit logs using a dashboard at tools/apps/audits-notebook/data/kibana-dashboard.ndjson.

Now you want to change the code in beats to disable the processor add_kubernetes_metadata.

You will need to comment the line and rebuild/rerun elastic-agent with those changes

# activate env variables to configure audit logs in Kind
cd tools/envs/local-auditlogs

# start k8s on Kind, Elastic via elastic-package, configure Fleet to use elastic-agent built from source
task local-managed-source:reload-from-source

keep it running for another 30 minutes and compare the results.

@gsantoro
Copy link
Contributor

gsantoro commented Mar 7, 2024

Summary of a zoom call with @cmacknz and @MichaelKatsoulis

  • we agreed that we should implement a more long-term solution, as highlighted here instead of the env variable to solve the problem only for k8s integration. This is because other teams (with other integrations) may want to disable/enable add_kubernetes_metadata or other processors without encoding the logic in beat code.
  • an alternative to mitigate the problem might be to reduce the number of beats by grouping more than 1 input type in the same beat. This is a solution that might solve other problems, and this is what the @cmacknz team has already been thinking about. Up to @cmacknz to discuss and implement this solution with his team
  • the problem of implementing a way to configure beat processors will be discussed by @cmacknz in order to find someone with enough time to implement this or some other solutions.

@gsantoro
Copy link
Contributor

I'm closing this issue for now since it's probably going to be offloaded to @cmacknz's team

@cmacknz cmacknz reopened this Mar 13, 2024
@cmacknz cmacknz added Team:Elastic-Agent Label for the Agent team and removed Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team labels Mar 13, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented May 2, 2024

It may actually be better to put this in the providers section of the policy, and kill two birds with one stone by exposing the agent.providers in Fleet which aren't supported yet. Something like:

providers:
     event_metadata:
         # The default value would be true to avoid a breaking change when they aren't specified.
         host: # controls add_host_metadata
             enabled: true
         kubernetes: # controls add_kubernetes_metadata
             enabled: false
         cloud: # controls add_cloud_metadata
             enabled: true
         docker: # controls add_docker_metadata 
             enabled: false

@nimarezainia
Copy link
Contributor

@kpollich I wanted to bring this to your attention after speaking to Craig. I think perhaps we should finally introduce the "advanced" section in the Agent Policy and add this "providers" section. These booleans can be generated from config also.

@kpollich
Copy link
Member

kpollich commented May 3, 2024

Sounds good to me. I think this should be achievable with the advanced settings framework.

@cmacknz cmacknz changed the title Disable add_kubernetes_metadata processor on beats startup Elastic:agent Allow disabling add_x_metadata default global processors May 3, 2024
@cmacknz
Copy link
Member

cmacknz commented May 3, 2024

Updated the description and transferring to the agent repository since this is really an agent feature.

@cmacknz cmacknz transferred this issue from elastic/beats May 3, 2024
@cmacknz cmacknz changed the title Elastic:agent Allow disabling add_x_metadata default global processors Allow disabling add_x_metadata default global processors May 3, 2024
@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 17, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz cmacknz changed the title Allow disabling add_x_metadata default global processors Allow disabling and configuring the add_x_metadata default global processors Jun 20, 2024
@pchila
Copy link
Member

pchila commented Jun 24, 2024

Initial design/implementation ideas:

  • The new agent event_metadata Provider is not a provider, rather a component that injects/manage components configuration (either directly in the transpiler or more likely as a ComponentModifier)
  • The client v2 will be modified to allow opt-in to subscribe ComponentChanges() and not just UnitChanges()
  • For beats, when receiving configuration for the global processors at component level, it will dump the config in a fleet-component-config.yaml file in the data.path location and re-exec itself to load that fragment of configuration (not sure it's possible to dynamically change the global providers without restarting beats since most of them are instantiated very early in the beats lifecycle)

Any thoughts/feedback/observations @cmacknz , @ycombinator ?

@cmacknz
Copy link
Member

cmacknz commented Jun 24, 2024

For beats, when receiving configuration for the global processors at component level, it will dump the config in a fleet-component-config.yaml file in the data.path location and re-exec itself to load that fragment of configuration (not sure it's possible to dynamically change the global providers without restarting beats since most of them are instantiated very early in the beats lifecycle)

Why do the Beats need to re-exec themselves? These processors map to global Beat processors (processors at the top level of the Beat configuration as defined in https://www.elastic.co/guide/en/beats/filebeat/current/defining-processors.html#where-valid) and should be reloadable like they are if you were writing the Beat configuration by hand.

@pchila
Copy link
Member

pchila commented Jun 25, 2024

Why do the Beats need to re-exec themselves? These processors map to global Beat processors (processors at the top level of the Beat configuration as defined in https://www.elastic.co/guide/en/beats/filebeat/current/defining-processors.html#where-valid) and should be reloadable like they are if you were writing the Beat configuration by hand.

I didn't find a place other than the very early stages when building out the pipelines happens using the global processors...

At that point, if we want to configure such processors we need to have the configuration available but I don't think the client is up yet (hence the idea of writing out a partial config file and re-exec)... I will have another look once I have the new client in and see what happens, if we can reconfigure the processors after we've assembled the pipeline

I would be happy to avoid writing out a file and re-exec so if by the time the beat starts receiving CheckinExpected messages we can still configure global processors on the fly, I will be more than happy save myself the extra complication and hassle

@cmacknz
Copy link
Member

cmacknz commented Jun 25, 2024

You should be able to test whether hot reloading the global processors at anytime works by trying it with a standalone Beat configuration.

@pchila
Copy link
Member

pchila commented Jul 23, 2024

Quick sum up of the solution and progress so far:

The solution requires modifications across 3 repositories:

  • elastic-agent: the "provider" (it's not really a provider, rather a ComponentModifier) has been roughly implemented ---> Allow configuring global processors #5052
  • elastic-agent client: modified the protobuf definitions to include a global processor config and introduced an 'opt-in' component changes publishing ---> Add global processor component config elastic-agent-client#117
  • beats: global processors are configured at a very early stage of the beats init and used before the V2 client had a chance to run and fetch configuration from agent and at this point we don't have a clear solution.

To go a bit deeper on the beats initialization problem:

  • here is where the function controlling the initialization of processors is created
  • here is the creation of the pipeline builder which contains the instances of the global processors
  • here is where the Pipeline is finally instantiated with a set of processors
  • Pipeline clients (inputs) will then connect to the Pipeline and receive a copy of the global processors to invoke before queuing the events for publishing. The global processors have been already set in the Pipeline during initialization above.

To try and load global processor config I opened 2 draft PRs in the beats repo with 2 different approaches:

  1. Add a reload mechanism for global processor config, make Pipeline track all the opened clients (inputs) and, when a configuration update arrives, update itself and all the connected clients with the new set of processors in a concurrency-safe manner ---> Global processors reload beats#40091
  2. Delay initialization of Pipeline object to give the beat a chance to load the component configuration before spawning inputs and output. If an update for the global processors comes after the pipeline has been created, restart the beat like we do for outputs ---> Global processor reload beats#40312

Neither of these beats PR are ultimately ready:

  1. Global processors reload beats#40091 still suffers from some segmentation faults during initialization and is at risk for deadlocks for the Client <---> Pipeline updates (both need mutexes to make the update safe)
  2. Global processor reload beats#40312 delaying the Pipeline initialization seems safer and the updates look easier to handle, however it seems that there are issues with the correct output initialization (the outputController for the Pipeline object need to be looked into). The logic for restarting has been implemented and seems to behave sensibly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

No branches or pull requests