diff --git a/content/en/blog/2024/otel-collector-anti-patterns/house-on-stilts.jpg b/content/en/blog/2024/otel-collector-anti-patterns/house-on-stilts.jpg new file mode 100644 index 000000000000..64ff14bccdb7 Binary files /dev/null and b/content/en/blog/2024/otel-collector-anti-patterns/house-on-stilts.jpg differ diff --git a/content/en/blog/2024/otel-collector-anti-patterns/index.md b/content/en/blog/2024/otel-collector-anti-patterns/index.md new file mode 100644 index 000000000000..17cf49cdca6a --- /dev/null +++ b/content/en/blog/2024/otel-collector-anti-patterns/index.md @@ -0,0 +1,185 @@ +--- +title: OpenTelemetry Collector Antipatterns +linkTitle: OTel Collector Antipatterns +date: 2024-03-01 +author: >- + [Adriana Villela](https://github.com/avillela) (Lightstep), + +canonical_url: https://open.substack.com/pub/geekingoutpodcast/p/opentelemetry-collector-anti-patterns +cSpell:ignore: antipattern antipatterns +--- + +![House on stilts against ocean and mountain backdrop](house-on-stilts.jpg) + +The [OpenTelemetry Collector](/docs/collector) is one of my favorite +OpenTelemetry (OTel) components. It’s a flexible and powerful data pipeline +which allows you to ingest OTel data from one or more sources, transform it +(including batching, filtering, and masking), and export it to one or more +observability backends for analysis. It’s vendor-neutral. It’s extensible, +meaning that you can create your own custom components for it. What’s there not +to like? + +Unfortunately, as it happens with many tools out there, it is also very easy to +fall into some bad habits. Today, I will dig into five OpenTelemetry Collector +antipatterns, and how to avoid them. Let’s get started! + +## Antipatterns + +### 1- Improper use of Collector deployment modes + +It’s not just enough to use a Collector. It’s also about _how_ your Collectors +are deployed within your organization. That’s right - Collector*s*, plural. +Because one is often not enough. + +There are two deployment modes for Collectors: agent mode and gateway mode, and +both are needed. + +In [agent mode](/docs/collector/deployment/agent/), the Collector sits next to +the application or on the same host as the application. + +![OTel Collector Agent Mode](otel-collector-agent.png) + +In [gateway mode](/docs/collector/deployment/gateway/), telemetry data is sent +to a load balancer, which then determines how to distribute the load amongst a +pool of Collectors. Because you have a pool of Collectors, should one Collector +in that pool fail, one of the other Collectors in the pool can take over. This +keeps data flowing to your destination sans disruptions. Gateway mode is +commonly deployed per cluster, data center, or region. + +![OTel Collector Agent Mode](otel-collector-gateway.png) + +So which should you use? Both agent and gateway. + +If you’re collecting telemetry data for your application, place a Collector +agent alongside your application. If you’re collecting data for infrastructure, +place a Collector agent alongside your infrastructure. Whatever you do, don’t +collect telemetry for all of your infrastructure and applications using a single +Collector. That way, if one Collector fails, the rest of your telemetry +collection is unaffected. + +The telemetry from your Collector agents can then be sent to a Collector +gateway. Because the gateway sits behind a load balancer, you don’t have a +single point of failure for exporting telemetry data, typically to your +observability backend. + +_Bottom line:_ Having the right Collector deployment configuration to send data +to your observability backend ensures higher availability of your telemetry +collection infrastructure. + +### 2- Not monitoring your Collectors + +Deploying multiple Collector agents and a Collector gateway is great, but it’s +not good enough. Wouldn’t it be nice to know when one of your Collectors is +malfunctioning, or when data is being dropped? That way, you can take action +before things start to escalate. This is where monitoring your Collectors can be +very useful. + +But how does one monitor a Collector? The OTel Collector already emits +[metrics for the purposes of its own monitoring](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/monitoring.md). +These can then be sent to your Observability backend for monitoring. + +### 3- Not using the right Collector Distribution (or not building your own distribution) + +There are two official distributions of the OpenTelemetry Collector: +[Core](https://github.com/open-telemetry/opentelemetry-collector), and +[Contrib](https://github.com/open-telemetry/opentelemetry-collector-contrib). + +The Core distribution is a bare-bones distribution of the Collector for OTel +developers to develop and test. It contains a base set of +[extensions](/docs/collector/configuration/#service-extensions), +[connectors](/docs/collector/configuration/#connectors), +[receivers](/docs/collector/configuration/#receivers), +[processors](/docs/collector/configuration/#processors), and +[exporters](/docs/collector/configuration/#exporters). + +The Contrib distribution is for non-OTel developers to experiment and learn. It +also extends the Core distribution, and includes components created by +third-parties (including vendors and individual community members), that are +useful to the OpenTelemetry community at large. + +Neither Core nor Contrib alone are meant to be part of your production workload. +Using just Core by itself is too bare-bones and wouldn’t suit an organization’s +needs. (Though its components are absolutely needed!) And although many +OpenTelemetry practitioners, deploy Contrib in their respective organizations, +it has many components, and you likely won’t need every single exporter, +receiver, processor, connector, and extension. That would be overkill, and your +Collector instance ends up needlessly bloated, potentially increasing the attack +surface. + +But how do you pick and choose the components that you need? The answer is to +build your own distribution, and you can do that using a tool called the +[OpenTelemetry Collector Builder](/docs/collector/custom-collector/) (OCB). In +addition, at some point, you may need to create your own custom Collector +component, such as a processor or exporter. The OCB allows you to integrate your +custom components AND pick and choose the Contrib components that you need. + +It is also worth mentioning that some vendors build their own +[Collector distributions](/ecosystem/distributions/). These are OTel Collector +distributions that are curated to Collector components that are specific to that +vendor. They may be a combination of custom, vendor-developed components, and +curated Collector Contrib components. Using vendor-specific distributions +ensures that you are using just the Collector components that you need, again +reducing overall bloat. + +_Bottom line:_ Using the right distribution reduces bloat and allows you to +include only the Collector components that you need. + +### 4- Not updating your Collectors + +This one’s short and sweet. Keeping software up-to-date is important, and the +Collector is no different! By regularly updating the Collector, it allows you to +stay up-to-date with the latest version so that you can take advantage of new +features, bug fixes, performance improvements, and security fixes. + +### 5- Not using the OpenTelemetry Collector where appropriate + +OpenTelemetry allows you to send telemetry signals from your application to an +observability backend in one of two ways: + +- [Directly from the application](/docs/collector/deployment/no-collector/) +- [Via the OpenTelemetry Collector](/docs/collector/) + +Sending telemetry data “direct from application” for non-production systems is +all well and good if you’re getting started with OpenTelemetry, but it is +neither suited nor recommended to use this approach for production systems. +Instead, the +[OpenTelemetry docs recommend using the OpenTelemetry Collector](/docs/collector/#when-to-use-a-collector). +How come? + +[Per the OTel Docs](/docs/collector/#when-to-use-a-collector), the Collector +“allows your service to offload data quickly and the collector can take care of +additional handling like retries, batching, encryption or even sensitive data +filtering.” + +Check out some additional Collector benefits: + +- **Collectors can enhance the quality of the telemetry emitted by an + application while also minimizing costs.** For example: sampling spans to + reduce costs, enriching telemetry with extra metadata, and generating new + telemetry, such as metrics derived from spans. +- **Using a Collector to ingest telemetry data makes it easy to change to a new + backend or export the data in a different format.** If we want to change how + telemetry is being processed or exported, that change happens in one place + (the Collector!), as opposed to making the same change for multiple + applications in your organization. +- **Collectors allow you to receive data of various formats and translate to the + desired format for export.** This can be very handy when transitioning from + some other telemetry solution to OTel. +- **Collectors allow you to ingest non-application telemetry.** This includes + logs and non-app metrics from infrastructure like Azure, Prometheus, and + Cloudwatch. + +That being said, there are some use-cases where folks don't want or can't use a +Collector. For instance, when collecting data at the edge from IOT devices, it +might be better to send data directly to their observability backend instead of +a local Collector, given that resources on that edge might be limited. + +_Bottom line:_ As a general rule, using the OpenTelemetry Collector gives you +additional flexibility for managing your telemetry data. + +## Final Thoughts + +The OpenTelemetry Collector is a powerful and flexible tool for ingesting, +manipulating, and exporting OpenTelemetry data. By using it to its full +potential and by avoiding these five pitfalls, your organization can be well on +its way towards achieving observability greatness. diff --git a/content/en/blog/2024/otel-collector-anti-patterns/otel-collector-agent.png b/content/en/blog/2024/otel-collector-anti-patterns/otel-collector-agent.png new file mode 100644 index 000000000000..a1af1423688c Binary files /dev/null and b/content/en/blog/2024/otel-collector-anti-patterns/otel-collector-agent.png differ diff --git a/content/en/blog/2024/otel-collector-anti-patterns/otel-collector-gateway.png b/content/en/blog/2024/otel-collector-anti-patterns/otel-collector-gateway.png new file mode 100644 index 000000000000..b705bc5610a0 Binary files /dev/null and b/content/en/blog/2024/otel-collector-anti-patterns/otel-collector-gateway.png differ diff --git a/static/refcache.json b/static/refcache.json index 3c8ac5371951..0297bea18a7b 100644 --- a/static/refcache.json +++ b/static/refcache.json @@ -39,6 +39,10 @@ "StatusCode": 200, "LastSeen": "2024-01-18T08:05:55.59597-05:00" }, + "https://adri-v.medium.com/43dca4a857a0": { + "StatusCode": 200, + "LastSeen": "2024-02-23T23:30:53.006527-05:00" + }, "https://agilecoffee.com/leancoffee/": { "StatusCode": 200, "LastSeen": "2024-01-18T08:05:43.542109-05:00" @@ -5283,6 +5287,10 @@ "StatusCode": 200, "LastSeen": "2024-01-30T15:37:21.465525-05:00" }, + "https://open.substack.com/pub/geekingoutpodcast/p/opentelemetry-collector-anti-patterns": { + "StatusCode": 200, + "LastSeen": "2024-02-26T15:05:23.506868-05:00" + }, "https://opencensus.io": { "StatusCode": 206, "LastSeen": "2024-01-18T19:07:33.722102-05:00" @@ -5395,6 +5403,54 @@ "StatusCode": 200, "LastSeen": "2024-01-18T19:07:12.98586-05:00" }, + "https://opentelemetry.io/docs/collector": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:03.656226-05:00" + }, + "https://opentelemetry.io/docs/collector/": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:04.244864-05:00" + }, + "https://opentelemetry.io/docs/collector/#when-to-use-a-collector": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:04.48411-05:00" + }, + "https://opentelemetry.io/docs/collector/configuration/#connectors": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:05.306982-05:00" + }, + "https://opentelemetry.io/docs/collector/configuration/#exporters": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:06.037446-05:00" + }, + "https://opentelemetry.io/docs/collector/configuration/#processors": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:05.754871-05:00" + }, + "https://opentelemetry.io/docs/collector/configuration/#receivers": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:05.518086-05:00" + }, + "https://opentelemetry.io/docs/collector/configuration/#service-extensions": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:05.132379-05:00" + }, + "https://opentelemetry.io/docs/collector/custom-collector/": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:06.360327-05:00" + }, + "https://opentelemetry.io/docs/collector/deployment/agent/": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:04.712097-05:00" + }, + "https://opentelemetry.io/docs/collector/deployment/gateway/": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:04.939057-05:00" + }, + "https://opentelemetry.io/docs/collector/deployment/no-collector/": { + "StatusCode": 206, + "LastSeen": "2024-02-23T22:55:04.014798-05:00" + }, "https://opentracing.io": { "StatusCode": 206, "LastSeen": "2024-01-18T19:07:33.813401-05:00"