Processing of telemetry data coming from an edge device may include multiple running pieces. For example: sensor->module 1->module-2>IoT Hub->function 1->microservice 1->microservice 2->storage.
While diagnosing and troubleshooting a problem such as "the data didn't arrive to the storage" or even worse "the data arrived but it's not what was expected", it's hard to "trace" how the data was traveling and what was happening at every single station. Hence distributed "tracing".
As a matter of fact, in many cases we know how the data was traveling as the flows are mostly straight forward, but we don't know what was happening at every point in the flow when this specific data was processed. This is the real challenge.
In IoT world the solution is getting more complicated as the flow consists of two parts: device and the cloud. While the flow steps in the cloud always have a direct access to the cloud tracing services (App Insights) and can flush their tracing data just on-the-fly, the steps happening in the device don't have this privilege as they are often offline and connection to the cloud service is very limited/restricted.
Besides the standard use-case of sending telemetry data from devices to the cloud, there are other scenarios which would also benefit from distributed tracing functionality:
- Device-to-cloud file upload
- Cloud-to-device messages
- Cloud-to-device direct method invocation
- Direct access to external/cloud services from the device
- Instrument custom modules on IoT Edge device with OpenTelemetry to report tracing.
- The key component to route OpenTelemetry tracing data and logs to the observability backend (e.g. App Insights, Jaeger, Zipkin, etc.) is OpenTelemetry Collector which may work on the device as a module and in the cloud as an Azure Function or K8s microservice.
- All modules on the device should export traces and logs to OpenTelemetry Collector Module via OpenTelemetry Protocol Exporter (e.g. with OTLP exporter). This decouples the module code from the details on how/where the tracing data is going to be used.
- On the devices that are mostly/normally online, the OpenTelemetry Collector Module is configured to receive traces and logs with OTLP Receiver and to export the data to Azure Monitor (App Insights) via Azure Monitor Exporter for OpenTelemetry Collector. See an example of IoT Edge deployment with an OpenTelemetry Collector configured to export OTLP data to Azure Monitor.
- Alternatively, the devices that are always online (can't work otherwise) and have a stable connection with the cloud may have custom modules export traces and logs directly to App Insights with Azure Monitor OpenTelemetry exporter. In this case they don't need to have OpenTelemetry Collector Module on the device.
- IoT leaf devices normally sit on the lower network levels without access to Azure Cloud, they might be implemented with languages like C/C++, Rust, so there is no App Insights SDK available for them, these devices are normally small and not capable to host something like OpenTelemetry Collector on their own.
- Leaf devices export their traces and logs via OTLP protocol to the IoT Edge device which plays a role of a gateway and resends everything to the cloud.
- On the devices that may be normally offline, the OpenTelemetry Collector Module is configured to export traces and logs to Azure Blob Storage module via Azure Blob Exporter. This module keeps the traces and logs on the device as long as it is offline. Once the device is online, Azure Blob Storage module automatically replicates the data to the blob storage account in the cloud. See an example of IoT Edge deployment with an OpenTelemetry Collector and an Azure Blob Storage module. On the cloud side there is an OpenTelemetry Collector instance running and reading traces and logs from the blob storage with Azure Blob Receiver and exports the data to Azure Monitor via Azure Monitor Exporter. See an example of OpenTelemetry Collector configuration running in the cloud.
- OpenTelemetry Collector instance in the cloud can be deployed as a pod on an AKS instance or as a container on ACI or just as a standalone instance on a VM
- OpenTelemetry Collector Azure Blob Receiver is subscribed on Azure Blob Storage events and when a new trace or log arrives to the cloud storage from the device, the receiver reads the data from the blob and transfers it to Azure Monitor.
- Alternatively, for the devices that are mostly offline and/or not supposed to report much to the cloud, the tracing data can be forwarded by OpenTelemetry Collector Module to an open source observability backend (e.g. Jaeger, Zipkin) using one of available exporters.
- All services in the cloud, that are included in the flow may export OpenTelemetry traces to Azure Monitor with the direct exporter from the code or they may use OTLP to export traces to the OpenTelemetry Collector instance in the cloud. The latter covers cases when services are not implemented with one of supported by Azure Monitor languages, for example GoLang or C/C++.
- All steps in the flow (modules on the device and services in the cloud) should leverage OpenTelemetry Tracing API components such as Span Attributes to store deviceid, sensorid, gateway, etc. and Span Events to store essential logs that should be exported with tracing data.
- D2C and C2D messages should contain tracing span context injected in the message system properties. It can be extracted and used by receiving modules and backend services to continue the trace. This may require using Context propagation techniques.
- Trace Azure IoT device-to-cloud messages with distributed tracing
- E2E diagnostic provision CLI
- E2E diagnostic event hub function
- OpenTelemetry and Tracing
- OpenTelemetry and Logs
- OpenTelemetry Collector
- Azure Monitor Exporter for OpenTelemetry Collector
- OpenTelemetry .Net API
- Sending telemetry to Azure Monitor
- OpenTelemetry LightStep
- Collect and Transport Metrics
- Logging Approaches
- Tree Pillars of Observability