Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otel with grafana loki tempo #5011

Merged
merged 2 commits into from
Jan 4, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions examples/deploy/otel-tracing-grafana/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
Tracing with Tempo
==================

[Tempo](https://grafana.com/oss/tempo/) is an open-source distributed tracing system for monitoring and debugging microservices. It helps track requests across services, analyze latency, identify bottlenecks, and diagnose failures.

Key use cases include debugging production issues, monitoring performance, visualizing service dependencies, and optimizing system reliability. As Tempo supports the OpenTelemetry protocol, it can be used to collect traces from Windmill.

Follow the guide on [setting up Tempo](https://windmill.dev/docs/misc/guides/otel#setting-up-tempo) for more details.

## Setting up Tempo along with Windmill

Start all services by running:

```bash
docker-compose up -d
```

## Configuring Windmill to use Tempo via OpenTelemetry Collector

In the Windmill UI available at `http://localhost`, complete the initial setup and go to "Instances Settings" and "OTEL/Prom" tab and fill in the open telemetry collector endpoint and service name and toggle the "Tracing" and "Logging" options to send traces and logs to the `otel-collector:4317` port.

## Open the Tempo UI

The Tempo UI is a plugin available in Grafana, if hosted with the `docker-compose.yml` file above, Grafana will be available at `http://localhost:3000`. When running a script or workflow with Windmill, you will be able to see the traces in the Tempo UI and investigate them.

## Searching for specific traces

To search/filter for a specific trace, for example a workflow, you can use the search function in the Tempo UI by filtering by tags set by Windmill.

The following tags are useful to filter for specific traces:

- `job_id`: The ID of the job
- `root_job`: The ID of the root job (flow)
- `parent_job`: The ID of the parent job (flow)
- `flow_step_id`: The ID of the step within the workflow
- `script_path`: The path of the script
- `workspace_id`: The name of the workspace
- `worker_id`: The ID of the worker
- `language`: The language of the script
- `tag`: The queue tag of the workflow

## Monitoring metrics with Tempo and Prometheus

Tempo can be used to generate time series for metrics of the collected traces. These time series can be used to compare the performance of individual steps within a workflow and their overall performance and relative contribution over time, as well as identify and troubleshoot issues and anomalies.

These metrics are exported to Prometheus and can be viewed in the Grafana UI in the Metrics Explorer. The generated metrics are:

- traces_spanmetrics_calls_total
- traces_spanmetrics_latency
- traces_spanmetrics_latency_bucket
- traces_spanmetrics_latency_count
- traces_spanmetrics_latency_sum
- traces_spanmetrics_size_total

## Viewing Logs with Loki

Logs from Windmill are sent to [Loki](https://grafana.com/oss/loki/), a log aggregation system that integrates seamlessly with Grafana. You can view and analyze these logs in the Grafana UI.

To access logs in Grafana:

1. Open the Grafana UI, typically available at `http://localhost:3000`.
2. Navigate to the "Explore" section.
3. Select the Loki data source.
4. Use the query editor to filter and search logs based on various labels and fields.

This setup allows you to correlate logs with traces, providing a comprehensive view of your system's behavior and aiding in troubleshooting and performance analysis.
245 changes: 245 additions & 0 deletions examples/deploy/otel-tracing-grafana/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
version: "3.7"

services:
db:
deploy:
# To use an external database, set replicas to 0 and set DATABASE_URL to the external database url in the .env file
replicas: 1
image: postgres:16
shm_size: 1g
restart: unless-stopped
volumes:
- db_data:/var/lib/postgresql/data
expose:
- 5432
ports:
- 5432:5432
environment:
POSTGRES_PASSWORD: changeme
POSTGRES_DB: windmill
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5

windmill_server:
image: ${WM_IMAGE}
pull_policy: always
deploy:
replicas: 1
restart: unless-stopped
expose:
- 8000
- 2525
environment:
- DATABASE_URL=${DATABASE_URL}
- MODE=server
depends_on:
db:
condition: service_healthy
volumes:
- worker_logs:/tmp/windmill/logs

windmill_worker:
image: ${WM_IMAGE}
pull_policy: always
deploy:
replicas: 3
resources:
limits:
cpus: "1"
memory: 2048M
# for GB, use syntax '2Gi'
restart: unless-stopped
environment:
- DATABASE_URL=${DATABASE_URL}
- MODE=worker
- WORKER_GROUP=default
depends_on:
db:
condition: service_healthy
# to mount the worker folder to debug, KEEP_JOB_DIR=true and mount /tmp/windmill
volumes:
# mount the docker socket to allow to run docker containers from within the workers
- /var/run/docker.sock:/var/run/docker.sock
- worker_dependency_cache:/tmp/windmill/cache
- worker_logs:/tmp/windmill/logs

## This worker is specialized for "native" jobs. Native jobs run in-process and thus are much more lightweight than other jobs
windmill_worker_native:
# Use ghcr.io/windmill-labs/windmill-ee:main for the ee
image: ${WM_IMAGE}
pull_policy: always
deploy:
replicas: 1
resources:
limits:
cpus: "1"
memory: 2048M
# for GB, use syntax '2Gi'
restart: unless-stopped
environment:
- DATABASE_URL=${DATABASE_URL}
- MODE=worker
- WORKER_GROUP=native
- NUM_WORKERS=8
- SLEEP_QUEUE=200
depends_on:
db:
condition: service_healthy
volumes:
- worker_logs:/tmp/windmill/logs
# This worker is specialized for reports or scraping jobs. It is assigned the "reports" worker group which has an init script that installs chromium and can be targeted by using the "chromium" worker tag.
# windmill_worker_reports:
# image: ${WM_IMAGE}
# pull_policy: always
# deploy:
# replicas: 1
# resources:
# limits:
# cpus: "1"
# memory: 2048M
# # for GB, use syntax '2Gi'
# restart: unless-stopped
# environment:
# - DATABASE_URL=${DATABASE_URL}
# - MODE=worker
# - WORKER_GROUP=reports
# depends_on:
# db:
# condition: service_healthy
# # to mount the worker folder to debug, KEEP_JOB_DIR=true and mount /tmp/windmill
# volumes:
# # mount the docker socket to allow to run docker containers from within the workers
# - /var/run/docker.sock:/var/run/docker.sock
# - worker_dependency_cache:/tmp/windmill/cache
# - worker_logs:/tmp/windmill/logs

# The indexer powers full-text job and log search, an EE feature.
windmill_indexer:
image: ${WM_IMAGE}
pull_policy: always
deploy:
replicas: 0 # set to 1 to enable full-text job and log search
restart: unless-stopped
expose:
- 8001
environment:
- PORT=8001
- DATABASE_URL=${DATABASE_URL}
- MODE=indexer
depends_on:
db:
condition: service_healthy
volumes:
- windmill_index:/tmp/windmill/search
- worker_logs:/tmp/windmill/logs

lsp:
image: ghcr.io/windmill-labs/windmill-lsp:latest
pull_policy: always
restart: unless-stopped
expose:
- 3001
volumes:
- lsp_cache:/root/.cache

multiplayer:
image: ghcr.io/windmill-labs/windmill-multiplayer:latest
deploy:
replicas: 0 # Set to 1 to enable multiplayer, only available on Enterprise Edition
restart: unless-stopped
expose:
- 3002

caddy:
image: ghcr.io/windmill-labs/caddy-l4:latest
restart: unless-stopped
# Configure the mounted Caddyfile and the exposed ports or use another reverse proxy if needed
volumes:
- ./Caddyfile:/etc/caddy/Caddyfile
# - ./certs:/certs # Provide custom certificate files like cert.pem and key.pem to enable HTTPS - See the corresponding section in the Caddyfile
ports:
# To change the exposed port, simply change 80:80 to <desired_port>:80. No other changes needed
- 80:80
- 25:25
# - 443:443 # Uncomment to enable HTTPS handling by Caddy
environment:
- BASE_URL=":80"
# - BASE_URL=":443" # uncomment and comment line above to enable HTTPS via custom certificate and key files
# - BASE_URL=mydomain.com # Uncomment and comment line above to enable HTTPS handling by Caddy

# Grafana OpenTelemetry Example
# https://windmill.dev/docs/misc/guides/otel#setting-up-grafana
init:
image: &tempoImage grafana/tempo:latest
user: root
entrypoint:
- "chown"
- "10001:10001"
- "/var/tempo"
volumes:
- tempo-data:/var/tempo

otel-collector:
image: otel/opentelemetry-collector:latest
container_name: otel-collector
expose:
- 4317
volumes:
- ./otel-config.yaml:/etc/otel/config.yaml
command: ["--config=/etc/otel/config.yaml"]

tempo:
image: *tempoImage
command: [ "-config.file=/etc/tempo.yaml" ]
volumes:
- ./tempo-config.yaml:/etc/tempo.yaml
- tempo-data:/var/tempo
expose:
- 3200
- 4317
depends_on:
- init

loki:
image: grafana/loki:latest
expose:
- 3100
command: -config.file=/etc/loki/local-config.yaml
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml

prometheus:
image: prom/prometheus:latest
command:
- --config.file=/etc/prometheus.yaml
- --web.enable-remote-write-receiver
- --enable-feature=exemplar-storage
- --enable-feature=native-histograms
volumes:
- ./prometheus-config.yaml:/etc/prometheus.yaml
expose:
- 9090

grafana:
image: grafana/grafana:11.0.0
volumes:
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
- GF_AUTH_DISABLE_LOGIN_FORM=true
- GF_FEATURE_TOGGLES_ENABLE=traceqlEditor metricsSummary
- GF_INSTALL_PLUGINS=https://storage.googleapis.com/integration-artifacts/grafana-exploretraces-app/grafana-exploretraces-app-latest.zip;grafana-traces-app
ports:
- "3000:3000"

volumes:
db_data: null
worker_dependency_cache: null
worker_logs: null
windmill_index: null
lsp_cache: null
tempo-data: null
40 changes: 40 additions & 0 deletions examples/deploy/otel-tracing-grafana/grafana-datasources.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
apiVersion: 1

datasources:
- name: Prometheus
type: prometheus
uid: prometheus
access: proxy
orgId: 1
url: http://prometheus:9090
basicAuth: false
isDefault: false
version: 1
editable: false
jsonData:
httpMethod: GET
- name: Tempo
type: tempo
access: proxy
orgId: 1
url: http://tempo:3200
basicAuth: false
isDefault: true
version: 1
editable: false
apiVersion: 1
uid: tempo
jsonData:
httpMethod: GET
serviceMap:
datasourceUid: prometheus
streamingEnabled:
search: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
httpHeaderName1: "X-Scope-OrgID"
secureJsonData:
httpHeaderValue1: "tenant1"
29 changes: 29 additions & 0 deletions examples/deploy/otel-tracing-grafana/loki-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
auth_enabled: false

server:
http_listen_port: 3100

common:
ring:
instance_addr: 0.0.0.0
kvstore:
store: inmemory
replication_factor: 1
path_prefix: /tmp/loki

schema_config:
configs:
- from: 2020-05-15
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h

storage_config:
filesystem:
directory: /tmp/loki/chunks

limits_config:
allow_structured_metadata: true
Loading