Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent metric availability in distributed kafka connect #985

Closed
chrisluedtke opened this issue Aug 5, 2024 · 6 comments
Closed

Inconsistent metric availability in distributed kafka connect #985

chrisluedtke opened this issue Aug 5, 2024 · 6 comments
Assignees

Comments

@chrisluedtke
Copy link

When I hit my metrics endpoint via curl debezium:8080, I get different connector metrics each time. I've seen this behavior in both core kafka connect metrics (task status), and in snapshot metrics from a debezium connector.

In this example, I expect to see 3 connector statuses each time:

[appuser@debezium-59f66877f6-bsqkl ~]$ curl debezium:8080 | grep task_status | grep postgres
kafka_connect_connector_task_status{connector="accounting-postgres-source",status="running",task="0",} 1.0
[appuser@debezium-59f66877f6-bsqkl ~]$ curl debezium:8080 | grep task_status | grep postgres
kafka_connect_connector_task_status{connector="records-postgres-source",status="running",task="0",} 1.0
[appuser@debezium-59f66877f6-bsqkl ~]$ curl debezium:8080 | grep task_status | grep postgres
kafka_connect_connector_task_status{connector="loads-postgres-source",status="running",task="0",} 1.0

Kafka connect dockerfile:

FROM confluentinc/cp-kafka-connect-base:7.5.1

RUN confluent-hub install --no-prompt debezium/debezium-connector-sqlserver:2.4.2 \
  && confluent-hub install --no-prompt debezium/debezium-connector-postgresql:2.4.2 \
  && confluent-hub install --no-prompt snowflakeinc/snowflake-kafka-connector:1.9.3 \
  && confluent-hub install --no-prompt mongodb/kafka-connect-mongodb:1.6.0

# monitoring
ENV JMX_AGENT_VERSION="0.20.0"

RUN curl -so /usr/share/java/jmx_prometheus_javaagent.jar \
  https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/$JMX_AGENT_VERSION/jmx_prometheus_javaagent-$JMX_AGENT_VERSION.jar

COPY jmx_config.yml /usr/share/java/jmx_config.yml

JMX config:

startDelaySeconds: 0
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
rules:
  # kafka.connect:type=app-info,client-id="{clientid}"
  - pattern: "kafka.connect<type=app-info, client-id=(.+)><>([a-z-]+): (.+)"
    name: "kafka_connect_app_info"
    value: "1"
    labels:
      client-id: "$1"
      $2: "$3"
    help: "Kafka Connect JMX metric info $1 $2"
    type: UNTYPED

  # kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}<> status"
  - pattern: 'kafka.connect<type=connector-task-metrics, connector=(.+), task=(.+)><>status: ([a-z-]+)'
    name: kafka_connect_connector_task_status
    value: "1"
    labels:
      connector: "$1"
      task: "$2"
      status: "$3"
    help: "Kafka Connect JMX Connector task status"
    type: GAUGE

  # kafka.connect:type=task-error-metrics,connector="{connector}",task="{task}"
  # kafka.connect:type=source-task-metrics,connector="{connector}",task="{task}"
  # kafka.connect:type=sink-task-metrics,connector="{connector}",task="{task}"
  # kafka.connect:type=connector-task-metrics,connector="{connector}",task="{task}"
  - pattern: kafka.connect<type=(.+)-metrics, connector=(.+), task=(.+)><>([a-z-]+)
    name: kafka_connect_$1_metrics_$4
    labels:
      connector: "$2"
      task: "$3"
    help: "Kafka Connect JMX metric $1 $4"
    type: GAUGE

  # kafka.connect:type=connector-metrics,connector="{connector}"
  # kafka.connect:type=connect-worker-metrics,connector="{connector}"
  - pattern: kafka.connect<type=(.+)-metrics, connector=(.+)><>([a-z-]+)
    name: kafka_connect_$1_metrics_$3
    labels:
      connector: "$2"
      task: "$3"
    help: "Kafka Connect JMX metric $1 $3"
    type: GAUGE

  # kafka.connect:type=connect-worker-rebalance-metrics
  - pattern: kafka.connect<type=connect-worker-rebalance-metrics><>([a-z-]+)
    name: kafka_connect_worker_rebalance_metrics_$1
    help: "Kafka Connect JMX metric rebalance information"
    type: GAUGE

  # kafka.connect:type=connect-worker-metrics
  - pattern: kafka.connect<type=connect-worker-metrics><>([a-z-]+)
    name: kafka_connect_worker_metrics_$1
    help: "Kafka Connect JMX metric worker $1"
    type: GAUGE

  # debezium.sql_server:type=connector-metrics,server=<topic.prefix>,task=<task.id>,context=snapshot
  # debezium.sql_server:type=connector-metrics,server=<topic.prefix>,task=<task.id>,context=streaming
  - pattern: "debezium.([^:]+)<type=connector-metrics, server=([^,]+), task=([^,]+), context=([^,]+), database=([^,]+), key=([^>]+)>([^:]+)"
    name: "debezium_connector_metrics_$7"
    labels:
      plugin: "$1"
      server: "$2"
      task: "$3"
      context: "$4"
      database: "$5"
      key: "$6"

  - pattern: "debezium.([^:]+)<type=connector-metrics, server=([^,]+), task=([^,]+), context=([^,]+), database=([^>]+)>([^:]+)"
    name: "debezium_connector_metrics_$6"
    labels:
      plugin: "$1"
      server: "$2"
      task: "$3"
      context: "$4"
      database: "$5"

  - pattern: "debezium.([^:]+)<type=connector-metrics, server=([^,]+), task=([^,]+), context=([^>]+)>([^:]+)"
    name: "debezium_connector_metrics_$5"
    labels:
      plugin: "$1"
      server: "$2"
      task: "$3"
      context: "$4"

  # debezium.postgres:type=connector-metrics,context=snapshot,server=<topic.prefix>
  # debezium.postgres:type=connector-metrics,context=streaming,server=<topic.prefix>
  # debezium.sql_server:type=connector-metrics,context=schema-history,server=<topic.prefix>
  - pattern: "debezium.([^:]+)<type=connector-metrics, context=([^,]+), server=([^,]+), key=([^>]+)>([^:]+)"
    name: "debezium_connector_metrics_$5"
    labels:
      plugin: "$1"
      context: "$2"
      server: "$3"
      key: "$4"

  - pattern: "debezium.([^:]+)<type=connector-metrics, context=([^,]+), server=([^>]+)>([^:]+)"
    name: "debezium_connector_metrics_$4"
    labels:
      plugin: "$1"
      context: "$2"
      server: "$3"

  # snowflake.kafka.connector:connector=connector_name,pipe=pipe_name,category=category_name,name=metric_name
  - pattern: "snowflake.kafka.connector([^:]*)<connector=([^,]+), pipe=([^,]+), category=([^,]+), name=([^>]+)><>Max"
    name: "snowflake_connector_metrics_$5_max"
    labels:
      plugin: snowflake
      connector: "$2"
      pipe: "$3"
      category: "$4"
    type: GAUGE
@chrisluedtke chrisluedtke changed the title Inconsistent metric availability Inconsistent metric availability in kafka connect Aug 5, 2024
@dhoard
Copy link
Collaborator

dhoard commented Aug 5, 2024

@chrisluedtke can you upgrade to version 1.0.1 and re-test?

There were issues before version 1.0.1 that could cause the inconsistency. Specifically, if two MBean names are normalized to the same metric name, the value being returned would typically be the last MBean value encountered.

@dhoard dhoard self-assigned this Aug 5, 2024
@chrisluedtke
Copy link
Author

chrisluedtke commented Aug 7, 2024

Thanks @dhoard. 1.0.1 has the same problem for me. I found that this issue is not present when running a standalone single process of kafka connect. So this appears to be a distributed system issue.

@prometheus prometheus deleted a comment from antublue Aug 8, 2024
@dhoard
Copy link
Collaborator

dhoard commented Aug 8, 2024

@chrisluedtke How are you managing ingress for the Connect containers when running a cluster?

@chrisluedtke
Copy link
Author

I'm running a kubernetes kafka-connect service with a selector that matches a label on the deployment and worker pods. I only access it from within the cluster (or use kubefwd to connect), e.g. via http://kafka-connect.dev-kafka-connect.svc.cluster.local:8080/metrics. So I'm relying on kube-proxy to route the request to the pod. Do I need to access each pod to gather all metrics, or is there a way to get all metrics in a single call to the leader?

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2024-08-02T20:19:45Z"
  name: kafka-connect
  namespace: dev-kafka-connect
  resourceVersion: "993603120"
  uid:<redacted_uid>
spec:
  clusterIP: <redacted_ip_1>
  clusterIPs:
  - <redacted_ip_1>
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: connect
    port: 8083
    protocol: TCP
    targetPort: 8083
  - name: jmx
    port: 1976
    protocol: TCP
    targetPort: 1976
  - name: prometheus
    port: 8080
    protocol: TCP
    targetPort: 8080
  selector:
    app: kafka-connect
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

I appreciate your input! I realize this is most likely a "me" problem.

@dhoard
Copy link
Collaborator

dhoard commented Aug 8, 2024

@chrisluedtke each Connect pod will need to be scraped directly. I suspect your ingress is bouncing between Connect PODS (exporters), resulting in different metrics

@chrisluedtke
Copy link
Author

Gotcha, I'll implement a prometheus server and pull metrics from there instead.

@chrisluedtke chrisluedtke changed the title Inconsistent metric availability in kafka connect Inconsistent metric availability in distributed kafka connect Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants