Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic using the load balancing exporter #31410

Closed
grzn opened this issue Feb 26, 2024 · 25 comments · Fixed by #31456
Closed

panic using the load balancing exporter #31410

grzn opened this issue Feb 26, 2024 · 25 comments · Fixed by #31456
Labels
bug Something isn't working exporter/loadbalancing

Comments

@grzn
Copy link
Contributor

grzn commented Feb 26, 2024

Component(s)

exporter/loadbalancing

What happened?

Description

We are running v0.94.0 in a number of k8s clusters, and are experiencing panics in the agent setup

Steps to Reproduce

I don't have an exact steps to reproduce, but this panic happens quite other across our clusters

Expected Result

No panic

Actual Result

Panic :)

Collector version

v0.94.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

connectors:
      null
    exporters:
      file/logs:
        path: /dev/null
      file/traces:
        path: /dev/null
      loadbalancing/traces:
        protocol:
          otlp:
            retry_on_failure:
              enabled: true
              max_elapsed_time: 30s
              max_interval: 5s
            sending_queue:
              enabled: true
              num_consumers: 20
              queue_size: 50000
            timeout: 20s
            tls:
              insecure: true
        resolver:
          k8s:
            service: opentelemetry-collector.default
    extensions:
      health_check: {}
    processors:
      batch:
        send_batch_max_size: 4096
        send_batch_size: 4096
        timeout: 100ms
      filter/fastpath:
        traces:
          span:
          - (end_time_unix_nano - start_time_unix_nano <= 1000000) and parent_span_id.string
            != ""
      k8sattributes:
        extract:
          annotations: null
          labels:
          - key: app
          metadata:
          - k8s.deployment.name
          - k8s.namespace.name
          - k8s.node.name
          - k8s.pod.name
          - k8s.pod.uid
          - container.id
          - container.image.name
          - container.image.tag
        filter:
          node_from_env_var: K8S_NODE_NAME
        pod_association:
        - sources:
          - from: resource_attribute
            name: k8s.pod.uid
        - sources:
          - from: resource_attribute
            name: k8s.pod.ip
        - sources:
          - from: resource_attribute
            name: host.name
      memory_limiter:
        check_interval: 1s
        limit_percentage: 95
        spike_limit_percentage: 10
      resource:
        attributes:
        - action: insert
          key: k8s.node.name
          value: ${K8S_NODE_NAME}
      resource/add_agent_k8s:
        attributes:
        - action: insert
          key: k8s.pod.name
          value: ${K8S_POD_NAME}
        - action: insert
          key: k8s.pod.uid
          value: ${K8S_POD_UID}
        - action: insert
          key: k8s.namespace.name
          value: ${K8S_NAMESPACE}
      resource/add_cluster_name:
        attributes:
        - action: upsert
          key: k8s.cluster.name
          value: test-eu3
      resource/add_environment:
        attributes:
        - action: insert
          key: deployment.environment
          value: test
      resourcedetection:
        detectors:
        - env
        - eks
        - ec2
        - system
        override: false
        timeout: 10s
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
      prometheus:
        config:
          scrape_configs:
          - job_name: opentelemetry-agent
            scrape_interval: 10s
            static_configs:
            - targets:
              - ${K8S_POD_IP}:9090
    service:
      extensions:
      - health_check
      pipelines:
        traces:
          exporters:
          - loadbalancing/traces
          processors:
          - memory_limiter
          - filter/fastpath
          - k8sattributes
          - resource
          - resource/add_cluster_name
          - resource/add_environment
          - resource/add_agent_k8s
          - resourcedetection
          receivers:
          - otlp
      telemetry:
        logs:
          encoding: json
          initial_fields:
            service: opentelemetry-agent
          level: INFO
          sampling:
            enabled: true
            initial: 3
            thereafter: 0
            tick: 60s
        metrics:
          address: 0.0.0.0:9090

Log output

net/http/server.go:3086 +0x4cc
created by net/http.(*Server).Serve in goroutine 745
net/http/server.go:2009 +0x518
net/http.(*conn).serve(0x4004ab4000, {0x862ab78, 0x4001dcfaa0})
net/http/server.go:2938 +0xbc
net/http.serverHandler.ServeHTTP({0x85dff10?}, {0x8608d30?, 0x40001dc380?}, 0x6?)
go.opentelemetry.io/collector/config/[email protected]/clientinfohandler.go:26 +0x100
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP(0x400212cd08, {0x8608d30, 0x40001dc380}, 0x4005a3ae00)
net/http/server.go:2136 +0x38
net/http.HandlerFunc.ServeHTTP(0x4005a3ae00?, {0x8608d30?, 0x40001dc380?}, 0x4005a31af0?)
go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:83 +0x40
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1({0x8608d30?, 0x40001dc380?}, 0x4005a31ad8?)
go.opentelemetry.io/contrib/instrumentation/net/http/[email protected]/handler.go:225 +0xf44
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP(0x4002c0d110, {0x8608d30?, 0x40001dc380}, 0x4005a3af00, {0x85aa920, 0x4001854080})
go.opentelemetry.io/collector/config/[email protected]/compression.go:147 +0x150
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP(0x4001854080, {0x8620d30, 0x40070b4540}, 0x4005a3b000)
net/http/server.go:2514 +0x144
net/http.(*ServeMux).ServeHTTP(0x4001854080?, {0x8620d30, 0x40070b4540}, 0x4005a3b000)
net/http/server.go:2136 +0x38
net/http.HandlerFunc.ServeHTTP(0x4005a31398?, {0x8620d30?, 0x40070b4540?}, 0x0?)
go.opentelemetry.io/collector/receiver/[email protected]/otlp.go:129 +0x28
go.opentelemetry.io/collector/receiver/otlpreceiver.(*otlpReceiver).startHTTPServer.func1({0x8620d30?, 0x40070b4540?}, 0x6422920?)
go.opentelemetry.io/collector/receiver/[email protected]/otlphttp.go:43 +0xb0
go.opentelemetry.io/collector/receiver/otlpreceiver.handleTraces({0x8620d30, 0x40070b4540}, 0x4005a3b000, 0x400521c4e0?)
go.opentelemetry.io/collector/receiver/[email protected]/internal/trace/otlp.go:42 +0xa4
go.opentelemetry.io/collector/receiver/otlpreceiver/internal/trace.(*Receiver).Export(0x400212ca38, {0x862ab78, 0x400521c630}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/[email protected]/internal/fanoutconsumer/traces.go:60 +0x208
go.opentelemetry.io/collector/internal/fanoutconsumer.(*tracesConsumer).ConsumeTraces(0x4002c82c60, {0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/processorhelper/traces.go:60 +0x1c0
go.opentelemetry.io/collector/processor/processorhelper.NewTracesProcessor.func1({0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/trace_exporter.go:121 +0x160
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*traceExporterImp).ConsumeTraces(0x4002c0f170, {0x862ab78, 0x400521c690}, {0x400614e510?, 0x400ee82104?})
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/trace_exporter.go:134 +0x16c
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/loadbalancingexporter.(*traceExporterImp).consumeTrace(0x40055565b8?, {0x862ab78, 0x400521c690}, 0xa?)
go.opentelemetry.io/collector/[email protected]/traces.go:25
go.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces(...)
go.opentelemetry.io/collector/[email protected]/exporterhelper/traces.go:99 +0xb4
go.opentelemetry.io/collector/exporter/exporterhelper.NewTracesExporter.func1({0x862ab78, 0x400521c690}, {0x400614f3f8?, 0x4013439754?})
go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:199 +0x50
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send(0x401368f540, {0x862ab78?, 0x400521c690?}, {0x85dfa10?, 0x400614f758?})
go.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:154 +0xa8
go.opentelemetry.io/collector/exporter/exporterhelper.(*queueSender).send(0x400faaaf00, {0x862ab78?, 0x400521c690?}, {0x85dfa10, 0x400614f758})
go.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:43
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Offer(...)
runtime/panic.go:914 +0x218
panic({0x64d27e0?, 0x8591d40?})
go.opentelemetry.io/otel/[email protected]/trace/span.go:437 +0x7f8
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End(0x4003f3c900, {0x0, 0x0, 0x286b4?})
go.opentelemetry.io/otel/[email protected]/trace/span.go:405 +0x2c
go.opentelemetry.io/otel/sdk/trace.(*recordingSpan).End.func1()
runtime/panic.go:920 +0x26c
panic({0x64d27e0?, 0x8591d40?})
net/http/server.go:1868 +0xb0
net/http.(*conn).serve.func1()
goroutine 419349 [running]:
2024/02/26 12:34:34 http: panic serving 10.0.58.206:49066: send on closed channel

Additional context

My guess is that the k8s resolver doesn't shutdown exporters properly?

@grzn grzn added bug Something isn't working needs triage New item requiring triage labels Feb 26, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

Is this only happening with the k8s resolver? Can you try the DNS resolver instead and report back?

@jpkrohling
Copy link
Member

@kentquirk, is this something you could take a look?

@MrAlias
Copy link
Contributor

MrAlias commented Feb 26, 2024

Looks related to open-telemetry/opentelemetry-go-contrib#4895.

@crobert-1
Copy link
Member

Looks related to open-telemetry/opentelemetry-go-contrib#4895.

I don't believe that's the issue here. From the attached logs it looks like the core dependency is at v0.94.1, which reverted the dependency to a version unaffected by that issue.

@crobert-1
Copy link
Member

#31050 potentially resolves this issue.

Currently in main the loadbalancer starts the resolver (k8sresolver in this case), but does not call its shutdown method in the exporter's Shutdown method. This PR will make it so that the load balancer properly calls shutdown on its loadbalancer, regardless of type. Also, the PR shuts down the loadbalancer for traces/metrics/logs exporters as well, as that also is not currently done.

@dmitryax
Copy link
Member

@grzn, is this something you started seeing in 0.94.0, or you haven't tried the loadbalancing exporter before?

@grzn
Copy link
Contributor Author

grzn commented Feb 27, 2024

This isn't new to v0.94.0
We saw it also in v0.91.0, and that's the first version we started using the load balancing exporter

@dmitryax
Copy link
Member

@crobert-1 I think the problem is a bit different here. The data is being sent to an exporter that was shut down. So it must be some desynchronisation between routing and tracking the list of active exporters

@dmitryax
Copy link
Member

#31456 should resolve the panic

@crobert-1 crobert-1 removed the needs triage New item requiring triage label Feb 28, 2024
@grzn
Copy link
Contributor Author

grzn commented Feb 28, 2024

Nice!
Than's @dmitryax I'll try this once merged and released.

Were you able to reproduce this panic in a UT?

@dmitryax
Copy link
Member

Were you able to reproduce this panic in a UT?

I wasn't but it became pretty clear to me after looking in the code

@dmitryax
Copy link
Member

@grzn, if you have a test cluster where you can try the build from the branch, that would be great. I can help you to push the image if needed. It's just one command to build make docker-otelcontribcol.

@grzn
Copy link
Contributor Author

grzn commented Feb 29, 2024

@dmitryax I have clusters to test this on, but I need a tagged image.

@grzn
Copy link
Contributor Author

grzn commented Feb 29, 2024

Maybe you can simulate this in UT by sending the traces to a dummy gRPC server that sleeps?

@dmitryax
Copy link
Member

Ok, I've built an amd64 linux image from the branch and pushed it to danoshin276/otelcontribcol:lb-fixed-1. Let me know if you need an arm64 image instead. The executable is under /otelcontribcol.

I'll try to reproduce it in a test in the meantime

@grzn
Copy link
Contributor Author

grzn commented Mar 4, 2024

@dmitryax i need both the arm64 and amd64; once you publish it i'll give it a try

@grzn
Copy link
Contributor Author

grzn commented Mar 5, 2024

I ended up compiling from your branch; deploying it now.

jpkrohling pushed a commit that referenced this issue Mar 5, 2024
Fix panic when a sub-exporter is shut down while still handling
requests. This change wraps exporters with an additional working group
to ensure that exporters are shut down only after they finish processing
data.

Fixes
#31410

It has some small related refactoring changes. I can extract them in
separate PRs if needed.
@grzn
Copy link
Contributor Author

grzn commented Mar 5, 2024

Okay so after restarting the deployment/collector, the daemonset/agent did not panic, but our backend pods show these errors:

traces export: context deadline exceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded

the metrics show there are no backends

# TYPE otelcol_loadbalancer_num_backends gauge
otelcol_loadbalancer_num_backends{resolver="k8s",service_instance_id="0b8491ac-21b9-4935-b3f1-1f21f7e620c0",service_name="otelcontribcol",service_version="0.96.0-dev"} 0

and the logs show

2024-03-05 11:06:03.886Z {"level":"warn","ts":1709636763.8864968,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #95 SubChannel #96] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.29.198:4317\", ServerName: \"10.0.29.198:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.29.198:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:03.896Z {"level":"warn","ts":1709636763.8962934,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:03.908Z {"level":"warn","ts":1709636763.908809,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:04.898Z {"level":"warn","ts":1709636764.898518,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:06.780Z {"level":"warn","ts":1709636766.7802808,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: connection refused\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:24.910Z {"level":"warn","ts":1709636784.9104936,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:46.574Z {"level":"warn","ts":1709636806.5742695,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:06:51.959Z {"level":"warn","ts":1709636811.959513,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: no route to host\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:07:08.639Z {"level":"warn","ts":1709636828.6391737,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:07:25.169Z {"level":"warn","ts":1709636845.1695504,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: no route to host\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:07:32.757Z {"level":"warn","ts":1709636852.7577434,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:07:59.047Z {"level":"warn","ts":1709636879.047226,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:08:14.679Z {"level":"warn","ts":1709636894.6795218,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: no route to host\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:08:28.773Z {"level":"warn","ts":1709636908.773698,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:09:14.311Z {"level":"warn","ts":1709636954.3115778,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #126 SubChannel #127] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.59.158:4317\", ServerName: \"10.0.59.158:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.59.158:4317: i/o timeout\"","service":"opentelemetry-agent","grpc_log":true}
2024-03-05 11:09:31.399Z {"level":"warn","ts":1709636971.399436,"caller":"zapgrpc/zapgrpc.go:193","msg":"[core] [Channel #114 SubChannel #115] grpc: addrConn.createTransport failed to connect to {Addr: \"10.0.47.151:4317\", ServerName: \"10.0.47.151:4317\", }. Err: connection error: desc = \"transport: Error while dialing: dial tcp 10.0.47.151:4317: connect: no route to host\"","service":"opentelemetry-agent","grpc_log":true}

@grzn
Copy link
Contributor Author

grzn commented Mar 5, 2024

giong to rollback

@jpkrohling
Copy link
Member

Can you confirm that these IPs are indeed collector instances behind your Kubernetes service named otelcontribcol? If they are, can you confirm they have OTLP receivers and that the port 4317 is exposed? Can you share the metrics from one of those collectors as well?

10.0.47.151
10.0.59.158
10.0.29.198

Do you have more pods behind the service? If so, can you share metrics about them as well?

@dmitryax
Copy link
Member

dmitryax commented Mar 5, 2024

#31602 should solve the issues you see now. @grzn do you have a chance to try this branch?

dmitryax added a commit that referenced this issue Mar 7, 2024
@grzn
Copy link
Contributor Author

grzn commented Mar 10, 2024

missed your comment;
yes, all of the IPs are pods behind the service, and this happens when I restart the service so the old pods are dead and I get no metrics from them.
I can get metrics from the new ones.

@grzn
Copy link
Contributor Author

grzn commented Mar 10, 2024

I see this is merged, I'll try the main branch again this week and report back.

@grzn
Copy link
Contributor Author

grzn commented Mar 12, 2024

The problem I reported on last week still happens on the main branch.

Scenario:

  • pods sendings traces to otel deployed sas a daemonset
  • the otel daemonset uses loadbalancing exporter and k8s resolver to send traces to an otel deployment
  • otel deployment sends traces to 3rdparty, uses loadbalacing processor

when I restart the deployment, some of the daemonset replicas goes bad:

  1. the pods sendings traces to this replica fail to send traces
  2. the metric otelcol_loadbalancer_num_backends drops down to zero

In this specific cluster, the deployment replica count is 5, the daemonset replica count is 20; out of the 20 pods, 1 went bad.

So right now the situation in main is worse than before the attempted fix.

DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
…elemetry#31456)

Fix panic when a sub-exporter is shut down while still handling
requests. This change wraps exporters with an additional working group
to ensure that exporters are shut down only after they finish processing
data.

Fixes
open-telemetry#31410

It has some small related refactoring changes. I can extract them in
separate PRs if needed.
DougManton pushed a commit to DougManton/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
…elemetry#31456)

Fix panic when a sub-exporter is shut down while still handling
requests. This change wraps exporters with an additional working group
to ensure that exporters are shut down only after they finish processing
data.

Fixes
open-telemetry#31410

It has some small related refactoring changes. I can extract them in
separate PRs if needed.
XinRanZhAWS pushed a commit to XinRanZhAWS/opentelemetry-collector-contrib that referenced this issue Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working exporter/loadbalancing
Projects
None yet
5 participants