Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKC timeout using Thanos #7874

Open
antikilahdjs opened this issue Oct 31, 2024 · 1 comment
Open

AKC timeout using Thanos #7874

antikilahdjs opened this issue Oct 31, 2024 · 1 comment

Comments

@antikilahdjs
Copy link

antikilahdjs commented Oct 31, 2024

Thanos with Memcached enabled plus MiniO as Long-term

Thanos, Prometheus and Golang version used:

Object Storage Provider: S3 MiniO

What happened:
I have configured my Thanos alongside Memcached but I am not able to fix the error about my query search when I need search more than 2 days. I am getting the error below

receive series from Addr: 10.233.117.207:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeoutreceive series from Addr: 10.233.116.94:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout

My Thanos Store:

args:
            - store
            - '--log.level=info'
            - '--log.format=logfmt'
            - '--data-dir=/var/thanos/store'
            - '--grpc-address=0.0.0.0:10901'
            - '--http-address=0.0.0.0:10902'
            - '--objstore.config=$(OBJSTORE_CONFIG)'
            - '--ignore-deletion-marks-delay=24h'
            - '--block-sync-concurrency=120'
            - '--sync-block-duration=60m'
            - '--index-cache-size=4096MB'
            - '--chunk-pool-size=4GB'
            - '--store.grpc.series-max-concurrency=300'
            - '--consistency-delay=30m'
            - |-
              --index-cache.config="config":
                "addresses":
                - "thanos-memcached-service.thanos:11211"
                "dns_provider_update_interval": "60s"
                "max_async_buffer_size": 0
                "max_async_concurrency": 1000
                "max_get_multi_batch_size": 0
                "max_get_multi_concurrency": 0
                "max_idle_connections": 400
                "max_item_size": 0
                "timeout": "180s"
              "type": "MEMCACHED"
            - |-
              --store.caching-bucket.config="blocks_iter_ttl": "720h"
              "chunk_object_attrs_ttl": "720h"
              "chunk_subrange_size": 128000
              "chunk_subrange_ttl": "720h"
              "config":
                "addresses":
                - "thanos-memcached-service.thanos:11211"
                "dns_provider_update_interval": "60s"
                "max_async_buffer_size": 0
                "max_async_concurrency": 1000
                "max_get_multi_batch_size": 0
                "max_get_multi_concurrency": 0
                "max_idle_connections": 400
                "max_item_size": 0
                "timeout": "180s"
              "max_chunks_get_range_requests": 3
              "metafile_content_ttl": "720h"
              "metafile_doesnt_exist_ttl": "1h"
              "metafile_exists_ttl": "720h"
              "metafile_max_size": "4MiB"
              "type": "MEMCACHED"
            - |-
              --tracing.config="config":
                "sampler_param": 2
                "sampler_type": "ratelimiting"
                "service_name": "thanos-store"
              "type": "JAEGER"

My Thanos Frontend

args:
            - query-frontend
            - '--enable-auto-gomemlimit'
            - '--log.level=info'
            - '--log.format=logfmt'
            - '--query-frontend.compress-responses'
            - '--http-address=0.0.0.0:9090'
            - >-
              --query-frontend.downstream-url=http://thanos-query.thanos.svc.cluster.local.:9090
            - '--query-range.split-interval=24h'
            - '--labels.split-interval=12h'
            - '--query-range.max-retries-per-request=100'
            - '--labels.max-retries-per-request=25'
            - '--query-frontend.log-queries-longer-than=0'
            - '--query-range.max-query-parallelism=120'
            - '--query-frontend.vertical-shards=0'
            - '--cache-compression-type='
            - '--query-frontend.downstream-tripper-config={"response_header_timeout": "5m", "max_idle_conns_per_host": 100}'
            - |-
              --query-range.response-cache-config="config":
                "addresses":
                - "thanos-memcached-service.thanos:11211"
                "dns_provider_update_interval": "30s"
                "max_async_buffer_size": 0
                "max_async_concurrency": 1000
                "max_get_multi_batch_size": 0
                "max_get_multi_concurrency": 0
                "max_idle_connections": 400
                "timeout": "180s"
                "expiration": "720h"
              "type": "MEMCACHED"
            - |-
              --labels.response-cache-config="config":
                "addresses":
                - "thanos-memcached-service.thanos:11211"
                "dns_provider_update_interval": "30s"
                "max_async_buffer_size": 0
                "max_async_concurrency": 1000
                "max_get_multi_batch_size": 0
                "max_get_multi_concurrency": 0
                "max_idle_connections": 400
                "timeout": "180s"
                "expiration": "720h"
              "type": "MEMCACHED"
            - |-
              --tracing.config="config":
                "sampler_param": 2
                "sampler_type": "ratelimiting"
                "service_name": "thanos-query-frontend"
              "type": "JAEGER"

My Prometheus:

containers:
    - args:
        - '--web.console.templates=/etc/prometheus/consoles'
        - '--web.console.libraries=/etc/prometheus/console_libraries'
        - '--storage.tsdb.retention.time=12h'
        - '--config.file=/etc/prometheus/config_out/prometheus.env.yaml'
        - '--storage.tsdb.path=/prometheus'
        - '--web.enable-lifecycle'
        - '--web.enable-admin-api'
        - '--web.route-prefix=/'
        - '--web.config.file=/etc/prometheus/web_config/web-config.yaml'
        - '--storage.tsdb.max-block-duration=2h'
        - '--storage.tsdb.min-block-duration=2h'
        - '--web.max-connections=8096'
        - '--query.max-concurrency=60'
      image: 'prom/prometheus:v2.49.1'

What you expected to happen:

My Prometheus have 6h of retention but if I try search more than this am getting the error mentioned

How to reproduce it (as minimally and precisely as possible):

Full logs to relevant components:

receive series from Addr: 10.233.117.207:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeoutreceive series from Addr: 10.233.116.94:10901 LabelSets: {prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-0"},{prometheus="kubesphere-monitoring-system/k8s", prometheus_replica="prometheus-k8s-1"},{prometheus="kubesphere-monitoring-system/k8s"} MinTime: 1727308800000 MaxTime: 1730368800000: rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout

Anything else we need to know:

ts=2024-08-22T04:15:02.506236929Z caller=memcached_client.go:438 level=warn name=index-cache msg="failed to fetch items from memcached" numKeys=1 firstKey=EP:01J5TQ7GTAK7JFP1SDHAZQABMB:NskVASoO0H1CJRIx74k3hIBPzIM6wCRkKvWOjc9V3Dg:dss err="write tcp 10.233.66.17:47668->10.233.31.160:11211: write: connection timed out"

Environment:

  • OS (e.g. from /etc/os-release): RedHat 8.5
  • Kernel (e.g. uname -a): 4.8
  • Others: Kubernetes

-->

Could you please help me to understand what I did wrong?

Copy link

dosubot bot commented Oct 31, 2024

The error you're encountering, "keepalive ping failed to receive ACK within timeout," suggests that the gRPC keepalive settings might need adjustment. Thanos uses keepalive pings to ensure that connections remain active, and if these pings fail, it can lead to the errors you're seeing.

Here are some steps you can take to address this issue:

  1. Adjust Keepalive Settings: You can modify the keepalive settings in your Thanos configuration. The default settings for keepalive pings in Thanos are:

    • Time: 10 seconds
    • Timeout: 5 seconds

    You might want to increase these values to allow more time for the keepalive pings to be acknowledged. For example, you could set:

    grpc:
      keepalive:
        time: 30s
        timeout: 15s
  2. Check Network Stability: Ensure that your network is stable and that there are no intermittent connectivity issues between Thanos components and Memcached.

  3. Review Memcached Configuration: The warning about failing to fetch items from Memcached could indicate a configuration issue. Consider the following adjustments:

    • Increase max_async_buffer_size and max_get_multi_batch_size to handle more concurrent requests.
    • Ensure that the Memcached instance has sufficient resources and is not overloaded.
  4. Resource Allocation: Verify that your Thanos Store and Query components have enough CPU and memory resources allocated to handle the load, especially when querying large time ranges.

  5. Monitor and Benchmark: Use monitoring tools to observe the performance of your Thanos setup and run benchmarks to identify any bottlenecks.

These steps should help you address the keepalive timeout issue and improve the stability of your Thanos setup with Memcached and MiniO. If the problem persists, further investigation into network and resource configurations may be necessary.

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant