v1.2 debian-elasticsearch crashes elasticsearch (100% CPU Usage) with buffer overflow error #151

calinah · 2018-06-18T13:56:41Z

Deploying for Kubernetes 1.10 on GKE using fluent/fluentd-kubernetes-daemonset

Deploying with v0.12-debian-elasticsearch works great as in the past, however switching to v1.2-debian-elasticsearch causes CPU usage for elasticsearch to go towards 99% and every fluentd Pod on every node shows buffer overflow error.

Fluentd log for v0.12:

2018-06-18 08:57:54 +0000 [info]: following tail of /var/log/containers/rethinkdb-data-0_rethinkdb_rethinkdb-data-6795097f453c7259226587f5e3ffed157393097a9e0a64077beb42b370115442.log
2018-06-18 08:57:54 +0000 [info]: following tail of /var/log/kube-proxy.log
2018-06-18 08:57:59 +0000 [info]: Connection opened to Elasticsearch cluster => {:host=>"efk-elasticsearch", :port=>9200, :scheme=>"http", :user=>"", :password=>"obfuscated"}
2018-06-18 08:58:56 +0000 [info]: stats - namespace_cache_size: 2, pod_cache_size: 2, pod_cache_watch_misses: 44, pod_cache_watch_updates: 1, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-18 08:59:56 +0000 [info]: stats - namespace_cache_size: 2, pod_cache_size: 2, pod_cache_watch_misses: 44, pod_cache_watch_updates: 1, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-18 09:00:56 +0000 [info]: stats - namespace_cache_size: 2, pod_cache_size: 2, pod_cache_watch_misses: 70, pod_cache_watch_delete_ignored: 4, pod_cache_watch_ignored: 4, pod_cache_watch_updates: 1, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-18 09:01:56 +0000 [info]: stats - namespace_cache_size: 2, pod_cache_size: 2, pod_cache_watch_misses: 86, pod_cache_watch_delete_ignored: 13, pod_cache_watch_ignored: 4, pod_cache_watch_updates: 1, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3

Elasticsearch CPU Usage using fluentd v0.12:

Fluent.conf used for both v0.12 and v1.2:

# AUTOMATICALLY GENERATED
    # DO NOT EDIT THIS FILE DIRECTLY, USE /templates/conf/fluent.conf.erb

    @include systemd.conf
    @include kubernetes.conf

    <match **>
       @type elasticsearch
       @id out_es
       log_level info
       include_tag_key true
       host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
       port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
       scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
       ssl_verify "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERIFY'] || 'true'}"
       user "#{ENV['FLUENT_ELASTICSEARCH_USER']}"
       password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD']}"
       reload_connections "#{ENV['FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS'] || 'true'}"
       logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'fluentd'}"
       logstash_format true
       <buffer>
         flush_thread_count 8
         flush_interval 10s
         chunk_limit_size 8MB
         queue_limit_length 256
         retry_max_interval 30
         retry_forever false
         overflow_action drop_oldest_chunk
       </buffer>
    </match>

Fluentd log when switching to 1.2:

2018-06-18 13:36:22 +0000 [info]: #0 [out_es] Connection opened to Elasticsearch cluster => {:host=>"efk-elasticsearch", :port=>9200, :scheme=>"http", :user=>"", :password=>"obfuscated"}
2018-06-18 13:36:28 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 4, pod_cache_api_updates: 4, id_cache_miss: 4
2018-06-18 13:36:31 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_time=1 next_retry_seconds=2018-06-18 13:35:42 +0000 chunk="56eea9eca144e3d02a40041c3759df77" error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Could not push logs to Elasticsearch after 2 retries. read timeout reached"
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:530:in `rescue in send_bulk'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:512:in `send_bulk'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:414:in `write'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:1099:in `try_flush'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:1378:in `flush_thread_run'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:440:in `block (2 levels) in start'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2018-06-18 13:36:36 +0000 [warn]: #0 [out_es] Could not push logs to Elasticsearch, resetting connection and trying again. read timeout reached
2018-06-18 13:36:40 +0000 [info]: #0 [out_es] Connection opened to Elasticsearch cluster => {:host=>"efk-elasticsearch", :port=>9200, :scheme=>"http", :user=>"", :password=>"obfuscated"}
2018-06-18 13:36:46 +0000 [warn]: #0 [out_es] Could not push logs to Elasticsearch, resetting connection and trying again. read timeout reached
2018-06-18 13:36:48 +0000 [info]: #0 [out_es] Connection opened to Elasticsearch cluster => {:host=>"efk-elasticsearch", :port=>9200, :scheme=>"http", :user=>"", :password=>"obfuscated"}
2018-06-18 13:36:48 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_time=2 next_retry_seconds=2018-06-18 13:36:33 +0000 chunk="56eea9f65f83fc511094dc866c867942" error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Could not push logs to Elasticsearch after 2 retries. read timeout reached"
  2018-06-18 13:36:48 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:530:in `rescue in send_bulk'

Elasticsearch log after switch to v1.2:

[2018-06-18T13:33:51,406][INFO ][o.e.c.m.MetaDataCreateIndexService] [efk-elasticsearch-0] [fluentd-2018.06.02] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []
[2018-06-18T13:33:51,872][INFO ][o.e.c.m.MetaDataMappingService] [efk-elasticsearch-0] [fluentd-2018.06.02/qQEQTX1tQGiFJDfqtENVRg] create_mapping [fluentd]
[2018-06-18T13:33:52,692][INFO ][o.e.c.r.a.AllocationService] [efk-elasticsearch-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[fluentd-2018.06.02][2]] ...]).
[2018-06-18T13:33:52,949][INFO ][o.e.c.m.MetaDataMappingService] [efk-elasticsearch-0] [fluentd-2018.06.18/97llyX3DSoCYyOVgTAroFQ] update_mapping [fluentd]
[2018-06-18T13:33:57,641][INFO ][o.e.m.j.JvmGcMonitorService] [efk-elasticsearch-0] [gc][248707] overhead, spent [278ms] collecting in the last [1s]
[2018-06-18T13:33:59,642][INFO ][o.e.m.j.JvmGcMonitorService] [efk-elasticsearch-0] [gc][248709] overhead, spent [483ms] collecting in the last [1s]
[2018-06-18T13:34:04,689][INFO ][o.e.m.j.JvmGcMonitorService] [efk-elasticsearch-0] [gc][248714] overhead, spent [264ms] collecting in the last [1s]

Elasticsearch CPU usage after switch:

Elasticsearch config:

cluster:
  name: ${CLUSTER_NAME}

node:
  master: ${IS_MASTER_NODE}
  data: ${IS_DATA_NODE}
  name: ${POD_NAME}

network:
  host: 0.0.0.0
transport:
  tcp:
    port: ${TRANSPORT_PORT}
http:
  port: ${HTTP_PORT}

path:
  data: /data/data
  logs: /data/log

bootstrap:
  memory_lock: ${MEMORY_LOCK}

thread_pool:
  bulk:
    queue_size: 2000
discovery:
  zen:
    hosts_provider: kubernetes
    minimum_master_nodes: 2
cloud:
  kubernetes:
    service: efk-elasticsearch-headless
    namespace: efk

Worth mentioning that I've tried a flush_interval of 10s, 30s and 60s for fluentd config and thread_pool.bulk.queue_size (for elasticsearch) of 1000 as well. I've checked that fluentd can connect to the Elasticsearch cluster as well and it can on both versions.

The version of elasticsearch is 6.1.2. I'm wondering if this is a similar issue to what has been reported in this issue #150.

Also each elasticsearch node has 8GB of Heap so potentially the GC overhead is causing the CPU to skyrocket. If so why does the same setup work without issues with both fluent-bit v0.13 and fluentd v0.12? Either way the efk setup is unusable with fluentd v1.2. @edsiper can you please help?

The text was updated successfully, but these errors were encountered:

github-actions · 2021-03-18T10:16:37Z

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

github-actions · 2021-04-18T10:14:42Z

This issue was automatically closed because of stale in 30 days

github-actions bot added the stale label Mar 18, 2021

github-actions bot closed this as completed Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2 debian-elasticsearch crashes elasticsearch (100% CPU Usage) with buffer overflow error #151

v1.2 debian-elasticsearch crashes elasticsearch (100% CPU Usage) with buffer overflow error #151

calinah commented Jun 18, 2018 •

edited

Loading

github-actions bot commented Mar 18, 2021

github-actions bot commented Apr 18, 2021

v1.2 debian-elasticsearch crashes elasticsearch (100% CPU Usage) with buffer overflow error #151

v1.2 debian-elasticsearch crashes elasticsearch (100% CPU Usage) with buffer overflow error #151

Comments

calinah commented Jun 18, 2018 • edited Loading

github-actions bot commented Mar 18, 2021

github-actions bot commented Apr 18, 2021

calinah commented Jun 18, 2018 •

edited

Loading