Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.2 debian-elasticsearch crashes elasticsearch (100% CPU Usage) with buffer overflow error #151

Closed
calinah opened this issue Jun 18, 2018 · 2 comments
Labels

Comments

@calinah
Copy link

calinah commented Jun 18, 2018

Deploying for Kubernetes 1.10 on GKE using fluent/fluentd-kubernetes-daemonset

Deploying with v0.12-debian-elasticsearch works great as in the past, however switching to v1.2-debian-elasticsearch causes CPU usage for elasticsearch to go towards 99% and every fluentd Pod on every node shows buffer overflow error.

Fluentd log for v0.12:

2018-06-18 08:57:54 +0000 [info]: following tail of /var/log/containers/rethinkdb-data-0_rethinkdb_rethinkdb-data-6795097f453c7259226587f5e3ffed157393097a9e0a64077beb42b370115442.log
2018-06-18 08:57:54 +0000 [info]: following tail of /var/log/kube-proxy.log
2018-06-18 08:57:59 +0000 [info]: Connection opened to Elasticsearch cluster => {:host=>"efk-elasticsearch", :port=>9200, :scheme=>"http", :user=>"", :password=>"obfuscated"}
2018-06-18 08:58:56 +0000 [info]: stats - namespace_cache_size: 2, pod_cache_size: 2, pod_cache_watch_misses: 44, pod_cache_watch_updates: 1, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-18 08:59:56 +0000 [info]: stats - namespace_cache_size: 2, pod_cache_size: 2, pod_cache_watch_misses: 44, pod_cache_watch_updates: 1, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-18 09:00:56 +0000 [info]: stats - namespace_cache_size: 2, pod_cache_size: 2, pod_cache_watch_misses: 70, pod_cache_watch_delete_ignored: 4, pod_cache_watch_ignored: 4, pod_cache_watch_updates: 1, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3
2018-06-18 09:01:56 +0000 [info]: stats - namespace_cache_size: 2, pod_cache_size: 2, pod_cache_watch_misses: 86, pod_cache_watch_delete_ignored: 13, pod_cache_watch_ignored: 4, pod_cache_watch_updates: 1, namespace_cache_api_updates: 3, pod_cache_api_updates: 3, id_cache_miss: 3

Elasticsearch CPU Usage using fluentd v0.12:
screen shot 2018-06-18 at 10 25 34

Fluent.conf used for both v0.12 and v1.2:

# AUTOMATICALLY GENERATED
    # DO NOT EDIT THIS FILE DIRECTLY, USE /templates/conf/fluent.conf.erb

    @include systemd.conf
    @include kubernetes.conf

    <match **>
       @type elasticsearch
       @id out_es
       log_level info
       include_tag_key true
       host "#{ENV['FLUENT_ELASTICSEARCH_HOST']}"
       port "#{ENV['FLUENT_ELASTICSEARCH_PORT']}"
       scheme "#{ENV['FLUENT_ELASTICSEARCH_SCHEME'] || 'http'}"
       ssl_verify "#{ENV['FLUENT_ELASTICSEARCH_SSL_VERIFY'] || 'true'}"
       user "#{ENV['FLUENT_ELASTICSEARCH_USER']}"
       password "#{ENV['FLUENT_ELASTICSEARCH_PASSWORD']}"
       reload_connections "#{ENV['FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS'] || 'true'}"
       logstash_prefix "#{ENV['FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX'] || 'fluentd'}"
       logstash_format true
       <buffer>
         flush_thread_count 8
         flush_interval 10s
         chunk_limit_size 8MB
         queue_limit_length 256
         retry_max_interval 30
         retry_forever false
         overflow_action drop_oldest_chunk
       </buffer>
    </match>

Fluentd log when switching to 1.2:

2018-06-18 13:36:22 +0000 [info]: #0 [out_es] Connection opened to Elasticsearch cluster => {:host=>"efk-elasticsearch", :port=>9200, :scheme=>"http", :user=>"", :password=>"obfuscated"}
2018-06-18 13:36:28 +0000 [info]: #0 [filter_kube_metadata] stats - namespace_cache_size: 2, pod_cache_size: 3, namespace_cache_api_updates: 4, pod_cache_api_updates: 4, id_cache_miss: 4
2018-06-18 13:36:31 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_time=1 next_retry_seconds=2018-06-18 13:35:42 +0000 chunk="56eea9eca144e3d02a40041c3759df77" error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Could not push logs to Elasticsearch after 2 retries. read timeout reached"
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:530:in `rescue in send_bulk'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:512:in `send_bulk'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:414:in `write'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:1099:in `try_flush'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:1378:in `flush_thread_run'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin/output.rb:440:in `block (2 levels) in start'
  2018-06-18 13:36:31 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluentd-1.2.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2018-06-18 13:36:36 +0000 [warn]: #0 [out_es] Could not push logs to Elasticsearch, resetting connection and trying again. read timeout reached
2018-06-18 13:36:40 +0000 [info]: #0 [out_es] Connection opened to Elasticsearch cluster => {:host=>"efk-elasticsearch", :port=>9200, :scheme=>"http", :user=>"", :password=>"obfuscated"}
2018-06-18 13:36:46 +0000 [warn]: #0 [out_es] Could not push logs to Elasticsearch, resetting connection and trying again. read timeout reached
2018-06-18 13:36:48 +0000 [info]: #0 [out_es] Connection opened to Elasticsearch cluster => {:host=>"efk-elasticsearch", :port=>9200, :scheme=>"http", :user=>"", :password=>"obfuscated"}
2018-06-18 13:36:48 +0000 [warn]: #0 [out_es] failed to flush the buffer. retry_time=2 next_retry_seconds=2018-06-18 13:36:33 +0000 chunk="56eea9f65f83fc511094dc866c867942" error_class=Fluent::Plugin::ElasticsearchOutput::ConnectionFailure error="Could not push logs to Elasticsearch after 2 retries. read timeout reached"
  2018-06-18 13:36:48 +0000 [warn]: #0 /fluentd/vendor/bundle/ruby/2.3.0/gems/fluent-plugin-elasticsearch-2.10.3/lib/fluent/plugin/out_elasticsearch.rb:530:in `rescue in send_bulk'

Elasticsearch log after switch to v1.2:

[2018-06-18T13:33:51,406][INFO ][o.e.c.m.MetaDataCreateIndexService] [efk-elasticsearch-0] [fluentd-2018.06.02] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings []
[2018-06-18T13:33:51,872][INFO ][o.e.c.m.MetaDataMappingService] [efk-elasticsearch-0] [fluentd-2018.06.02/qQEQTX1tQGiFJDfqtENVRg] create_mapping [fluentd]
[2018-06-18T13:33:52,692][INFO ][o.e.c.r.a.AllocationService] [efk-elasticsearch-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[fluentd-2018.06.02][2]] ...]).
[2018-06-18T13:33:52,949][INFO ][o.e.c.m.MetaDataMappingService] [efk-elasticsearch-0] [fluentd-2018.06.18/97llyX3DSoCYyOVgTAroFQ] update_mapping [fluentd]
[2018-06-18T13:33:57,641][INFO ][o.e.m.j.JvmGcMonitorService] [efk-elasticsearch-0] [gc][248707] overhead, spent [278ms] collecting in the last [1s]
[2018-06-18T13:33:59,642][INFO ][o.e.m.j.JvmGcMonitorService] [efk-elasticsearch-0] [gc][248709] overhead, spent [483ms] collecting in the last [1s]
[2018-06-18T13:34:04,689][INFO ][o.e.m.j.JvmGcMonitorService] [efk-elasticsearch-0] [gc][248714] overhead, spent [264ms] collecting in the last [1s]

Elasticsearch CPU usage after switch:
screen shot 2018-06-18 at 14 38 40

Elasticsearch config:

cluster:
  name: ${CLUSTER_NAME}

node:
  master: ${IS_MASTER_NODE}
  data: ${IS_DATA_NODE}
  name: ${POD_NAME}

network:
  host: 0.0.0.0
transport:
  tcp:
    port: ${TRANSPORT_PORT}
http:
  port: ${HTTP_PORT}

path:
  data: /data/data
  logs: /data/log

bootstrap:
  memory_lock: ${MEMORY_LOCK}

thread_pool:
  bulk:
    queue_size: 2000
discovery:
  zen:
    hosts_provider: kubernetes
    minimum_master_nodes: 2
cloud:
  kubernetes:
    service: efk-elasticsearch-headless
    namespace: efk

Worth mentioning that I've tried a flush_interval of 10s, 30s and 60s for fluentd config and thread_pool.bulk.queue_size (for elasticsearch) of 1000 as well. I've checked that fluentd can connect to the Elasticsearch cluster as well and it can on both versions.

The version of elasticsearch is 6.1.2. I'm wondering if this is a similar issue to what has been reported in this issue #150.

Also each elasticsearch node has 8GB of Heap so potentially the GC overhead is causing the CPU to skyrocket. If so why does the same setup work without issues with both fluent-bit v0.13 and fluentd v0.12? Either way the efk setup is unusable with fluentd v1.2. @edsiper can you please help?

@github-actions
Copy link

This issue has been automatically marked as stale because it has been open 90 days with no activity. Remove stale label or comment or this issue will be closed in 30 days

@github-actions github-actions bot added the stale label Mar 18, 2021
@github-actions
Copy link

This issue was automatically closed because of stale in 30 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant