Skip to content

Latest commit

 

History

History
917 lines (679 loc) · 33.5 KB

README.md

File metadata and controls

917 lines (679 loc) · 33.5 KB

1. Introduction

This project is about giving you a step by step introduction on how to leverage docker and the open-source ecosystem to do metrics/logs/alerting.

Note: This project is only intended to present ideas.

Note: if you are using docker for mac please assign at least 5go of memory.

2. container logging

In docker-compose-step1.yml we create a simple container that displays hello world

The container definition is as follows

  example:
    image: ubuntu
    command: echo hello world

Run it with docker-compose -f docker-compose-step1.yml

$ docker-compose -f docker-compose-step1.yml up                                                 
Creating network "monitoring-demo_default" with the default driver
Creating monitoring-demo_example_1 ...
Creating monitoring-demo_example_1 ... done
Attaching to monitoring-demo_example_1
example_1  | hello world
monitoring-demo_example_1 exited with code 0

Hello world has been writen on stdout. How fancy !

The output of the container has also been captured by docker.

Run docker logs monitoring-demo_example_1 you should see

$ docker logs monitoring-demo_example_1                                
hello world

When outputing to stdout and stderr docker captures these logs and send them to the log bus. A listener listen to logs and store container logs into their own log file.

graph LR;
    container --> Docker;
    Docker -- write to --> stdout;
    Docker -- write to --> File;
Loading

In order to know where it's stored just inspect the container with docker inspect monitoring-demo_example_1 you should see

$ docker inspect monitoring-demo_example_1
[
    {
        "Id": "cf1a86e1dc9ac16bc8f60b234f9b3e6310bd591dc385bc1da8e1081d2837752a",
        "Created": "2017-10-24T21:24:57.558550709Z",
        "Path": "echo",
        "Args": [
            "hello",
            "world"
        ],
        "State": {
            "Status": "exited",
            "Running": false,
            "Paused": false,
            "Restarting": false,
...
... snip snip ...
...
            }
        }
    }
]

That's a lot of different information, let's look for the log info

$ docker inspect monitoring-demo_example_1 | grep log  
        "LogPath": "/var/lib/docker/containers/cf1a86e1dc9ac16bc8f60b234f9b3e6310bd591dc385bc1da8e1081d2837752a/cf1a86e1dc9ac16bc8f60b234f9b3e6310bd591dc385bc1da8e1081d2837752a-json.log",

Perfect, let's extract that field now with jq

$ docker inspect monitoring-demo_example_1 | jq -r '.[].LogPath'
/var/lib/docker/containers/cf1a86e1dc9ac16bc8f60b234f9b3e6310bd591dc385bc1da8e1081d2837752a/cf1a86e1dc9ac16bc8f60b234f9b3e6310bd591dc385bc1da8e1081d2837752a-json.log

Note: you will not to be able to read this file directly using docker for mac.

More about logs : https://docs.docker.com/engine/admin/logging/overview/#use-environment-variables-or-labels-with-logging-drivers

3. Listening for logs using a container

The objective now is to leverage the docker event bus, listen to it and output it on the console.

graph LR;
    container --> Docker((Docker));
    Docker -- write to --> stdout;
    Docker -- write to --> File;
    Listener -- listen to --> Docker;
    Listener -- write to --> stdout;
Loading

Therefore we should see twice anything that is outputed on stdout.

We will use logspout to listen for all the docker logs.

  logspout:
    image: bekt/logspout-logstash
    restart: on-failure
    volumes:
      - /var/run/docker.sock:/tmp/docker.sock
    environment:
      ROUTE_URIS: logstash://logstash:5000
    depends_on:
      - logstash

Note: In order to read from the log bus, we need to access the docker socket. This the volume mapping configuration.

Once logspout gets a log, it sends it logstash.

logstashLogstash is defined as follows

  logstash:
    image: logstash
    restart: on-failure
    command: -e "input { udp { port => 5000 codec => json } } filter { if [docker][image] =~ /^logstash/ {  drop { } } } output { stdout { codec => rubydebug } }"

Here I define a complete logstash configuration on the command line.

Note: logspout will send all logs event from logstash, filter the logstash one to prevent infinite printing.

So here is are the containers at play:

graph LR;
    container --> Docker((Docker));
    Docker -- write to --> stdout;
    Docker -- write to --> File;
    Logspout -- listen to --> Docker;
    Logspout -- write to --> Logstash;
    Logstash -- write to --> stdout;
Loading

Run the demo with docker-compose -f docker-compose-step2.yml up, you should see

$ docker-compose -f docker-compose-step2.yml up
Recreating monitoring-demo_logstash_1 ...
Recreating monitoring-demo_logstash_1
Starting monitoring-demo_example_1 ...
Recreating monitoring-demo_logstash_1 ... done
Recreating monitoring-demo_logspout_1 ...
Recreating monitoring-demo_logspout_1 ... done
Attaching to monitoring-demo_example_1, monitoring-demo_logstash_1, monitoring-demo_logspout_1
example_1   | 11597
example_1   | 9666
example_1   | 3226
...
... snip snip ...
...
example_1   | 10854
logstash_1  | {
logstash_1  |     "@timestamp" => 2017-10-24T21:49:09.787Z,
logstash_1  |         "stream" => "stdout",
logstash_1  |       "@version" => "1",
logstash_1  |           "host" => "172.24.0.4",
logstash_1  |        "message" => "10854",
logstash_1  |         "docker" => {
logstash_1  |            "image" => "ubuntu",
logstash_1  |         "hostname" => "15716aaf6095",
logstash_1  |             "name" => "/monitoring-demo_example_1",
logstash_1  |               "id" => "15716aaf6095efdde8ab3e566a911aac284e63d3c949dd19ddfd64258d20de9b",
logstash_1  |           "labels" => nil
logstash_1  |     },
logstash_1  |           "tags" => []
logstash_1  | }

Note: Along the message is container metadata! This will be of tremendous help while debugging your cluster !

4. Elasticsearch

It's kind of silly to grab stdout in such a convoluted way to export it back to stdout.

Let's make something useful such as sending all the logs to elasticsearch.

Let's define first an elasticsearch server

  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:5.6.0
    restart: on-failure
    ports:
      - "9200:9200"
      - "9300:9300"
    environment:
      xpack.security.enabled: "false"

and it's kibana companion

  kibana:
    image: docker.elastic.co/kibana/kibana:5.5.2
    restart: on-failure
    ports:
      - "5601:5601"
    environment:
      xpack.security.enabled: "false"
    depends_on:
      - elasticsearch

Let's as logstash to send all logs not to stdout but to elasticsearch now.

-e "input { udp { port => 5000 codec => json } } filter { if [docker][image] =~ /^logstash/ {  drop { } } } output { stdout { codec => rubydebug } }"

becomes

-e "input { udp { port => 5000 codec => json } } filter { if [docker][image] =~ /^logstash/ {  drop { } } } output { elasticsearch { hosts => "elasticsearch" } }"

By default the logs will be sent to the logstash-* index.

So let's create the defaut kibana index pattern.

  kibana_index_pattern:
    image: ubuntu
    command: |
      bash -c "sleep 30 ; curl 'http://kibana:5601/es_admin/.kibana/index-pattern/logstash-*/_create' -H 'kbn-version: 5.5.2' -H 'content-type: application/json' --data-binary '{\"title\":\"logstash-*\",\"timeFieldName\":\"@timestamp\",\"notExpandable\":true}'"
    depends_on:
      - kibana

Here are the containers involved:

graph LR;
    Logspout --listen to--> Docker((Docker));
    Logspout -- write to --> Logstash;
    Logstash -- write to --> Elasticsearch;
    Kibana -- reads --> Elasticsearch;
Loading

Run the demo with docker-compose -f docker-compose-step3.yml up

$ docker-compose -f docker-compose-step3.yml up   
Starting monitoring-demo_example_1 ...
Starting monitoring-demo_example_1
Creating monitoring-demo_elasticsearch_1 ...
Creating monitoring-demo_elasticsearch_1 ... done
Recreating monitoring-demo_logstash_1 ...
Recreating monitoring-demo_logstash_1
Creating monitoring-demo_kibana_1 ...
Recreating monitoring-demo_logstash_1 ... done
Recreating monitoring-demo_logspout_1 ...
Recreating monitoring-demo_logspout_1 ... done
Attaching to monitoring-demo_example_1, monitoring-demo_elasticsearch_1, monitoring-demo_logstash_1, monitoring-demo_kibana_1, monitoring-demo_logspout_1
...
... snip snip ...
...

Now look at the logs in kibana

5. Elasticsearch Metrics !

Docker has metrics about the state of each container, but also what is does consume, let's leverage that !

Let's use metricbeat for that

  metricbeat:
    image: docker.elastic.co/beats/metricbeat:5.6.3
    volumes:
       - /var/run/docker.sock:/tmp/docker.sock
    depends_on:
      - elasticsearch

Note: like for logspout we need to ask container question to docker via its socket.

The nice thing about metric beat is that it comes with ready made dashboards, let's leverage that too.

  metricbeat-dashboard-setup:
    image: docker.elastic.co/beats/metricbeat:5.6.3
    command: ./scripts/import_dashboards -es http://elasticsearch:9200
    depends_on:
      - elasticsearch

Here are the containers at play :

graph LR;
    MetricBeat -- listen to --> Docker((Docker));
    MetricBeat -- write to --> Elasticsearch;
    MetricBeat -- setup dashboards --> Kibana;
    Kibana -- reads from --> Elasticsearch;
Loading

Run the demo with docker-compose -f docker-compose-step4.yml up then look at the

6. Better metrics: the TICK stack

The TICK stack is comprised of

This stack has many very interesting properties, let's leverage them.

Let's start with influxdb

  influxdb:
    image: influxdb:1.3.7
    ports:
      - "8086:8086"

Then kapacitor

  kapacitor:
    image: kapacitor:1.3.3
    hostname: kapacitor
    environment:
      KAPACITOR_HOSTNAME: kapacitor
      KAPACITOR_INFLUXDB_0_URLS_0: http://influxdb:8086
    depends_on:
      - influxdb

Then chronograf

  chronograf:
    image: chronograf:1.3.10
    environment:
      KAPACITOR_URL: http://kapacitor:9092
      INFLUXDB_URL: http://influxdb:8086
    ports:
      - "8888:8888"
    depends_on:
      - influxdb
      - kapacitor

Then telegraf

  telegraf:
    image: telegraf:1.4.3
    volumes:
      - /var/run/docker.sock:/tmp/docker.sock
      - ./telegraf/telegraf.conf:/etc/telegraf/telegraf.conf:ro
    links:
      - influxdb
      - elasticsearch

A few things to notice here:

  • we once again match docker.sock : we will listen to metrics in telegraf too
  • we have a local telegraf.conf
  • we link to influxdb as metrics will be shipped there
  • we link elasticsearch ... we will monitor elasticsearch too !

Let's look at how telegraf conf looks like.

I removed many default values, if you want to see them fully go to https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf

[agent]
interval = "10s"

## Outputs
[[outputs.influxdb]]
urls = ["http://influxdb:8086"]
database = "telegraf"

## Inputs
[[inputs.cpu]]
[[inputs.disk]]
[[inputs.diskio]]
[[inputs.kernel]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.net]]
[[inputs.netstat]]
[[inputs.interrupts]]
[[inputs.linux_sysctl_fs]]
[[inputs.docker]]
endpoint = "unix:///tmp/docker.sock"
[[inputs.elasticsearch]]
servers = ["http://elasticsearch:9200"]

This configuration should be self-explanatory right ?

Note: The telegraf plugin ecosystem is huge, see the full list here : https://github.com/influxdata/telegraf#input-plugins

Now run the demo docker-compose -f docker-compose-step5.yml up

You are starting to have many containers:

The ELK story:

graph LR;
    Logspout -- listen to --> Docker((Docker));
    Logspout -- write to --> Logstash;
    Logstash -- write to --> Elasticsearch;
    Kibana -- reads from --> Elasticsearch;
    MetricBeat -- listen to --> Docker;
    MetricBeat -- write to --> Elasticsearch;
    MetricBeat -- one time dashboards setup --> Kibana;
Loading

And the TICK story:

graph LR;
    Telegraf -- listen to --> Docker((Docker));
    Telegraf -- write to --> Influxdb;
    Chronograf -- reads from --> Influxdb;
    Kapacitor -- listen to --> Influxdb;
    Chronograf -- setup rules --> Kapacitor;
    Kapacitor -- notifies --> Notification;
Loading

Run the demo with docker-compose -f docker-compose-step5.yml up then look at the following links

You can play around with the alerting system etc.

7. Getting the best of the ecosystem

Are are now in a pretty good shape

  1. we have all the logs in elasticseach
  2. we have metrics in elasticsearch
  3. we have metrics in influxdb
  4. we have a mean of visualization via chronograf
  5. we have a mean of alerting via kapacitor

We should be all set right ?

Well, no, we can do better: as an admin I want to mix and match logs, visualization and alerting in a single page.

Let's do that together by leveraging grafana

  grafana:
    image: grafana/grafana:4.6.1
    ports:
      - "3000:3000"
    depends_on:
      - influxdb
      - elasticsearch

Nothing fancy here, but if you run like this, you'll have to setup manually

  • the elasticsearch datasource
  • the influxdb datasource
  • the alert channels
  • having a few default dashboards

Well there's a local build that does just that

  grafana-setup:
    build: grafana-setup/
    depends_on:
      - grafana
graph LR;
    Grafana -- reads from --> Influxdb
    Grafana -- reads from --> Elasticsearch
    Grafana -- write to --> AlertChannels
    GrafanaSetup -- one time setup --> Grafana
Loading

Run the demo with docker-compose -f docker-compose-step6.yml up then enjoy your docker metrics in grafan!

Note: Use username admin password admin

Go at the bottom of the page ... here are the logs for the container you are looking at !

Note: do not hesitate to rely on dashboards from the community at https://grafana.com/dashboards

You can create alerts etc. That's great.

8. Kafka the data hub

We can't have all this data for ourselves right ? We most probably are not the same users.

What about the security team, what about auditing, what about performance engineers, what about pushing the data to other storages etc.

Well kafka is very useful here, let's leverage that component.

Kafka relies on zookeeper, let's use the simplest images I could find:

  zookeeper:
    image: wurstmeister/zookeeper:3.4.6
    ports:
      - "2181:2181"

Same for thing for kafka

  kafka:
    image: wurstmeister/kafka:1.0.0
    ports:
      - "9092"
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
    depends_on:
      - zookeeper
      

Now we can update telegraf to ask to ship all its data to kafka too.

Let's add the kafka output in the telegraf configuration

[[outputs.kafka]]
   brokers = ["kafka:9092"]
   topic = "telegraf"

And add the link to telegraf container to kafka server

  telegraf:
    image: telegraf:1.4.3
    volumes:
      - /var/run/docker.sock:/tmp/docker.sock
      - ./telegraf/telegraf-with-kafka-output.conf:/etc/telegraf/telegraf.conf:ro
    links:
      - influxdb
      - elasticsearch
      - kafka

The Kafka story

graph LR;
    Telegraf -- listen to --> Docker;
    Telegraf -- write to --> Influxdb;
    Telegraf -- write to --> Kafka
    Kafka -- read/writes --> Zookeeper
Loading

Run the demo docker-compose -f docker-compose-step7.yml up

Let's see if we got our metrics data readily available in kafka ...

docker exec -ti monitoring-demo_kafka_1 kafka-console-consumer.sh  --zookeeper zookeeper --topic telegraf --max-messages 5                              
Using the ConsoleConsumer with old consumer is deprecated and will be removed in a future major release. Consider using the new consumer by passing [bootstrap-server] instead of [zookeeper].
docker_container_mem,build-date=20170801,com.docker.compose.service=kibana,license=GPLv2,com.docker.compose.config-hash=1e1f2bf92f25fcc3a4b235d04f600cd276809e7195a0c5196f0a8098e82e47b3,host=c280c5e69493,container_image=docker.elastic.co/kibana/kibana,maintainer=Elastic\ Docker\ Team\ <[email protected]>,com.docker.compose.version=1.16.1,com.docker.compose.oneoff=False,com.docker.compose.project=monitoring-demo,vendor=CentOS,com.docker.compose.container-number=1,name=CentOS\ Base\ Image,engine_host=moby,container_name=monitoring-demo_kibana_1,container_version=5.5.2 pgpgin=98309i,rss_huge=0i,total_pgmajfault=3i,total_pgpgin=98309i,total_rss_huge=0i,usage_percent=1.9058546412278363,active_anon=155103232i,hierarchical_memory_limit=9223372036854771712i,max_usage=272527360i,container_id="aa2195088fd305079d2942b009c9e9fd1bb38781aa558be6a9f084a334b1b755",writeback=0i,pgfault=116807i,pgpgout=59702i,total_mapped_file=0i,total_unevictable=0i,total_writeback=0i,unevictable=0i,active_file=0i,mapped_file=0i,total_inactive_anon=20480i,total_pgfault=116807i,total_rss=154718208i,usage=162758656i,total_active_anon=155103232i,cache=3416064i,rss=154718208i,total_cache=3416064i,total_inactive_file=3010560i,total_pgpgout=59702i,limit=8360689664i,pgmajfault=3i,total_active_file=0i,inactive_anon=20480i,inactive_file=3010560i 1508887282000000000

docker_container_cpu,vendor=CentOS,com.docker.compose.container-number=1,build-date=20170801,container_image=docker.elastic.co/kibana/kibana,com.docker.compose.project=monitoring-demo,container_name=monitoring-demo_kibana_1,cpu=cpu-total,host=c280c5e69493,license=GPLv2,com.docker.compose.config-hash=1e1f2bf92f25fcc3a4b235d04f600cd276809e7195a0c5196f0a8098e82e47b3,com.docker.compose.oneoff=False,engine_host=moby,container_version=5.5.2,com.docker.compose.service=kibana,maintainer=Elastic\ Docker\ Team\ <[email protected]>,com.docker.compose.version=1.16.1,name=CentOS\ Base\ Image usage_total=11394168870i,usage_system=27880670000000i,throttling_periods=0i,throttling_throttled_periods=0i,throttling_throttled_time=0i,usage_in_usermode=10420000000i,usage_in_kernelmode=970000000i,container_id="aa2195088fd305079d2942b009c9e9fd1bb38781aa558be6a9f084a334b1b755",usage_percent=7.948400539083559 1508887282000000000

docker_container_cpu,com.docker.compose.project=monitoring-demo,vendor=CentOS,com.docker.compose.container-number=1,com.docker.compose.oneoff=False,container_image=docker.elastic.co/kibana/kibana,maintainer=Elastic\ Docker\ Team\ <[email protected]>,engine_host=moby,com.docker.compose.config-hash=1e1f2bf92f25fcc3a4b235d04f600cd276809e7195a0c5196f0a8098e82e47b3,build-date=20170801,license=GPLv2,com.docker.compose.version=1.16.1,container_name=monitoring-demo_kibana_1,host=c280c5e69493,name=CentOS\ Base\ Image,cpu=cpu0,container_version=5.5.2,com.docker.compose.service=kibana container_id="aa2195088fd305079d2942b009c9e9fd1bb38781aa558be6a9f084a334b1b755",usage_total=3980860071i 1508887282000000000

docker_container_cpu,com.docker.compose.container-number=1,host=c280c5e69493,name=CentOS\ Base\ Image,com.docker.compose.oneoff=False,container_version=5.5.2,build-date=20170801,com.docker.compose.service=kibana,maintainer=Elastic\ Docker\ Team\ <[email protected]>,vendor=CentOS,com.docker.compose.project=monitoring-demo,engine_host=moby,license=GPLv2,com.docker.compose.config-hash=1e1f2bf92f25fcc3a4b235d04f600cd276809e7195a0c5196f0a8098e82e47b3,com.docker.compose.version=1.16.1,container_name=monitoring-demo_kibana_1,container_image=docker.elastic.co/kibana/kibana,cpu=cpu1 usage_total=3942753596i,container_id="aa2195088fd305079d2942b009c9e9fd1bb38781aa558be6a9f084a334b1b755" 1508887282000000000

docker_container_cpu,maintainer=Elastic\ Docker\ Team\ <[email protected]>,cpu=cpu2,host=c280c5e69493,build-date=20170801,container_version=5.5.2,com.docker.compose.config-hash=1e1f2bf92f25fcc3a4b235d04f600cd276809e7195a0c5196f0a8098e82e47b3,com.docker.compose.version=1.16.1,com.docker.compose.container-number=1,name=CentOS\ Base\ Image,com.docker.compose.oneoff=False,container_name=monitoring-demo_kibana_1,com.docker.compose.service=kibana,container_image=docker.elastic.co/kibana/kibana,vendor=CentOS,com.docker.compose.project=monitoring-demo,engine_host=moby,license=GPLv2 usage_total=1607029783i,container_id="aa2195088fd305079d2942b009c9e9fd1bb38781aa558be6a9f084a334b1b755" 1508887282000000000

Processed a total of 5 messages

Yes it looks like it !

We are in a pretty good shape right ?

Well, we can do better. We have many jvm based components such as kafka, and we know its monitoring is based on the JMX standard.

9. Enter JMX !

Telegraf is a go application, it does not speak jvm natively. However it speaks jolokia.

Let's leverage that.

So let's create our own image based on the wurstmeister/kafka, download jolokia and add it to the image.

FROM wurstmeister/kafka:1.0.0

ENV JOLOKIA_VERSION 1.3.5
ENV JOLOKIA_HOME /usr/jolokia-${JOLOKIA_VERSION}
RUN curl -sL --retry 3 \
  "https://github.com/rhuss/jolokia/releases/download/v${JOLOKIA_VERSION}/jolokia-${JOLOKIA_VERSION}-bin.tar.gz" \
  | gunzip \
  | tar -x -C /usr/ \
 && ln -s $JOLOKIA_HOME /usr/jolokia \
 && rm -rf $JOLOKIA_HOME/client \
 && rm -rf $JOLOKIA_HOME/reference

CMD ["start-kafka.sh"]

And link the new kafka definition to this image

  kafka:
    build: kafka-with-jolokia/
    ports:
      - "9092"
    environment:
      JOLOKIA_VERSION: 1.3.5
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_OPTS: -javaagent:/usr/jolokia-1.3.5/agents/jolokia-jvm.jar=host=0.0.0.0
    depends_on:
      - zookeeper

Configure telegraf to gather jmx metrics using the jolokia agent

[[inputs.jolokia]]
context = "/jolokia/"

[[inputs.jolokia.servers]]
name = "kafka"
host = "kafka"
port = "8778"

[[inputs.jolokia.metrics]]
name = "heap_memory_usage"
mbean  = "java.lang:type=Memory"
attribute = "HeapMemoryUsage"

[[inputs.jolokia.metrics]]
name = "messages_in"
mbean = "kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec"

[[inputs.jolokia.metrics]]
name = "bytes_in"
mbean = "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec"

Then configure telegraf to use the new configuration with jolokia input

  telegraf:
    image: telegraf:1.4.3
    volumes:
      - /var/run/docker.sock:/tmp/docker.sock
      - ./telegraf/telegraf-with-kafka-output-and-jolokia.conf:/etc/telegraf/telegraf.conf:ro
    links:
      - influxdb
      - elasticsearch
      - kafka

Run the demo docker-compose -f docker-compose-step8.yml up

You'll see the new kafka image created

$ docker images |  grep demo                                                                                     
monitoring-demo_kafka                            latest              5a746c9ff5ea        2 minutes ago       270MB

The JMX story

graph LR;
    Telegraf -- write to --> Kafka
    Telegraf -- get metrics --> Jolokia
    Jolokia -- reads JMX --> Kafka
Loading

Do we have jolokia metrics ?

$ docker exec -ti monitoring-demo_kafka_1 kafka-console-consumer.sh  --zookeeper zookeeper --topic telegraf | grep jolokia 
jolokia,host=cde5575b52a5,jolokia_name=kafka,jolokia_port=8778,jolokia_host=kafka heap_memory_usage_used=188793344,messages_in_MeanRate=12.98473084303969,bytes_out_FiveMinuteRate=1196.4939381458667,bytes_out_RateUnit="SECONDS",active_controller_Value=1,heap_memory_usage_init=1073741824,heap_memory_usage_committed=1073741824,messages_in_FiveMinuteRate=4.794914163942757,messages_in_EventType="messages",isr_expands_Count=0,isr_expands_FiveMinuteRate=0,isr_expands_OneMinuteRate=0,messages_in_RateUnit="SECONDS",bytes_in_FifteenMinuteRate=995.4606306690374,bytes_out_OneMinuteRate=3453.5697437249646,bytes_out_Count=413240,offline_partitions_Value=0,isr_shrinks_OneMinuteRate=0,messages_in_FifteenMinuteRate=1.8164700620801133,messages_in_OneMinuteRate=11.923477587504813,bytes_in_Count=955598,bytes_in_MeanRate=7110.765507856953,isr_shrinks_Count=0,isr_expands_RateUnit="SECONDS",isr_shrinks_EventType="shrinks",isr_expands_MeanRate=0,bytes_in_RateUnit="SECONDS",bytes_in_OneMinuteRate=6587.34465794122,bytes_in_FiveMinuteRate=2631.3776025779002,bytes_out_EventType="bytes",isr_shrinks_FiveMinuteRate=0,isr_expands_EventType="expands",messages_in_Count=1745,bytes_out_MeanRate=3074.982298604404,isr_expands_FifteenMinuteRate=0,heap_memory_usage_max=1073741824,bytes_in_EventType="bytes",bytes_out_FifteenMinuteRate=438.0280170256858,isr_shrinks_MeanRate=0,isr_shrinks_RateUnit="SECONDS",isr_shrinks_FifteenMinuteRate=0 1508889300000000000
jolokia,jolokia_name=kafka,jolokia_port=8778,jolokia_host=kafka,host=cde5575b52a5 bytes_in_MeanRate=6630.745414108696,isr_shrinks_RateUnit="SECONDS",isr_expands_EventType="expands",isr_expands_FiveMinuteRate=0,isr_expands_RateUnit="SECONDS",heap_memory_usage_max=1073741824,messages_in_Count=1745,isr_expands_FifteenMinuteRate=0,bytes_out_RateUnit="SECONDS",isr_shrinks_OneMinuteRate=0,isr_shrinks_FifteenMinuteRate=0,isr_shrinks_MeanRate=0,messages_in_RateUnit="SECONDS",bytes_in_OneMinuteRate=5576.066868503058,messages_in_FifteenMinuteRate=1.796398775034883,bytes_in_FiveMinuteRate=2545.1107836610863,bytes_out_Count=413240,active_controller_Value=1,isr_expands_Count=0,heap_memory_usage_committed=1073741824,messages_in_EventType="messages",bytes_in_Count=955598,isr_expands_OneMinuteRate=0,messages_in_FiveMinuteRate=4.637718179794651,messages_in_MeanRate=12.107909165680097,isr_shrinks_Count=0,isr_shrinks_EventType="shrinks",bytes_in_FifteenMinuteRate=984.461178226918,offline_partitions_Value=0,bytes_out_OneMinuteRate=2923.3836736983444,bytes_out_EventType="bytes",isr_shrinks_FiveMinuteRate=0,isr_expands_MeanRate=0,bytes_in_EventType="bytes",bytes_out_MeanRate=2867.3907911149618,messages_in_OneMinuteRate=10.093005874965653,bytes_in_RateUnit="SECONDS",bytes_out_FifteenMinuteRate=433.18797795919676,bytes_out_FiveMinuteRate=1157.2682011038034,heap_memory_usage_init=1073741824,heap_memory_usage_used=189841920 1508889310000000000

Well looks like we do !

10. Let's do some manual monitoring

Let's say you have some hand coded monitoring tools that you did in python or bash as such

#!/bin/bash
group=$1
kafkaHost=$2
kafkaPort=$3
kafka-consumer-groups.sh --bootstrap-server ${kafkaHost}:${kafkaPort} --group ${group} --describe 2> /dev/null \
      | tail -n +3 \
      | awk -v GROUP=${group} '{print "kafka_group_lag,group="GROUP",topic="$1",partition="$2",host="$7" current_offset="$3"i,log_end_offset="$4"i,lag="$5"i"}'

You have many possibilities there :

The exec plugin :

[[inputs.exec]]
  commands = ["kafka-lag.sh mygroup broker 9092"]
  timeout = "5s"

You can also make Telegraf listen line protocol metrics on a socket

Note: Telegraf has many networking options and protocols supported.

[[inputs.socket_listener]]
    service_address = "tcp://:8094"

You would then update your bash to send the data to telegraf!

#!/bin/bash
group=$1
kafkaHost=$2
kafkaPort=$3
telegrafHost=$4
telegrafPort=$5
echo Fetching metrics for the ${group} group in ${kafkaHost}:${kafkaPort} and pushing the metrics into ${telegrafHost}:${telegrafPort}
while true
do
    kafka-consumer-groups.sh --bootstrap-server ${kafkaHost}:${kafkaPort} --group ${group} --describe 2> /dev/null \
         | tail -n +3 \
         | awk -v GROUP=${group} '{print "kafka_group_lag,group="GROUP",topic="$1",partition="$2",host="$7" current_offset="$3"i,log_end_offset="$4"i,lag="$5"i"}' \
         | nc ${telegrafHost} ${telegrafPort}
         echo Sleeping for 10s
         sleep 10s
done

The bash telemetry story

graph LR;
    Telegraf -- write to --> Influxdb
    Group-Kafka-Lag -- read group metrics --> Kafka
    Group-Kafka-Lag -- send metrics to over TCP --> Telegraf
Loading

Run the demo docker-compose -f docker-compose-step9.yml up

You can now graph on slowness of consumers.

11. Self descriptive visualizations

Let's rely on jdbranham-diagram-panel to show pretty diagram that will be live

For that we need to install a plugin, let's leverage the grafana GF_INSTALL_PLUGINS environment variable

  grafana:
    image: grafana/grafana:4.6.1
    ports:
      - "3000:3000"
    environment:
      GF_INSTALL_PLUGINS: jdbranham-diagram-panel
    depends_on:
      - influxdb
      - elasticsearch

Run the demo docker-compose -f docker-compose-step10.yml up

You can now create live diagrams !

live diagrams

12. Your sql databases are back

Note: Todo

Leverage your sql databases in your grafana dashboards with http://docs.grafana.org/features/datasources/mysql/

You can consume your database changes and push them to kafka https://www.confluent.io/product/connectors/

13. Share your database tables as kafka table

Change Data Capture and Kafka connect Look at the ecosystem : https://www.confluent.io/product/connectors/

14. Going even further with Kafka using KSQL

Note: Todo

Now that Kafka is the real bus of your architecture, you can leverage ksql declarative power such as

CREATE TABLE possible_fraud AS
  SELECT card_number, count(*)
  FROM authorization_attempts
  WINDOW TUMBLING (SIZE 5 SECONDS)
  GROUP BY card_number
  HAVING count(*) > 3;

15. Going C3

Note: Todo

Now that kafka, ksql, connect is driving many parts of your monitoring, you want to have a dedicated tool that will enrich your existing metrics/visualizations : https://www.confluent.io/product/control-center/

16. Going Prometheus

Note: Todo

https://prometheus.io/

17. Going distributed open tracing

Note: Todo

http://opentracing.io/

18. Monitoring Federation

Note: Todo

Have a global overview of many clusters.

19. Security

Note: Todo

Always a bit of a pain.