Skip to content

Design Overview

Andrew Choi edited this page Jul 10, 2020 · 8 revisions

The goal of adding Xinfra Monitor is to make it as easy as possible to 1) develop and execute long-running kafka-specific system tests in real clusters, and 2) monitor existing Kafka deployment's performance as experienced by users. Developers should be able to easily create new tests by composing reusable modules to take actions and collect metrics using this framework. And users should be able to run Xinfra Monitor to perform actions at a user-defined schedule on the test cluster, e.g. broker hard kill and rolling bounce, and validate that Kafka still works well in these scenarios. To achieve these goals, Xinfra Monitor is modeled as a collection of Apps and Services.

A typical App may start some producers/consumers, take predefined sequence of actions periodically, report metrics, and validate metrics against some assertions. For example, Xinfra Monitor can start one producer, one consumer, and bounce a random broker (say if it is monitoring a test cluster) every five minutes; the availability and message loss rate can be exposed via JMX metrics that can be collected and displayed on a health dashboard in real-time; and an alert is triggered if message loss rate is larger than 0.

To allow an App to be composed from reusable modules, we implement the logic of periodic/long-running actions in services. A service will execute the action in its own thread and report metrics. For example, we can have the following services:

  • Produce service, which produces message to kafka and report produce rate and availability.
  • Consume service, which consumes message from kafka and report message loss rate, message duplicate rate and end-to-end latency. This service depends on the produce service to provide messages that encode certain information.
  • Broker bounce service, which bounces a given broker at regular interval.

An App will be composed of services and validate certain assertions either continuously or periodically. For example, we can create an App that includes one produce service, one consume service, and one broker bounce service. The produce service and consume service will be configured to use the same topic. And the App can validate that the message loss rate is constantly 0.

Finally, a given Xinfra Monitor instance runs on a single physical machine and multiple apps/services can run in one Xinfra Monitor instance. The diagram below demonstrates the relations between services, apps and Xinfra Monitor instance, as well as how Xinfra Monitor interacts with Kafka and user.

While all services in the same Xinfra Monitor instance must run on the same physical machine, we can start multiple Xinfra Monitor instances in different clusters that coordinate together to orchestrate a distributed end-to-end test. In the test described by the diagram below, we start two Xinfra Monitor instances in two clusters. The first Xinfra Monitor instance contains one produce service that produces to Kafka cluster 1. The message is then mirrored from cluster 1 to cluster 2. Finally the consume service in the second Xinfra Monitor instance consumes messages from the same topic and report end-to-end latency of this cross-cluster pipeline.

Clone this wiki locally