Kafka connector preview #770

sravotto · 2024-03-15T13:30:51Z

This PR consists of 3 separate commits:

hlc: adding range.Contains function: adds a utility function to check if a range contains a given timestamp.
Kafka connector preview: adds a source connector that consumes CockroachDB changefeeds send through a Kafka cluster. The connector supports JSON message, and currently uses immediate mode.
Kafka connector integration test: verifies that change events originated by a CockroachDB changefeed, and routed via a single node Kafka cluster are received by the connector and applied to a target CockroachDB datatabase.

This change is

codecov-commenter · 2024-03-15T13:38:15Z

Codecov Report

Attention: Patch coverage is 58.18966% with 97 lines in your changes are missing coverage. Please review.

Project coverage is 77.01%. Comparing base (b29765e) to head (8986e9c).

Files	Patch %	Lines
internal/source/kafka/conn.go	42.10%	31 Missing and 2 partials ⚠️
internal/source/kafka/config.go	45.76%	20 Missing and 12 partials ⚠️
internal/source/kafka/consumer.go	72.72%	14 Missing and 7 partials ⚠️
internal/source/kafka/provider.go	77.77%	3 Missing and 3 partials ⚠️
internal/cmd/kafka/kafka.go	62.50%	2 Missing and 1 partial ⚠️
internal/source/kafka/kafka.go	0.00%	2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #770      +/-   ##
==========================================
- Coverage   78.42%   77.01%   -1.41%     
==========================================
  Files         201      207       +6     
  Lines        9769    10001     +232     
==========================================
+ Hits         7661     7702      +41     
- Misses       1440     1616     +176     
- Partials      668      683      +15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BramGruneir

Overall, this is great, bit it needs a bit more work.

Also, I get that this is just the first pass here, but you're going to be limited by running only a single instance of replicator. Ideally, we put the topics in staging and take out a lease against them to allow for more than one replicator instance to push data.
I can see how this can be easily be done by running multiple instances of ConsumeClaim. Or we can use the mark function to determine if a value has been already processed. Lots of options here.

Reviewed 2 of 2 files at r1, 13 of 13 files at r2, 4 of 4 files at r3, all commit messages.
Reviewable status: all files reviewed, 21 unresolved discussions (waiting on @bobvawter and @sravotto)

-- commits line 11 at r2:
sent not send

internal/source/kafka/config.go line 74 at r2 (raw file):

	f.IntVar(&c.batchSize, "batchSize", 100, "messages to accumulate before committing to the target")
	f.StringArrayVar(&c.brokers, "broker", nil, "address of Kafka broker(s)")
	f.StringVar(&c.from, "from", "", "accept messages at or newer than this timestamp")

From and To or Min and Max? Also, perhaps it's worth adding Timestamp to the end of it for clarity: FromTimestamp or MinTimestamp

I think Min and Max might be better.

Just because from could be from where or from when?

internal/source/kafka/config.go line 74 at r2 (raw file):

	f.IntVar(&c.batchSize, "batchSize", 100, "messages to accumulate before committing to the target")
	f.StringArrayVar(&c.brokers, "broker", nil, "address of Kafka broker(s)")
	f.StringVar(&c.from, "from", "", "accept messages at or newer than this timestamp")

only accept messages with timestamps at or newer than this timestamp, this is an inclusive lower limit

internal/source/kafka/config.go line 76 at r2 (raw file):

	f.StringVar(&c.from, "from", "", "accept messages at or newer than this timestamp")
	f.StringVar(&c.group, "group", "", "the Kafka consumer group id")
	f.BoolVar(&c.oldest, "oldest", false, "start from the oldest message available")

Mention that this is in lieu of --From (and vice versa in --From)

internal/source/kafka/config.go line 76 at r2 (raw file):

	f.StringVar(&c.from, "from", "", "accept messages at or newer than this timestamp")
	f.StringVar(&c.group, "group", "", "the Kafka consumer group id")
	f.BoolVar(&c.oldest, "oldest", false, "start from the oldest message available")

Should this also be set to true by default?

internal/source/kafka/config.go line 78 at r2 (raw file):

	f.BoolVar(&c.oldest, "oldest", false, "start from the oldest message available")
	f.StringVar(&c.strategy, "strategy", "sticky", "Kafka consumer group re-balance strategy")
	f.StringVar(&c.to, "to", "", "accept messages at or older than this timestamp")

only accept messages with timestamps before this one, this is an exclusive upper limit

internal/source/kafka/config.go line 80 at r2 (raw file):

	f.StringVar(&c.to, "to", "", "accept messages at or older than this timestamp")
	f.StringArrayVar(&c.topics, "topic", nil, "the topic(s) that the consumer should use")
	f.StringVar(&c.version, "kafkaVersion", "3.6.0", "Kafka version")

Why is this necessary?

internal/source/kafka/config.go line 131 at r2 (raw file):

	}
	if hlc.Compare(from, to) > 0 {
		return errors.New("from timestamp must earlier than to timestamp")

this message is unclear

internal/source/kafka/conn.go line 35 at r2 (raw file):

//
//	note: we get resolved timestamps on all the partitions,
//	      so we should be able to leverage that.

There may be some complication here, as the messages are only ordered per partition right? Or is it by topic? So we may need to wait on writing a resolved timestamp until after we've grabbed all the other messages

internal/source/kafka/conn.go line 38 at r2 (raw file):

//
// TODO (silvano): support Avro format, schema registry.
// TODO (silvano): add metrics.

Please add github issues for these todos and link the issue number here

internal/source/kafka/conn.go line 62 at r2 (raw file):

// are allocated to each process based on the chosen rebalance strategy.
func (c *Conn) Start(ctx *stopper.Context) error {
	version, err := sarama.ParseKafkaVersion(c.config.version)

This check should be in preflight.

internal/source/kafka/conn.go line 74 at r2 (raw file):

	switch c.config.strategy {
	case "sticky":

This initial check should be in preflight. And please make it an enum.

internal/source/kafka/conn.go line 133 at r2 (raw file):

// getOffsets get the most recent offsets at the given time
// for all the topics and partitions.
// TODO (silvano) : add testing

Add github issue and ref to it here please

internal/source/kafka/conn.go line 136 at r2 (raw file):

func (c *Conn) getOffsets(nanos int64) ([]*partitionState, error) {
	res := make([]*partitionState, 0)
	client, err := sarama.NewClient(c.config.brokers, c.saramaConfig)

Is there a pool or is this common practice?

internal/source/kafka/consumer.go line 79 at r2 (raw file):

}

// ConsumeClaim process new messages for the topic/partition specified in the claim.

processes

internal/source/kafka/consumer.go line 96 at r2 (raw file):

		case message, ok := <-claim.Messages():
			if !ok {
				log.Printf("message channel was closed")

These log messages should have the topic, partition and maybe even the current offset in them

Also, shouldn't these logs be debug?

internal/source/kafka/consumer.go line 118 at r2 (raw file):

		case <-time.After(time.Second):
			// Periodically flush a batch, and mark the latest message for each topic/partition as read.
			if toProcess, err = c.accept(ctx, toProcess); err != nil {

How does Read differ from Mark?

internal/source/kafka/consumer.go line 133 at r2 (raw file):

) map[string]*sarama.ConsumerMessage {
	for _, message := range consumed {
		session.MarkMessage(message, "")

If we can use kafka to Mark messages.... then we can avoid staging completely... hmm...

internal/source/kafka/consumer.go line 146 at r2 (raw file):

		return toProcess, nil
	}
	log.Printf("flushing %d", toProcess.Count())

debug?

internal/source/kafka/provider.go line 24 at r2 (raw file):

	"github.com/cockroachdb/cdc-sink/internal/sequencer/chaos"
	"github.com/cockroachdb/cdc-sink/internal/sequencer/immediate"
	scriptSeq "github.com/cockroachdb/cdc-sink/internal/sequencer/script"

you used scriptSequencer earlier, please pick one
And for /script you used scriptRuntime

internal/util/hlc/hlc_test.go line 69 at r1 (raw file):

	a.False(rng.Contains(zero))
	a.False(rng.Contains(nine))

Can you test almost 10? {9, max}

internal/util/hlc/hlc_test.go line 74 at r1 (raw file):

	a.True(rng.Contains(fifteen))
	a.True(rng.Contains(almostTwenty))
	a.True(rng.Contains(twenty))

Should this be false, exclusive of max?

sravotto

The way ConsumerGroup works allow multiple instances of replicator. Each instance would have its own set of (non-overlapping) partitions. And if one replicator dies, its partitions will be allocated to the remaining ones. It works fine with immediate, but we need some coordination with the resolved timestamps across replication instances if we want to enforce transaction consistency.

Reviewable status: 10 of 19 files reviewed, 21 unresolved discussions (waiting on @bobvawter and @BramGruneir)

-- commits line 11 at r2: