Real-time Data Integration Platform (RDIP) Module #3

mohbadar · 2019-07-23T04:27:40Z

RDIP supports a variety of change data capture strategies, both batch and real-time, enabling a entities to select an update strategy that optimizes the overarching data integration processes. This is especially important when data needs to be copied from one or many data-sources to one or many other data-sources or to an analytics data warehouse without disrupting the regular flow of data, which is the case when users are forced to wait for batch runs. CDC streamlines modern analytics by leveraging event-driven data and making data integration more agile to deliver increased operational efficiency.

Technology Stack

**• Apache Kafk Message Broker 
• Apache Kafka Stream Processing API
• Apache Kafka Connect 
• Confluent Platform
• Spring 
    ◦ Spring boot 
    ◦ Spring Security
    ◦ Spring Retry
    ◦ Spring JPA
    ◦ Spring AOP 
    ◦ Spring OAuth2
    ◦ Spring JWT
    ◦ Spring Test
    ◦ Spring HATEOAS
• Log4j2
• Netty4
• PostgreSQL
• H2 – In Memory Database 
• AngularJS**

Kafka Connect Security

Securing Kafka Connect requires that you configure security for:

• **Kafka Connect workers:** part of the Kafka Connect API, a worker is really just an advanced client, underneath the covers
• **Kafka Connect connectors:** connectors may have embedded producers or consumers, so you must override the default configurations for Connect producers used with source connectors and Connect consumers used with sink connectors
• **Kafka Connect REST:** Kafka Connect exposes a REST API that can be configured to use SSL using additional properties

Encryption

If you have enabled SSL encryption in your Apache Kafka cluster, then you must make sure that Kafka Connect is also configured for security.

Authentication

If you have enabled authentication in your Kafka cluster, then you must make sure that Kafka Connect is also configured for security. Click on the section to configure authentication in Kafka Connect:

•     Authentication with SSL
•     Authentication with SASL/GSSAPI
•     Authentication with SASL/SCRAM
•     Authentication with SASL/PLAIN

Separate principals

As of now, there is no way to change the configuration for connectors individually, but if your server supports client authentication over SSL, it is possible to use a separate principal for the worker and the connectors. In this case, you need to generate a separate certificate for each of them and install them in separate keystores.

The key Connect configuration differences are as follows, notice the unique password, keystore location, and keystore password.

Connect workers manage the producers used by source connectors and the consumers used by sink connectors. So, for the connectors to leverage security, you also have to override the default producer/consumer configuration that the worker uses.

ACL Considerations

Using separate principals for the connectors allows you to define access control lists (ACLs) with finer granularity. For example, you can use this capability to prevent the connectors themselves from writing to any of internal topics used by the Connect cluster. Additionally, you can use different keystores for source and sink connectors and enable scenarios where source connectors have only write access to a topic but sink connectors have only read access to the same topic.

Note that if you are using SASL for authentication, you must use the same principal for workers and connectors as only a single JAAS is currently supported on the client side at this time as described here.

Connector ACL Requirements

Source connectors must be given WRITE permission to any topics that they need to write to. Similarly, sink connectors need READ permission to any topics they will read from. They also need Group READ permission since sink tasks depend on consumer groups internally. Connect defines the consumer group.id conventionally for each sink connector as connect-{name} where {name} is substituted by the name of the connector.

Externalizing Secrets

You can use a ConfigProvider implementation to prevent secrets from appearing in cleartext for Connector configurations on the filesystem (standalone mode) or in internal topics (distributed mode). You can specify variables in the configuration that are replaced at runtime with secrets from an external source. A reference implementation of ConfigProvider is provided with Kafka 2.0 called FileConfigProvider that allows variable references to be replaced with values from a properties file. However, the reference FileConfigProvider implementation still shows secrets in cleartext in the properties file that is managed by FileConfigProvider.

Configuring the Connect REST API for HTTP or HTTPS

By default you can make REST API calls over HTTP with Kafka Connect. You can also configure Connect to allow either HTTP or HTTPS, or both.

The REST API is used to monitor and manage Kafka Connect and for Kafka Connect cross-cluster communication. Requests that are received on the follower nodes REST API are forwarded on to the leader node REST API. If the URI host is different from the URI that it listens on, you can change the URI with the rest.advertised.host.name, rest.advertised.port and rest.advertised.listener configuration options. This URI will be used by the follower nodes to connect with the leader.

Design Principles

**1. Change Data Capture (CDC)
2. Event Sourcing 
3. Extract, Transform and Load (ETL) in real-time
4. Binary Serialization (Avro)**

Change Data Capture (CDC)

CDC (change data capture) is an approach to data integration that is helping firms obtain greater value from their data by allowing them to integrate and analyze data faster—and using fewer system resources. A highly efficient mechanism for limiting impact on the source extract when loading new data into operational data stores and data warehouses, CDC or change data capture complements ETL and enterprise information integration tools.

CDC eliminates the need for bulk load updating and inconvenient batch windows by enabling incremental loading or real-time streaming of data changes into your data warehouse. It can also be used for populating real-time business intelligence dashboards, synchronizing data across geographically distributed systems, and facilitating zero-downtime database migrations.

By allowing you to detect, capture, and deliver changed data, CDC reduces the time required for and resource costs of data warehousing while enabling continuous data integration.

Event Sourcing

With event sourcing, instead of storing the “current” state of the entities that are used in our system, we store a stream of events that relate to these entities. Each event is a fact, it describes a state change that occurred to the entity (past tense!). As we all know, facts are indisputable and immutable.

Having a stream of such events it’s possible to find out what’s the current state of an entity by folding all events relating to that entity; note, however, that it’s not possible the other way round — when storing the “current” state only, we discard a lot of valuable historical information.

ETL

ETL stands for "Extract, Transform, Load", and is the common paradigm by which data from multiple systems is combined to a single database, data store, or warehouse for legacy storage or analytics.

Binary Serialization with Apache Avro

Avro is an open source data serialization system that helps with data exchange between systems, programming languages, and processing frameworks. Avro helps define a binary format for your data, as well as map it to the programming language of your choice.

Avro has a JSON like data model, but can be represented as either JSON or in a compact binary form. It comes with a very sophisticated schema description language that describes data.

Avro is the best choice for a number of reasons:

•     It has a direct mapping to and from JSON
•     It has a very compact format. The bulk of JSON, repeating every field name with every single record, is what makes JSON inefficient for high-volume usage.
•     It is very fast.
•     It has great bindings for a wide variety of programming languages so you can generate Java objects that make working with event data easier, but it does not require code generation so tools can be written generically for any data stream.
•     It has a rich, extensible schema language defined in pure JSON
•     It has the best notion of compatibility for evolving your data over time.

Though it may seem like a minor thing handling this kind of metadata turns out to be one of the most critical and least appreciated aspects in keeping data high quality and easily useable at organizational scale.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-time Data Integration Platform (RDIP) Module #3

Real-time Data Integration Platform (RDIP) Module #3

mohbadar commented Jul 23, 2019

Real-time Data Integration Platform (RDIP) Module #3

Real-time Data Integration Platform (RDIP) Module #3

Comments

mohbadar commented Jul 23, 2019

Technology Stack

Design Principles