Streaming without ENABLE_SCHEMA_EVOLUTION; postgres JSON mapped to snowflake VARIANT #536

acristu · 2023-02-06T16:58:53Z

Hi,

We needed to map postgresql JSON data type to snowflake VARIANT and we will require some other tweaking of the schema evolution logic in the future so we propose an approach in which Snowpipe Streaming is used but schema evolution is still done using ALTER TABLE statements via jdbc. Will this approach be supported in the future?

This is a draft pull req with the following changes:

New config parameter snowflake.schematization.auto, true by default, only relevant if "snowflake.enable.schematization": "true"; when false, the connector will not try to set ENABLE_SCHEMA_EVOLUTION
added postgresql JSON to VARIANT in SchematizationUtils.convertToSnowflakeType; we can make this configurable/extensible with other mappings, this is just a draft specific to the debezium semantic type naming

… type convertion

acristu · 2023-02-08T15:19:44Z

Testing postgresql source (debezium) -> snowflake sink, every time a new column is added to the source postgresql table the snowflake sink task crashes with:

java.lang.IllegalStateException: No current assignment for partition tsttbl-0
	at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:370)
	at org.apache.kafka.clients.consumer.internals.SubscriptionState.seekUnvalidated(SubscriptionState.java:387)
	at org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1604)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.rewind(WorkerSinkTask.java:625)
. . . .

The topic has one partition, "tasks.max": "1", one connect worker ... troubleshooting ongoing ... if anyone has seen this before or have any hints please share.

sfc-gh-tzhang · 2023-02-09T00:25:36Z

src/main/java/com/snowflake/kafka/connector/internal/SnowflakeConnectionServiceV1.java

+    if (autoSchematization) {
+      // Enable schema evolution by default if the table is created by the connector
+      String enableSchemaEvolutionQuery =
+          "alter table identifier(?) set ENABLE_SCHEMA_EVOLUTION = true";
+      try {
+        PreparedStatement stmt = conn.prepareStatement(enableSchemaEvolutionQuery);
+        stmt.setString(1, tableName);
+        stmt.executeQuery();
+      } catch (SQLException e) {
+        // Skip the error given that schema evolution is still under PrPr
+        LOG_WARN_MSG(
+            "Enable schema evolution failed on table: {}, message: {}", tableName, e.getMessage());
+      }


This won't work because we rely on the schema evolution to create the table with the correct schema, if you don't want schema evolution on the table, you should create the table yourself and it will have schema evolution turned off by default

thank you for the reply, the schema is evolved here: https://github.com/streamkap-com/snowflake-kafka-connector/blob/63bc190dd75692e6423d3f50b25143dafaa40d1a/src/main/java/com/snowflake/kafka/connector/internal/streaming/TopicPartitionChannel.java#L653. The main point of this draft PR is to assess if "manual schema evolution" from SchematizationUtils.evolveSchemaIfNeeded can be supported as an alternative to "automatic schema evolution". Our main concern is that that "manual schema evolution" is currently done on the error path, if the insertRow fails.

We cand contribute this "manual schema evolution" option, when setting snowflake.schematization.auto=false the connector would use the connect schema from the schema registry or embedded into the records to evolve the snowflake schema before inserting the data. But we wanted to see first if this approach is acceptable for you going forward.

Our main concern is that that "manual schema evolution" is currently done on the error path, if the insertRow fails.

This is by design, insertRow will always fail first because we just create a table with RECORD_METADATA only during connector start up

Ok, that should not be a problem, I mean schema changes should not happen often. The problem is that without these proposed changes https://github.com/streamkap-com/snowflake-kafka-connector/blob/63bc190dd75692e6423d3f50b25143dafaa40d1a/src/main/java/com/snowflake/kafka/connector/internal/streaming/TopicPartitionChannel.java#L264, we cannot enable schema evolution without setting ENABLE_SCHEMA_EVOLUTION. We would like to use only SchematizationUtils.evolveSchemaIfNeeded and not ENABLE_SCHEMA_EVOLUTION.

sfc-gh-tzhang · 2023-02-09T00:32:27Z

src/main/java/com/snowflake/kafka/connector/internal/streaming/SchematizationUtils.java

@@ -209,6 +209,9 @@ private static String convertToSnowflakeType(Type kafkaType) {
      case BOOLEAN:
        return "BOOLEAN";
      case STRING:
+        if (semanticType != null && semanticType.equals("io.debezium.data.Json")) {


what's the issue of using VARCHAR in this case?

By using varchar you'd have to parse json each time you wanted to pull out a field. With variant can use json.field. So this will skip extra processing.

I see, this makes sense, I guess this is a general issue for json value since they will be all mapped to STRING, I will see what we can do here, thanks!

acristu · 2023-02-10T11:18:21Z

Testing postgresql source (debezium) -> snowflake sink, every time a new column is added to the source postgresql table the snowflake sink task crashes with:
java.lang.IllegalStateException: No current assignment for partition tsttbl-0
	at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:370)
	at org.apache.kafka.clients.consumer.internals.SubscriptionState.seekUnvalidated(SubscriptionState.java:387)
	at org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1604)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.rewind(WorkerSinkTask.java:625)
. . . .
The topic has one partition, "tasks.max": "1", one connect worker ... troubleshooting ongoing ... if anyone has seen this before or have any hints please share.

This was because of a configuration we used:

    "transforms": "changeTopicName",
    "transforms.changeTopicName.regex": "^[^.]\\w+.\\w+.(.*)",
    "transforms.changeTopicName.replacement": "$1",
    "transforms.changeTopicName.type": "org.apache.kafka.connect.transforms.RegexRouter",

Because the schema evolution is applied in case of error, a consumer.seek is needed to the previous offsets, using the wrong topic name causes the exception ... removed the above SMT and things work well.

…dering

… type convertion

…dering

…/streamkap-com/snowflake-kafka-connector into streaming-no-auto-schematization

…-schematization

sfc-gh-tzhang

Quick update:

The wildcard support is a good one, and we will look into it separately
Using the semanticType to support more data type won't work in general, but looks like there is a doc field in ConnectSchema which we could use it for hint
We don't plan to support auto schema evolution for existing tables because it's a one time thing and we want customer to do that manually to make sure this is something they want. It will be automatically enabled for tables created by KC

acristu · 2023-03-24T08:40:58Z

Hi @sfc-gh-tzhang ,

Thank you for the update.
It is difficult to use the connector without regex support for topic-to-table mappings and JSON/date/timestamp support in schema evolution from all debezium sources, for now using the workarounds in this PR.

sfc-gh-achyzy · 2024-06-05T07:50:52Z

Hi @acristu. Do you need more help with this PR, or can it be closed now that you've solved the issue with a workaround?

Added snowflake.schematization.auto; added postgresql JSON to VARIANT…

63bc190

… type convertion

acristu marked this pull request as ready for review February 6, 2023 17:46

acristu requested review from sfc-gh-japatel, sfc-gh-tzhang, sfc-gh-tjones and sfc-gh-rcheng as code owners February 6, 2023 17:46

sfc-gh-tzhang reviewed Feb 9, 2023

View reviewed changes

fix date/time conversion in SchematizationUtils; fixed junit

c91c9b4

Alexandru Cristu added 7 commits February 10, 2023 12:55

Added support for regex matcher in snowflake.topic2table.map

e42e169

fix column ordering to match as much as possible the source column or…

4a68939

…dering

Added snowflake.schematization.auto; added postgresql JSON to VARIANT…

1fceaef

… type convertion

fix date/time conversion in SchematizationUtils; fixed junit

4c03bd0

Added support for regex matcher in snowflake.topic2table.map

e6e702e

fix column ordering to match as much as possible the source column or…

7efd254

…dering

Merge branch 'streaming-no-auto-schematization' of https://github.com…

8369100

…/streamkap-com/snowflake-kafka-connector into streaming-no-auto-schematization

sfc-gh-tzhang self-assigned this Feb 24, 2023

Alexandru Cristu added 4 commits February 28, 2023 07:03

Merge remote-tracking branch 'upstream/master' into streaming-no-auto…

003e67e

…-schematization

Merge remote-tracking branch 'upstream/master' into streaming-no-auto…

f12e887

…-schematization

merge upstream/master

f463138

Merge remote-tracking branch 'upstream/master' into streaming-no-auto…

c3e093f

…-schematization

sfc-gh-tzhang reviewed Mar 22, 2023

View reviewed changes

fix for NPE SinkRecord.keySchema, github snowflakedb#582

88ce343

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming without ENABLE_SCHEMA_EVOLUTION; postgres JSON mapped to snowflake VARIANT #536

Streaming without ENABLE_SCHEMA_EVOLUTION; postgres JSON mapped to snowflake VARIANT #536

acristu commented Feb 6, 2023 •

edited

Loading

acristu commented Feb 8, 2023

sfc-gh-tzhang Feb 9, 2023

acristu Feb 9, 2023 •

edited

Loading

sfc-gh-tzhang Feb 10, 2023 •

edited

Loading

acristu Feb 10, 2023

sfc-gh-tzhang Feb 9, 2023

acristu Feb 9, 2023

sfc-gh-tzhang Feb 10, 2023

acristu commented Feb 10, 2023 •

edited

Loading

sfc-gh-tzhang left a comment

acristu commented Mar 24, 2023

sfc-gh-achyzy commented Jun 5, 2024

Streaming without ENABLE_SCHEMA_EVOLUTION; postgres JSON mapped to snowflake VARIANT #536

Are you sure you want to change the base?

Streaming without ENABLE_SCHEMA_EVOLUTION; postgres JSON mapped to snowflake VARIANT #536

Conversation

acristu commented Feb 6, 2023 • edited Loading

acristu commented Feb 8, 2023

sfc-gh-tzhang Feb 9, 2023

Choose a reason for hiding this comment

acristu Feb 9, 2023 • edited Loading

Choose a reason for hiding this comment

sfc-gh-tzhang Feb 10, 2023 • edited Loading

Choose a reason for hiding this comment

acristu Feb 10, 2023

Choose a reason for hiding this comment

sfc-gh-tzhang Feb 9, 2023

Choose a reason for hiding this comment

acristu Feb 9, 2023

Choose a reason for hiding this comment

sfc-gh-tzhang Feb 10, 2023

Choose a reason for hiding this comment

acristu commented Feb 10, 2023 • edited Loading

sfc-gh-tzhang left a comment

Choose a reason for hiding this comment

acristu commented Mar 24, 2023

sfc-gh-achyzy commented Jun 5, 2024

acristu commented Feb 6, 2023 •

edited

Loading

acristu Feb 9, 2023 •

edited

Loading

sfc-gh-tzhang Feb 10, 2023 •

edited

Loading

acristu commented Feb 10, 2023 •

edited

Loading