Feature: Handle various Primary Key outlier cases #53

nickchomey · 2024-09-28T21:55:43Z

Feature description

I just tried to snapshot my entire database whose tables are generated by various wordpress plugins. As it turns out, many tables have a Unique Key rather than a Primary Key and the conduit-connector-mysql/source.go->getPrimaryKey() method therefore errors-out and the entire snapshot fails.

It seems that some of these instances are just poor schema design by the plugin developers. But others are legitimate use-cases:

there isn't even an auto-incrementing column, and instead the table is more of a key-value store
or the "primary key" is more of a composite unique key (e.g. multiple column values joined together to form a unique key, such that its impossible to create a row that has that same combination of column values).

Given this, as well as the reality that it won't always be possible to fix/add primary keys when they're missing (I definitely wont have the ability to modify the schemas of all mysql sources that I will be connecting to), this connector should probably not insist/rely upon each table having a Primary Key and also have an automated, but configurable, way of handling these situations.

I'm not sure what the right way to handle this should be, but here are some ideas:

For safety's sake, the default should probably not proceed if there's a primary key missing on any table. However, don't throw a fatal error that immediately crashes Conduit, and instead let the tables all get scanned, extract the primary keys, and then cancel the snapshot before any records are generated, and log all of the tables that are missing primary keys so that action can be taken.
Provide some setting(s) that allow for either specifying which tables don't need a primary key or, better yet, provide a setting that defaults to false but when set to true allows the snapshot and syncing to proceed when there are tables with primary keys missing. The idea being that once they've reviewed the problem tables that were logged with the first idea, they can add keys as-needed and then ignore any that are missing.
Provide a setting that looks for an auto-increment column, and uses that as the primary key if it exists.
Provide a setting as an ultimate fallback that allows for specifying the default value for missing primary keys (eg. can be a blank string, null, "id", etc...)

Perhaps something can be done with the Avro Schema Registry? I still don't really have a grasp of what it does, let alone how to make use of it in a pipeline. But if that seems like the right way to handle this, that would be great.

The text was updated successfully, but these errors were encountered:

nickchomey · 2024-09-29T03:18:58Z

Another issue: when the primary key is not an integer.

conduit-connector-mysql/fetch_worker.go->getMaxValue() tries to get the maximum value of the primary key column. But there are instances in which the primary key is not an integer (im looking at one where it is an IP address).

How can this be handled?

Edit:

It was described in our call that a primary issue for non-ordinal PKs is when snapshotting and the snapshot fails or new records are added during the snapshot.

Perhaps a solution would be to add an integer or ULID somehow to the key in the openCDC record and/or the schema registry? Perhaps it could be a sibling field in the openCDC.Key field to whatever the actual PK is?

nickchomey · 2024-09-29T03:26:41Z

Another: the primary key is a composite key

I have a few dozen tables where the primary key is a composite of 2+ columns. The postgres and mysql connectors seems to just fail for the Source Connector, and randomly takes the first key for the destination connector.

It was described in our call that it shouldn't be a big issue to add support for including multiple/all primary keys in the openCDC record's Key field. That would be a good solution. It would also help solve the issue with non-ordinal Key mentioned in the comment above.

nickchomey · 2024-09-29T04:14:43Z

Debezium seems to have comprehensive support for all of the above.

use unique key as fallback
specify which columns form the record key
optional primary key (though it looks like you only get CREATE events when there is no primary key.

nickchomey · 2024-10-19T16:01:12Z

I also wonder if it would be possible to add some metadata for foreign keys? It would be helpful in re-creating relationships in my surrealdb destination connector. I can open another issue if that's more appropriate

alarbada · 2024-10-19T16:03:54Z

I also wonder if it would be possible to add some metadata for foreign keys? It would be helpful in re-creating relationships in my surrealdb destination connector. I can open another issue if that's more appropriate

Please do! I'd appreciate it

alarbada · 2024-10-28T21:31:17Z

Hello @nickchomey, how many of your tables do you estimate are affected by this? We don't technically require right now to have an ordinal key to sort records really, just that we've got a primary key. The autoincrement part right now should be enforced by the user, the connector does not error if the primary key is non ordinal. For example, it will use lexicographical ordering (dictionary) if the primary key is a string.

Now, if you had ULID as a column and I allowed to specify custom columns to sort out the ordering, yeah, that would fix this issue.

alarbada · 2024-10-28T21:35:16Z

It would also help solve the issue with non-ordinal Key mentioned in the comment above.

I'm sorry, but I don't see how. We need some column that, when specified in the order by, guarantees us that we don't miss new inserted records, otherwise we loose data.

nickchomey · 2024-10-28T22:55:49Z

Honestly, I dont quite recall how many tables are affected by this (and it would take a bit of effort to figure out again)

I dont have UUID or ULID, and it wouldnt be feasible for me (or anyone in this common situation) to implement that.

A handful have non-ordinal/auto-incrementing integer keys. I also suppose that it could be documented that if there is a specific primary key, but it isn't an auto-incrementing integer/ULID, snapshotting will only be reliable if there aren't any changes made while the snapshot is in progress?

Many more tables have composite keys, and the connector doesn't always select the correct one automatically.

It would also help solve the issue with non-ordinal Key mentioned in the comment above.

I dont recall what my idea was with this, nor can I figure it out right now! Best to assume it was a mistake. Still, it could be helpful in various ways to include all of the columns of the composite primary key.

Perhaps, as has been discussed elsewhere, it could be useful/necessary to be able to specify which column is the primary key in each table (with omissions defaulting to the current mechanism)?

simonl2002 added this to Conduit Main Sep 28, 2024

github-project-automation bot moved this to Triage in Conduit Main Sep 28, 2024

nickchomey changed the title ~~Feature: Handle case where a table doesn't have a primary key~~ Feature: Handle various Primary Key outlier cases Sep 29, 2024

nickchomey mentioned this issue Sep 29, 2024

Fixes for the start row id #56

Closed

3 tasks

lovromazgon moved this from Triage to Todo in Conduit Main Oct 4, 2024

nickchomey mentioned this issue Oct 10, 2024

Fixes naïve snapshot iteration #59

Merged

3 tasks

alarbada mentioned this issue Oct 19, 2024

Source connector should provide a schema #62

Open

alarbada added the source related to the source connector label Oct 28, 2024

alarbada moved this from Todo to In Progress in Conduit Main Oct 28, 2024

alarbada self-assigned this Oct 28, 2024

alarbada added the triage Needs to be triaged label Oct 28, 2024

alarbada moved this from In Progress to Triage in Conduit Main Oct 28, 2024

alarbada removed the triage Needs to be triaged label Oct 28, 2024

alarbada linked a pull request Oct 30, 2024 that will close this issue

Smarter and more flexible table configuration #69

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Handle various Primary Key outlier cases #53

Feature: Handle various Primary Key outlier cases #53

nickchomey commented Sep 28, 2024 •

edited

Loading

nickchomey commented Sep 29, 2024 •

edited

Loading

nickchomey commented Sep 29, 2024 •

edited

Loading

nickchomey commented Sep 29, 2024

nickchomey commented Oct 19, 2024

alarbada commented Oct 19, 2024 •

edited

Loading

alarbada commented Oct 28, 2024

alarbada commented Oct 28, 2024

nickchomey commented Oct 28, 2024

Feature: Handle various Primary Key outlier cases #53

Feature: Handle various Primary Key outlier cases #53

Comments

nickchomey commented Sep 28, 2024 • edited Loading

Feature description

nickchomey commented Sep 29, 2024 • edited Loading

nickchomey commented Sep 29, 2024 • edited Loading

nickchomey commented Sep 29, 2024

nickchomey commented Oct 19, 2024

alarbada commented Oct 19, 2024 • edited Loading

alarbada commented Oct 28, 2024

alarbada commented Oct 28, 2024

nickchomey commented Oct 28, 2024

nickchomey commented Sep 28, 2024 •

edited

Loading

nickchomey commented Sep 29, 2024 •

edited

Loading

nickchomey commented Sep 29, 2024 •

edited

Loading

alarbada commented Oct 19, 2024 •

edited

Loading