Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Handle various Primary Key outlier cases #53

Open
nickchomey opened this issue Sep 28, 2024 · 8 comments · May be fixed by #69
Open

Feature: Handle various Primary Key outlier cases #53

nickchomey opened this issue Sep 28, 2024 · 8 comments · May be fixed by #69
Assignees
Labels
source related to the source connector

Comments

@nickchomey
Copy link

nickchomey commented Sep 28, 2024

Feature description

I just tried to snapshot my entire database whose tables are generated by various wordpress plugins. As it turns out, many tables have a Unique Key rather than a Primary Key and the conduit-connector-mysql/source.go->getPrimaryKey() method therefore errors-out and the entire snapshot fails.

It seems that some of these instances are just poor schema design by the plugin developers. But others are legitimate use-cases:

  • there isn't even an auto-incrementing column, and instead the table is more of a key-value store
  • or the "primary key" is more of a composite unique key (e.g. multiple column values joined together to form a unique key, such that its impossible to create a row that has that same combination of column values).

Given this, as well as the reality that it won't always be possible to fix/add primary keys when they're missing (I definitely wont have the ability to modify the schemas of all mysql sources that I will be connecting to), this connector should probably not insist/rely upon each table having a Primary Key and also have an automated, but configurable, way of handling these situations.

I'm not sure what the right way to handle this should be, but here are some ideas:

  1. For safety's sake, the default should probably not proceed if there's a primary key missing on any table. However, don't throw a fatal error that immediately crashes Conduit, and instead let the tables all get scanned, extract the primary keys, and then cancel the snapshot before any records are generated, and log all of the tables that are missing primary keys so that action can be taken.
  2. Provide some setting(s) that allow for either specifying which tables don't need a primary key or, better yet, provide a setting that defaults to false but when set to true allows the snapshot and syncing to proceed when there are tables with primary keys missing. The idea being that once they've reviewed the problem tables that were logged with the first idea, they can add keys as-needed and then ignore any that are missing.
  3. Provide a setting that looks for an auto-increment column, and uses that as the primary key if it exists.
  4. Provide a setting as an ultimate fallback that allows for specifying the default value for missing primary keys (eg. can be a blank string, null, "id", etc...)

Perhaps something can be done with the Avro Schema Registry? I still don't really have a grasp of what it does, let alone how to make use of it in a pipeline. But if that seems like the right way to handle this, that would be great.

@nickchomey nickchomey changed the title Feature: Handle case where a table doesn't have a primary key Feature: Handle various Primary Key outlier cases Sep 29, 2024
@nickchomey
Copy link
Author

nickchomey commented Sep 29, 2024

Another issue: when the primary key is not an integer.

conduit-connector-mysql/fetch_worker.go->getMaxValue() tries to get the maximum value of the primary key column. But there are instances in which the primary key is not an integer (im looking at one where it is an IP address).

How can this be handled?

Edit:

It was described in our call that a primary issue for non-ordinal PKs is when snapshotting and the snapshot fails or new records are added during the snapshot.

Perhaps a solution would be to add an integer or ULID somehow to the key in the openCDC record and/or the schema registry? Perhaps it could be a sibling field in the openCDC.Key field to whatever the actual PK is?

@nickchomey
Copy link
Author

nickchomey commented Sep 29, 2024

Another: the primary key is a composite key

I have a few dozen tables where the primary key is a composite of 2+ columns. The postgres and mysql connectors seems to just fail for the Source Connector, and randomly takes the first key for the destination connector.

It was described in our call that it shouldn't be a big issue to add support for including multiple/all primary keys in the openCDC record's Key field. That would be a good solution. It would also help solve the issue with non-ordinal Key mentioned in the comment above.

@nickchomey
Copy link
Author

Debezium seems to have comprehensive support for all of the above.

@nickchomey
Copy link
Author

I also wonder if it would be possible to add some metadata for foreign keys? It would be helpful in re-creating relationships in my surrealdb destination connector. I can open another issue if that's more appropriate

@alarbada
Copy link
Collaborator

alarbada commented Oct 19, 2024

I also wonder if it would be possible to add some metadata for foreign keys? It would be helpful in re-creating relationships in my surrealdb destination connector. I can open another issue if that's more appropriate

Please do! I'd appreciate it

@alarbada alarbada added the source related to the source connector label Oct 28, 2024
@alarbada alarbada moved this from Todo to In Progress in Conduit Main Oct 28, 2024
@alarbada alarbada self-assigned this Oct 28, 2024
@alarbada alarbada added the triage Needs to be triaged label Oct 28, 2024
@alarbada alarbada moved this from In Progress to Triage in Conduit Main Oct 28, 2024
@alarbada alarbada removed the triage Needs to be triaged label Oct 28, 2024
@alarbada
Copy link
Collaborator

Hello @nickchomey, how many of your tables do you estimate are affected by this? We don't technically require right now to have an ordinal key to sort records really, just that we've got a primary key. The autoincrement part right now should be enforced by the user, the connector does not error if the primary key is non ordinal. For example, it will use lexicographical ordering (dictionary) if the primary key is a string.

Now, if you had ULID as a column and I allowed to specify custom columns to sort out the ordering, yeah, that would fix this issue.

@alarbada
Copy link
Collaborator

It would also help solve the issue with non-ordinal Key mentioned in the comment above.

I'm sorry, but I don't see how. We need some column that, when specified in the order by, guarantees us that we don't miss new inserted records, otherwise we loose data.

@nickchomey
Copy link
Author

Honestly, I dont quite recall how many tables are affected by this (and it would take a bit of effort to figure out again)

I dont have UUID or ULID, and it wouldnt be feasible for me (or anyone in this common situation) to implement that.

A handful have non-ordinal/auto-incrementing integer keys. I also suppose that it could be documented that if there is a specific primary key, but it isn't an auto-incrementing integer/ULID, snapshotting will only be reliable if there aren't any changes made while the snapshot is in progress?

Many more tables have composite keys, and the connector doesn't always select the correct one automatically.

It would also help solve the issue with non-ordinal Key mentioned in the comment above.

I dont recall what my idea was with this, nor can I figure it out right now! Best to assume it was a mistake. Still, it could be helpful in various ways to include all of the columns of the composite primary key.

Perhaps, as has been discussed elsewhere, it could be useful/necessary to be able to specify which column is the primary key in each table (with omissions defaulting to the current mechanism)?

@alarbada alarbada linked a pull request Oct 30, 2024 that will close this issue
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source related to the source connector
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants