Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique key detector tool #6900

Open
simonaubertbd opened this issue Sep 21, 2024 · 0 comments
Open

Unique key detector tool #6900

simonaubertbd opened this issue Sep 21, 2024 · 0 comments

Comments

@simonaubertbd
Copy link

What's your use case?
More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.

What's your proposed solution?
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe.

How do I imagine that ?

Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows

Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".

ex :

<style> </style>
order_id line_id amount customer site
1 1 100 A U_250
1 2 12 A U_250
1 3 45 A U_250
2 1 75 A U_250
2 2 12 A U_250
3 1 15 B U_250
4 1 45 B U_251

The user will select every field but excluding Amount (he knows that Amount would have no sense in key)

The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields

to match the number of row (7)
And gives something like that

<style> </style>
choice number of fields field combination
very good 2 order_id,line_id
average 3 order_id,line_id, customer
average 3 order_id,line_id, site
bad 4 order_id,line_id, site, customer
….

Are there any alternative solutions?
N/A

Best regards,

Simon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant