You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What's your use case?
More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.
What's your proposed solution?
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe.
How do I imagine that ?
Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows
Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".
ex :
<style>
</style>
order_id
line_id
amount
customer
site
1
1
100
A
U_250
1
2
12
A
U_250
1
3
45
A
U_250
2
1
75
A
U_250
2
2
12
A
U_250
3
1
15
B
U_250
4
1
45
B
U_251
The user will select every field but excluding Amount (he knows that Amount would have no sense in key)
The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields
to match the number of row (7)
And gives something like that
<style>
</style>
choice
number of fields
field combination
very good
2
order_id,line_id
average
3
order_id,line_id, customer
average
3
order_id,line_id, site
bad
4
order_id,line_id, site, customer
…
…
….
Are there any alternative solutions?
N/A
Best regards,
Simon
The text was updated successfully, but these errors were encountered:
What's your use case?
More than often, you have to deal with a dataset without knowing what's make a row unique. This can lead to misinterpret the data, cartesian product at join and other funny stuff.
What's your proposed solution?
This is a feature I haven't seen in any data prepation/etl. The core feature is to detect the unique key in a dataframe.
How do I imagine that ?
Entry; one dataframe, ability to select fields or check all, ability to specify a max number of field for combination (empty or 0=no max).
Algo : it tests the count distinct every combination of field versus the count of rows
Result : one row by field combination that works. If no result : "no field combination is unique. check for duplicate or need for aggregation upstream".
ex :
<style> </style>The user will select every field but excluding Amount (he knows that Amount would have no sense in key)
The algo will test the following key
-each separate field
-each combination of two fields
-each combination of three fields
-each combination of four fields
to match the number of row (7)
<style> </style>And gives something like that
Are there any alternative solutions?
N/A
Best regards,
Simon
The text was updated successfully, but these errors were encountered: