How should I measure the similarity of 2 SQLs? #2870

CBQu · 2024-01-22T06:00:56Z

CBQu
Jan 22, 2024

Goal: I would like to measure the similarity of 2 SQLs in a scale of [0, 1]. 0 means totally different. 1 means exactly same. Assume 2 SQLs are from the same dialect.

My current idea:
I am considering using the diff function. diff returns an operation list. The length of the list can be a measurement of difference.

To be more specific, the similarity function could be something like

1 - len(diff) / some_bound,

where if len(diff) == 0, it naturally indicate "exactly same". But the some_bound should be a value that describe the maximum difference of 2 ASTs. e.g. DELETE FROM a and SELECT 1.

My question: What would be the upper bound of diff to normalize it? My guess is the sum of # nodes. But I am not sure if it is a tight bound.

Any idea or alternative solution would be super helpful. Thanks!

Answered by izeigerman

Feb 12, 2024

The diff function returns an edit script which also contains all nodes that have been unchanged. Those nodes will be wrapped into Keep.

So the upper bound (denominator) would the length of the edit script. The numerator will be a number of Keep nodes in the edit script. For example:

edit_script = diff(...)
numerator = len([e for e in edit_script if isinstance(e, Keep)])
denominator = len(edit_script)
score = numerator / denominator

View full answer

izeigerman · 2024-02-12T17:19:31Z

izeigerman
Feb 12, 2024
Collaborator

The diff function returns an edit script which also contains all nodes that have been unchanged. Those nodes will be wrapped into Keep.

So the upper bound (denominator) would the length of the edit script. The numerator will be a number of Keep nodes in the edit script. For example:

edit_script = diff(...)
numerator = len([e for e in edit_script if isinstance(e, Keep)])
denominator = len(edit_script)
score = numerator / denominator

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should I measure the similarity of 2 SQLs? #2870

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How should I measure the similarity of 2 SQLs? #2870

CBQu Jan 22, 2024

Replies: 1 comment

izeigerman Feb 12, 2024 Collaborator

CBQu
Jan 22, 2024

izeigerman
Feb 12, 2024
Collaborator