How should I measure the similarity of 2 SQLs? #2870
-
Goal: I would like to measure the similarity of 2 SQLs in a scale of [0, 1]. 0 means totally different. 1 means exactly same. Assume 2 SQLs are from the same dialect. My current idea: To be more specific, the similarity function could be something like
where if len(diff) == 0, it naturally indicate "exactly same". But the some_bound should be a value that describe the maximum difference of 2 ASTs. e.g. My question: What would be the upper bound of diff to normalize it? My guess is the sum of # nodes. But I am not sure if it is a tight bound. Any idea or alternative solution would be super helpful. Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
The So the upper bound (denominator) would the length of the edit script. The numerator will be a number of
|
Beta Was this translation helpful? Give feedback.
The
diff
function returns an edit script which also contains all nodes that have been unchanged. Those nodes will be wrapped into Keep.So the upper bound (denominator) would the length of the edit script. The numerator will be a number of
Keep
nodes in the edit script. For example: