Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMULATE...GIVEN syntax is unlike SELECT...WHERE and may cause unintended errors #605

Open
versar opened this issue Jan 17, 2018 · 0 comments

Comments

@versar
Copy link

versar commented Jan 17, 2018

The problem is that using the word AND to express multiple constraints in a SIMULATE query is incorrect, however that syntax is correct in a SELECT query so a user may use it by accident. When this happens, no error is returned. The query runs normally. However, the wrong results are produced.

AND works in SELECT queries. In specific, the BQL or SQL expression
1: %bql SELECT var1, var2 FROM t WHERE "var1"=val1 AND "var2"=val2
is interpreted as:
2: %bql SELECT var1, var2 FROM t WHERE ("var1"=val1) AND ("var2"=val2)

However AND does not work normally in SIMULATE queries. In specific, the BQL expression
3: %bql SIMULATE var1, var2 FROM p WHERE "var1"=val1 AND "var2"=val2 LIMIT 10
is interpreted as:
4: %bql SIMULATE var1, var2 FROM p WHERE "var1" = (val1 AND "var2" = val2) LIMIT 10

Therefore, if a user invokes query 3 above, they will not get the answer they expect. Currently, the correct way to write multiple constraints in a SIMULATE expression is for them to be separated by commas, not by AND. Therefore the user will only get the right answer by using the syntax below:
5: %bql SIMULATE var1, var2 FROM p WHERE "var1"=val1, "var2"=val2 LIMIT 10

This is a problem because a user (like me!) who believes naturally that SELECT and SIMULATE have parallel syntax is susceptible to accidentally use AND with SIMULATE instead of commas. The bug is especially insidious because the result will silently be wrong, without an error. And, the results will look almost reasonable -- if the constraints are ignored, the variable will usually just be simulated at values near its overall mean. This is a trap for the unwary, especially when the incorrect results propagate through the rest of an analysis.

Possible resolutions:

  1. Return a syntax error when the compiler sees a SIMULATE query with multiple ANDs instead of commas.
  2. Compile an SIMULATE query with multiple ANDs to interpret the query as if it has commas.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant