-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sanity checking the handling of WHERE statements with datasette #19
Comments
Okay, I've had a go at looking at the SQL to see if that's an issue. I've added some logging, and simplified where clause slightly # this is the execute defined in winging_it.ProxyCursor.execute()
def execute(self, sql, parameters=None):
with open('duckdb.log', 'w') as f:
f.write('sql={}\n'.format(sql))
f.write('parameters={}\n'.format(parameters))
print('# params={} sql={}'.format(parameters, sql))
sql = rewrite(sql)
with open('duckdb.updated.log', 'w') as f:
f.write('sql={}\n'.format(sql))
f.write('parameters={}\n'.format(parameters))
sql, parameters = fixup_params(sql, parameters)
print('## params={} sql={}'.format(parameters, sql))
t = time.time()
rv = self.cursor.execute(sql, parameters)
#print('took {}'.format(time.time() - t))
return rv Here's the SQL before it gets converted for duckdb: select
[Datetime (UTC)],
Country,
[Zone Name],
[Zone Id],
[Carbon Intensity gCO₂eq/kWh (direct)],
[Carbon Intensity gCO₂eq/kWh (LCA)],
[Low Carbon Percentage],
[Renewable Percentage],
[Data Source],
[Data Estimated],
[Data Estimation Method]
from
result_zstd
where
[Datetime (UTC)] < "2022-01-01"
limit
10 Here's the updated sql SELECT "Datetime (UTC)",
Country,
"Zone Name",
"Zone Id",
"Carbon Intensity gCO₂eq/kWh (direct)",
"Carbon Intensity gCO₂eq/kWh (LCA)",
"Low Carbon Percentage",
"Renewable Percentage",
"Data Source",
"Data Estimated",
"Data Estimation Method"
FROM result_zstd
WHERE "Datetime (UTC)" < "2022-01-01"
LIMIT 10
The error I see is as follows in the logs:
For some reason, it looks like duckdb is looking at the wrong part of the query, but I'm not sure why this is happening. I'll get a reproducible test case tonight. |
Hey, thanks for the kind comments & detailed report! What happens if you try single quotes rather than double quotes for the timestamp / date? SQLite has some unfortunate quoting behaviour that dates from them trying to conform to what an ancient MySQL version did: https://www.sqlite.org/quirks.html#double_quoted_string_literals_are_accepted If (You might still hit other issues with the integration, of course! But hopefully this gets you one step closer.) |
Haha! I think the double quote / single quote string was the issue! Using the combo of square brackets for column names, and single quote for values in the datasette SQL interface worked without the errors, and I can reliably reproduce the error by switching to use double quotes in the SQL interface. select
[Datetime (UTC)],
Country,
[Zone Name],
[Zone Id],
[Carbon Intensity gCO₂eq/kWh (direct)],
[Carbon Intensity gCO₂eq/kWh (LCA)],
[Low Carbon Percentage],
[Renewable Percentage],
[Data Source],
[Data Estimated],
[Data Estimation Method]
from
result_zstd
where
[Datetime (UTC)] BETWEEN '2022-01-02' AND '2022-01-03'
AND [Zone Id] = 'DK'
limit
101 Result! Thank you! |
With this in mind, the error that is returned is pretty cryptic. Do you have any suggestions for catching the use of double strings in this scenario to warn people? You could do it with some code a bit like the following, and a better named exception: import re
def has_double_quoted_string(text):
pattern = r'"([^"]*)"'
matches = re.findall(pattern, text)
return len(matches) > 0
if has_double_quoted_string(sql):
raise Exception('Sorry, double quoted strings work in SQLite, but DuckDB handles them differently. Please use the single quotes for literal values') The problem is, I don't think you can guarantee that the rest of datasette is avoiding the use of double quoted strings. The code above also throws an exception when you have a string representation of a whole query, that begins and ends with double quote strings. For the code above to work, you'd need to know to check for when a double quoted string is only being used for literal, and I don't know how to do that. I'm not sure how to do that, but I'll have a look around. |
okay, this looks like it'll be a hairy regular expression. It might be simpler to hook into the SQL interface to warn users in writing. I'll see how hard that is. |
I really like the idea of trying to nudge the user towards a solution when the error happens. A rough and ready approach (which is the standard this project aims for) might be to wrap ProxyCursor.execute and ProxyConnection.execute's invocations of the underlying DuckDB If they catch a
they can pull out the referenced column, see if it's quoted with double quotes and if it's present in the raw SQL with double quotes. If so, it can throw a new error that wraps the original one and points them towards checking for double quotes vs single quotes.
Well, yes and no. I've opened a PR (simonw/datasette#2004) to fix all of its current misuses of double quotes. In the meantime,
Yes, I think this is the crux of it. SQLite's silent transformation is bad, as the authors themselves note. A query author typoing a column name can result in a query silently "succeeding". That is, the query will run without errors, but do something entirely unintended, as the clause will now refer to a fixed literal instead of a column name. |
Ok, this approach sounds a lot more sensible,
Something like this seems to work and is catching the exception with the message whilst letting everything else through: import re
from duckdb import BinderException
try:
rv = self.cursor.execute(sql, parameters)
except BinderException as exc:
# should return 'Binder Error: Referenced column "LITERAL_IN_DOUBLE_QUOTES" '
referenced_column_message = exc.args[0].split(' not found in FROM clause')[0]
# a regular expression to find the literal wrapped by the double quoted string
pattern = r'(?<![A-Za-z0-9_])"(?:[^"]*(?:"[^"]*)*[^"]*)"(?![A-Za-z0-9_])'
# find the instance of the double quoted literal in the message, so you can refer to it in
# the new exception message
matches = re.findall(pattern, referenced_column_message)
if matches:
raise Exception((
f"It looks like you are using a double quoted string for a value at: {matches[0]}. "
"To make this work with DuckDB, wrap it in single quoted strings instead."
)
) Because we are now just checking that an error string is formatted a given way, this doesn't look too hard to write tests around. I'll make a PR later this week. thanks @cldellow ! |
OK, @cldellow - I think this PR is in ok shape to look over now. |
…lback Add fallback exception when double quotes are used - see issue #19
Thanks for the PR! I've merged it, keeping the gitpod stuff. I don't use gitpod myself, but it seems reasonable to include it. I've released v0.6.1 to pypi with the improvements + a shout-out to you in the release notes. |
Hi there.
First of all, thanks for making datasette-parquet - I was very happy to discover it exists.
I'm using it on a project to play around with some recently published open data and whenever I try adding a "WHERE" clause to a query that would go to duckdb I'm seeing an error.
When I run load a page without a where clause, all is well.
However once I add a where clause like so:
I end up with a confusing "Binder" error that I'm not sure how to debug:
I'm working to get a test case together, but in the meantime, is the rewrite.py the file to check to see if there is any logic that it might be causing this?
https://github.com/cldellow/datasette-parquet/blob/main/datasette_parquet/rewrite.py#L80-L81
I see some logging statements, and knowing where to look would help for me to create a PR to fix it once I understand the cause of the issue.
The text was updated successfully, but these errors were encountered: