-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Offer a format
param to from
, for DuckDB?
#2168
Comments
Could this be also a solution to #1535? So far the experience of reading files with duckdb is not good....... from `https://example.com/foo.parquet`
take 10 SELECT
*
FROM
"https://example"."com/foo".parquet
LIMIT
10
-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org) |
@eitsupi great point, that is very bad... Unfortunately I don't think this would help there, since that happens earlier in the compilation. |
We discussed this on the dev call today (which @eitsupi you'd be more than welcome to join, #1083 for details; though I'm not sure if the timezone works well for you...) I'm going to think about this & #1535 more; there are some good tradeoffs that @aljazerzen brought up. I think it's important we make this much better though; this is one of the those things that really should "just work", and is likely to be salient for the sort of folks who are trying out new tools, like PRQL / DuckDB. |
(Yeah, it's midnight in my timezone and I need to sleep...... ) (Actually, yesterday I was up, which is bad......) |
Ha, yeah, the timezones are hard — any earlier is difficult for me and any later is difficult for the Europe folks. Maybe we should do them every 13 days & 16 hours to make it fairer. FWIW your English is excellent in writing... To the extent you wanted to attend once soon, you'd be very welcome, I know folks would enjoy meeting you! (But ofc no stress) |
We have this, CC @ywelsch
to WITH table_0 AS (
SELECT
*
FROM
'https://example.com/foo.parquet'
)
SELECT
*
FROM
table_0 AS table_1
-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org) I'll prioritize making this much better, it's quite bad at the moment |
What about a Your example would then be: from_file format:parquet `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*`
take 10 and similarly from_file `albums.csv`
take 10 where the I think at the time there was a concern about keeping the number of keywords small. Given that this has to be translated into |
I just found the continuation of the discussion about To the second comment, @max-sixty you replied
Is this perhaps the answer to the question, i.e. while DuckDB allows saying If you're looking for a semantic justification, I think one could say that Another difference would be that I just caught up on #1535 and I saw that However I still think it might allow us to provide a more uniform interface for importing files across different databases where db specific workarounds might be required. See for example
I guess you see
In that case I prefer the second proposed form because from (read_parquet `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*`)
take 10 More importantly though I'm opposed to |
FWIW the following worked for me in Colab: %%prql duckdb:///:memory:
func read_parquet path -> s"SELECT * from read_parquet({path})"
from (read_parquet "s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")
filter phylum == 'Chordata'
derive longitude = (decimallongitude | round 0)
take 4 This also sidesteps the ident issue for these cases because the So @eitsupi , your example can become func read_parquet path -> s"SELECT * from read_parquet({path})"
from (read_parquet "https://example.com/foo.parquet")
take 10 which in the playground generates WITH table_0 AS (
SELECT
*
from
read_parquet('https://example.com/foo.parquet')
)
SELECT
*
FROM
table_0 AS table_1
LIMIT
10
-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org) |
Great, I think this is an excellent & easy thing to add, let's do it. We can refine later re strings vs idents, and collapsing the additional Do others agree? There's a lot of text above @snth — assuming we do the |
Cool, let me see if I can put together a PR tomorrow. Yeah, sorry about all the text - it took me a while to work through this and figure out the potential differences between |
This is a brilliant solution! |
I agree; to the extent something is only DuckDB, we could put it behind a (But definitely no need to wait until we get modules, at our most conservative we can add a note we may make this change in the changelog) |
@snth lmk if you're up for this, otherwise I'll do it |
Has this been finished with #2409? |
Yes, closing! |
From PRQL/pyprql#150
I think probably we add a param to
from
that would parse toread_parquet
for DuckDB, so the expression above would be:into:
Another option would be:
The text was updated successfully, but these errors were encountered: