Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer a format param to from, for DuckDB? #2168

Closed
max-sixty opened this issue Mar 13, 2023 · 16 comments
Closed

Offer a format param to from, for DuckDB? #2168

max-sixty opened this issue Mar 13, 2023 · 16 comments

Comments

@max-sixty
Copy link
Member

max-sixty commented Mar 13, 2023

From PRQL/pyprql#150


I think probably we add a param to from that would parse to read_parquet for DuckDB, so the expression above would be:

from `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*` format:parquet
take 10

into:

SELECT * 
FROM read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")
LIMIT 10

Another option would be:

from (read_parquet `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*`)
take 10
@eitsupi
Copy link
Member

eitsupi commented Mar 14, 2023

Could this be also a solution to #1535?

So far the experience of reading files with duckdb is not good.......

from `https://example.com/foo.parquet`
take 10
SELECT
  *
FROM
  "https://example"."com/foo".parquet
LIMIT
  10

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

@max-sixty
Copy link
Member Author

@eitsupi great point, that is very bad...

Unfortunately I don't think this would help there, since that happens earlier in the compilation.

@max-sixty
Copy link
Member Author

We discussed this on the dev call today (which @eitsupi you'd be more than welcome to join, #1083 for details; though I'm not sure if the timezone works well for you...)

I'm going to think about this & #1535 more; there are some good tradeoffs that @aljazerzen brought up.

I think it's important we make this much better though; this is one of the those things that really should "just work", and is likely to be salient for the sort of folks who are trying out new tools, like PRQL / DuckDB.

@eitsupi
Copy link
Member

eitsupi commented Mar 20, 2023

which @eitsupi you'd be more than welcome to join, #1083 for details; though I'm not sure if the timezone works well for you...

(Yeah, it's midnight in my timezone and I need to sleep...... ) (Actually, yesterday I was up, which is bad......)
(I hope you don't mind that my English is terrible and I don't think I can keep up with the conversation.)

@max-sixty
Copy link
Member Author

(Yeah, it's midnight in my timezone and I need to sleep...... ) (Actually, yesterday I was up, which is bad......) (I hope you don't mind that my English is terrible and I don't think I can keep up with the conversation.)

Ha, yeah, the timezones are hard — any earlier is difficult for me and any later is difficult for the Europe folks. Maybe we should do them every 13 days & 16 hours to make it fairer.

FWIW your English is excellent in writing...

To the extent you wanted to attend once soon, you'd be very welcome, I know folks would enjoy meeting you! (But ofc no stress)

@max-sixty
Copy link
Member Author

max-sixty commented Mar 21, 2023

We have this, CC @ywelsch

from s"SELECT * FROM 'https://example.com/foo.parquet'"

to

WITH table_0 AS (
  SELECT
    *
  FROM
    'https://example.com/foo.parquet'
)
SELECT
  *
FROM
  table_0 AS table_1

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

I'll prioritize making this much better, it's quite bad at the moment

@snth
Copy link
Member

snth commented Mar 22, 2023

What about a from_file transform as we briefly discussed in #286 (comment) (right at the bottom of that comment and subsequent comments):

Your example would then be:

from_file format:parquet `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*`
take 10

and similarly

from_file `albums.csv`
take 10

where the format:csv parameter can be inferred from the file extension.

I think at the time there was a concern about keeping the number of keywords small. Given that this has to be translated into read_parquet, read_csv or read_json on the DuckDB side and these aren't that widely supported in other RDBMSs, it seems justified to me to have a separate keyword for this.

@snth
Copy link
Member

snth commented Mar 22, 2023

I just found the continuation of the discussion about from_file at #1516 (comment).

To the second comment, @max-sixty you replied

I see from_file as a PRQL function which reads a file and uses its data in a query, like from_text does. Otherwise why not just defer to the DB?

Is this perhaps the answer to the question, i.e. while DuckDB allows saying FROM file.parquet, this is syntactic sugar for FROM read_parquet('file.parquet') in cases where the filetype can be inferred. In cases where the filetype can't be inferred read_parquet has to be explicitly specified, on the PRQL side that could be accomplished by supplying the format:parquet argument.

If you're looking for a semantic justification, I think one could say that from_text and from_file both take objects that aren't inherently relations and transform them into relations while from allows referencing relations that are already defined, either because they're database tables or views or previously defined relations such as CTEs.

Another difference would be that from_file takes a format argument while from doesn't.


I just caught up on #1535 and I saw that from_file wouldn't make any difference to the ident issue.

However I still think it might allow us to provide a more uniform interface for importing files across different databases where db specific workarounds might be required. See for example


I guess you see from_file as just producing literal values in the query like from_text and the parsing of the file happening by the PRQL compiler rather than the database. I initially struggled to see the point of that but I can see some use cases now, the main one being potentially sensitive information or environment specific parameters. The relevant differences between from_file and from in that case would be that:

  • from_file would read the file at query compile time whereas from reads it at query runtime.
  • from_file would have the file be read by the PRQL compiler while from has the file be read by the query execution engine.

In that case I prefer the second proposed form because read_parquet could probably be implemented as a stdlib or module function, i.e. the following form:

from (read_parquet `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*`)
take 10

More importantly though I'm opposed to from gaining a format parameter because that doesn't make sense in the general case.

@snth
Copy link
Member

snth commented Mar 22, 2023

FWIW the following worked for me in Colab:

%%prql duckdb:///:memory:
func read_parquet path -> s"SELECT * from read_parquet({path})"

from (read_parquet "s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")
filter phylum == 'Chordata'
derive longitude = (decimallongitude | round 0)
take 4

This also sidesteps the ident issue for these cases because the read_parquet function now just takes a string as a parameter, not an ident.

So @eitsupi , your example can become

func read_parquet path -> s"SELECT * from read_parquet({path})"

from (read_parquet "https://example.com/foo.parquet")
take 10

which in the playground generates

WITH table_0 AS (
  SELECT
    *
  from
    read_parquet('https://example.com/foo.parquet')
)
SELECT
  *
FROM
  table_0 AS table_1
LIMIT
  10

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

@max-sixty
Copy link
Member Author

func read_parquet path -> s"SELECT * from read_parquet({path})"

Great, I think this is an excellent & easy thing to add, let's do it. We can refine later re strings vs idents, and collapsing the additional SELECT.

Do others agree?


There's a lot of text above @snth — assuming we do the read_parquet s-string, what else is left? Maybe we open a new issue with a from_file proposal?

@snth
Copy link
Member

snth commented Mar 22, 2023

Cool, let me see if I can put together a PR tomorrow.


Yeah, sorry about all the text - it took me a while to work through this and figure out the potential differences between from_file and from. I don't see from_file as enough of a priority right now that it's worth opening an issue for it.

@eitsupi
Copy link
Member

eitsupi commented Mar 23, 2023

func read_parquet path -> s"SELECT * from read_parquet({path})"

This is a brilliant solution!
However, it would require adding the same number of functions as DuckDB (etc.) functions, so it may seem more ideal to split it up into external modules, etc.

@max-sixty
Copy link
Member Author

However, it would require adding the same number of functions as DuckDB (etc.) functions, so it may seem more ideal to split it up into external modules, etc.

I agree; to the extent something is only DuckDB, we could put it behind a duckdb.read_parquet module.

(But definitely no need to wait until we get modules, at our most conservative we can add a note we may make this change in the changelog)

@max-sixty
Copy link
Member Author

Cool, let me see if I can put together a PR tomorrow.

@snth lmk if you're up for this, otherwise I'll do it

@aljazerzen
Copy link
Member

Has this been finished with #2409?

@max-sixty
Copy link
Member Author

Yes, closing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants