Offer a `format` param to `from`, for DuckDB? #2168

max-sixty · 2023-03-13T23:36:58Z

I think probably we add a param to from that would parse to read_parquet for DuckDB, so the expression above would be:

from `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*` format:parquet
take 10

into:

SELECT * 
FROM read_parquet("s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")
LIMIT 10

Another option would be:

from (read_parquet `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*`)
take 10

The text was updated successfully, but these errors were encountered:

eitsupi · 2023-03-14T13:45:32Z

Could this be also a solution to #1535?

So far the experience of reading files with duckdb is not good.......

from `https://example.com/foo.parquet`
take 10

SELECT
  *
FROM
  "https://example"."com/foo".parquet
LIMIT
  10

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

max-sixty · 2023-03-14T16:04:52Z

@eitsupi great point, that is very bad...

Unfortunately I don't think this would help there, since that happens earlier in the compilation.

max-sixty · 2023-03-19T20:20:00Z

We discussed this on the dev call today (which @eitsupi you'd be more than welcome to join, #1083 for details; though I'm not sure if the timezone works well for you...)

I'm going to think about this & #1535 more; there are some good tradeoffs that @aljazerzen brought up.

I think it's important we make this much better though; this is one of the those things that really should "just work", and is likely to be salient for the sort of folks who are trying out new tools, like PRQL / DuckDB.

eitsupi · 2023-03-20T12:55:53Z

which @eitsupi you'd be more than welcome to join, #1083 for details; though I'm not sure if the timezone works well for you...

(Yeah, it's midnight in my timezone and I need to sleep...... ) (Actually, yesterday I was up, which is bad......)
(I hope you don't mind that my English is terrible and I don't think I can keep up with the conversation.)

max-sixty · 2023-03-20T14:55:08Z

(Yeah, it's midnight in my timezone and I need to sleep...... ) (Actually, yesterday I was up, which is bad......) (I hope you don't mind that my English is terrible and I don't think I can keep up with the conversation.)

Ha, yeah, the timezones are hard — any earlier is difficult for me and any later is difficult for the Europe folks. Maybe we should do them every 13 days & 16 hours to make it fairer.

FWIW your English is excellent in writing...

To the extent you wanted to attend once soon, you'd be very welcome, I know folks would enjoy meeting you! (But ofc no stress)

max-sixty · 2023-03-21T20:25:08Z

We have this, CC @ywelsch

from s"SELECT * FROM 'https://example.com/foo.parquet'"

to

WITH table_0 AS (
  SELECT
    *
  FROM
    'https://example.com/foo.parquet'
)
SELECT
  *
FROM
  table_0 AS table_1

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

I'll prioritize making this much better, it's quite bad at the moment

snth · 2023-03-22T19:55:25Z

What about a from_file transform as we briefly discussed in #286 (comment) (right at the bottom of that comment and subsequent comments):

Your example would then be:

from_file format:parquet `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*`
take 10

and similarly

from_file `albums.csv`
take 10

where the format:csv parameter can be inferred from the file extension.

I think at the time there was a concern about keeping the number of keywords small. Given that this has to be translated into read_parquet, read_csv or read_json on the DuckDB side and these aren't that widely supported in other RDBMSs, it seems justified to me to have a separate keyword for this.

snth · 2023-03-22T20:52:48Z

I just found the continuation of the discussion about from_file at #1516 (comment).

To the second comment, @max-sixty you replied

I see from_file as a PRQL function which reads a file and uses its data in a query, like from_text does. Otherwise why not just defer to the DB?

Is this perhaps the answer to the question, i.e. while DuckDB allows saying FROM file.parquet, this is syntactic sugar for FROM read_parquet('file.parquet') in cases where the filetype can be inferred. In cases where the filetype can't be inferred read_parquet has to be explicitly specified, on the PRQL side that could be accomplished by supplying the format:parquet argument.

If you're looking for a semantic justification, I think one could say that from_text and from_file both take objects that aren't inherently relations and transform them into relations while from allows referencing relations that are already defined, either because they're database tables or views or previously defined relations such as CTEs.

Another difference would be that from_file takes a format argument while from doesn't.

I just caught up on #1535 and I saw that from_file wouldn't make any difference to the ident issue.

However I still think it might allow us to provide a more uniform interface for importing files across different databases where db specific workarounds might be required. See for example

I guess you see from_file as just producing literal values in the query like from_text and the parsing of the file happening by the PRQL compiler rather than the database. I initially struggled to see the point of that but I can see some use cases now, the main one being potentially sensitive information or environment specific parameters. The relevant differences between from_file and from in that case would be that:

from_file would read the file at query compile time whereas from reads it at query runtime.
from_file would have the file be read by the PRQL compiler while from has the file be read by the query execution engine.

In that case I prefer the second proposed form because read_parquet could probably be implemented as a stdlib or module function, i.e. the following form:

from (read_parquet `s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*`)
take 10

More importantly though I'm opposed to from gaining a format parameter because that doesn't make sense in the general case.

snth · 2023-03-22T21:16:03Z

FWIW the following worked for me in Colab:

%%prql duckdb:///:memory:
func read_parquet path -> s"SELECT * from read_parquet({path})"

from (read_parquet "s3://gbif-open-data-us-east-1/occurrence/2023-02-01/occurrence.parquet/*")
filter phylum == 'Chordata'
derive longitude = (decimallongitude | round 0)
take 4

This also sidesteps the ident issue for these cases because the read_parquet function now just takes a string as a parameter, not an ident.

So @eitsupi , your example can become

func read_parquet path -> s"SELECT * from read_parquet({path})"

from (read_parquet "https://example.com/foo.parquet")
take 10

which in the playground generates

WITH table_0 AS (
  SELECT
    *
  from
    read_parquet('https://example.com/foo.parquet')
)
SELECT
  *
FROM
  table_0 AS table_1
LIMIT
  10

-- Generated by PRQL compiler version:0.6.1 (https://prql-lang.org)

max-sixty · 2023-03-22T21:27:42Z

func read_parquet path -> s"SELECT * from read_parquet({path})"

Great, I think this is an excellent & easy thing to add, let's do it. We can refine later re strings vs idents, and collapsing the additional SELECT.

Do others agree?

There's a lot of text above @snth — assuming we do the read_parquet s-string, what else is left? Maybe we open a new issue with a from_file proposal?

snth · 2023-03-22T21:57:45Z

Cool, let me see if I can put together a PR tomorrow.

Yeah, sorry about all the text - it took me a while to work through this and figure out the potential differences between from_file and from. I don't see from_file as enough of a priority right now that it's worth opening an issue for it.

eitsupi · 2023-03-23T13:59:47Z

func read_parquet path -> s"SELECT * from read_parquet({path})"

This is a brilliant solution!
However, it would require adding the same number of functions as DuckDB (etc.) functions, so it may seem more ideal to split it up into external modules, etc.

max-sixty · 2023-03-23T14:25:54Z

However, it would require adding the same number of functions as DuckDB (etc.) functions, so it may seem more ideal to split it up into external modules, etc.

I agree; to the extent something is only DuckDB, we could put it behind a duckdb.read_parquet module.

(But definitely no need to wait until we get modules, at our most conservative we can add a note we may make this change in the changelog)

max-sixty · 2023-03-29T02:12:31Z

Cool, let me see if I can put together a PR tomorrow.

@snth lmk if you're up for this, otherwise I'll do it

aljazerzen · 2023-04-13T11:53:48Z

Has this been finished with #2409?

max-sixty · 2023-04-13T14:38:41Z

Yes, closing!

max-sixty added enhancement needs-discussion Undecided dilemma labels Mar 14, 2023

max-sixty added the priority label Mar 22, 2023

max-sixty closed this as completed Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offer a `format` param to `from`, for DuckDB? #2168

Offer a `format` param to `from`, for DuckDB? #2168

max-sixty commented Mar 13, 2023 •

edited

Loading

eitsupi commented Mar 14, 2023

max-sixty commented Mar 14, 2023

max-sixty commented Mar 19, 2023

eitsupi commented Mar 20, 2023

max-sixty commented Mar 20, 2023

max-sixty commented Mar 21, 2023 •

edited

Loading

snth commented Mar 22, 2023 •

edited

Loading

snth commented Mar 22, 2023

snth commented Mar 22, 2023

max-sixty commented Mar 22, 2023

snth commented Mar 22, 2023

eitsupi commented Mar 23, 2023

max-sixty commented Mar 23, 2023

max-sixty commented Mar 29, 2023

aljazerzen commented Apr 13, 2023

max-sixty commented Apr 13, 2023

Offer a format param to from, for DuckDB? #2168

Offer a format param to from, for DuckDB? #2168

Comments

max-sixty commented Mar 13, 2023 • edited Loading

eitsupi commented Mar 14, 2023

max-sixty commented Mar 14, 2023

max-sixty commented Mar 19, 2023

eitsupi commented Mar 20, 2023

max-sixty commented Mar 20, 2023

max-sixty commented Mar 21, 2023 • edited Loading

snth commented Mar 22, 2023 • edited Loading

snth commented Mar 22, 2023

snth commented Mar 22, 2023

max-sixty commented Mar 22, 2023

snth commented Mar 22, 2023

eitsupi commented Mar 23, 2023

max-sixty commented Mar 23, 2023

max-sixty commented Mar 29, 2023

aljazerzen commented Apr 13, 2023

max-sixty commented Apr 13, 2023

Offer a `format` param to `from`, for DuckDB? #2168

Offer a `format` param to `from`, for DuckDB? #2168

max-sixty commented Mar 13, 2023 •

edited

Loading

max-sixty commented Mar 21, 2023 •

edited

Loading

snth commented Mar 22, 2023 •

edited

Loading