Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: Ad-hoc(batch) ingestion #18583

Open
3 of 27 tasks
st1page opened this issue Sep 18, 2024 · 7 comments
Open
3 of 27 tasks

Tracking: Ad-hoc(batch) ingestion #18583

st1page opened this issue Sep 18, 2024 · 7 comments
Assignees
Milestone

Comments

@st1page
Copy link
Contributor

st1page commented Sep 18, 2024

We will enhance the ad-hoc ingestion capability in subsequent releases, with the expectation that it will eventually be possible for users to read ad-hoc data if it is persisted on an external system.

Streaming storage

for the streaming storage, the predicate pushdown with the "offset" is required

  • kafka
    • select from source
    • TVF
  • pulsar
    • select from source
    • TVF

lake

file source(object store)

  • select from source
  • TVF
    • only support S3 currently
  • optimization
    • column pruning
    • predicate pushdown

Database

Currently we only support Create table with primary key on the CDC connector. To support it, we need design and introduce new syntax that CREATE source with CDC connector. In that case, the source can only be ad-hoc queried.

  • PG
  • MySQL
    • select from source
    • TVF
    • optimization
      • column pruning
      • range predicate pushdown
      • lookup
  • MongoDB
    • select from source
    • TVF
    • optimization
      • column pruning
      • range predicate pushdown
      • lookup
@github-actions github-actions bot added this to the release-2.1 milestone Sep 18, 2024
@kwannoel
Copy link
Contributor

Hi, I will help with this issue, starting with TVFs.

@xxchan
Copy link
Member

xxchan commented Sep 27, 2024

Have we reached consensus to support TVFs? To me, their use cases are duplicated with Sources, so they seem to be unnecessary.

I’d like to see rationales and examples where they are more useful than sources before adding them

@st1page
Copy link
Contributor Author

st1page commented Sep 27, 2024

Have we reached consensus to support TVFs? To me, their use cases are duplicated with Sources, so they seem to be unnecessary.

I’d like to see rationales and examples where they are more useful than sources before adding them

@xxchan
Copy link
Member

xxchan commented Sep 27, 2024

Thanks for the explanation!

Currently we only support the CDC table and can not create a source on a external databases's table.

Makes me think whether also related with other shared source e.g., Kafka?

We can refer to the grammer of duckDB for the cases

Compared with duckDB

  • They don't have source at all. So it might be a little different
  • Their syntax contains a ATTACH, which looks like CREATE CONNECTION we might have in the future. So maybe we should design that first.
ATTACH 'dbname=postgresscanner' AS postgres_db (TYPE POSTGRES);
SELECT * FROM postgres_query('postgres_db', 'SELECT * FROM cars LIMIT 3');

@st1page
Copy link
Contributor Author

st1page commented Sep 27, 2024

Currently we only support the CDC table and can not create a source on a external databases's table.

Makes me think whether also related with other shared source e.g., Kafka?

The issue is not related to "shared" but it is beacuse the CDC source contains multiple tables' changes. Actually that is a "CONNECTION"

@st1page
Copy link
Contributor Author

st1page commented Sep 27, 2024

Compared with duckDB

  • They don't have source at all. So it might be a little different
  • Their syntax contains a ATTACH, which looks like CREATE CONNECTION we might have in the future. So maybe we should design that first.
ATTACH 'dbname=postgresscanner' AS postgres_db (TYPE POSTGRES);
SELECT * FROM postgres_query('postgres_db', 'SELECT * FROM cars LIMIT 3');

Agree with that. cc @chenzl25. do we have plan to simplify the syntax of the TVF with connection?

@chenzl25
Copy link
Contributor

chenzl25 commented Sep 27, 2024

After the connection is supported, in my mind connection can be used in TVF directly like:

  • read_parquet(s3_connection, 's3://bucket/path/xxxx.parquet')
  • read_csv(s3_connection, 's3://bucket/path/xxxx.parquet')
  • read_json(s3_connection, 's3://bucket/path/xxxx.parquet')
  • iceberg_scan(iceberg_connection, 'database_name.table_name')
  • postgres_query(pg_connection, 'select * from t')
  • mysql_quert(my_connection, 'select * from t')

Connections contain the necessary information to allow TVF to query the external system.
I think @tabVersion will support Connection in this Q.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants