-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support read_csv for backends with no native support #9908
base: main
Are you sure you want to change the base?
Conversation
ACTION NEEDED Ibis follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
skip the Trino and impala in tests @cpcloud request to review |
"snowflake", | ||
"pyspark", | ||
): | ||
pytest.skip(f"{con.name} implements its own `read_parquet`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these backends have their own implementation, some of these options still could pass this test, so I skip these backends.
904cf7e
to
96ff701
Compare
"""Register a CSV file as a table in the current backend. | ||
|
||
This function reads a CSV file and registers it as a table in the current | ||
backend. Note that for Impala and Trino backends, the performance may be suboptimal. | ||
|
||
Parameters | ||
---------- | ||
path | ||
The data source. A string or Path to the CSV file. | ||
table_name | ||
An optional name to use for the created table. This defaults to | ||
a sequentially generated name. | ||
**kwargs | ||
Additional keyword arguments passed to the backend loading function. | ||
Common options are skip_rows, column_names, delimiter, and include_columns. | ||
More details could be found: | ||
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html | ||
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html | ||
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Register a CSV file as a table in the current backend. | |
This function reads a CSV file and registers it as a table in the current | |
backend. Note that for Impala and Trino backends, the performance may be suboptimal. | |
Parameters | |
---------- | |
path | |
The data source. A string or Path to the CSV file. | |
table_name | |
An optional name to use for the created table. This defaults to | |
a sequentially generated name. | |
**kwargs | |
Additional keyword arguments passed to the backend loading function. | |
Common options are skip_rows, column_names, delimiter, and include_columns. | |
More details could be found: | |
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html | |
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html | |
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html | |
"""Register a CSV file as a table in the current backend. | |
This function reads a CSV file and registers it as a table in the | |
current backend. Note that for the Impala and Trino backends, the | |
performance may be suboptimal. | |
Parameters | |
---------- | |
path | |
The data source. A string or Path to the CSV file. | |
table_name | |
An optional name to use for the created table. This defaults to a | |
sequentially generated name. | |
**kwargs | |
Additional keyword arguments passed to the PyArrow loading function. | |
Common options include: | |
- skip_rows | |
- column_names | |
- delimiter | |
- include_columns | |
A full list of options can be found on the following pages: | |
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html | |
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html | |
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html | |
::: {.callout-note} | |
Options from each of the above reference pages can be passed directly to | |
this function, Ibis will handle sorting them into the appropriate | |
options buckets. | |
::: |
Read a single csv file: | ||
|
||
>>> table = con.read_csv("path/to/file.csv") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read a single csv file: | |
>>> table = con.read_csv("path/to/file.csv") | |
Read a single csv file: | |
>>> table = con.read_csv("path/to/file.csv") | |
Read a single csv file, skipping the first row, with a custom delimiter: | |
>>> table = con.read_csv("path/to/file.csv", skip_rows=1, delimiter=";") | |
Read a single csv file, but only load the specified columns: | |
>>> table = con.read_csv("path/to/file.csv", include_columns=["species", "island"]) | |
["pyspark"], | ||
condition=IS_SPARK_REMOTE, | ||
raises=PySparkAnalysisException, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an unrelated formatting change
"snowflake", | ||
"pyspark", | ||
): | ||
pytest.skip(f"{con.name} implements its own `read_parquet`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytest.skip(f"{con.name} implements its own `read_parquet`") | |
pytest.skip(f"{con.name} implements its own `read_csv`") |
@pytest.mark.never( | ||
[ | ||
"duckdb", | ||
"polars", | ||
"bigquery", | ||
"clickhouse", | ||
"datafusion", | ||
"snowflake", | ||
"pyspark", | ||
], | ||
reason="backend implements its own read_csv", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytest.mark.never( | |
[ | |
"duckdb", | |
"polars", | |
"bigquery", | |
"clickhouse", | |
"datafusion", | |
"snowflake", | |
"pyspark", | |
], | |
reason="backend implements its own read_csv", | |
) |
You can remove this since you are skipping them inside the test body
|
||
pyarrow_table = pa.concat_tables(pyarrow_tables) | ||
table_name = table_name or util.gen_name("read_csv") | ||
self.create_table(table_name, pyarrow_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think this should probably be a temp table or a memtable
, because none of our other read_*
functions create a persistent object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memtable
is probably a good option
ah, saw your comments, please ignore the above request. |
Description of changes
For backends that lack native read_csv support,
pyarrow.csv.read_csv()
will be used../directory/*
Issues closed
Partially solve #9448