Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support read_csv for backends with no native support #9908

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

jitingxu1
Copy link
Contributor

Description of changes

For backends that lack native read_csv support, pyarrow.csv.read_csv() will be used.

  • Read a single file
  • Read all files in a directory, something like this: ./directory/*
  • Read files matching a glob pattern
  • Support cloud storage systems like S3 and GCP

Issues closed

Partially solve #9448

Copy link
Contributor

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

@jitingxu1 jitingxu1 changed the title feat: Support read_csv for backends with no native support feat: support read_csv for backends with no native support Aug 23, 2024
ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_register.py Show resolved Hide resolved
@jitingxu1 jitingxu1 requested a review from cpcloud August 28, 2024 21:53
@github-actions github-actions bot added the tests Issues or PRs related to tests label Sep 20, 2024
@jitingxu1
Copy link
Contributor Author

skip the Trino and impala in tests

@cpcloud request to review

ibis/backends/__init__.py Outdated Show resolved Hide resolved
ibis/backends/__init__.py Show resolved Hide resolved
"snowflake",
"pyspark",
):
pytest.skip(f"{con.name} implements its own `read_parquet`")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these backends have their own implementation, some of these options still could pass this test, so I skip these backends.

Comment on lines +1277 to +1295
"""Register a CSV file as a table in the current backend.

This function reads a CSV file and registers it as a table in the current
backend. Note that for Impala and Trino backends, the performance may be suboptimal.

Parameters
----------
path
The data source. A string or Path to the CSV file.
table_name
An optional name to use for the created table. This defaults to
a sequentially generated name.
**kwargs
Additional keyword arguments passed to the backend loading function.
Common options are skip_rows, column_names, delimiter, and include_columns.
More details could be found:
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Register a CSV file as a table in the current backend.
This function reads a CSV file and registers it as a table in the current
backend. Note that for Impala and Trino backends, the performance may be suboptimal.
Parameters
----------
path
The data source. A string or Path to the CSV file.
table_name
An optional name to use for the created table. This defaults to
a sequentially generated name.
**kwargs
Additional keyword arguments passed to the backend loading function.
Common options are skip_rows, column_names, delimiter, and include_columns.
More details could be found:
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html
"""Register a CSV file as a table in the current backend.
This function reads a CSV file and registers it as a table in the
current backend. Note that for the Impala and Trino backends, the
performance may be suboptimal.
Parameters
----------
path
The data source. A string or Path to the CSV file.
table_name
An optional name to use for the created table. This defaults to a
sequentially generated name.
**kwargs
Additional keyword arguments passed to the PyArrow loading function.
Common options include:
- skip_rows
- column_names
- delimiter
- include_columns
A full list of options can be found on the following pages:
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html
https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html
::: {.callout-note}
Options from each of the above reference pages can be passed directly to
this function, Ibis will handle sorting them into the appropriate
options buckets.
:::

Comment on lines +1308 to +1311
Read a single csv file:

>>> table = con.read_csv("path/to/file.csv")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Read a single csv file:
>>> table = con.read_csv("path/to/file.csv")
Read a single csv file:
>>> table = con.read_csv("path/to/file.csv")
Read a single csv file, skipping the first row, with a custom delimiter:
>>> table = con.read_csv("path/to/file.csv", skip_rows=1, delimiter=";")
Read a single csv file, but only load the specified columns:
>>> table = con.read_csv("path/to/file.csv", include_columns=["species", "island"])

Comment on lines +29 to +31
["pyspark"],
condition=IS_SPARK_REMOTE,
raises=PySparkAnalysisException,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an unrelated formatting change

"snowflake",
"pyspark",
):
pytest.skip(f"{con.name} implements its own `read_parquet`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pytest.skip(f"{con.name} implements its own `read_parquet`")
pytest.skip(f"{con.name} implements its own `read_csv`")

Comment on lines +629 to +640
@pytest.mark.never(
[
"duckdb",
"polars",
"bigquery",
"clickhouse",
"datafusion",
"snowflake",
"pyspark",
],
reason="backend implements its own read_csv",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.never(
[
"duckdb",
"polars",
"bigquery",
"clickhouse",
"datafusion",
"snowflake",
"pyspark",
],
reason="backend implements its own read_csv",
)

You can remove this since you are skipping them inside the test body


pyarrow_table = pa.concat_tables(pyarrow_tables)
table_name = table_name or util.gen_name("read_csv")
self.create_table(table_name, pyarrow_table)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think this should probably be a temp table or a memtable, because none of our other read_* functions create a persistent object

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memtable is probably a good option

@jitingxu1
Copy link
Contributor Author

ah, saw your comments, please ignore the above request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants