feat: support read_csv for backends with no native support #9908

jitingxu1 · 2024-08-23T04:20:46Z

Description of changes

For backends that lack native read_csv support, pyarrow.csv.read_csv() will be used.

Read a single file
Read all files in a directory, something like this: ./directory/*
Read files matching a glob pattern
Support cloud storage systems like S3 and GCP

Issues closed

Partially solve #9448

github-actions · 2024-08-23T04:21:11Z

ACTION NEEDED

Ibis follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message.

Please update your PR title and description to match the specification.

ibis/backends/tests/test_register.py

jitingxu1 · 2024-09-20T18:42:07Z

skip the Trino and impala in tests

@cpcloud request to review

ibis/backends/__init__.py

jitingxu1 · 2024-09-24T22:37:23Z

ibis/backends/tests/test_register.py

+        "snowflake",
+        "pyspark",
+    ):
+        pytest.skip(f"{con.name} implements its own `read_parquet`")


these backends have their own implementation, some of these options still could pass this test, so I skip these backends.

gforsyth · 2024-09-25T13:46:02Z

ibis/backends/__init__.py

+        """Register a CSV file as a table in the current backend.
+
+        This function reads a CSV file and registers it as a table in the current
+        backend. Note that for Impala and Trino backends, the performance may be suboptimal.
+
+        Parameters
+        ----------
+        path
+            The data source. A string or Path to the CSV file.
+        table_name
+            An optional name to use for the created table. This defaults to
+            a sequentially generated name.
+        **kwargs
+            Additional keyword arguments passed to the backend loading function.
+            Common options are skip_rows, column_names, delimiter, and include_columns.
+            More details could be found:
+            https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html
+            https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html
+            https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html


Suggested change

"""Register a CSV file as a table in the current backend.

This function reads a CSV file and registers it as a table in the current

backend. Note that for Impala and Trino backends, the performance may be suboptimal.

Parameters

----------

path

The data source. A string or Path to the CSV file.

table_name

An optional name to use for the created table. This defaults to

a sequentially generated name.

**kwargs

Additional keyword arguments passed to the backend loading function.

Common options are skip_rows, column_names, delimiter, and include_columns.

More details could be found:

https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html

https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html

https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html

"""Register a CSV file as a table in the current backend.

This function reads a CSV file and registers it as a table in the

current backend. Note that for the Impala and Trino backends, the

performance may be suboptimal.

Parameters

----------

path

The data source. A string or Path to the CSV file.

table_name

An optional name to use for the created table. This defaults to a

sequentially generated name.

**kwargs

Additional keyword arguments passed to the PyArrow loading function.

Common options include:

- skip_rows

- column_names

- delimiter

- include_columns

A full list of options can be found on the following pages:

https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html

https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html

https://arrow.apache.org/docs/python/generated/pyarrow.csv.ConvertOptions.html

::: {.callout-note}

Options from each of the above reference pages can be passed directly to

this function, Ibis will handle sorting them into the appropriate

options buckets.

:::

gforsyth · 2024-09-25T13:51:17Z

ibis/backends/__init__.py

+        Read a single csv file:
+
+        >>> table = con.read_csv("path/to/file.csv")
+


Suggested change

Read a single csv file:

>>> table = con.read_csv("path/to/file.csv")

Read a single csv file:

>>> table = con.read_csv("path/to/file.csv")

Read a single csv file, skipping the first row, with a custom delimiter:

>>> table = con.read_csv("path/to/file.csv", skip_rows=1, delimiter=";")

Read a single csv file, but only load the specified columns:

>>> table = con.read_csv("path/to/file.csv", include_columns=["species", "island"])

gforsyth · 2024-09-25T13:52:15Z

ibis/backends/tests/test_register.py

+        ["pyspark"],
+        condition=IS_SPARK_REMOTE,
+        raises=PySparkAnalysisException,


This seems like an unrelated formatting change

gforsyth · 2024-09-25T13:54:34Z

ibis/backends/tests/test_register.py

+        "snowflake",
+        "pyspark",
+    ):
+        pytest.skip(f"{con.name} implements its own `read_parquet`")


Suggested change

pytest.skip(f"{con.name} implements its own `read_parquet`")

pytest.skip(f"{con.name} implements its own `read_csv`")

gforsyth · 2024-09-25T14:03:53Z

ibis/backends/tests/test_register.py

+@pytest.mark.never(
+    [
+        "duckdb",
+        "polars",
+        "bigquery",
+        "clickhouse",
+        "datafusion",
+        "snowflake",
+        "pyspark",
+    ],
+    reason="backend implements its own read_csv",
+)


Suggested change

@pytest.mark.never(

[

"duckdb",

"polars",

"bigquery",

"clickhouse",

"datafusion",

"snowflake",

"pyspark",

],

reason="backend implements its own read_csv",

)

You can remove this since you are skipping them inside the test body

gforsyth · 2024-09-25T14:15:22Z

ibis/backends/__init__.py

+
+        pyarrow_table = pa.concat_tables(pyarrow_tables)
+        table_name = table_name or util.gen_name("read_csv")
+        self.create_table(table_name, pyarrow_table)


Hm, I think this should probably be a temp table or a memtable, because none of our other read_* functions create a persistent object

memtable is probably a good option

jitingxu1 · 2024-09-25T15:43:02Z

ah, saw your comments, please ignore the above request.

Jiting Xu added 2 commits August 22, 2024 21:06

add read_csv

8b87686

lint

fedd4de

jitingxu1 changed the title ~~feat: Support read_csv for backends with no native support~~ feat: support read_csv for backends with no native support Aug 23, 2024

resolve tests

38f91dd

cpcloud requested changes Aug 23, 2024

View reviewed changes

ibis/backends/tests/test_register.py Outdated Show resolved Hide resolved

ibis/backends/tests/test_register.py Show resolved Hide resolved

jitingxu1 and others added 3 commits August 26, 2024 16:09

resolve typo

773cfb5

resolve typo

c7aea6e

Merge branch 'main' into support-read-csv

6547ae3

jitingxu1 requested a review from cpcloud August 28, 2024 21:53

jitingxu1 and others added 7 commits September 15, 2024 12:11

Merge branch 'main' into support-read-csv

69b4e39

test

6152533

Merge branch 'main' into support-read-csv

0214160

Merge branch 'main' into support-read-csv

7bb6f96

Merge branch 'ibis-project:main' into support-read-csv

520fe5f

remove trino from notyet

e023025

Merge branch 'main' into support-read-csv

f1b42f8

jitingxu1 mentioned this pull request Sep 19, 2024

bug(trino): cannot create table with large size data in trino #10178

Closed

1 task

skip trino and impala test

902bb47

github-actions bot added the tests Issues or PRs related to tests label Sep 20, 2024

jitingxu1 and others added 3 commits September 20, 2024 10:47

lint

79885a9

Merge branch 'main' into support-read-csv

ce2c8cf

enable impala in test

9acda5c

gforsyth reviewed Sep 23, 2024

View reviewed changes

ibis/backends/__init__.py Outdated Show resolved Hide resolved

ibis/backends/__init__.py Show resolved Hide resolved

Merge branch 'main' into support-read-csv

e62925b

jitingxu1 commented Sep 24, 2024

View reviewed changes

test: add unit test and documentation

96ff701

jitingxu1 force-pushed the support-read-csv branch from 904cf7e to 96ff701 Compare September 24, 2024 22:40

gforsyth reviewed Sep 25, 2024

View reviewed changes

jitingxu1 requested a review from gforsyth September 25, 2024 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support read_csv for backends with no native support #9908

feat: support read_csv for backends with no native support #9908

jitingxu1 commented Aug 23, 2024

github-actions bot commented Aug 23, 2024

jitingxu1 commented Sep 20, 2024

jitingxu1 Sep 24, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

gforsyth Sep 25, 2024

jitingxu1 commented Sep 25, 2024

		Read a single csv file:

		>>> table = con.read_csv("path/to/file.csv")

	pytest.skip(f"{con.name} implements its own `read_parquet`")
	pytest.skip(f"{con.name} implements its own `read_csv`")

feat: support read_csv for backends with no native support #9908

Are you sure you want to change the base?

feat: support read_csv for backends with no native support #9908

Conversation

jitingxu1 commented Aug 23, 2024

Description of changes

Issues closed

github-actions bot commented Aug 23, 2024

jitingxu1 commented Sep 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jitingxu1 commented Sep 25, 2024