-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CSV parsing options #813
Conversation
Adding support for CSV files where values can span several lines, pyarrow parser already supports it
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
@skirdey thank you for the change! Could you please fix the issue with pre-commit (it's likely about a long line). |
thanks folks, will shorten the line and change text
…On Mon, Jan 13, 2025 at 6:27 PM Ivan Shcheklein ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/datachain/lib/dc.py
<#813 (comment)>:
> @@ -1955,6 +1956,7 @@ def from_csv(
column_types : Dictionary of column names and their corresponding types.
It is passed to CSV reader and for each column specified type auto
inference is disabled.
+ newlines_in_values: Tells pareser that values in csv can span multiple lines.
pareser -> parser
csv -> CSV (for consistency)
—
Reply to this email directly, view it on GitHub
<#813 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABHIJQFWGTUC7HZJ6XLBR232KRY2NAVCNFSM6AAAAABVCAFC3SVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKNBYGUYTINZRGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @skirdey ! Please read a few comments with some fixes and suggestions on how to improve it a bit.
I've added a simple dict as parse options config. I've kept it simple. Let me know if you have preference to see ParseOptions as more defined dataclass. I've left out callable parse option as it seems to add complexity, maybe in the next PR. |
This comment was marked as outdated.
This comment was marked as outdated.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #813 +/- ##
==========================================
- Coverage 87.50% 87.42% -0.08%
==========================================
Files 128 128
Lines 11344 11349 +5
Branches 1538 1540 +2
==========================================
- Hits 9926 9922 -4
- Misses 1031 1039 +8
- Partials 387 388 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Thank you for this improvement! 👍
* Update dc.py Adding support for CSV files where values can span several lines, pyarrow parser already supports it * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update dc.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adding csv parse options config * naming of parse_options_config to parse_options * typo * fix tests, address PR review --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ivan Shcheklein <[email protected]>
Adds support for CSV parsing options (via
pyarrow
parser). Among them: for CSV files where values can span several lines.