Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSV parsing options #813

Merged
merged 11 commits into from
Jan 21, 2025
Merged

Add CSV parsing options #813

merged 11 commits into from
Jan 21, 2025

Conversation

skirdey
Copy link
Contributor

@skirdey skirdey commented Jan 13, 2025

Adds support for CSV parsing options (via pyarrow parser). Among them: for CSV files where values can span several lines.

Adding support for CSV files where values can span several lines, pyarrow parser already supports it
@dmpetrov
Copy link
Member

@skirdey thank you for the change!

Could you please fix the issue with pre-commit (it's likely about a long line).
You can just run pre-commit and it'll fix it automatically.

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
@skirdey
Copy link
Contributor Author

skirdey commented Jan 14, 2025 via email

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @skirdey ! Please read a few comments with some fixes and suggestions on how to improve it a bit.

@shcheklein shcheklein added the enhancement New feature or request label Jan 14, 2025
@skirdey
Copy link
Contributor Author

skirdey commented Jan 14, 2025

I've added a simple dict as parse options config.
Unit tests gave the same results as on main branch, but there are quite a few failures on both when running locally.

I've kept it simple. Let me know if you have preference to see ParseOptions as more defined dataclass.

I've left out callable parse option as it seems to add complexity, maybe in the next PR.

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
src/datachain/lib/dc.py Outdated Show resolved Hide resolved
@shcheklein shcheklein changed the title Update dc.py Add CSV parse config options Jan 14, 2025
@shcheklein shcheklein changed the title Add CSV parse config options Add CSV parsing options Jan 14, 2025
@shcheklein

This comment was marked as outdated.

Copy link

codecov bot commented Jan 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.42%. Comparing base (aad99e2) to head (b582491).
Report is 16 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #813      +/-   ##
==========================================
- Coverage   87.50%   87.42%   -0.08%     
==========================================
  Files         128      128              
  Lines       11344    11349       +5     
  Branches     1538     1540       +2     
==========================================
- Hits         9926     9922       -4     
- Misses       1031     1039       +8     
- Partials      387      388       +1     
Flag Coverage Δ
datachain 87.37% <100.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@shcheklein shcheklein self-assigned this Jan 21, 2025
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thank you for this improvement! 👍

src/datachain/lib/dc.py Outdated Show resolved Hide resolved
@shcheklein shcheklein merged commit 1b5a585 into iterative:main Jan 21, 2025
33 checks passed
dreadatour pushed a commit that referenced this pull request Jan 27, 2025
* Update dc.py

Adding support for CSV files where values can span several lines, pyarrow parser already supports it

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update dc.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adding csv parse options config

* naming of parse_options_config to parse_options

* typo

* fix tests, address PR review

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ivan Shcheklein <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants