Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python,rust): read_csv preserve whitespace and newlines #13934

Conversation

Julian-J-S
Copy link
Contributor

@Julian-J-S Julian-J-S commented Jan 23, 2024

fixes #13933 and partly fix for #12763

addresses a few issues of read_csv (adding issue links later)

Fix 1: preserve whitespace at start of line (currently trimmed)

Currently read_csv trims whitespace at the start of each line (only for the first value). This is a bug. Whitespace should be preserved as is belongs to the value. Also this only happens for the first value.

DATA = """col
a
    b
        c
d"""
pl.read_csv(source=StringIO(DATA))

# Current result:
shape: (4, 1)
┌─────┐
│ col │
│ --- │
│ str │
╞═════╡
│ a   │
│ b   │
│ c   │
│ d   │
└─────┘

# NEW result (preserve whitespace, it belongs to the value!)
shape: (4, 1)
┌───────────┐
│ col       │
│ ---       │
│ str       │
╞═══════════╡
│ a         │
│     b     │
│         c │
│ d         │
└───────────┘

Fix 2: preserve newlines (see #13933 )

Currently read_csv removed empty lines if the csv has multiple columns and keeps empty lines as null for csv with a single column. All whitespace in csv files belongs to the values and should be preserved. Also this is currently inconsistent as described. In the future I can also add an option like pandas has to optionally skip empty lines.

if/when this is merged I can also add an parameter skip_empty_lines (like pandas). But default should be to preserve any form of whitespace as it belongs to the format.

# Single column (stays the same)
DATA = """col
a

c"""
pl.read_csv(source=StringIO(DATA))
shape: (3, 1)
┌──────┐
│ col  │
│ ---  │
│ str  │
╞══════╡
│ a    │
│ null<<<<<<<<<<<<<<<<<<< already correct for single column csvc    │
└──────┘

# Multiple columns
DATA = """col1,col2
a,1

c,3"""
pl.read_csv(source=StringIO(DATA))

# Current (newline skipped)
shape: (2, 2)
┌──────┬──────┐
│ col1col2 │
│ ------  │
│ stri64  │
╞══════╪══════╡
│ a1    │
│ c3    │
└──────┴──────┘

# New result (newline preserved)
shape: (3, 2)
┌──────┬──────┐
│ col1col2 │
│ ------  │
│ stri64  │
╞══════╪══════╡
│ a1    │
│ nullnull<<<<<<<<<<<<<<<<<<< this is new! consistent with single column csvc3    │
└──────┴──────┘

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jan 23, 2024
@alexander-beedie alexander-beedie added the A-io-csv Area: reading/writing CSV files label Jan 23, 2024
@ritchie46
Copy link
Member

Yes, I agree. Thank you for the fix. Can you add those examples above as tests? We don't want to regress in the future.

@Julian-J-S
Copy link
Contributor Author

Yes, I agree. Thank you for the fix. Can you add those examples above as tests? We don't want to regress in the future.

sure, done!

Added test for preserving whitespace at the start of the line (fix1)

There were already tests with empty lines. I had already adjusted the expected result to be consistent and treat empty lines as null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv inconsistent treatment of empty lines (single col -> null; multi col -> "skip")
3 participants