Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv handles empty lines differently in 0.20.6 leadingrows of Null #14271

Closed
2 tasks done
hagsted opened this issue Feb 5, 2024 · 2 comments
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@hagsted
Copy link

hagsted commented Feb 5, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

text = """
    A, B, C,

    1,1,1,

    2,2,2,

    3,3,3,

"""
df = pl.read_csv(io.StringIO(text))
print(df)

Log output

Which in 0.20.5 gives:

shape: (3, 4)
┌───────────┬─────┬─────┬──────┐
│         A ┆  B  ┆  C  ┆      │
│ ---       ┆ --- ┆ --- ┆ ---  │
│ i64       ┆ i64 ┆ i64 ┆ str  │
╞═══════════╪═════╪═════╪══════╡
│ 1         ┆ 1   ┆ 1   ┆ null │
│ 2         ┆ 2   ┆ 2   ┆ null │
│ 3         ┆ 3   ┆ 3   ┆ null │
└───────────┴─────┴─────┴──────┘
but in 0.20.6 gives:

shape: (7, 4)
┌───────┬──────┬──────┬──────┐
│     A ┆  B   ┆  C   ┆      │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ i64   ┆ i64  ┆ i64  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ null  ┆ null ┆ null ┆ null │
│ 1     ┆ 1    ┆ 1    ┆ null │
│ null  ┆ null ┆ null ┆ null │
│ 2     ┆ 2    ┆ 2    ┆ null │
│ null  ┆ null ┆ null ┆ null │
│ 3     ┆ 3    ┆ 3    ┆ null │
│ null  ┆ null ┆ null ┆ null │
└───────┴──────┴──────┴──────┘

Issue description

I have some malformed CSV files, which contains an empty line between every line with data. Up until polars 0.20.5 I could just read this file and all the empty lines would be discarded. When going to 0.20.6 the lines would be included as rows with all Nulls.

Expected behavior

I would think that a line without any separator, should be discarded, or there should be a way to handle this in the read_csv function, like a "skip empty lines" parameter. If the line contains the separators but otherwise is empty, I would expect it to end up as a Null row.

I could of course do something like:

df.filter(~pl.all_horizontal(pl.all().is_null()))

But that just feels like a workaround.

The column with all Nulls in the table shown, is expected as there is an extra separator in each row, indicating an extra value. And could be skipped by setting "columns" in the read_csv function

Installed versions

--------Version info---------
Polars:               0.20.6
Index type:           UInt32
Platform:             Windows-11-10.0.22631-SP0
Python:               3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@hagsted hagsted added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 5, 2024
@Julian-J-S
Copy link
Contributor

Julian-J-S commented Feb 5, 2024

hi @hagsted
yes, this is the intended behaviour. (see #13934)
All whitespace (spaces, tabs, newlines, ...) in CSV files belong to the data and should be preserved (by default).
This was already the case for single column data and was made consistent with multiple column csvs.

Some people use empty lines to group data in their csv files.
If you need the emtpy line information you have it available but if you remove it there is no way of getting it back. (you lose information)
As you showed there are options to remove all emtpy lines after reading the data.

However, having an option to optionally skip emtpy lines (like pandas) might also be useful. (I am just a little worried to make read_csv bigger and bigger and bigger 🤣 )


also, in case you are wondering: trailing new lines are explicitly not allowed according to the inofficial csv specs. This is why you get an extra column with null values 😉

@hagsted
Copy link
Author

hagsted commented Feb 5, 2024

Okay, thanks for the clarification. I vote for a skip_empty_lines parameter 😉

@hagsted hagsted closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants