read_csv handles empty lines differently in 0.20.6 leadingrows of Null #14271

hagsted · 2024-02-05T06:32:11Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

text = """
    A, B, C,

    1,1,1,

    2,2,2,

    3,3,3,

"""
df = pl.read_csv(io.StringIO(text))
print(df)

Log output

Which in 0.20.5 gives:

shape: (3, 4)
┌───────────┬─────┬─────┬──────┐
│         A ┆  B  ┆  C  ┆      │
│ ---       ┆ --- ┆ --- ┆ ---  │
│ i64       ┆ i64 ┆ i64 ┆ str  │
╞═══════════╪═════╪═════╪══════╡
│ 1         ┆ 1   ┆ 1   ┆ null │
│ 2         ┆ 2   ┆ 2   ┆ null │
│ 3         ┆ 3   ┆ 3   ┆ null │
└───────────┴─────┴─────┴──────┘
but in 0.20.6 gives:

shape: (7, 4)
┌───────┬──────┬──────┬──────┐
│     A ┆  B   ┆  C   ┆      │
│ ---   ┆ ---  ┆ ---  ┆ ---  │
│ i64   ┆ i64  ┆ i64  ┆ str  │
╞═══════╪══════╪══════╪══════╡
│ null  ┆ null ┆ null ┆ null │
│ 1     ┆ 1    ┆ 1    ┆ null │
│ null  ┆ null ┆ null ┆ null │
│ 2     ┆ 2    ┆ 2    ┆ null │
│ null  ┆ null ┆ null ┆ null │
│ 3     ┆ 3    ┆ 3    ┆ null │
│ null  ┆ null ┆ null ┆ null │
└───────┴──────┴──────┴──────┘

Issue description

I have some malformed CSV files, which contains an empty line between every line with data. Up until polars 0.20.5 I could just read this file and all the empty lines would be discarded. When going to 0.20.6 the lines would be included as rows with all Nulls.

Expected behavior

I would think that a line without any separator, should be discarded, or there should be a way to handle this in the read_csv function, like a "skip empty lines" parameter. If the line contains the separators but otherwise is empty, I would expect it to end up as a Null row.

I could of course do something like:

df.filter(~pl.all_horizontal(pl.all().is_null()))

But that just feels like a workaround.

The column with all Nulls in the table shown, is expected as there is an extra separator in each row, indicating an extra value. And could be skipped by setting "columns" in the read_csv function

Installed versions

--------Version info---------
Polars:               0.20.6
Index type:           UInt32
Platform:             Windows-11-10.0.22631-SP0
Python:               3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
numpy:                1.26.3
openpyxl:             <not installed>
pandas:               2.2.0
pyarrow:              15.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

Julian-J-S · 2024-02-05T06:59:55Z

hi @hagsted
yes, this is the intended behaviour. (see #13934)
All whitespace (spaces, tabs, newlines, ...) in CSV files belong to the data and should be preserved (by default).
This was already the case for single column data and was made consistent with multiple column csvs.

Some people use empty lines to group data in their csv files.
If you need the emtpy line information you have it available but if you remove it there is no way of getting it back. (you lose information)
As you showed there are options to remove all emtpy lines after reading the data.

However, having an option to optionally skip emtpy lines (like pandas) might also be useful. (I am just a little worried to make read_csv bigger and bigger and bigger 🤣 )

also, in case you are wondering: trailing new lines are explicitly not allowed according to the inofficial csv specs. This is why you get an extra column with null values 😉

hagsted · 2024-02-05T07:04:13Z

Okay, thanks for the clarification. I vote for a skip_empty_lines parameter 😉

hagsted added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 5, 2024

hagsted closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv handles empty lines differently in 0.20.6 leadingrows of Null #14271

read_csv handles empty lines differently in 0.20.6 leadingrows of Null #14271

hagsted commented Feb 5, 2024 •

edited

Loading

Julian-J-S commented Feb 5, 2024 •

edited

Loading

hagsted commented Feb 5, 2024

read_csv handles empty lines differently in 0.20.6 leadingrows of Null #14271

read_csv handles empty lines differently in 0.20.6 leadingrows of Null #14271

Comments

hagsted commented Feb 5, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

Julian-J-S commented Feb 5, 2024 • edited Loading

hagsted commented Feb 5, 2024

hagsted commented Feb 5, 2024 •

edited

Loading

Julian-J-S commented Feb 5, 2024 •

edited

Loading