Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty/full whitespace value gets converted to null in CSV parser if it is in the first column #12832

Closed
2 tasks done
lizdeika opened this issue Dec 1, 2023 · 6 comments
Closed
2 tasks done
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@lizdeika
Copy link

lizdeika commented Dec 1, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from io import StringIO

import polars as pl

tsv_data = """
a	xxxxxx	12
b	vvvvvv	13
c	eee	14
 	 	0.0
d	ttt	66
e	ggg	444
f	 	44
 	 	0.1
"""

column_list = ["col1", "col2", "col3"]
schema = {"col1": pl.Utf8, "col2": pl.Utf8, "col3": pl.Utf8}
dtypes = {"col1": pl.Utf8, "col2": pl.Utf8, "col3": pl.Utf8}

df = pl.read_csv(
    StringIO(tsv_data),
    has_header=False,
    schema=schema,
    dtypes=dtypes,
    new_columns=column_list,
    separator="\t",
)

print(df)

Log output

shape: (8, 3)
┌──────┬────────┬──────┐
│ col1 ┆ col2   ┆ col3 │
│ ---  ┆ ---    ┆ ---  │
│ str  ┆ str    ┆ str  │
╞══════╪════════╪══════╡
│ a    ┆ xxxxxx ┆ 12   │
│ b    ┆ vvvvvv ┆ 13   │
│ c    ┆ eee    ┆ 14   │
│ null ┆        ┆ 0.0  │
│ d    ┆ ttt    ┆ 66   │
│ e    ┆ ggg    ┆ 444  │
│ f    ┆        ┆ 44   │
│ null ┆        ┆ 0.1  │
└──────┴────────┴──────┘

Issue description

Simple TSV file that has 4th and last rows' first column value as SPACE character
Those spaces get converted to nulls.
No problem for columns that are not first.

Expected behavior

Space is Space, not null

Installed versions

--------Version info---------
Polars:               0.19.18
Index type:           UInt32
Platform:             macOS-14.0-arm64-arm-64bit
Python:               3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.6.0
gevent:               <not installed>
matplotlib:           <not installed>
numpy:                1.25.1
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              12.0.1
pydantic:             1.10.12
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.16
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@lizdeika lizdeika added bug Something isn't working python Related to Python Polars labels Dec 1, 2023
@lizdeika lizdeika changed the title read_csv TSV file first column value as SPACE get converted to null read_csv TSV file first column value as SPACE gets converted to null Dec 1, 2023
@orlp
Copy link
Collaborator

orlp commented Dec 1, 2023

This is not limited to tab-separated values, the same happens for CSV with commas as well.

@orlp orlp changed the title read_csv TSV file first column value as SPACE gets converted to null Empty/full whitespace value gets converted to null in CSV parser if it is in the first column Dec 1, 2023
@orlp orlp added the accepted Ready for implementation label Dec 1, 2023
@lizdeika
Copy link
Author

lizdeika commented Dec 1, 2023

Maybe this will help:
Setting missing_utf8_is_empty_string=True
Space(in first column) gets converted to empty string "" instead of null
Looks like space is not recognized as utf8 char if it is a value of the first column

@Wainberg
Copy link
Contributor

Wainberg commented Dec 1, 2023

Similar whitespace-related CSV bugs: #10587, #12763

@lizdeika
Copy link
Author

lizdeika commented Dec 5, 2023

Seems I should fallback to pandas

@orlp
Copy link
Collaborator

orlp commented Dec 5, 2023

@lizdeika Pull requests are welcome!

@stinodego stinodego added P-low Priority: low and removed accepted Ready for implementation labels Jan 12, 2024
@alexander-beedie alexander-beedie added the A-io-csv Area: reading/writing CSV files label Jan 23, 2024
@taki-mekhalfa
Copy link
Contributor

Not able to reproduce using polars 0.20.6 anymore;
was fixed by: #13934

@orlp orlp closed this as completed Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-csv Area: reading/writing CSV files bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

6 participants