`DataFrame.top_k` not handling nulls correctly in version 1.0.0-rc.1 #17165

braaannigan · 2024-06-24T14:24:18Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
pl.DataFrame(
     {
         "a": ["a", "b", "c", "d"],
         "b": [None, 1, 1, 3],
     }
 ).top_k(k=2,by="b",reverse=True)

Log output

shape: (2, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ a   ┆ null │
│ b   ┆ 1    │
└─────┴──────┘

Issue description

The output should exclude nulls, but this has nulls first. Also issues with reverse=False

pl.DataFrame(
     {
         "a": ["a", "b", "c", "d"],
         "b": [None, None, 1, 3],
     }
 ).top_k(k=3,by="b",reverse=False)

has nulls first

Expected behavior

No nulls in these outputs

Installed versions

--------Version info---------
Polars:               1.0.0-rc.1
Index type:           UInt32
Platform:             Linux-5.10.104-linuxkit-x86_64-with-glibc2.28
Python:               3.10.1 (main, Dec 21 2021, 09:50:13) [GCC 8.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           0.3.3
deltalake:            0.18.1
fastexcel:            0.10.4
fsspec:               2024.6.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               0.10.0
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           2.0.31
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0```

</details>

The text was updated successfully, but these errors were encountered:

stinodego · 2024-06-24T20:21:51Z

Thanks for the report.

The functionality was completely redone in #16804 including the new design with regards to nulls. Because the intended functionality was not present in earlier versions, this does not classify as a regression.

@orlp FYI

ritchie46 · 2024-06-25T06:04:01Z

Note that this is DataFrame top_k, which I believe is similar to sort().head() and Nulls have an ordering. I believe this is as expected.

stinodego · 2024-06-25T08:37:50Z

Note that this is DataFrame top_k, which I believe is similar to sort().head() and Nulls have an ordering. I believe this is as expected.

This really is a bug, at least according to the design Orson and I discussed. Null values come last, regardless of whether you're using top_k, or bottom_k, or top_k with reverse.

It's even easier to see the bug in the non-reverse case:

import polars as pl

res = pl.DataFrame(
    {
        "a": ["a", "b", "c", "d"],
        "b": [None, None, 1, 3],
    }
).top_k(k=2, by="b")
print(res)

shape: (2, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ a   ┆ null │
│ b   ┆ null │
└─────┴──────┘

Clearly, values 1, 3 are the top values in b and those rows should be selected.

braaannigan added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Jun 24, 2024

stinodego added P-medium Priority: medium A-ops-sort Area: sorting operations and removed needs triage Awaiting prioritization by a maintainer labels Jun 24, 2024

stinodego changed the title ~~top_k not omitted nulls in 1.0 rc1~~ DataFrame.top_k not handling nulls correctly in version 1.0.0-rc.1 Jun 24, 2024

orlp mentioned this issue Jun 27, 2024

fix: Fix DataFrame.top_k not handling nulls correctly #17239

Merged

ritchie46 closed this as completed in #17239 Jun 27, 2024

c-peters added the accepted Ready for implementation label Jul 1, 2024

c-peters assigned orlp Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DataFrame.top_k` not handling nulls correctly in version 1.0.0-rc.1 #17165

`DataFrame.top_k` not handling nulls correctly in version 1.0.0-rc.1 #17165

braaannigan commented Jun 24, 2024

stinodego commented Jun 24, 2024 •

edited

Loading

ritchie46 commented Jun 25, 2024

stinodego commented Jun 25, 2024 •

edited

Loading

DataFrame.top_k not handling nulls correctly in version 1.0.0-rc.1 #17165

DataFrame.top_k not handling nulls correctly in version 1.0.0-rc.1 #17165

Comments

braaannigan commented Jun 24, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

stinodego commented Jun 24, 2024 • edited Loading

ritchie46 commented Jun 25, 2024

stinodego commented Jun 25, 2024 • edited Loading

`DataFrame.top_k` not handling nulls correctly in version 1.0.0-rc.1 #17165

`DataFrame.top_k` not handling nulls correctly in version 1.0.0-rc.1 #17165

stinodego commented Jun 24, 2024 •

edited

Loading

stinodego commented Jun 25, 2024 •

edited

Loading