Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump rust-polars to 0.42.0 #1183

Merged
merged 41 commits into from
Aug 19, 2024
Merged

Bump rust-polars to 0.42.0 #1183

merged 41 commits into from
Aug 19, 2024

Conversation

etiennebacher
Copy link
Collaborator

@etiennebacher etiennebacher commented Aug 14, 2024

Close #1185

https://github.com/pola-rs/polars/releases/tag/rs-0.42.0

@eitsupi I think the most annoying change is the move from pl_flavor to compatlevel made here: pola-rs/polars#17421

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Aug 15, 2024

Only 2 tests failing.

df = pl$DataFrame(mtcars)
df$sort(c("cyl", "mpg"), descending = c(TRUE, FALSE), maintain_order = TRUE)

now returns both columns sorted in decreasing order and I can't figure out why (and I can't reproduce on Python)

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Aug 18, 2024

@eitsupi really stuck on this failure in $sort(), it would be great if you could take a look. Small summary of what's wrong:

Python 1.5.0:

import polars as pl
df = pl.DataFrame({"x": [1, 2, 1, 2], "y": [1, 2, 3, 4]})
df.sort(["x", "y"], descending=[True, False])
shape: (4, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 2   │
│ 2   ┆ 4   │
│ 1   ┆ 1   │
│ 1   ┆ 3   │
└─────┴─────┘

This PR:

dat = pl$DataFrame(x = c(1, 2, 1, 2), y = 1:4)
dat$sort(c("x", "y"), descending = c(TRUE, FALSE))
shape: (4, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ f64 ┆ i32 │ # <---------- col "y" should be in ascending order, not descending
╞═════╪═════╡
│ 2.0 ┆ 4   │
│ 2.0 ┆ 2   │
│ 1.0 ┆ 3   │
│ 1.0 ┆ 1   │
└─────┴─────┘

The same code on main works correctly, and dat$sort("x", "y", descending = c(TRUE, FALSE)) works too:

shape: (4, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ f64 ┆ i32 │
╞═════╪═════╡
│ 2.0 ┆ 2   │
│ 2.0 ┆ 4   │
│ 1.0 ┆ 1   │
│ 1.0 ┆ 3   │
└─────┴─────┘

@eitsupi
Copy link
Collaborator

eitsupi commented Aug 18, 2024

In a8c20da, it works correctly.

> dat$sort(c("x", "y"), descending = c(TRUE, FALSE))
Error:
! Evaluation failed in `$sort()`.
Caused by error in `extend_bool()` at neo-r-polars/R/lazyframe-frame.R:58:5:
! the length of `descending` (2) does not match the length of `...` (1)
Run `rlang::last_trace()` to see where the error occurred.

> dat$sort("x", "y", descending = c(TRUE, FALSE))
shape: (4, 2)
┌─────┬─────┐
│ xy   │
│ ------ │
│ f64i32 │
╞═════╪═════╡
│ 2.02   │
│ 2.04   │
│ 1.01   │
│ 1.03   │
└─────┴─────┘

I do not know where the problem lies in the current implementation.

Is it possible that there is a problem where the by and ... are concatenated?
I think the by argument should be eliminated, as well as the next branch.

lazyframe__sort <- function(
...,
descending = FALSE,
nulls_last = FALSE,
multithreaded = TRUE,
maintain_order = FALSE) {
wrap({
check_dots_unnamed()
by <- parse_into_list_of_expressions(...)
descending <- extend_bool(descending, length(by), "descending", "...")
nulls_last <- extend_bool(nulls_last, length(by), "nulls_last", "...")
self$`_ldf`$sort_by_exprs(
by,
descending = descending,
nulls_last = nulls_last,
multithreaded = multithreaded,
maintain_order = multithreaded
)
})
}

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Aug 18, 2024

dat$sort(c("x", "y"), descending = c(TRUE, FALSE))
Error:
! Evaluation failed in $sort().
Caused by error in extend_bool() at neo-r-polars/R/lazyframe-frame.R:58:5:
! the length of descending (2) does not match the length of ... (1)
Run rlang::last_trace() to see where the error occurred.

This doesn't match the behaviour of py-polars, that accepts input as ["x", "y"] which I think is the equivalent of c("x", "y").

The issue is probably that these lines:

let mut exprs = robj_to!(VecPLExprCol, by)?;
let mut ddd = robj_to!(VecPLExprCol, dotdotdot)?;
exprs.append(&mut ddd);

transform c("x", "y") to [cols(["x", "y"])], while they transform "x", "y" to [col("x"), col("y")]. Therefore, eliminating by doesn't change anything. I think on the R side, unpack_list() should be modified but I fear it's gonna break several things.

@eitsupi
Copy link
Collaborator

eitsupi commented Aug 18, 2024

This doesn't match the behaviour of py-polars, that accepts input as ["x", "y"] which I think is the equivalent of c("x", "y").

Good point.
I was mistaken, Python's parse_into_list_of_expressions(["x", "y"]) seems to return [col("x"), col("y")] instead of [cols(["x", "y"])].

@eitsupi
Copy link
Collaborator

eitsupi commented Aug 18, 2024

This doesn't match the behaviour of py-polars, that accepts input as ["x", "y"] which I think is the equivalent of c("x", "y").

Good point. I was mistaken, Python's parse_into_list_of_expressions(["x", "y"]) seems to return [col("x"), col("y")] instead of [cols(["x", "y"])].

I've done a little digging into this, and rather than try to do a complete reproduction, I think it would be best to follow the general dynamic dots rule (use !!!) rather than a full reproduction.

> dat$sort(!!!c("x", "y"), descending = c(TRUE, FALSE))
shape: (4, 2)
┌─────┬─────┐
│ xy   │
│ ------ │
│ f64i32 │
╞═════╪═════╡
│ 2.02   │
│ 2.04   │
│ 1.01   │
│ 1.03   │
└─────┴─────┘

@etiennebacher
Copy link
Collaborator Author

I got it, we had already the code for <expr>$sort_by(), the trick was to use wrap_elist_result() instead of unpack_list().

I hope the rewrite is going well, I'm really confused with all those internal functions 😄

I think it would be best to follow the general dynamic dots rule (use !!!) rather than a full reproduction.

I think that's something to think about for the rewrite, but for now I'd rather keep the current behavior. I don't work a lot with Python, so overall I don't really understand the point of handling by and ... separately in py-polars. Do you know if (and why) this is better than just having ...?

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Aug 18, 2024

@eitsupi I have changed rust-version to 1.81.0 in Cargo.toml but there's still this error in CI:

error: rustc 1.80.1 is not supported by the following packages:
[email protected] requires rustc 1.81.0

Do you know how to fix it? I don't see 1.80.1 anywhere in the package so I don't know why it uses that

@eitsupi
Copy link
Collaborator

eitsupi commented Aug 18, 2024

Rust 1.81 is not released yet.

@etiennebacher etiennebacher marked this pull request as ready for review August 18, 2024 16:21
Copy link
Collaborator

@eitsupi eitsupi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@eitsupi eitsupi merged commit 086d6df into main Aug 19, 2024
37 of 39 checks passed
@eitsupi eitsupi deleted the rust-polars-0.42.0 branch August 19, 2024 12:11
@eitsupi
Copy link
Collaborator

eitsupi commented Aug 19, 2024

I don't work a lot with Python, so overall I don't really understand the point of handling by and ... separately in py-polars. Do you know if (and why) this is better than just having ...?

My understanding is that this is simply due to copying from Python Polars.
As you may remember, pl$col() used to have a similar structure, which could cause problems. I think only ... is needed.

@etiennebacher
Copy link
Collaborator Author

My understanding is that this is simply due to copying from Python Polars.

Actually I was asking for the python implementation. Maybe it's common in python, but I don't understand the point of having by and *more_by. To me, it makes more sense to only have *more_by. The only reason I can come up with is that it's a standard approach in Python, so I was asking whether you know this.

I think it's a weird pattern in R and it makes it harder for us to maintain because we have to ensure that by and ... are correctly handled instead of just .... At the same time, I'm a bit hesitant on diverting from the python API...

@eitsupi
Copy link
Collaborator

eitsupi commented Aug 19, 2024

I think this is because the things that can be specified are different.
In other words, here we can specify Iterable[IntoExpr] for by, but not for *more_by.

Python Polars' API is derived from pyspark, and we can find the same pattern in pyspark as well.
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrameWriter.bucketBy.html#pyspark.sql.DataFrameWriter.bucketBy

colstr, list or tuple
a name of a column, or a list of names.
colsstr
additional names (optional). If col is a list it should be empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor optimization toggles
2 participants