Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Complete rewrite based on savvy and rlang #1152

Open
eitsupi opened this issue Jun 27, 2024 · 28 comments
Open

[Proposal] Complete rewrite based on savvy and rlang #1152

eitsupi opened this issue Jun 27, 2024 · 28 comments
Labels
enhancement New feature or request
Milestone

Comments

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 27, 2024

Continued from #942 and #1126

(The following text is copied from https://github.com/eitsupi/neo-r-polars's README)


Motivation

I have been developing r-polars for over a year, and I felt that a significant rewrite was necessary.
r-polars is a clone of py-polars,
but the package structure is currently quite different.
Therefore, it was difficult to keep up with frequent updates.

I thought that now, around the release of Python Polars 1.0.0, is a good time for a complete rewrite, so I decided to try it.

There are several reasons to rewrite r-polars on both the Rust and R sides.

Rust side

  1. Appropriate file division. Due to the limitations of extendr,
    it is not possible to place multiple impl blocks.
    (extendr/extendr#538)
  2. Error handling.
    There is a lot of custom code to use the Result type with extendr, which is quite different from other packages based on extendr.
    (extendr/extendr#650)
  3. Simplify type conversion.
    The code is difficult to follow because it uses a macro called robj_to for type conversion (at least in rust-analyzer).

About 1 and 2, I expect that switching from extendr to savvy
will improve the situation.

For 3, in py-polars and nodejs-polars, a thin Wrap struct wraps other types and processes them with standard From traits etc.,
which I think makes the code cleaner.

R side

  1. The structure of classes.
    In py-polars, the strategy is that classes defined on the Rust side (e.g., PyDataFrame) are wrapped by classes defined on the Python side (e.g., DataFrame).
    In r-polars, a complex strategy is adopted to update classes created by Rust side/extendr (e.g., RPolarsDataFrame) with a lot of custom code.
    (This is also related to the fact that extendr makes associated functions of Rust structs members of R classes. savvy does not mix associated functions and methods.)
  2. S3 methods first.
    This is also related to the Rust side, in the current r-polars, generic functions like as_polars_series were added later,
    so there are several places where type conversion from R to Polars is done on the Rust side, making it difficult to understand where the type conversion is done.
    If type conversion from R to Polars is done with two generic functions, as_polars_series and as_polars_expr, the code will be much simpler and customization from the R side will be possible.
  3. Error handling.
    Currently, r-polars has its own Result type on the R side, and error handling is done through it.
    The backtrace generated that is quite easy to understand, but it is not necessarily easy to use when using polars internally in other packages, such as testthat::expect_error().
  4. Based on rlang.
    Currently, r-polars has no R package dependencies. This is great,
    but that includes a degraded copy of list2()
    instead of the convenient functions in the rlang package.
    rlang is a lightweight R package, and I feel that it is more beneficial to depend on the convenient functions of rlang than to stick to no dependencies.

1 and 3 are also related to the fact that it is built with extendr, and it seems that switching to savvy is appropriate here as well.
If we abandon the current Result type on the R side, it is natural to use rlang for error handling, so from that perspective, it is reasonable to depend on rlang in 4.

@eitsupi eitsupi added the enhancement New feature or request label Jun 27, 2024
@eitsupi eitsupi pinned this issue Jun 27, 2024
@eitsupi eitsupi added this to the Rewrite milestone Jun 27, 2024
@etiennebacher
Copy link
Collaborator

Thanks for all the work, I'll take a look this weekend after #1147

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 28, 2024

Significant changes that may affect users are noted in the NEWS file.
https://github.com/eitsupi/neo-r-polars/blob/60502b13e9c21ef38c733abc24477671e0e13a2a/NEWS.md

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 30, 2024

I pushed the neo-r-polars' main branch to this repository so that we can start the migration work.
f7316d3

For now, the new next branch lacks developer tools and other settings, and switching to this branch in the local repository of r-polars should detect a large number of diffs due to differences in .gitignore, so I recommend that do not switch to the branch.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jul 4, 2024

Documentation for translation from Python Polars has been pushed.
I hope it comes across as fairly simple.
https://github.com/pola-rs/r-polars/blob/63790cd069cd736e650a5aabf7900f197490ece9/DEVELOPMENT.md

@etiennebacher
Copy link
Collaborator

Thanks, that is very helpful. I noticed that the class names have changed, e.g from RPolarsExpr to polars_expr. Is there any reason for this? I think the current class names are already clear and unique enough to avoid collisions with classes from other packages.

Similarly, the current Struct is called RPolarsExpr but in the rewrite it is PlRExpr. I like the fact that those names are identical on the R and Rust sides, is it possible to keep them like this?

@etiennebacher
Copy link
Collaborator

Also, related to the post on Mastodon asking for help, I think it would be helpful to have a checklist (either at the top of this issue or in a new one) with more details on the progress of the rewrite. It doesn't have to be super detailed but simply have an idea of how far we are, where should a new contributor focus, etc. For instance something like this:

  • DataFrame
    • translation
    • tests
  • LazyFrame
    • translation
    • tests
  • Expr
    • str
      • translation
      • tests
    • arr
    • ...

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jul 5, 2024

Thanks for taking a look at this. I will create the task list at another epic issue soon as I can, as it is definitely needed.

I noticed that the class names have changed, e.g from RPolarsExpr to polars_expr. Is there any reason for this? I think the current class names are already clear and unique enough to avoid collisions with classes from other packages.

There are two reasons.

  1. To avoid conflicts between the current polars package and the neopolars package. This is also why the names of the structs on the Rust side is something like PlRExpr. This justification disappears when there is no longer a need for coexistence (i.e., when the neopolars package is renamed to polars).
  2. Because the class name of CamelCase is not considered to be very common in R.
    In other words, the current class names such as RPolarsExpr are a compromise due to the restrictions of the extendr that Rust's struct and R's classes cannot be named separately, and the need to avoid class name conflicts with other packages. But if class names can be given only on the R side I feel that the snake_case is more appropriate than the CamelCase.
    Of course, this is debatable.

Similarly, the current Struct is called RPolarsExpr but in the rewrite it is PlRExpr. I like the fact that those names are identical on the R and Rust sides, is it possible to keep them like this?

No, these are actually different things. That is, polars_expr and PlRExpr already exist on the R side, and PlRExpr is an internal class that is not normally touched by the user.
This is the same as Python Polars having two classes: the Expr class and the PyExpr class (from the Rust struct).

In Python:

>>> import polars as pl
>>> pl.lit(1).__class__.__name__
'Expr'
>>> pl.lit(1)._pyexpr.__class__.__name__
'PyExpr'

In R:

> pl$lit(1) |> class()
[1] "polars_expr"   "polars_object"

> pl$lit(1)$`_rexpr` |> class()
[1] "PlRExpr"

Unlike the current RPolarsExpr, PlRExpr is an internal class, so I have tentatively named it PlRExpr because we think it is safe even if it does not contain the word "polars" (although it is an internal class, it is not without risk of collision, so I have not made it RExpr).

@etiennebacher
Copy link
Collaborator

I see, thanks for the explanations.

There are two reasons.

  1. To avoid conflicts between the current polars package and the neopolars package. This is also why the names of the structs on the Rust side is something like PlRExpr. This justification disappears when there is no longer a need for coexistence (i.e., when the neopolars package is renamed to polars).
  2. Because the class name of CamelCase is not considered to be very common in R.
    In other words, the current class names such as RPolarsExpr are a compromise due to the restrictions of the extendr that Rust's struct and R's classes cannot be named separately, and the need to avoid class name conflicts with other packages. But if class names can be given only on the R side I feel that the snake_case is more appropriate than the CamelCase.
    Of course, this is debatable.

I don't think the style of class names is important. To me, the two main requirements is that they must be clear and unique, and I think RPolarsExpr and variants match those requirements. We already break a lot of code at each release, I'd rather not break the class names in addition to that.

Bottom line, I'd rather keep the current RPolars* names. Open to discuss too ofc.


Sidenote: I didn't find conventions for class names (in general, I feel R code lacks conventions anyway). I also didn't find packages with CamelCase class names (can't say I searched a lot though) but some prominent packages do not use snake_case:

class(Matrix::Matrix(0, 3, 2))
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"

Packages using S4 (like those on BioConductor) also usually have camel case.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jul 7, 2024

I didn't find conventions for class names (in general, I feel R code lacks conventions anyway).

I think the most commonly used style guides are the Tidyverse style guide and Google's Style guide (of course Google publishes Style guides for many languages and I believe they are very popular).

https://style.tidyverse.org/syntax.html#object-names
https://google.github.io/styleguide/Rguide.html

Both of these encourage the use of snake_case for class names.

@etiennebacher etiennebacher changed the title [Proposal] Completely rewritten based on savvy and rlang [Proposal] Complete rewrite based on savvy and rlang Aug 8, 2024
@ju6ge

This comment was marked as off-topic.

@eitsupi

This comment was marked as off-topic.

@CGMossa

This comment was marked as off-topic.

@ju6ge

This comment was marked as off-topic.

@CGMossa

This comment was marked as off-topic.

@eitsupi

This comment was marked as off-topic.

@ju6ge

This comment was marked as off-topic.

@eitsupi

This comment was marked as off-topic.

@ju6ge

This comment was marked as off-topic.

@eitsupi

This comment was marked as off-topic.

@eitsupi

This comment was marked as off-topic.

@ju6ge

This comment was marked as off-topic.

@eitsupi

This comment was marked as off-topic.

@ju6ge

This comment was marked as off-topic.

@eitsupi

This comment was marked as off-topic.

@GuillaumePressiat
Copy link

Hi,

I regularly follow this r-polars repo because of my works with both R and Python. Coming from R I have found py-polars to be exactly what I was waiting for since years in Python (dataframe library with a dplyr philosophy). I've also began to code in Rust thanks to pola.rs.

From an external point of view I've found r-polars not so easy, mainly due to the pl$... syntax.
It's an OOP question, S3 vs S4 vs R6...
Speaking about complete rewrite, have you think about a way to bypass the $ in your proposal? I've seen tidypolars and I'm not talking about that. Complete rewrite with a dplyr/pola.rs coding experience?
I mean have you considered another constructor than $? @ will almost do the same albeit with a better readibility? I've think about it and it's not an easy question.
Maybe this reflexion is somewhere here in your repo...

and of course thanks for your work!

@etiennebacher
Copy link
Collaborator

etiennebacher commented Aug 31, 2024

Hi there!

From an external point of view I've found r-polars not so easy, mainly due to the pl$... syntax.

I agree that's probably deterring some people from using the "raw" polars syntax directly in R, but I think it's also useful to be able to quickly compare Python and R code. Basically, except for some differences between languages (e.g. using c() instead of []), being able to only replace $ by . (or vice-versa) is very convenient.

@ will almost do the same albeit with a better readibility?

Not convinced about that. @ is used mostly by S4-based packages. It seems most of them are in Bioconductor and plenty of people (including myself) never deal with that. $ is used in all packages built on R6, including very popular ones like shiny (I think).

Using S3 instead of $ or @ to chain operations would create massive namespace conflicts with the base library and other packages since polars basically reimplements a bunch of basic functions, and not all of those functions are S3 generics.

I've seen tidypolars and I'm not talking about that

Why not? The point is precisely to use dplyr and other tidyverse packages syntax but with polars under the hood. Isn't this what you're looking for?

I think the current "polars ecosystem" in R makes some sense: polars for an almost identical experience as in python, tidypolars to use the tidyverse syntax, and polarssql for SQL syntax. That said, let us know if something is unclear to you and curious to see your opinion on that :)

@GuillaumePressiat
Copy link

GuillaumePressiat commented Aug 31, 2024

Many thanks for you answer.

There are different point of views :)

My question was not really about technical aspects but more about user experience.
Whereas with pandas Python code was not neat, with pola.rs Python code is now easy to read.

For me polars's expressivness leads to focus on it (all verbs are in it, filter, group_by, with_columns, etc.). I agree that it's good to keep its syntax.
My opinion is that for readability purpose in R, I don't like too much $ in R code. This symbol is vertical and breaks lines visually. So my question with @ which is more like a ligation between terms.
In dplyr there are not $ and it was sort of a revolution in R. A good one with neat code. For code writers and also code readers (first time we see it in doc we don't write it).
In summary, I am wondering if r-polars can by itself bring another revolution to R but without $ $. We see it and we say: it's different and I like it.

(
  as_polars_df(iris)
      $with_columns((pl
                $col('Sepal.Length') * 2)
                $alias('Sepal.Length.Bigger'))
)

A dumb proposal may be with @:

(
  as_polars_df(iris)
      @with_columns(
                (pl@col('Sepal.Length') * 2)
                 @alias('Sepal.Length.Bigger'))
)

It also could be nice not to have to write pl$col and just col inside with_columns by sort of inheritance? Find a way to skip col R function in this context.

About tidypolars my thinking is: it seems really a big work to keep tidypolars wired with r-polars evolutions forwarded by pola.rs ones... I mean evolutions are almost unpredictable in such a vast ecosystem. Maybe I'm not too fan of under the hood :)
But Rust pola.rs's seems more and more stable now, as data.table / dtplyr it's surely a good lead.

Again there are different point of views!
In summary, by no mean R api could be the same as Python api.

Thanks for reading.

@etiennebacher
Copy link
Collaborator

etiennebacher commented Aug 31, 2024

My opinion is that for readability purpose in R, I don't like too much $ in R code. This symbol is vertical and breaks lines visually

I get your point, but we're not going to break all the polars code out there for what is in the end a matter of personal preferences. Some people prefer chaining functions with $ as in R6-based packages.

It also could be nice not to have to write pl$col and just col inside with_columns by sort of inheritance? Find a way to skip col R function in this context.

It would be nice, but that introduces complexity since we'd need to understand when we are in a Polars context and when we're not. This might be trivial when we can detect that we're inside a with_columns() call, but in some cases we create expressions outside a call and then execute them inside so preventing conflicts with the base function col() is not as easy as it seems. Why not in the future, but for now there are many other priorities in r-polars.

About tidypolars my thinking is: it seems really a big work to keep tidypolars wired with r-polars evolutions forwarded by pola.rs ones...

Implementing it was some work, yes, but now most dplyr and tidyr functions are translated and I don't think they'll add many more in the future. Of course, there are still tons of functions from other packages that could be translated, but I'm not afraid of breaking changes in pola.rs affecting tidypolars too much. Basically every tidypolars release contains a few tweaks to accommodate pola.rs changes but nothing major (and I keep the polars and tidypolars releases quite synchronized).

Anyway, thanks for the feedback and don't hesitate if you have other questions or bug reports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants