Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch from extendr to savvy #1126

Closed
eitsupi opened this issue Jun 2, 2024 · 16 comments
Closed

Switch from extendr to savvy #1126

eitsupi opened this issue Jun 2, 2024 · 16 comments
Labels
enhancement New feature or request
Milestone

Comments

@eitsupi
Copy link
Collaborator

eitsupi commented Jun 2, 2024

I don't think this can be accomplished in a short term, but it would be far easier to maintain by eliminating the large amount of code written to return a Result type to R with extendr.
I also think we can simplify the source code in several other points.

@eitsupi eitsupi added the enhancement New feature or request label Jun 2, 2024
@etiennebacher
Copy link
Collaborator

Not familiar with savvy so I don't see how it would simplify the code while keeping the same functionalities, can you add some examples of simplification?

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 9, 2024

@etiennebacher It would be quicker to look at a real example.
The following two R packages are based on savvy and DataFusion1.
https://github.com/yutannihilation/datafusion-r
https://github.com/eitsupi/r-glaredb

For example, implementing the Arrow C stream interface is much easier than extendr.
The following few lines will pull nanoarrow_array_stream, etc. into the Rust side.

use arrow::ffi_stream::{ArrowArrayStreamReader, FFI_ArrowArrayStream};

let stream_reader = unsafe {
    let stream = savvy::ExternalPointerSexp::try_from(stream_ptr)?
        .cast_mut_unchecked::<FFI_ArrowArrayStream>();
    ArrowArrayStreamReader::from_raw(stream).map_err(|e| e.to_string())?
};

It would also allow for more general Rust rules for file structure, such as the ability to split impl blocks into multiple files. (extendr/extendr#739)

Footnotes

  1. https://datafusion.apache.org/user-guide/faq.html

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 11, 2024

Currently, the file structure on the Rust side is quite different from that of py-polars and uses a uniquely defined error type, so it is necessary to write functions according to completely different rules than for general packages using extendr.
This is fine for this package alone (?), but if we are creating something like pyo3-polars, I expect that similar code will be required for that package as well, making maintenance quite cumbersome.
So I think it might be worthwhile to completely rewrite r-polars in savvy to be a copy of py-polars.

Another thing I am thinking is that just as Python Polars have, for example, the DataFrame class (defined in Python) and the (internal) PyDataFrame class (defined in Rust), it would be better to have a dual structure for each class in R Polars.
Note: extendr and savvy have different ways of representing the associated functions of Struct on the R side (I feel that savvy is more appropriate).

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 15, 2024

@etiennebacher Although only basic type conversion from R to Polars has been implemented yet, I have created a polars binding using savvy.

Note that the directory structure matches that of py-polars.
This is not possible with extendr, which cannot have multiple impl blocks (extendr/extendr#538).
https://github.com/eitsupi/neo-r-polars/blob/f2b726f9493285b33fe75c42b8bc2026722b40eb/src/rust/src/series/construction.rs#L7-L8 (ref. https://github.com/pola-rs/polars/blob/9af88f1fdee6f3af5b4e5ac646ecd13fce80bca0/py-polars/src/series/construction.rs#L58-L59)
https://github.com/eitsupi/neo-r-polars/blob/f2b726f9493285b33fe75c42b8bc2026722b40eb/src/rust/src/series/mod.rs#L30-L31 (ref. https://github.com/pola-rs/polars/blob/9af88f1fdee6f3af5b4e5ac646ecd13fce80bca0/py-polars/src/series/mod.rs#L70-L71)

Matching py-polars will make it easier to follow upstream and contribute. The current implementation of r-polars is too complex.

@etiennebacher
Copy link
Collaborator

etiennebacher commented Jun 15, 2024

Thanks @eitsupi, that looks great and much easier to follow and understand than the current Rust implementation (so far), thanks! I won't have much time (or skill anyway) to explore this in the next few days but I have a few questions coming to my mind (all of this supposing we move to savvy):

  • does the structure of impl for Expr, DataFrame, etc. need to change? In other words can we just copy-paste the big impl blocks we already have? Edit: just saw in https://github.com/eitsupi/neo-r-polars/blob/f2b726f9493285b33fe75c42b8bc2026722b40eb/src/rust/src/series/mod.rs#L30-L31 that apparently we would just need to replace RResult<Self> by savvy::Result<Self>
  • should we expect anything problematic if someone wants to make an r-polars extension that relies on extendr while we rely on savvy? Are they interoperable?
  • would the transition be in several steps or one massive PR to change all the Rust part?
  • would the structure of errors change? (I mean by that whether the whole unwrap() thing in R would still be necessary)
  • is it possible to translate extendr_concurrent.rs to savvy? That's the biggest black box to me

The current implementation of r-polars is too complex.

100% agree

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 15, 2024

Thanks for checking that.

Basically, yes, but the type of input is different.
Currently r-polars takes input from R as Robj, but savvy requires the use of different arguments for each type.
https://yutannihilation.github.io/savvy/guide/key_ideas.html

For example, in the following, you will see that the integer and double vectors are converted to Series in separate functions (I tried to see if this could be handled in a single function, but it seemed impossible. But I think separating the functions makes the correspondence between one S3 method and one Rust function clearer and easier to understand).
https://github.com/eitsupi/neo-r-polars/blob/f2b726f9493285b33fe75c42b8bc2026722b40eb/src/rust/src/series/construction.rs#L14-L28

  • should we expect anything problematic if someone wants to make an r-polars extension that relies on extendr while we rely on savvy? Are they interoperable?

As long as the Arrow C interface is used, there should be no problem (currently the C interface is intended to be used via nanoarrow, see #1138)

  • would the transition be in several steps or one massive PR to change all the Rust part?

I don't think a phased transition is possible. Almost everything needs to be rewritten in a huge PR / switch the branch.
Or I was imagining something like having the two exist in parallel in completely separate branches (like v1 and v2 of Docker Compose).

  • would the structure of errors change? (I mean by that whether the whole unwrap() thing in R would still be necessary)

This is both an advantage and a disadvantage, but the error structure is completely different.
That is, savvy, unlike extendr, has the ability from the start to handle errors so we can use it to abandon our own Result type, but it does not provide anything like the current stack trace.
https://yutannihilation.github.io/savvy/guide/error.html
The current implementation should be better with respect to error messages.

  • is it possible to translate extendr_concurrent.rs to savvy? That's the biggest black box to me

This seems to be related to the Rust code to bypass R's single-threaded limitation to execute R code?
I think the root cause is that R is single-threaded but there is no such thing as a Python GIL, so perhaps something like this is needed everywhere (I don't understand it either)

I don't think there is anything extendr-specific that can be translated, but I don't know if this is the right implementation to begin with (like the segmentation fault we recently got on Linux).

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 15, 2024

``` r
geos::as_geos_geometry("LINESTRING (0 1, 3 9)")
#> <geos_geometry[1]>
#> [1] <LINESTRING (0 1, 3 9)>
pl$DataFrame(x = geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))
#> Error: Execution halted with the following contexts
#> 0: In R: in $DataFrame():
#> 0: During function call [pl$DataFrame(x = geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))]
#> 1: When constructing polars literal from Robj
#> 2: Encountered the following error in Rust-Polars:
#> expected Series
```

Compared to the above example, the error message would look like this:

> as_polars_series(geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))
Error in as_polars_series.default(geos::as_geos_geometry("LINESTRING (0 1, 3 9)")) : 
  Unsupported class: geos_geometry

> rlang::global_entrace()

> as_polars_series(geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))
Error in `as_polars_series.default()`:
! Unsupported class: geos_geometry
Run `rlang::last_trace()` to see where the error occurred.

> rlang::last_trace()
<error/rlang_error>
Error in `as_polars_series.default()`:
! Unsupported class: geos_geometry
---
Backtrace:1. ├─neopolars::as_polars_series(geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))
 2. └─neopolars:::as_polars_series.default(geos::as_geos_geometry("LINESTRING (0 1, 3 9)")) at neo-r-polars/R/as_polars_series.R:3:3

> as_polars_df(list(x = geos::as_geos_geometry("LINESTRING (0 1, 3 9)")))
Error in `as_polars_series.default()`:
! Unsupported class: geos_geometry
Run `rlang::last_trace()` to see where the error occurred.

> rlang::last_trace()
<error/rlang_error>
Error in `as_polars_series.default()`:
! Unsupported class: geos_geometry
---
Backtrace:1. ├─neopolars::as_polars_df(list(x = geos::as_geos_geometry("LINESTRING (0 1, 3 9)")))
 2. └─neopolars:::as_polars_df.list(list(x = geos::as_geos_geometry("LINESTRING (0 1, 3 9)"))) at neo-r-polars/R/as_polars_df.R:3:3
 3.   ├─neopolars:::wrap(PlRDataFrame$init(lapply(x, function(column) as_polars_series(column)$`_s`))) at neo-r-polars/R/as_polars_df.R:13:3
 4.   ├─PlRDataFrame$init(lapply(x, function(column) as_polars_series(column)$`_s`)) at neo-r-polars/R/series-series.R:4:3
 5.   │ └─neopolars:::.savvy_wrap_PlRDataFrame(...) at neo-r-polars/R/000-wrappers.R:58:3
 6.   └─base::lapply(x, function(column) as_polars_series(column)$`_s`) at neo-r-polars/R/000-wrappers.R:43:3
 7.     └─neopolars (local) FUN(X[[i]], ...)
 8.       ├─neopolars::as_polars_series(column) at neo-r-polars/R/as_polars_df.R:13:13
 9.       └─neopolars:::as_polars_series.default(column) at neo-r-polars/R/as_polars_series.R:3:3

(This is not an appropriate example since it seems unlikely that we can create an error-prone situation within Rust at the moment, but anyway, the Result type on the R side is not necessary.)

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 15, 2024

I implemented <Series>$struct$unnest() to see how the error looks from the Rust side (eitsupi/neo-r-polars@43efe43).

> as_polars_series(mtcars)$struct$unnest()
shape: (32, 11)
┌──────┬─────┬───────┬───────┬───┬─────┬─────┬──────┬──────┐
│ mpgcyldisphp    ┆ … ┆ vsamgearcarb │
│ ------------   ┆   ┆ ------------  │
│ f64f64f64f64   ┆   ┆ f64f64f64f64  │
╞══════╪═════╪═══════╪═══════╪═══╪═════╪═════╪══════╪══════╡
│ 21.06.0160.0110.0 ┆ … ┆ 0.01.04.04.0  │
│ 21.06.0160.0110.0 ┆ … ┆ 0.01.04.04.0  │
│ 22.84.0108.093.0  ┆ … ┆ 1.01.04.01.0  │
│ 21.46.0258.0110.0 ┆ … ┆ 1.00.03.01.0  │
│ 18.78.0360.0175.0 ┆ … ┆ 0.00.03.02.0  │
│ …    ┆ …   ┆ …     ┆ …     ┆ … ┆ …   ┆ …   ┆ …    ┆ …    │
│ 30.44.095.1113.0 ┆ … ┆ 1.01.05.02.0  │
│ 15.88.0351.0264.0 ┆ … ┆ 0.01.05.04.0  │
│ 19.76.0145.0175.0 ┆ … ┆ 0.01.05.06.0  │
│ 15.08.0301.0335.0 ┆ … ┆ 0.01.05.08.0  │
│ 21.44.0121.0109.0 ┆ … ┆ 1.01.04.02.0  │
└──────┴─────┴───────┴───────┴───┴─────┴─────┴──────┴──────┘

> as_polars_series(1)$struct$unnest()
Error: SchemaMismatch(ErrString("invalid series dtype: expected `Struct`, got `f64`"))

> rlang::global_entrace()

> as_polars_series(1)$struct$unnest()
Error:
! SchemaMismatch(ErrString("invalid series dtype: expected `Struct`, got `f64`"))
Run `rlang::last_trace()` to see where the error occurred.

> rlang::last_trace()
<error/rlang_error>
Error:
! SchemaMismatch(ErrString("invalid series dtype: expected `Struct`, got `f64`"))
---
Backtrace:1. └─as_polars_series(1)$struct$unnest()
 2.   └─neopolars:::series_struct_unnest(.self) at neo-r-polars/R/series-struct.R:5:19
 3.     ├─neopolars:::wrap(self$`_s`$struct_unnest()) at neo-r-polars/R/series-struct.R:12:3
 4.     └─self$`_s`$struct_unnest() at neo-r-polars/R/series-series.R:4:3
 5.       └─neopolars:::.savvy_wrap_PlRDataFrame(...) at neo-r-polars/R/000-wrappers.R:72:5

@etiennebacher
Copy link
Collaborator

I took a look at the code in your neo-r-polars repo but didn't play a lot with it yet. I'm trying to see how this transition to savvy would work and how we should organize that.

Motivation

I think it'd be nice to have a doc that summarizes why we want to move to savvy (your comments above could be a good basis for that). I get there are some advantages to do so, e.g. better file organization, but it's not obvious why this can't be improved while keeping the extendr framework. You mentioned that

Note that the directory structure matches that of py-polars. This is not possible with extendr, which cannot have multiple impl blocks

but maybe this could be implemented in extendr upstream instead? (I see there's some discussion going on in extendr/extendr#538 so there might be some development there soon).

Also, I'm not confident I can adapt some more complex code (e.g. extendr_concurrent.rs) in savvy.

This is not to say that this isn't worth it, but it's quite a massive rewrite so we should be clear why we do it.

List of changes

We could list the type of changes that would be necessary. You already started making a proof-of-concept so you know more what is needed. Feel free to edit this list.

  1. Input checking

Currently r-polars takes input from R as Robj, but savvy requires the use of different arguments for each type.

I suppose this means we need input checking on the R side. It shouldn't be too hard to implement if we use rlang standalone functions to check types, but it will be annoying to do so given the size of the code base.

  1. All the extendr_concurrent.rs stuff

How we do the transition

Supposing we move to savvy, making PRs to a separate branch for savvy and then merging to main when the transition is complete makes the most sense to me. Still, this means that while we focus on savvy we don't add new features since we would have to implement them both in main and in savvy (not sure this is clear).

Summary

It would be nice if neo-r-polars could be a minimal PoC with 1-2 methods for Expr, DataFrame, etc. (maybe this is already what you had in mind). Basically, we want to be sure that:

  • functionality doesn't change
  • docs workflow doesn't need to be updated
  • no perf regression
  • all the concurrent stuff still works
  • errors are at least as good as they are now

Once all of this is guaranteed in this minimal PoC, then we can move on and add all the Expr/DataFrame/... stuff that requires a lot of tedious (but relatively easy) work. Until we have this, I'm not sure I'll spend much time on neo-r-polars given my limited time.


What do you think about all this?

@etiennebacher
Copy link
Collaborator

Sidenote about errors: would it make sense to return a string with a specific "error" attribute? This would allow more customization of the error message in R. For instance, if we use rlang then we can make the error message prettier and reformulate some stuff:

foo_unwrap <- function(x) {
  error <- attributes(x)$error_type
  if (is.null(error)) {
    return(x)
  }
  err_type <- switch(
    error,
    "ColumnNotFound" = "Column not found",
    "ComputeError" = "Operation failed"
  )
  err_type <- paste0(err_type, ":")
  
  rlang::abort(
    c(
      err_type,
      " " = x
    )
  )
}

# Function worked: no msg
foo_unwrap(1)
#> [1] 1
# Function failed
error_output <- c("a")
attr(error_output, "error_type") <- "ColumnNotFound"
foo_unwrap(error_output)
#> Error in `foo_unwrap()`:
#> ! Column not found:
#>   a
# Function failed
error_output <- c("arithmetic on string and numeric not allowed, try an explicit cast first")
attr(error_output, "error_type") <- "ComputeError"
foo_unwrap(error_output)
#> Error in `foo_unwrap()`:
#> ! Operation failed:
#>   arithmetic on string and numeric not allowed, try an explicit cast first

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 19, 2024

Thanks for taking a look at this. I will write about motivation in the README once the minimum functionality (e.g. the examples in the README of this repository) is implemented.

I suppose this means we need input checking on the R side.

Yes. So basically we need to implement the branch using the S3 method.
For example, the default method of as_polars_series() raises an error.
https://github.com/eitsupi/neo-r-polars/blob/437680d2573f09ee0b08e301937d1b480b85b54f/R/as_polars_series.R

I think it is clearer compared to the current implementation which has a huge match arm on the Rust side.

@eitsupi

This comment was marked as resolved.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 22, 2024

Thanks to @etiennebacher, error handling based on rlang has been implemented, and I explained Motivation and current status in the README.
https://github.com/eitsupi/neo-r-polars

@sorhawell @vincentarelbundock @Sicheng-Pan @grantmcdermott Could you take a look at that?
If we switch to savvy, the most distressing thing is that we will lose the "collect in background" feature that @Sicheng-Pan implemented in GSoC 2023 in the short term.
(Now that polars can pass query plans to other computers without actively evaluating them, I suppose it is possible to implement this in a way that passes the query plan directly to other processes.)

@Sicheng-Pan
Copy link
Collaborator

@eitsupi Thanks for the prototype, and it looks very promising. Probably we can implement collect_in_background with the new polars feature.

Even if we stick to the current implementation, it seems possible to migrate without too much effort, if we disable the support for the usage of R functions in queries (just for this feature)

@grantmcdermott
Copy link
Collaborator

I haven’t been able to test it in-depth, but my high-level feedback is that this all sounds good to me. Ensuring a healthy dev experience is something that we should be prioritising first and foremost. If the switch to savvy and the addition of the rlang dependency make life easier for you @eitsupi (and you @etiennebacher), then that’s all to the good. It’s infinitely preferable to make some dependency changes than overwhelm the primary maintainer(s). I’ll try to test neo-polars on some of my existing codebases when I get a sec. But if the test suite is already passing then I say go for it :-)

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jun 27, 2024

Thanks all for your feedback.
That prototype seems to be working generally well and I hope to push it as a new branch in this repository within the next few days.

I have opened a new issue #1152 because this rewrite is not just a migration to savvy, but also complete change of structure on the R side.

@eitsupi eitsupi closed this as completed Jun 27, 2024
@eitsupi eitsupi added this to the Rewrite milestone Jun 27, 2024
@eitsupi eitsupi closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants