Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are multiple functions that do the same thing (for arrow) #732

Closed
eitsupi opened this issue Jan 24, 2024 · 9 comments · Fixed by #1078
Closed

There are multiple functions that do the same thing (for arrow) #732

eitsupi opened this issue Jan 24, 2024 · 9 comments · Fixed by #1078
Assignees
Labels
enhancement New feature or request

Comments

@eitsupi
Copy link
Collaborator

eitsupi commented Jan 24, 2024

Some context can be found in #64

While looking into the Arrow implementation, I noticed that this package has several functions related to the Arrow C interface.

The current publicly available function for creating DataFrames from arrow Tables is polars:::arrow_to_rdf (ported from Python Polars) written in R inside polars::as_polars_df (formerly polars::pl$from_arrow), but here is a function written in Rust with almost the same functionality.

rb_list_to_df <- function(r_batches, names) .Call(wrap__rb_list_to_df, r_batches, names)

This function does not appear to be used anywhere.

Presumably this is a function created for experimental purposes and used in the benchmarks stored below.

https://github.com/pola-rs/r-polars/blob/1a6521c5a9622d432b8676bf45782515b2059863/inst/misc/examples/from_arrow.R

I ran this script with the current development version (on #730) and found that pl$from_arrow and rb_list_to_df are about the same speed, so perhaps rb_list_to_df can be removed.

> source("/workspaces/r-polars/inst/misc/examples/from_arrow.R", encoding = "UTF-8")

Attaching package:arrowThe following object is masked frompackage:utils:

    timestamp

total n chunks: 475# A tibble: 4 × 13
  expression              min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
  <bch:expr>         <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
1 to_arrow           655.57ms    1.44s     0.518  915.77KB     1.45    10    28     19.31s <NULL>
2 from_arrow            827ms    1.74s     0.547    3.15MB     1.75    10    32     18.27s <NULL>
3 from_arrow_no_rec654.13ms 746.89ms     1.31   959.62KB     4.33    10    33      7.62s <NULL>
4 from_arrow_all_se1.08s    1.13s     0.827 1011.25KB     3.97    10    48     12.09s <NULL>
# ℹ 3 more variables: memory <list>, time <list>, gc <list>
# A tibble: 4 × 13
  expression   min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory    
  <bch:expr> <bch> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>    
1 to_arrow   140ms 282.19ms     3.64       39KB    0.364    10     1      2.75s <NULL> <Rprofmem>
2 from_arrow 130ms 165.84ms     6.03     86.8KB    3.02     10     5      1.66s <NULL> <Rprofmem>
3 from_arro915ms    1.13s     0.869   875.5KB    4.17     10    48      11.5s <NULL> <Rprofmem>
4 to_arrow2  130ms  176.8ms     5.43     38.8KB    0.543    10     1      1.84s <NULL> <Rprofmem>
# ℹ 2 more variables: time <list>, gc <list>
There were 50 or more warnings (use warnings() to see the first 50)

It makes sense that arrow_to_rdf is copied from Python Polars, which is actively improving its speed, and is as fast or faster than rb_list_to_df.

There are also

arrow_stream_to_df <- function(robj_str) .Call(wrap__arrow_stream_to_df, robj_str)
arrow_stream_to_series <- function(robj_str) .Call(wrap__arrow_stream_to_series, robj_str)
export_df_to_arrow_stream <- function(robj_df, robj_str) .Call(wrap__export_df_to_arrow_stream, robj_df, robj_str)

@eitsupi eitsupi added the enhancement New feature or request label Jan 24, 2024
@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 26, 2024

Some context is found in #326

@etiennebacher
Copy link
Collaborator

perhaps rb_list_to_df can be removed.

If it doesn't have any obvious perf benefits it can be removed

@sorhawell
Copy link
Collaborator

sorhawell commented Jan 29, 2024

Seems to me also polars:::rb_list_to_df is old and can be removed together with any subsequent dead code, as warned by the rust compiler. Superseded by arrow_to_rdf()

however
arrow_stream_to_df, export_df_to_arrow_stream and new_arrow_stream are called from extendr-polars repository they are used to zero copy share polars DataFrames between to different polars compilation units (R/rust crates with polars).

@sorhawell
Copy link
Collaborator

I would keep arrow_stream_to_series even though not being used on R side, as it is actually the general case of arrow_stream_to_df

both calls arrow_stream_to_series_internal

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 30, 2024

Thanks, but I suspect that using a set of functions from the nanoarrow package would be better than using my own solution of passing the pointer as str.

See also https://github.com/JosiahParry/arrow-extendr

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 30, 2024

Oh, sorry, I didn't realize that nanoarrow supports passing pointers as str.
I'd like to do some refactoring after #730 is merged.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Jan 30, 2024

@sorhawell This causes segmentation fault.

stream_out <- nanoarrow::as_nanoarrow_array_stream(mtcars)
polars:::arrow_stream_to_series(nanoarrow::nanoarrow_pointer_addr_chr(stream_out))

In general, I vote to remove this unless you write tests for this, as it is not possible to maintain a function that has not been tested.
Also, as far as I can see, the only advantage of writing this on the Rust side is that nanoarrow is not needed, and I feel that using nanoarrow is advantageous because it is easier to maintain.

@eitsupi
Copy link
Collaborator Author

eitsupi commented Feb 2, 2024

Thinking about it again, I opened the issue pola-rs/polars#14208 because there is no need to implement it here if there is the Arrow C interface in the DataFrame of Rust Polars itself.

@sorhawell
Copy link
Collaborator

@sorhawell This causes segmentation fault.

stream_out <- nanoarrow::as_nanoarrow_array_stream(mtcars)
polars:::arrow_stream_to_series(nanoarrow::nanoarrow_pointer_addr_chr(stream_out))

In general, I vote to remove this unless you write tests for this, as it is not possible to maintain a function that has not been tested. Also, as far as I can see, the only advantage of writing this on the Rust side is that nanoarrow is not needed, and I feel that using nanoarrow is advantageous because it is easier to maintain.

Yes you can very well achieve an UB crash such as segfault.

This is because:

The arrow C stream and C data interface are very unsafe. There are exact rules to follow or bad things happen.

The protocols is an orchestrated ping-pong between "producer" and "consumer" where each step on its own is unsafe.

arrow_stream_to_series is a private function which is one step in the exchange and is by it self not safe. I think it is possible to segfault with private functions from nanoarrow also.

Only the full implementation of all steps can be safe. The simple integration test is placed in extendr-polars which is annoying if not run in CI. Maybe there could be written some unit test where r-polars is both the producer and consumer.

Btw I know it is very desirable to move as much as possible to rust-polars but it is tricky to achieve as far as I know.

nanoarrow is very convenient and move a burden away from r-polars maintainers. So hurray for nanoarrow. There are some arrow types that rust-polars implement differently. I would test for if various types are supported.

Since I'm not active currently, I fully understand ditching any code that will not be further developed in a near future and which is not tested in the CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants