-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There are multiple functions that do the same thing (for arrow) #732
Comments
Some context is found in #326 |
If it doesn't have any obvious perf benefits it can be removed |
Seems to me also however |
I would keep both calls arrow_stream_to_series_internal |
Thanks, but I suspect that using a set of functions from the |
Oh, sorry, I didn't realize that nanoarrow supports passing pointers as str. |
@sorhawell This causes segmentation fault. stream_out <- nanoarrow::as_nanoarrow_array_stream(mtcars)
polars:::arrow_stream_to_series(nanoarrow::nanoarrow_pointer_addr_chr(stream_out)) In general, I vote to remove this unless you write tests for this, as it is not possible to maintain a function that has not been tested. |
Thinking about it again, I opened the issue pola-rs/polars#14208 because there is no need to implement it here if there is the Arrow C interface in the DataFrame of Rust Polars itself. |
Yes you can very well achieve an UB crash such as segfault. This is because: The arrow C stream and C data interface are very unsafe. There are exact rules to follow or bad things happen. The protocols is an orchestrated ping-pong between "producer" and "consumer" where each step on its own is unsafe. arrow_stream_to_series is a private function which is one step in the exchange and is by it self not safe. I think it is possible to segfault with private functions from nanoarrow also. Only the full implementation of all steps can be safe. The simple integration test is placed in extendr-polars which is annoying if not run in CI. Maybe there could be written some unit test where r-polars is both the producer and consumer. Btw I know it is very desirable to move as much as possible to rust-polars but it is tricky to achieve as far as I know. nanoarrow is very convenient and move a burden away from r-polars maintainers. So hurray for nanoarrow. There are some arrow types that rust-polars implement differently. I would test for if various types are supported. Since I'm not active currently, I fully understand ditching any code that will not be further developed in a near future and which is not tested in the CI. |
Some context can be found in #64
While looking into the Arrow implementation, I noticed that this package has several functions related to the Arrow C interface.
The current publicly available function for creating DataFrames from arrow Tables is
polars:::arrow_to_rdf
(ported from Python Polars) written in R insidepolars::as_polars_df
(formerlypolars::pl$from_arrow
), but here is a function written in Rust with almost the same functionality.r-polars/R/extendr-wrappers.R
Line 41 in 1a6521c
This function does not appear to be used anywhere.
Presumably this is a function created for experimental purposes and used in the benchmarks stored below.
https://github.com/pola-rs/r-polars/blob/1a6521c5a9622d432b8676bf45782515b2059863/inst/misc/examples/from_arrow.R
I ran this script with the current development version (on #730) and found that
pl$from_arrow
andrb_list_to_df
are about the same speed, so perhapsrb_list_to_df
can be removed.It makes sense that
arrow_to_rdf
is copied from Python Polars, which is actively improving its speed, and is as fast or faster thanrb_list_to_df
.There are also
r-polars/R/extendr-wrappers.R
Lines 47 to 51 in 1a6521c
The text was updated successfully, but these errors were encountered: