`as.data.frame(<RPolarsDataFrame>)` seems slow #1079

eitsupi · 2024-05-06T03:10:57Z

It seems to take 100 times longer than the conversion from arrow::Table.

Could arrow be using ALTREP to make the materialization later?

Details

library(polars)
library(arrow, warn.conflicts = FALSE)

polars_info()
#> Polars R package version : 0.16.3
#> Rust Polars crate version: 0.39.2
#>
#> Thread pool size: 16
#>
#> Features:
#> default                    TRUE
#> full_features              TRUE
#> disable_limit_max_threads  TRUE
#> nightly                    TRUE
#> sql                        TRUE
#> rpolars_debug_print       FALSE
#>
#> Code completion: deactivated
arrow_info()
#> Arrow package version: 15.0.1
#>
#> Capabilities:
#>
#> acero      TRUE
#> dataset    TRUE
#> substrait FALSE
#> parquet    TRUE
#> json       TRUE
#> s3         TRUE
#> gcs        TRUE
#> utf8proc   TRUE
#> re2        TRUE
#> snappy     TRUE
#> gzip       TRUE
#> brotli     TRUE
#> zstd       TRUE
#> lz4        TRUE
#> lz4_frame  TRUE
#> lzo       FALSE
#> bz2        TRUE
#> jemalloc   TRUE
#> mimalloc   TRUE
#>
#> Memory:
#>
#> Allocator jemalloc
#> Current    0 bytes
#> Max        0 bytes
#>
#> Runtime:
#>
#> SIMD Level          avx2
#> Detected SIMD Level avx2
#>
#> Build:
#>
#> C++ Library Version  15.0.1
#> C++ Compiler            GNU
#> C++ Compiler Version 11.4.0

big_df <-  do.call(rbind, lapply(1:5, \(x) nycflights13::flights))

from_r <- bench::mark(
  as_polars_df = as_polars_df(big_df),
  as_arrow_table = as_arrow_table(big_df),
  check = FALSE,
  min_iterations = 5
)

big_pldf <- as_polars_df(big_df)
big_at <- as_arrow_table(big_df)

to_r <- bench::mark(
  pldf = as.data.frame(big_pldf),
  at = as.data.frame(big_at),
  check = FALSE,
  min_iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

from_r
#> # A tibble: 2 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 as_polars_df      347ms    400ms      2.56    1.07MB    0.639
#> 2 as_arrow_table    357ms    370ms      2.72    1.21MB    0
to_r
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pldf       820.76ms  875.2ms      1.15   192.8MB     1.38
#> 2 at           6.17ms   11.9ms     52.8     13.5MB     5.86

^{Created on 2024-05-06 with reprex v2.1.0}

The text was updated successfully, but these errors were encountered:

eitsupi · 2024-09-05T12:51:24Z

The current implementation of the next branch is much slower.

# Construct an Arrow array from an R vector
long_vec_1 <- 1:10^6

bench::mark(
  arrow = {
    arrow::as_arrow_array(long_vec_1)
  },
  nanoarrow = {
    nanoarrow::as_nanoarrow_array(long_vec_1)
  },
  polars = {
    polars::as_polars_series(long_vec_1)
  },
  neopolars = {
    neopolars::as_polars_series(long_vec_1)
  },
  check = FALSE,
  min_iterations = 5
)
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 arrow        2.62ms   2.92ms     328.    19.82MB     2.04
#> 2 nanoarrow  496.13µs 644.87µs    1252.   458.41KB     2.03
#> 3 polars       2.06ms   2.26ms     405.     6.33MB     0
#> 4 neopolars    84.6ms   90.1ms      10.9    1.59MB     0

# Export Arrow data as an R vector
arrow_array_1 <- arrow::as_arrow_array(long_vec_1)
nanoarrow_array_1 <- nanoarrow::as_nanoarrow_array(long_vec_1)
polars_series_1 <- polars::as_polars_series(long_vec_1)
neopolars_series_1 <- neopolars::as_polars_series(long_vec_1)

bench::mark(
  arrow = {
    as.vector(arrow_array_1)
  },
  nanoarrow = {
    as.vector(nanoarrow_array_1)
  },
  polars = {
    as.vector(polars_series_1)
  },
  neopolars = {
    as.vector(neopolars_series_1)
  },
  check = TRUE,
  min_iterations = 5
)
#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 arrow       13.94µs  15.84µs  46309.      4.59KB     4.63
#> 2 nanoarrow   559.9µs   1.85ms    513.      3.85MB    72.8
#> 3 polars       6.45ms   8.79ms    112.      5.93MB     9.13
#> 4 neopolars  148.82ms 164.65ms      6.02    5.24MB     0

^{Created on 2024-09-05 with reprex v2.1.1}

This is strange because the construction process seems to be almost identical (the main branch branches with or without NA, but in fact the speed does not seem to change with or without NA).

r-polars/src/rust/src/conversion_r_to_s.rs

Lines 138 to 150 in f55eade

    
           Rtype::Integers => { 
        
               let rints = x.as_integers().expect("as matched"); 
        
               let s = if rints.no_na().is_true() { 
        
                   pl::Series::new(name, x.as_integer_slice().expect("as matched")) 
        
               } else { 
        
                   //convert R NAs to rust options 
        
                   let mut s: pl::Series = rints 
        
                       .iter() 
        
                       .map(|x| if x.is_na() { None } else { Some(x.inner()) }) 
        
                       .collect(); 
        
                   s.rename(name); 
        
                   s 
        
               };

r-polars/src/rust/src/series/construction.rs

Lines 22 to 28 in 72897e5

    
           fn new_i32(name: &str, values: IntegerSexp) -> Result<Self> { 
        
               let ca: Int32Chunked = values 
        
                   .iter() 
        
                   .map(|value| if value.is_na() { None } else { Some(*value) }) 
        
                   .collect_trusted(); 
        
               Ok(ca.with_name(name).into_series().into()) 
        
           }

Is the superior export speed of arrow and nanoarrow probably due to the use of ALTREP?

eitsupi added the enhancement New feature or request label May 6, 2024

eitsupi mentioned this issue Sep 9, 2024

Performance differences observed in r-polars (?) yutannihilation/savvy#293

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`as.data.frame(<RPolarsDataFrame>)` seems slow #1079

`as.data.frame(<RPolarsDataFrame>)` seems slow #1079

eitsupi commented May 6, 2024 •

edited

Loading

eitsupi commented Sep 5, 2024 •

edited

Loading

as.data.frame(<RPolarsDataFrame>) seems slow #1079

as.data.frame(<RPolarsDataFrame>) seems slow #1079

Comments

eitsupi commented May 6, 2024 • edited Loading

eitsupi commented Sep 5, 2024 • edited Loading

`as.data.frame(<RPolarsDataFrame>)` seems slow #1079

`as.data.frame(<RPolarsDataFrame>)` seems slow #1079

eitsupi commented May 6, 2024 •

edited

Loading

eitsupi commented Sep 5, 2024 •

edited

Loading