Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Convert arrow dictionary to R factor via as.data.frame.nanoarrow_array_stream()? #513

Open
eitsupi opened this issue Jun 9, 2024 · 3 comments

Comments

@eitsupi
Copy link
Contributor

eitsupi commented Jun 9, 2024

Maybe related to #220

I noticed that if we convert nanoarrow_array_stream to data.frame, the dictionary becomes a character.

stream <- data.frame(
  x = as.factor(letters[1:5]),
  y = as.factor(1:5)
) |>
  nanoarrow::as_nanoarrow_array_stream()

stream
#> <nanoarrow_array_stream struct<x: dictionary(int32)<string>, y: dictionary(int32)<string>>>
#>  $ get_schema:function ()
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)
#>  $ release   :function ()
stream |>
  tibble::as_tibble()
#> # A tibble: 5 × 2
#>   x     y
#>   <chr> <chr>
#> 1 a     1
#> 2 b     2
#> 3 c     3
#> 4 d     4
#> 5 e     5

Created on 2024-06-09 with reprex v2.1.0

@paleolimbot
Copy link
Member

Thanks for bringing this up!

One of the tricky things about dictionaries in Arrow is that the "levels"/"dictionary" live at the array level, not at the type level. This means that two arrays can be a dictionary(int32, string) but each have its own dictionary. Arrow C++ (and therefore arrow R) handles this with a rather complex system of "dictionary unification", which it can do because it has equality kernels and can do fancy things. nanoarrow doesn't have any of that, so I made the default conversion a little simpler (and did it in such a way that it handles dictionaries of things that aren't just strings in a more predictable way, or at least more stable if unexpected to the average R user).

You should be able to specify that you want a factor() specifically, and this will work for converting just one batch. If you need to convert an arbitrary stream, you'll need to know the levels in advance at the moment (this could be fixed such that it "learns" the levels as it goes and finalizes the array at the end...basically an implementation of dictionary unification written in R).

library(nanoarrow)
#> Warning: package 'nanoarrow' was built under R version 4.3.3

df1 <- data.frame(
  x = as.factor(letters[1:5]),
  y = as.factor(1:5)
)

df2 <- data.frame(
  x = as.factor(letters[6:10]),
  y = as.factor(1:5)
)

# Safest/most type stable/makes the fewest assumptions to just return
# the dictionary value type
basic_array_stream(list(df1, df2)) |> 
  convert_array_stream() |> 
  tibble::as_tibble()
#> # A tibble: 10 × 2
#>    x     y    
#>    <chr> <chr>
#>  1 a     1    
#>  2 b     2    
#>  3 c     3    
#>  4 d     4    
#>  5 e     5    
#>  6 f     1    
#>  7 g     2    
#>  8 h     3    
#>  9 i     4    
#> 10 j     5

# You can specify a factor() target type if you know the levels
basic_array_stream(list(df1, df2)) |> 
  convert_array_stream(
    data.frame(x = factor(levels = letters), y = factor(levels = as.character(1:5)))
  ) |> 
  tibble::as_tibble()
#> # A tibble: 10 × 2
#>    x     y    
#>    <fct> <fct>
#>  1 a     1    
#>  2 b     2    
#>  3 c     3    
#>  4 d     4    
#>  5 e     5    
#>  6 f     1    
#>  7 g     2    
#>  8 h     3    
#>  9 i     4    
#> 10 j     5

# If you have only one batch, factor() should work as a target (but doesn't currently)
# You can specify a factor() target type if you know the levels
basic_array_stream(list(df1)) |> 
  convert_array_stream(
    data.frame(x = factor(), y = factor())
  ) |> 
  tibble::as_tibble()
#> # A tibble: 5 × 2
#>   x     y    
#>   <fct> <fct>
#> 1 a     1    
#> 2 b     2    
#> 3 c     3    
#> 4 d     4    
#> 5 e     5

Created on 2024-06-09 with reprex v2.1.0

@eitsupi
Copy link
Contributor Author

eitsupi commented Jun 10, 2024

One of the tricky things about dictionaries in Arrow is that the "levels"/"dictionary" live at the array level, not at the type level.

Thanks for the detailed explanation. I see, this is indeed a complicated process.

Perhaps the statistics on the C interface that are currently being discussed could provide some sort of dictionary for the entire column...?

@paleolimbot
Copy link
Member

Perhaps the statistics on the C interface that are currently being discussed could provide some sort of dictionary for the entire column...?

I think that convert_array_stream(stream, factor()) could be smarter: when I first implemented the "convert to R" logic I didn't allow for any flexibility with respect to "finalizing" a value. I had to bite that bullet to support GeoArrow (i.e., with the nanoarrow_vctr()), but at the moment attempting to convert a stream with a target factor() will error.

There is also a PR open to refactor the conversion process to make it easier to add these features: #392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants