Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Let na.rm of mean() support removing NaN as in base R #44089

Open
psychelzh opened this issue Sep 12, 2024 · 2 comments
Open

[R] Let na.rm of mean() support removing NaN as in base R #44089

psychelzh opened this issue Sep 12, 2024 · 2 comments

Comments

@psychelzh
Copy link

psychelzh commented Sep 12, 2024

Describe the enhancement requested

See the following example. In base R na.rm will also remove NaN values, because of the fact that is.na(NaN) == TRUE.

suppressPackageStartupMessages(library(dplyr))
data.frame(x = c(1, 2, NaN)) |> 
  arrow::as_arrow_table() |>
  summarise(x_avg = mean(x, na.rm = TRUE)) |> 
  collect()
#> # A tibble: 1 × 1
#>   x_avg
#>   <dbl>
#> 1   NaN
data.frame(x = c(1, 2, NaN)) |> 
  summarise(x_avg = mean(x, na.rm = TRUE))
#>   x_avg
#> 1   1.5

Created on 2024-09-12 with reprex v2.1.0

Component(s)

R

@psychelzh psychelzh changed the title [R] Let na.rm of mean() support NaN as base R [R] Let na.rm of mean() support removing NaN as in base R Sep 12, 2024
@amoeba
Copy link
Member

amoeba commented Sep 12, 2024

Hi @psychelzh, a change here might require some discussion. My understanding is that different types of systems treat NaNs differently and what you're seeing is a result of arrow having different semantics than R. R and the pandas Python package treat NaNs as NA/None whereas database systems do not and I think arrow is emulating the database system semantics here.

For comparison, Python has a similar situation as R: pandas treats NaN as NA/null (and actually ignores them by default so this is slightly different from R):

In [1]: pd.Series([1, 2, np.nan]).mean()
Out[1]: 1.5

While PyArrow does not:

In [2]: pc.mean(pa.array([1, 2, np.nan]))
Out[2]: <pyarrow.DoubleScalar: nan>

However, because compatibility is important, PyArrow supports a from_pandas flag which makes PyArrow match pandas semantics:

In [3]: pc.mean(pa.array([1, 2, np.nan], from_pandas=True))
Out[3]: <pyarrow.DoubleScalar: 1.5>

There are a few ways this could be addressed and I think this type of issue has come up before in places like #29060. The R package could add something similar to PyArrow's from_pandas flag or an option could be added to individual kernels (mean in this case). Thoughts @ianmcook @nealrichardson @jonkeane @thisisnic?

@thisisnic
Copy link
Member

thisisnic commented Sep 14, 2024

I have mixed feelings - I'm generally opposed to changing how something has worked for a long time given that other users may be depending on it or working around it , but in favour of matching base R when we are emulating their functions.

@amoeba I'm a bit rusty with the codebase - if we were to do this, how complex would it be to implement?

Note: we allow NaN to be treated as NA on Arrow Datum objects, e.g.

library(arrow)
my_vec <- c(42, NA, NaN)
is.na(my_vec)
#> [1] FALSE  TRUE  TRUE
my_arrow_array <- Array$create(my_vec)
is.na(my_arrow_array)
#> Array
#> <bool>
#> [
#>   false,
#>   true,
#>   true
#> ]

I think, given the above behaviour that we have chosen to implement, if this isn't too complex and someone is willing to implement it, we could totally look into supporting it. Could we perhaps have it as an option? Or something else? Keen to hear what others think though.

Some tangentially related PRs:

psychelzh added a commit to psychelzh/eeg_shared_memory that referenced this issue Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants