[R] Let `na.rm` of `mean()` support removing `NaN` as in base R #44089

psychelzh · 2024-09-12T14:44:50Z

Describe the enhancement requested

See the following example. In base R na.rm will also remove NaN values, because of the fact that is.na(NaN) == TRUE.

suppressPackageStartupMessages(library(dplyr))
data.frame(x = c(1, 2, NaN)) |> 
  arrow::as_arrow_table() |>
  summarise(x_avg = mean(x, na.rm = TRUE)) |> 
  collect()
#> # A tibble: 1 × 1
#>   x_avg
#>   <dbl>
#> 1   NaN

data.frame(x = c(1, 2, NaN)) |> 
  summarise(x_avg = mean(x, na.rm = TRUE))
#>   x_avg
#> 1   1.5

^{Created on 2024-09-12 with reprex v2.1.0}

Component(s)

R

The text was updated successfully, but these errors were encountered:

amoeba · 2024-09-12T20:00:46Z

Hi @psychelzh, a change here might require some discussion. My understanding is that different types of systems treat NaNs differently and what you're seeing is a result of arrow having different semantics than R. R and the pandas Python package treat NaNs as NA/None whereas database systems do not and I think arrow is emulating the database system semantics here.

For comparison, Python has a similar situation as R: pandas treats NaN as NA/null (and actually ignores them by default so this is slightly different from R):

In [1]: pd.Series([1, 2, np.nan]).mean()
Out[1]: 1.5

While PyArrow does not:

In [2]: pc.mean(pa.array([1, 2, np.nan]))
Out[2]: <pyarrow.DoubleScalar: nan>

However, because compatibility is important, PyArrow supports a from_pandas flag which makes PyArrow match pandas semantics:

In [3]: pc.mean(pa.array([1, 2, np.nan], from_pandas=True))
Out[3]: <pyarrow.DoubleScalar: 1.5>

There are a few ways this could be addressed and I think this type of issue has come up before in places like #29060. The R package could add something similar to PyArrow's from_pandas flag or an option could be added to individual kernels (mean in this case). Thoughts @ianmcook @nealrichardson @jonkeane @thisisnic?

thisisnic · 2024-09-14T19:57:00Z

I have mixed feelings - I'm generally opposed to changing how something has worked for a long time given that other users may be depending on it or working around it , but in favour of matching base R when we are emulating their functions.

@amoeba I'm a bit rusty with the codebase - if we were to do this, how complex would it be to implement?

Note: we allow NaN to be treated as NA on Arrow Datum objects, e.g.

library(arrow)

my_vec <- c(42, NA, NaN)
is.na(my_vec)
#> [1] FALSE  TRUE  TRUE

my_arrow_array <- Array$create(my_vec)
is.na(my_arrow_array)
#> Array
#> <bool>
#> [
#>   false,
#>   true,
#>   true
#> ]

I think, given the above behaviour that we have chosen to implement, if this isn't too complex and someone is willing to implement it, we could totally look into supporting it. Could we perhaps have it as an option? Or something else? Keen to hear what others think though.

Some tangentially related PRs:

See apache/arrow#44089

psychelzh added the Type: enhancement label Sep 12, 2024

github-actions bot added the Component: R label Sep 12, 2024

psychelzh changed the title ~~[R] Let na.rm of mean() support NaN as base R~~ [R] Let na.rm of mean() support removing NaN as in base R Sep 12, 2024

psychelzh added a commit to psychelzh/eeg_shared_memory that referenced this issue Sep 20, 2024

Use read_parquet()

793516e

See apache/arrow#44089

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Let `na.rm` of `mean()` support removing `NaN` as in base R #44089

[R] Let `na.rm` of `mean()` support removing `NaN` as in base R #44089

psychelzh commented Sep 12, 2024 •

edited

Loading

amoeba commented Sep 12, 2024

thisisnic commented Sep 14, 2024 •

edited

Loading

[R] Let na.rm of mean() support removing NaN as in base R #44089

[R] Let na.rm of mean() support removing NaN as in base R #44089

Comments

psychelzh commented Sep 12, 2024 • edited Loading

Describe the enhancement requested

Component(s)

amoeba commented Sep 12, 2024

thisisnic commented Sep 14, 2024 • edited Loading

[R] Let `na.rm` of `mean()` support removing `NaN` as in base R #44089

[R] Let `na.rm` of `mean()` support removing `NaN` as in base R #44089

psychelzh commented Sep 12, 2024 •

edited

Loading

thisisnic commented Sep 14, 2024 •

edited

Loading