nanoarrow 0.5.0 release post (#525)

Co-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: William Ayd <[email protected]>
apache · May 29, 2024 · fcd3256 · fcd3256
1 parent 2a0acf8
commit fcd3256
Showing 1 changed file with 309 additions and 0 deletions.
diff --git a/_posts/2024-05-27-nanoarrow-0.5.0-release.md b/_posts/2024-05-27-nanoarrow-0.5.0-release.md
@@ -0,0 +1,309 @@
+---
+layout: post
+title: "Apache Arrow nanoarrow 0.5.0 Release"
+date: "2024-01-27 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the 0.5.0 release of
+Apache Arrow nanoarrow. This release covers 79 resolved issues from
+9 contributors.
+
+## Release Highlights
+
+The primary focus of the nanoarrow 0.5.0 release was expanding the
+initial [Python bindings](#python-bindings) that were released in 0.4.0.
+The nanoarrow Python package can now create and consume most Arrow
+data types, arrays, and array streams, including conversion to/from
+objects compatible with the Python buffer protocol and conversion
+to/from lists of Python objects.
+
+The nanoarrow 0.5.0 release also includes updates to its build
+configuration to make it possible to use nanoarrow with `FetchContent`
+in projects with a wider variety of CMake usage. In addition to CMake,
+nanoarrow now supports the Meson build system. Thanks to
+[@vyasr](https://github.com/vyasr) and [@WillAyd](https://github.com/WillAyd)
+for contributing these changes!
+
+In the [R bindings](#r-bindings), support for reading IPC streams
+is now accessible with `read_nanoarrow()`!
+
+Finally, build system helpers and helpers to reconcile modern C++ usage
+with nanorrow C structures (e.g., iterating over an `ArrowArrayStream` or
+`ArrowArray` using a range-for loop) were added to `nanoarrow.hpp`.
+Thanks to [@bkeitz](https://github.com/bkietz) for contributing these
+changes!
+
+See the
+[Changelog](https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0/CHANGELOG.md)
+for a detailed list of contributions to this release.
+
+## Breaking Changes
+
+Most changes included in the nanoarrow 0.5.0 release will not break downstream
+code; however, several changes in the C library are breaking changes to previous
+behaviour.
+
+- `ArrowBufferResize()` and `ArrowBitmapResize()` now adjust `size_bytes`/
+  `size_bits` in addition to `capacity_bytes`/`buffer.capacity_bytes`.
+  Preivously these functions only adjusted the capacity of the underlying
+  buffer which caused some understandable confusion even though this
+  behaviour was documented. This change affects all usage of
+  `ArrowBufferReisze()` and `ArrowBitmapResize()` that *increased* the size
+  of the underlying buffer (i.e., usage where `shrink_to_fit` was non zero
+  should be unaffected).
+- `ArrowBufferReset()` now *always* calls the allocator's `free()` callback.
+  Previously, a call to the `free()` callback was skipped if the pointer was
+  `NULL`; however, this led to some confusion and made it easy to accidentally
+  leak a custom deallocator whose pointer happened to be `NULL`.
+- As a consequence of the above, it is now mandatory to call `ArrowBufferInit()`
+  before calling `ArrowBufferReset()`. There was some existing usage of nanoarrow
+  that zero-ed the memory for an `ArrowBuffer` and then (sometimes) called
+  `ArrowBufferReset()`. Preivously this was a no-op; however, after 0.5.0 this
+  will crash. This is consistent with other structures in the nanoarrow C library
+  (which require an initialization before it is safe to reset/release them).
+
+
+### Python bindings
+
+The nanoarrow Python bindings are distributed as the `nanoarrow` package on
+[PyPI](https://pypi.org/project/nanoarrow/) and [conda-forge](https://anaconda.org/conda-forge/nanoarrow):
+
+```shell
+pip install nanoarrow
+conda install nanoarrow -c conda-forge
+```
+
+High level users can use the `Schema`, `Array`, and `ArrayStream` classes
+to interact with data types, arrays, and array streams:
+
+```python
+import nanoarrow as na
+
+na.int32()
+#> <Schema> int32
+
+na.Array([1, 2, 3], na.int32())
+#> nanoarrow.Array<int32>[3]
+#> 1
+#> 2
+#> 3
+
+url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
+na.ArrayStream.from_url(url)
+#> nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>
+```
+
+Low-level users can use `c_schema()`, `c_array()`, and `c_array_stream()` to interact
+with thin wrappers around the Arrow C Data interface structures:
+
+```python
+na.c_schema(pa.decimal128(10, 3))
+#> <nanoarrow.c_schema.CSchema decimal128(10, 3)>
+#> - format: 'd:10,3'
+#> - name: ''
+#> - flags: 2
+#> - metadata: NULL
+#> - dictionary: NULL
+#> - children[0]:
+
+na.c_array(["one", "two", "three", None], na.string())
+#> <nanoarrow.c_array.CArray string>
+#> - length: 4
+#> - offset: 0
+#> - null_count: 1
+#> - buffers: (4754305168, 4754307808, 4754310464)
+#> - dictionary: NULL
+#> - children[0]:
+```
+
+All nanoarrow type/array-like objects implement the
+[Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)
+for both producing and consuming and are zero-copy interchangeable with `pyarrow`
+objects in many cases:
+
+```python
+import pyarrow as pa
+
+pa.field(na.int32())
+#> pyarrow.Field<: int32>
+
+na.Schema(pa.string())
+#> <Schema> string
+
+pa.array(na.Array([4, 5, 6], na.int32()))
+#> <pyarrow.lib.Int32Array object at 0x11b552500>
+#> [
+#>   4,
+#>   5,
+#>   6
+#> ]
+
+na.Array(pa.array([10, 11, 12]))
+#> nanoarrow.Array<int64>[3]
+#> 10
+#> 11
+#> 12
+```
+
+For a more detailed tour of the nanoarrow Python bindings, see the
+[Getting started in Python guide](https://arrow.apache.org/nanoarrow/latest/getting-started/python.html) and the
+[Python API reference](https://arrow.apache.org/nanoarrow/latest/reference/python/index.html).
+
+### C/C++
+
+The nanoarrow 0.5.0 release includes a number of bugfixes and improvements
+to the core C library and C++ helpers.
+
+First, the CMake build system was refactored to enable `FetchContent` to
+work in a wider variety of
+[develop/build/install scenarios](https://github.com/apache/arrow-nanoarrow/tree/main/examples/cmake-scenarios). In most cases, CMake-based projects should be able
+to add the nanoarrow C library as a dependency with:
+
+```cmake
+include(FetchContent)
+fetchcontent_declare(nanoarrow
+                     GIT_REPOSITORY https://github.com/apache/arrow-nanoarrow.git
+                     GIT_TAG  apache-arrow-nanoarrow-0.5.0
+                     GIT_SHALLOW TRUE)
+fetchcontent_makeavailable(nanoarrow)
+
+add_executable(some_target ...)
+target_link_libraries(some_target nanoarrow::nanoarrow)
+```
+
+Projects using the [Meson](https://mesonbuild.com/) build system can install
+nanoarrow from [WrapDB](https://mesonbuild.com/Wrapdb-projects.html) using:
+
+```shell
+mkdir -p subprojects
+meson wrap install nanoarrow
+```
+
+...and use `dependency('nanoarrow')` to add the dependency:
+
+```
+nanoarrow_dep = dependency('nanoarrow')
+example_exec = executable('some_target',
+                          ...,
+                          dependencies: [nanoarrow_dep])
+```
+
+Finally, a set of C++ range/view helpers were added to smooth out some
+of more verbose aspects of working with nanoarrow in C++. While the
+new helpers are targeted at more than just nanoarrow's tests, they have been
+particularly helpful in allowing nanoarrow's tests to be more less repetitive
+and more effective. For example, one particularly verbose test was collapsed
+to:
+
+```cpp
+#include <gtest/gtest.h>
+#include <gmock/gmock-matchers.h>
+#include <nanoarrow/nanoarrow_gtest_util.hpp>
+#include <nanoarrow/nanoarrow.hpp>
+
+nanoarrow::UniqueArrayStream array_stream;
+// ... populate array_stream
+nanoarrow::ViewArrayStream array_stream_view(array_stream.get());
+
+for (ArrowArray& array : array_stream_view) {
+  EXPECT_THAT(nanoarrow::ViewArrayAs<int32_t>(&array), ElementsAre(1234));
+}
+
+EXPECT_EQ(array_stream_view.count(), 1);
+EXPECT_EQ(array_stream_view.code(), NANOARROW_OK);
+EXPECT_STREQ(array_stream_view.error()->message, "");
+```
+
+See the new section in the
+[C++ API reference](https://arrow.apache.org/nanoarrow/latest/reference/cpp.html#range-for-utilities) for details.
+
+### R bindings
+
+The nanoarrow R bindings are distributed as the `nanoarrow` package on
+[CRAN](https://cran.r-project.org/).
+
+Whereas nanoarrow has had an IPC reader supporting most features of the
+IPC streaming format since 0.3.0, the R bindings did not implement bindings
+until this release. The 0.5.0 release of the R package includes `read_nanoarrow()`
+as an entrypoint to reading streams from various sources including URLs,
+filenames, and R connections:
+
+``` r
+library(nanoarrow)
+
+url <- "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
+
+read_nanoarrow(url) |>
+  tibble::as_tibble()
+#> # A tibble: 15,487 × 5
+#>    commit                                time                files merge message
+#>    <chr>                                 <dttm>              <int> <lgl> <chr>
+#>  1 49cdb0fe4e98fda19031c864a18e6156c6ed… 2024-03-07 02:00:52     2 FALSE GH-403…
+#>  2 1d966e98e41ce817d1f8c5159c0b9caa4de7… 2024-03-06 21:51:34     1 FALSE GH-403…
+#>  3 96f26a89bd73997f7532643cdb27d04b7097… 2024-03-06 20:29:15     1 FALSE GH-402…
+#>  4 ee1a8c39a55f3543a82fed900dadca791f6e… 2024-03-06 07:46:45     1 FALSE GH-403…
+#>  5 3d467ac7bfae03cf2db09807054c5672e195… 2024-03-05 16:13:32     1 FALSE GH-201…
+#>  6 ef6ea6beed071ed070daf03508f4c14b4072… 2024-03-05 14:53:13    20 FALSE GH-403…
+#>  7 53e0c745ad491af98a5bf18b67541b12d779… 2024-03-05 12:31:38     2 FALSE GH-401…
+#>  8 3ba6d286caad328b8572a3b9228045da8c8d… 2024-03-05 08:15:42     6 FALSE GH-400…
+#>  9 4ce9a5edd2710fb8bf0c642fd0e3863b01c2… 2024-03-05 07:56:25     2 FALSE GH-401…
+#> 10 2445975162905bd8d9a42ffc9cd0daa0e19d… 2024-03-05 01:04:20     1 FALSE GH-403…
+#> # ℹ 15,477 more rows
+```
+
+In developing the Python bindings, it became clear that a representation of
+a Arrow C++'s `ChunkedArray` was an important concept to represent. Whereas
+the Python bindings have the `Array` class to provide this structure, the
+R bindings had only the `nanoarrow_array` as a thin wrapper around the
+Arrow C Data interface. When developing the geospatial extension
+[GeoArrow for R](https://github.com/geoarrow/geoarrow-r), a data structure that
+maintained chunked Arrow memory as an R vector was needed as an intermediary
+between an Arrow-native source and an R-native destination. This experimental
+structure can be created with `as_nanoarrow_vctr()`:
+
+``` r
+library(nanoarrow)
+
+array <- as_nanoarrow_array(c("one", "two", "three"))
+convert_array(array, nanoarrow_vctr())
+#> <nanoarrow_vctr string[3]>
+#> [1] "one"   "two"   "three"
+```
+
+## Contributors
+
+This release consists of contributions from 9 contributors in addition
+to the invaluable advice and support of the Apache Arrow developer mailing list.
+
+```console
+$ git shortlog -sn apache-arrow-nanoarrow-0.5.0.dev..apache-arrow-nanoarrow-0.5.0 | grep -v "GitHub Actions"
+  67  Dewey Dunnington
+  3  Dirk Eddelbuettel
+  3  Joris Van den Bossche
+  2  William Ayd
+  1  Alenka Frim
+  1  Benjamin Kietzman
+  1  Max Conradt
+  1  Vyas Ramasubramani
+  1  eitsupi
+```