Skip to content

Commit

Permalink
nanoarrow 0.5.0 release post (#525)
Browse files Browse the repository at this point in the history
Co-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Bryce Mecum <[email protected]>
Co-authored-by: William Ayd <[email protected]>
  • Loading branch information
5 people authored May 29, 2024
1 parent 2a0acf8 commit fcd3256
Showing 1 changed file with 309 additions and 0 deletions.
309 changes: 309 additions & 0 deletions _posts/2024-05-27-nanoarrow-0.5.0-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@
---
layout: post
title: "Apache Arrow nanoarrow 0.5.0 Release"
date: "2024-01-27 00:00:00"
author: pmc
categories: [release]
---
<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

The Apache Arrow team is pleased to announce the 0.5.0 release of
Apache Arrow nanoarrow. This release covers 79 resolved issues from
9 contributors.

## Release Highlights

The primary focus of the nanoarrow 0.5.0 release was expanding the
initial [Python bindings](#python-bindings) that were released in 0.4.0.
The nanoarrow Python package can now create and consume most Arrow
data types, arrays, and array streams, including conversion to/from
objects compatible with the Python buffer protocol and conversion
to/from lists of Python objects.

The nanoarrow 0.5.0 release also includes updates to its build
configuration to make it possible to use nanoarrow with `FetchContent`
in projects with a wider variety of CMake usage. In addition to CMake,
nanoarrow now supports the Meson build system. Thanks to
[@vyasr](https://github.com/vyasr) and [@WillAyd](https://github.com/WillAyd)
for contributing these changes!

In the [R bindings](#r-bindings), support for reading IPC streams
is now accessible with `read_nanoarrow()`!

Finally, build system helpers and helpers to reconcile modern C++ usage
with nanorrow C structures (e.g., iterating over an `ArrowArrayStream` or
`ArrowArray` using a range-for loop) were added to `nanoarrow.hpp`.
Thanks to [@bkeitz](https://github.com/bkietz) for contributing these
changes!

See the
[Changelog](https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0/CHANGELOG.md)
for a detailed list of contributions to this release.

## Breaking Changes

Most changes included in the nanoarrow 0.5.0 release will not break downstream
code; however, several changes in the C library are breaking changes to previous
behaviour.

- `ArrowBufferResize()` and `ArrowBitmapResize()` now adjust `size_bytes`/
`size_bits` in addition to `capacity_bytes`/`buffer.capacity_bytes`.
Preivously these functions only adjusted the capacity of the underlying
buffer which caused some understandable confusion even though this
behaviour was documented. This change affects all usage of
`ArrowBufferReisze()` and `ArrowBitmapResize()` that *increased* the size
of the underlying buffer (i.e., usage where `shrink_to_fit` was non zero
should be unaffected).
- `ArrowBufferReset()` now *always* calls the allocator's `free()` callback.
Previously, a call to the `free()` callback was skipped if the pointer was
`NULL`; however, this led to some confusion and made it easy to accidentally
leak a custom deallocator whose pointer happened to be `NULL`.
- As a consequence of the above, it is now mandatory to call `ArrowBufferInit()`
before calling `ArrowBufferReset()`. There was some existing usage of nanoarrow
that zero-ed the memory for an `ArrowBuffer` and then (sometimes) called
`ArrowBufferReset()`. Preivously this was a no-op; however, after 0.5.0 this
will crash. This is consistent with other structures in the nanoarrow C library
(which require an initialization before it is safe to reset/release them).


### Python bindings

The nanoarrow Python bindings are distributed as the `nanoarrow` package on
[PyPI](https://pypi.org/project/nanoarrow/) and [conda-forge](https://anaconda.org/conda-forge/nanoarrow):

```shell
pip install nanoarrow
conda install nanoarrow -c conda-forge
```

High level users can use the `Schema`, `Array`, and `ArrayStream` classes
to interact with data types, arrays, and array streams:

```python
import nanoarrow as na

na.int32()
#> <Schema> int32

na.Array([1, 2, 3], na.int32())
#> nanoarrow.Array<int32>[3]
#> 1
#> 2
#> 3

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
na.ArrayStream.from_url(url)
#> nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>
```

Low-level users can use `c_schema()`, `c_array()`, and `c_array_stream()` to interact
with thin wrappers around the Arrow C Data interface structures:

```python
na.c_schema(pa.decimal128(10, 3))
#> <nanoarrow.c_schema.CSchema decimal128(10, 3)>
#> - format: 'd:10,3'
#> - name: ''
#> - flags: 2
#> - metadata: NULL
#> - dictionary: NULL
#> - children[0]:

na.c_array(["one", "two", "three", None], na.string())
#> <nanoarrow.c_array.CArray string>
#> - length: 4
#> - offset: 0
#> - null_count: 1
#> - buffers: (4754305168, 4754307808, 4754310464)
#> - dictionary: NULL
#> - children[0]:
```

All nanoarrow type/array-like objects implement the
[Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)
for both producing and consuming and are zero-copy interchangeable with `pyarrow`
objects in many cases:

```python
import pyarrow as pa

pa.field(na.int32())
#> pyarrow.Field<: int32>

na.Schema(pa.string())
#> <Schema> string

pa.array(na.Array([4, 5, 6], na.int32()))
#> <pyarrow.lib.Int32Array object at 0x11b552500>
#> [
#> 4,
#> 5,
#> 6
#> ]

na.Array(pa.array([10, 11, 12]))
#> nanoarrow.Array<int64>[3]
#> 10
#> 11
#> 12
```

For a more detailed tour of the nanoarrow Python bindings, see the
[Getting started in Python guide](https://arrow.apache.org/nanoarrow/latest/getting-started/python.html) and the
[Python API reference](https://arrow.apache.org/nanoarrow/latest/reference/python/index.html).

### C/C++

The nanoarrow 0.5.0 release includes a number of bugfixes and improvements
to the core C library and C++ helpers.

First, the CMake build system was refactored to enable `FetchContent` to
work in a wider variety of
[develop/build/install scenarios](https://github.com/apache/arrow-nanoarrow/tree/main/examples/cmake-scenarios). In most cases, CMake-based projects should be able
to add the nanoarrow C library as a dependency with:

```cmake
include(FetchContent)
fetchcontent_declare(nanoarrow
GIT_REPOSITORY https://github.com/apache/arrow-nanoarrow.git
GIT_TAG apache-arrow-nanoarrow-0.5.0
GIT_SHALLOW TRUE)
fetchcontent_makeavailable(nanoarrow)
add_executable(some_target ...)
target_link_libraries(some_target nanoarrow::nanoarrow)
```

Projects using the [Meson](https://mesonbuild.com/) build system can install
nanoarrow from [WrapDB](https://mesonbuild.com/Wrapdb-projects.html) using:

```shell
mkdir -p subprojects
meson wrap install nanoarrow
```

...and use `dependency('nanoarrow')` to add the dependency:

```
nanoarrow_dep = dependency('nanoarrow')
example_exec = executable('some_target',
...,
dependencies: [nanoarrow_dep])
```

Finally, a set of C++ range/view helpers were added to smooth out some
of more verbose aspects of working with nanoarrow in C++. While the
new helpers are targeted at more than just nanoarrow's tests, they have been
particularly helpful in allowing nanoarrow's tests to be more less repetitive
and more effective. For example, one particularly verbose test was collapsed
to:

```cpp
#include <gtest/gtest.h>
#include <gmock/gmock-matchers.h>
#include <nanoarrow/nanoarrow_gtest_util.hpp>
#include <nanoarrow/nanoarrow.hpp>

nanoarrow::UniqueArrayStream array_stream;
// ... populate array_stream
nanoarrow::ViewArrayStream array_stream_view(array_stream.get());

for (ArrowArray& array : array_stream_view) {
EXPECT_THAT(nanoarrow::ViewArrayAs<int32_t>(&array), ElementsAre(1234));
}

EXPECT_EQ(array_stream_view.count(), 1);
EXPECT_EQ(array_stream_view.code(), NANOARROW_OK);
EXPECT_STREQ(array_stream_view.error()->message, "");
```
See the new section in the
[C++ API reference](https://arrow.apache.org/nanoarrow/latest/reference/cpp.html#range-for-utilities) for details.
### R bindings
The nanoarrow R bindings are distributed as the `nanoarrow` package on
[CRAN](https://cran.r-project.org/).
Whereas nanoarrow has had an IPC reader supporting most features of the
IPC streaming format since 0.3.0, the R bindings did not implement bindings
until this release. The 0.5.0 release of the R package includes `read_nanoarrow()`
as an entrypoint to reading streams from various sources including URLs,
filenames, and R connections:
``` r
library(nanoarrow)
url <- "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
read_nanoarrow(url) |>
tibble::as_tibble()
#> # A tibble: 15,487 × 5
#> commit time files merge message
#> <chr> <dttm> <int> <lgl> <chr>
#> 1 49cdb0fe4e98fda19031c864a18e6156c6ed… 2024-03-07 02:00:52 2 FALSE GH-403…
#> 2 1d966e98e41ce817d1f8c5159c0b9caa4de7… 2024-03-06 21:51:34 1 FALSE GH-403…
#> 3 96f26a89bd73997f7532643cdb27d04b7097… 2024-03-06 20:29:15 1 FALSE GH-402…
#> 4 ee1a8c39a55f3543a82fed900dadca791f6e… 2024-03-06 07:46:45 1 FALSE GH-403…
#> 5 3d467ac7bfae03cf2db09807054c5672e195… 2024-03-05 16:13:32 1 FALSE GH-201…
#> 6 ef6ea6beed071ed070daf03508f4c14b4072… 2024-03-05 14:53:13 20 FALSE GH-403…
#> 7 53e0c745ad491af98a5bf18b67541b12d779… 2024-03-05 12:31:38 2 FALSE GH-401…
#> 8 3ba6d286caad328b8572a3b9228045da8c8d… 2024-03-05 08:15:42 6 FALSE GH-400…
#> 9 4ce9a5edd2710fb8bf0c642fd0e3863b01c2… 2024-03-05 07:56:25 2 FALSE GH-401…
#> 10 2445975162905bd8d9a42ffc9cd0daa0e19d… 2024-03-05 01:04:20 1 FALSE GH-403…
#> # ℹ 15,477 more rows
```

In developing the Python bindings, it became clear that a representation of
a Arrow C++'s `ChunkedArray` was an important concept to represent. Whereas
the Python bindings have the `Array` class to provide this structure, the
R bindings had only the `nanoarrow_array` as a thin wrapper around the
Arrow C Data interface. When developing the geospatial extension
[GeoArrow for R](https://github.com/geoarrow/geoarrow-r), a data structure that
maintained chunked Arrow memory as an R vector was needed as an intermediary
between an Arrow-native source and an R-native destination. This experimental
structure can be created with `as_nanoarrow_vctr()`:

``` r
library(nanoarrow)

array <- as_nanoarrow_array(c("one", "two", "three"))
convert_array(array, nanoarrow_vctr())
#> <nanoarrow_vctr string[3]>
#> [1] "one" "two" "three"
```

## Contributors

This release consists of contributions from 9 contributors in addition
to the invaluable advice and support of the Apache Arrow developer mailing list.

```console
$ git shortlog -sn apache-arrow-nanoarrow-0.5.0.dev..apache-arrow-nanoarrow-0.5.0 | grep -v "GitHub Actions"
67 Dewey Dunnington
3 Dirk Eddelbuettel
3 Joris Van den Bossche
2 William Ayd
1 Alenka Frim
1 Benjamin Kietzman
1 Max Conradt
1 Vyas Ramasubramani
1 eitsupi
```

0 comments on commit fcd3256

Please sign in to comment.