-
Notifications
You must be signed in to change notification settings - Fork 110
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Co-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Bryce Mecum <[email protected]> Co-authored-by: William Ayd <[email protected]>
- Loading branch information
1 parent
2a0acf8
commit fcd3256
Showing
1 changed file
with
309 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,309 @@ | ||
--- | ||
layout: post | ||
title: "Apache Arrow nanoarrow 0.5.0 Release" | ||
date: "2024-01-27 00:00:00" | ||
author: pmc | ||
categories: [release] | ||
--- | ||
<!-- | ||
{% comment %} | ||
Licensed to the Apache Software Foundation (ASF) under one or more | ||
contributor license agreements. See the NOTICE file distributed with | ||
this work for additional information regarding copyright ownership. | ||
The ASF licenses this file to you under the Apache License, Version 2.0 | ||
(the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
{% endcomment %} | ||
--> | ||
|
||
The Apache Arrow team is pleased to announce the 0.5.0 release of | ||
Apache Arrow nanoarrow. This release covers 79 resolved issues from | ||
9 contributors. | ||
|
||
## Release Highlights | ||
|
||
The primary focus of the nanoarrow 0.5.0 release was expanding the | ||
initial [Python bindings](#python-bindings) that were released in 0.4.0. | ||
The nanoarrow Python package can now create and consume most Arrow | ||
data types, arrays, and array streams, including conversion to/from | ||
objects compatible with the Python buffer protocol and conversion | ||
to/from lists of Python objects. | ||
|
||
The nanoarrow 0.5.0 release also includes updates to its build | ||
configuration to make it possible to use nanoarrow with `FetchContent` | ||
in projects with a wider variety of CMake usage. In addition to CMake, | ||
nanoarrow now supports the Meson build system. Thanks to | ||
[@vyasr](https://github.com/vyasr) and [@WillAyd](https://github.com/WillAyd) | ||
for contributing these changes! | ||
|
||
In the [R bindings](#r-bindings), support for reading IPC streams | ||
is now accessible with `read_nanoarrow()`! | ||
|
||
Finally, build system helpers and helpers to reconcile modern C++ usage | ||
with nanorrow C structures (e.g., iterating over an `ArrowArrayStream` or | ||
`ArrowArray` using a range-for loop) were added to `nanoarrow.hpp`. | ||
Thanks to [@bkeitz](https://github.com/bkietz) for contributing these | ||
changes! | ||
|
||
See the | ||
[Changelog](https://github.com/apache/arrow-nanoarrow/blob/apache-arrow-nanoarrow-0.5.0/CHANGELOG.md) | ||
for a detailed list of contributions to this release. | ||
|
||
## Breaking Changes | ||
|
||
Most changes included in the nanoarrow 0.5.0 release will not break downstream | ||
code; however, several changes in the C library are breaking changes to previous | ||
behaviour. | ||
|
||
- `ArrowBufferResize()` and `ArrowBitmapResize()` now adjust `size_bytes`/ | ||
`size_bits` in addition to `capacity_bytes`/`buffer.capacity_bytes`. | ||
Preivously these functions only adjusted the capacity of the underlying | ||
buffer which caused some understandable confusion even though this | ||
behaviour was documented. This change affects all usage of | ||
`ArrowBufferReisze()` and `ArrowBitmapResize()` that *increased* the size | ||
of the underlying buffer (i.e., usage where `shrink_to_fit` was non zero | ||
should be unaffected). | ||
- `ArrowBufferReset()` now *always* calls the allocator's `free()` callback. | ||
Previously, a call to the `free()` callback was skipped if the pointer was | ||
`NULL`; however, this led to some confusion and made it easy to accidentally | ||
leak a custom deallocator whose pointer happened to be `NULL`. | ||
- As a consequence of the above, it is now mandatory to call `ArrowBufferInit()` | ||
before calling `ArrowBufferReset()`. There was some existing usage of nanoarrow | ||
that zero-ed the memory for an `ArrowBuffer` and then (sometimes) called | ||
`ArrowBufferReset()`. Preivously this was a no-op; however, after 0.5.0 this | ||
will crash. This is consistent with other structures in the nanoarrow C library | ||
(which require an initialization before it is safe to reset/release them). | ||
|
||
|
||
### Python bindings | ||
|
||
The nanoarrow Python bindings are distributed as the `nanoarrow` package on | ||
[PyPI](https://pypi.org/project/nanoarrow/) and [conda-forge](https://anaconda.org/conda-forge/nanoarrow): | ||
|
||
```shell | ||
pip install nanoarrow | ||
conda install nanoarrow -c conda-forge | ||
``` | ||
|
||
High level users can use the `Schema`, `Array`, and `ArrayStream` classes | ||
to interact with data types, arrays, and array streams: | ||
|
||
```python | ||
import nanoarrow as na | ||
|
||
na.int32() | ||
#> <Schema> int32 | ||
|
||
na.Array([1, 2, 3], na.int32()) | ||
#> nanoarrow.Array<int32>[3] | ||
#> 1 | ||
#> 2 | ||
#> 3 | ||
|
||
url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows" | ||
na.ArrayStream.from_url(url) | ||
#> nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...> | ||
``` | ||
|
||
Low-level users can use `c_schema()`, `c_array()`, and `c_array_stream()` to interact | ||
with thin wrappers around the Arrow C Data interface structures: | ||
|
||
```python | ||
na.c_schema(pa.decimal128(10, 3)) | ||
#> <nanoarrow.c_schema.CSchema decimal128(10, 3)> | ||
#> - format: 'd:10,3' | ||
#> - name: '' | ||
#> - flags: 2 | ||
#> - metadata: NULL | ||
#> - dictionary: NULL | ||
#> - children[0]: | ||
|
||
na.c_array(["one", "two", "three", None], na.string()) | ||
#> <nanoarrow.c_array.CArray string> | ||
#> - length: 4 | ||
#> - offset: 0 | ||
#> - null_count: 1 | ||
#> - buffers: (4754305168, 4754307808, 4754310464) | ||
#> - dictionary: NULL | ||
#> - children[0]: | ||
``` | ||
|
||
All nanoarrow type/array-like objects implement the | ||
[Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) | ||
for both producing and consuming and are zero-copy interchangeable with `pyarrow` | ||
objects in many cases: | ||
|
||
```python | ||
import pyarrow as pa | ||
|
||
pa.field(na.int32()) | ||
#> pyarrow.Field<: int32> | ||
|
||
na.Schema(pa.string()) | ||
#> <Schema> string | ||
|
||
pa.array(na.Array([4, 5, 6], na.int32())) | ||
#> <pyarrow.lib.Int32Array object at 0x11b552500> | ||
#> [ | ||
#> 4, | ||
#> 5, | ||
#> 6 | ||
#> ] | ||
|
||
na.Array(pa.array([10, 11, 12])) | ||
#> nanoarrow.Array<int64>[3] | ||
#> 10 | ||
#> 11 | ||
#> 12 | ||
``` | ||
|
||
For a more detailed tour of the nanoarrow Python bindings, see the | ||
[Getting started in Python guide](https://arrow.apache.org/nanoarrow/latest/getting-started/python.html) and the | ||
[Python API reference](https://arrow.apache.org/nanoarrow/latest/reference/python/index.html). | ||
|
||
### C/C++ | ||
|
||
The nanoarrow 0.5.0 release includes a number of bugfixes and improvements | ||
to the core C library and C++ helpers. | ||
|
||
First, the CMake build system was refactored to enable `FetchContent` to | ||
work in a wider variety of | ||
[develop/build/install scenarios](https://github.com/apache/arrow-nanoarrow/tree/main/examples/cmake-scenarios). In most cases, CMake-based projects should be able | ||
to add the nanoarrow C library as a dependency with: | ||
|
||
```cmake | ||
include(FetchContent) | ||
fetchcontent_declare(nanoarrow | ||
GIT_REPOSITORY https://github.com/apache/arrow-nanoarrow.git | ||
GIT_TAG apache-arrow-nanoarrow-0.5.0 | ||
GIT_SHALLOW TRUE) | ||
fetchcontent_makeavailable(nanoarrow) | ||
add_executable(some_target ...) | ||
target_link_libraries(some_target nanoarrow::nanoarrow) | ||
``` | ||
|
||
Projects using the [Meson](https://mesonbuild.com/) build system can install | ||
nanoarrow from [WrapDB](https://mesonbuild.com/Wrapdb-projects.html) using: | ||
|
||
```shell | ||
mkdir -p subprojects | ||
meson wrap install nanoarrow | ||
``` | ||
|
||
...and use `dependency('nanoarrow')` to add the dependency: | ||
|
||
``` | ||
nanoarrow_dep = dependency('nanoarrow') | ||
example_exec = executable('some_target', | ||
..., | ||
dependencies: [nanoarrow_dep]) | ||
``` | ||
|
||
Finally, a set of C++ range/view helpers were added to smooth out some | ||
of more verbose aspects of working with nanoarrow in C++. While the | ||
new helpers are targeted at more than just nanoarrow's tests, they have been | ||
particularly helpful in allowing nanoarrow's tests to be more less repetitive | ||
and more effective. For example, one particularly verbose test was collapsed | ||
to: | ||
|
||
```cpp | ||
#include <gtest/gtest.h> | ||
#include <gmock/gmock-matchers.h> | ||
#include <nanoarrow/nanoarrow_gtest_util.hpp> | ||
#include <nanoarrow/nanoarrow.hpp> | ||
|
||
nanoarrow::UniqueArrayStream array_stream; | ||
// ... populate array_stream | ||
nanoarrow::ViewArrayStream array_stream_view(array_stream.get()); | ||
|
||
for (ArrowArray& array : array_stream_view) { | ||
EXPECT_THAT(nanoarrow::ViewArrayAs<int32_t>(&array), ElementsAre(1234)); | ||
} | ||
|
||
EXPECT_EQ(array_stream_view.count(), 1); | ||
EXPECT_EQ(array_stream_view.code(), NANOARROW_OK); | ||
EXPECT_STREQ(array_stream_view.error()->message, ""); | ||
``` | ||
See the new section in the | ||
[C++ API reference](https://arrow.apache.org/nanoarrow/latest/reference/cpp.html#range-for-utilities) for details. | ||
### R bindings | ||
The nanoarrow R bindings are distributed as the `nanoarrow` package on | ||
[CRAN](https://cran.r-project.org/). | ||
Whereas nanoarrow has had an IPC reader supporting most features of the | ||
IPC streaming format since 0.3.0, the R bindings did not implement bindings | ||
until this release. The 0.5.0 release of the R package includes `read_nanoarrow()` | ||
as an entrypoint to reading streams from various sources including URLs, | ||
filenames, and R connections: | ||
``` r | ||
library(nanoarrow) | ||
url <- "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows" | ||
read_nanoarrow(url) |> | ||
tibble::as_tibble() | ||
#> # A tibble: 15,487 × 5 | ||
#> commit time files merge message | ||
#> <chr> <dttm> <int> <lgl> <chr> | ||
#> 1 49cdb0fe4e98fda19031c864a18e6156c6ed… 2024-03-07 02:00:52 2 FALSE GH-403… | ||
#> 2 1d966e98e41ce817d1f8c5159c0b9caa4de7… 2024-03-06 21:51:34 1 FALSE GH-403… | ||
#> 3 96f26a89bd73997f7532643cdb27d04b7097… 2024-03-06 20:29:15 1 FALSE GH-402… | ||
#> 4 ee1a8c39a55f3543a82fed900dadca791f6e… 2024-03-06 07:46:45 1 FALSE GH-403… | ||
#> 5 3d467ac7bfae03cf2db09807054c5672e195… 2024-03-05 16:13:32 1 FALSE GH-201… | ||
#> 6 ef6ea6beed071ed070daf03508f4c14b4072… 2024-03-05 14:53:13 20 FALSE GH-403… | ||
#> 7 53e0c745ad491af98a5bf18b67541b12d779… 2024-03-05 12:31:38 2 FALSE GH-401… | ||
#> 8 3ba6d286caad328b8572a3b9228045da8c8d… 2024-03-05 08:15:42 6 FALSE GH-400… | ||
#> 9 4ce9a5edd2710fb8bf0c642fd0e3863b01c2… 2024-03-05 07:56:25 2 FALSE GH-401… | ||
#> 10 2445975162905bd8d9a42ffc9cd0daa0e19d… 2024-03-05 01:04:20 1 FALSE GH-403… | ||
#> # ℹ 15,477 more rows | ||
``` | ||
|
||
In developing the Python bindings, it became clear that a representation of | ||
a Arrow C++'s `ChunkedArray` was an important concept to represent. Whereas | ||
the Python bindings have the `Array` class to provide this structure, the | ||
R bindings had only the `nanoarrow_array` as a thin wrapper around the | ||
Arrow C Data interface. When developing the geospatial extension | ||
[GeoArrow for R](https://github.com/geoarrow/geoarrow-r), a data structure that | ||
maintained chunked Arrow memory as an R vector was needed as an intermediary | ||
between an Arrow-native source and an R-native destination. This experimental | ||
structure can be created with `as_nanoarrow_vctr()`: | ||
|
||
``` r | ||
library(nanoarrow) | ||
|
||
array <- as_nanoarrow_array(c("one", "two", "three")) | ||
convert_array(array, nanoarrow_vctr()) | ||
#> <nanoarrow_vctr string[3]> | ||
#> [1] "one" "two" "three" | ||
``` | ||
|
||
## Contributors | ||
|
||
This release consists of contributions from 9 contributors in addition | ||
to the invaluable advice and support of the Apache Arrow developer mailing list. | ||
|
||
```console | ||
$ git shortlog -sn apache-arrow-nanoarrow-0.5.0.dev..apache-arrow-nanoarrow-0.5.0 | grep -v "GitHub Actions" | ||
67 Dewey Dunnington | ||
3 Dirk Eddelbuettel | ||
3 Joris Van den Bossche | ||
2 William Ayd | ||
1 Alenka Frim | ||
1 Benjamin Kietzman | ||
1 Max Conradt | ||
1 Vyas Ramasubramani | ||
1 eitsupi | ||
``` |