Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WEBSITE] DataFusion 26.0.0 Blog Post #370

Merged
merged 7 commits into from
Jun 24, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
301 changes: 301 additions & 0 deletions _posts/2023-06-24-datafusion-25.0.0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
---
layout: post
title: "Apache Arrow DataFusion 26.0.0"
date: "2023-06-24 00:00:00"
author: pmc
categories: [release]
---

<!--
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to you under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
{% endcomment %}
-->

It has been a whirlwind 6 months of DataFusion development since [our
last update]: the community has grown, many features have been added,
performance improved and we are [discussing] branching out to our own
top level Apache Project.

## Background

[Apache Arrow DataFusion] is an extensible query engine and database
toolkit, written in [Rust], that uses [Apache Arrow] as its in-memory
format.

[apache arrow datafusion]: https://arrow.apache.org/datafusion/
[apache arrow]: https://arrow.apache.org
[rust]: https://www.rust-lang.org/

DataFusion, along with [Apache Calcite], Facebook's [Velox] and
similar technology are part of the next generation "[Deconstructed
Database]" architectures, where new systems are built on a foundation
of fast, modular components, rather as a single tightly integrated
system.

[apache calcite]: https://calcite.apache.org
[velox]: https://github.com/facebookincubator/velox
[deconstructed database]: https://www.usenix.org/publications/login/winter2018/khurana
[spark]: https://spark.apache.org/
[duckdb]: https://duckdb.org
[pola.rs]: https://www.pola.rs/


While single tightly integrated systems such as [Spark], [DuckDB] and
[Pola.rs] are great pieces of technology, our community believes that
anyone developing new data heavy application, such as those common in
machine learning in the next 5 years, will **require** a high
performance, vectorized, query engine to remain relevant. The only
practical way to gain access to such technology without investing many
millions of dollars to build a new tightly integrated engine, is
though open source projects like DataFusion and similar enabling
technologies such as [Apache Arrow] and [Rust].

DataFusion is targeted primarily at developers creating other data
intensive analytics, and offers:

- High performance, native, parallel streaming execution engine
- Mature [SQL support], featuring subqueries, window functions, grouping sets, and more
- Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
- Native DataFrame API and [python bindings]
- [Well documented] source code and architecture, designed to be customized to suit downstream project needs
- High quality, easy to use code [released every 2 weeks to crates.io]
- Welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation]

The rest of this post highlights some of the improvements we have made
to DataFusion over the last 6 months and a preview of where we are
heading. You can see a list of all changes in the detailed
[CHANGELOG].

[SQL support]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
[apache software foundation]: https://www.apache.org/
[well documented]: https://docs.rs/datafusion/latest/datafusion/index.html
[python bindings]: https://arrow.apache.org/datafusion-python/
[changelog]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
[released every 2 weeks to crates.io]: https://crates.io/crates/datafusion/versions

## (Even) Better Performance

[Various] benchmarks show DataFusion to be quite close or [even
faster] to the state of the art in analytic performance (at the moment
this seems to be DuckDB). We continually work on improving performance
(see [#5546] for a list) and would love additional help in this area.

[various]: https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter
[even faster]: https://github.com/tustvold/access-log-bench
[#5546]: https://github.com/apache/arrow-datafusion/issues/5546

DataFusion now reads single large Parquet files significantly faster by
[parallelizing across multiple cores]. Native speeds for reading JSON
and CSV files are also up to 2.5x faster thanks to improvements
upstream in arrow-rs [JSON reader] and [CSV reader].

[parallelizing across multiple cores]: https://github.com/apache/arrow-datafusion/pull/5057
[json reader]: https://github.com/apache/arrow-rs/pull/3479#issuecomment-1384353159
[csv reader]: https://github.com/apache/arrow-rs/pull/3365

Also, we have integrated the [arrow-rs Row Format] into DataFusion resulting in up to [2-3x faster sorting and merging].

[arrow-rs row format]: https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
[2-3x faster sorting and merging]: https://github.com/apache/arrow-datafusion/pull/6163

## Improved Documentation and Website

Part of growing the DataFusion community is ensuring that DataFusion's
features are understood and that it is easy to contribute and
participate. To that end the [website] has been cleaned up, [the
architecture guide] expanded, the [roadmap] updated, and several
overview talks created:

- Apr 2023 _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
- April 2023 _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
- April 2023 _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)

[website]: https://arrow.apache.org/datafusion/
[the architecture guide]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture
[roadmap]: https://arrow.apache.org/datafusion/contributor-guide/roadmap.html

## New Features

### More Streaming, Less Memory

We have made significany progress on the [streaming execution roadmap]
such as [unbounded datasources], [streaming group by], sophisticated
[sort] and [repartitioning] improvements in the optimizer, and support
for [symmetric hash join] (read more about that in the great [Synnada
Blog Post] on the topic). Together, these features both 1) make it
easier to build streaming systems using DataFusion that can
incrementally generate output before (or ever) seeing the end of the
input and 2) allow general queries to use less memory and generate their
results faster.

We have also improved the runtime [memory management] system so that
DataFusion now stays within its declared memory budget [generate
runtime errors].

[sort]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/global_sort_selection/index.html
[repartitioning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/repartition/index.html
[streaming execution roadmap]: https://github.com/apache/arrow-datafusion/issues/4285
[unbounded datasources]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.unbounded_output
[streaming group by]: https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/enum.GroupByOrderMode.html
[symmetric hash join]: https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.SymmetricHashJoinExec.html
[synnada blog post]: https://github.com/apache/arrow-datafusion/issues/4285
[memory management]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/index.html
[generate runtime errors]: https://github.com/apache/arrow-datafusion/issues/3941

### DML Support (`INSERT`, `DELETE`, `UPDATE`, etc)

Part of building high performance data systems includes writing data,
and DataFusion supports several features for creating new files:

- `INSERT INTO` and `SELECT ... INTO ` support for memory backed and CSV tables
- New [API for writing data into TableProviders]

We are working on easier to use [COPY INTO] syntax, better support
for writing parquet, JSON, and AVRO, and more -- see our [tracking epic]
for more details.

[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569
[api for writing data into tableproviders]: https://docs.rs/datafusion/latest/datafusion/physical_plan/insert/trait.DataSink.html
[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569
[copy into]: https://github.com/apache/arrow-datafusion/issues/5654

### Timestamp and Intervals

One mark of the maturity of a SQL engine is how it handles the tricky
world of timestamp, date, times and interval arithmetic. DataFusion is
feature complete in this area and behaves as you would expect,
supporting queries such as

```sql
SELECT now() + '1 month' FROM my_table;
```

We still have a long tail of [date and time improvements], which we are working on as well.

[date and time improvements]: https://github.com/apache/arrow-datafusion/issues/3148

### Querying Structured Types (`List` and `Struct`s)

Arrow and Parquet [support nested data] well and DataFusion lets you
easily query such `Struct` and `List`. For example, you can use
DataFusion to read and query the [JSON Datasets for Exploratory OLAP -
Mendeley Data] like this:

[support nested data]: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
[json datasets for exploratory olap - mendeley data]: https://data.mendeley.com/datasets/ct8f9skv97

```sql
----------
-- Explore structured data using SQL
----------
SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10;
+---------------------------------------------------------------------------------------------------------------------------+
| delete |
+---------------------------------------------------------------------------------------------------------------------------+
| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} |
| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} |
| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}} |
| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}} |
| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}} |
| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} |
| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} |
| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} |
| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} |
| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} |
+---------------------------------------------------------------------------------------------------------------------------+

----------
-- Select some deeply nested fields
----------
SELECT
delete['status']['id']['$numberLong'] as delete_id,
delete['status']['user_id'] as delete_user_id
FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10;

+--------------------+----------------+
| delete_id | delete_user_id |
+--------------------+----------------+
| 135037425050320896 | 334902461 |
| 134703982051463168 | 405383453 |
| 134773741740765184 | 64823441 |
| 132543659655704576 | 45917834 |
| 133786431926697984 | 67229952 |
| 134619093570560002 | 182430773 |
| 134019857527214080 | 257396311 |
| 133931546469076993 | 124539548 |
| 134397743350296576 | 139836391 |
| 127833661767823360 | 244442687 |
+--------------------+----------------+
```

### Subqueries All the Way Down

DataFusion can run many different subqueries by rewriting them to
joins. It has been able to run the full suite of TPC-H queries for at
least the last year, but recently we have implemented significant
improvements to this logic, sufficient to run almost all queries in
the TPC-DS benchmark as well.

## Community and Project Growth

The six months since [our last update] saw significant growth in
the DataFusion community. Between versions `17.0.0` and `26.0.0`,
DataFusion merged 711 PRs from 107 distinct contributors, not
including all the work that goes into our core dependencies such as
[arrow](https://crates.io/crates/arrow),
[parquet](https://crates.io/crates/parquet), and
[object_store](https://crates.io/crates/object_store), that much of
the same community helps support.

In addition, we have added 7 new committers and 1 new PMC member to
the Apache Arrow project, largely focused on DataFusion, and we
learned about some of the cool [new systems] which are using
DataFusion. Given the growth of the community and interest in the
project, we also clarified the [mission statement] and are
[discussing] "graduate"ing DataFusion to a new top level
Apache Software Foundation project.

[our last update]: https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0
[new systems]: https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users
[mission statement]: https://github.com/apache/arrow-datafusion/discussions/6441
[discussing]: https://github.com/apache/arrow-datafusion/discussions/6475

<!--
$ git log --pretty=oneline 17.0.0..26.0.0 . | wc -l
711

$ git shortlog -sn 17.0.0..26.0.0 . | wc -l
107
-->

# How to Get Involved

Kudos to everyone in the community who has contributed ideas,
discussions, bug reports, documentation and code. It is exciting to be
innovating on the next generation of database architectures together!

If you are interested in contributing to DataFusion, we would love to
have you join us. You can try out DataFusion on some of your own
data and projects and let us know how it goes or contribute a PR with
documentation, tests or code. A list of open issues suitable for
beginners is [here].

Check out our [Communication Doc] for more ways to engage with the
community.

[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html