-
Notifications
You must be signed in to change notification settings - Fork 110
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[WEBSITE] DataFusion 26.0.0 Blog Post (#370)
Closes apache/datafusion#5812 This PR tries to summarize the last 6 months of DataFusion and tell people what has been going on in the community. --------- Co-authored-by: Paddy Horan <[email protected]> Co-authored-by: Folyd <[email protected]>
- Loading branch information
1 parent
855f956
commit 4ad44e6
Showing
1 changed file
with
301 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,301 @@ | ||
--- | ||
layout: post | ||
title: "Apache Arrow DataFusion 26.0.0" | ||
date: "2023-06-24 00:00:00" | ||
author: pmc | ||
categories: [release] | ||
--- | ||
|
||
<!-- | ||
{% comment %} | ||
Licensed to the Apache Software Foundation (ASF) under one or more | ||
contributor license agreements. See the NOTICE file distributed with | ||
this work for additional information regarding copyright ownership. | ||
The ASF licenses this file to you under the Apache License, Version 2.0 | ||
(the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
{% endcomment %} | ||
--> | ||
|
||
It has been a whirlwind 6 months of DataFusion development since [our | ||
last update]: the community has grown, many features have been added, | ||
performance improved and we are [discussing] branching out to our own | ||
top level Apache Project. | ||
|
||
## Background | ||
|
||
[Apache Arrow DataFusion] is an extensible query engine and database | ||
toolkit, written in [Rust], that uses [Apache Arrow] as its in-memory | ||
format. | ||
|
||
[apache arrow datafusion]: https://arrow.apache.org/datafusion/ | ||
[apache arrow]: https://arrow.apache.org | ||
[rust]: https://www.rust-lang.org/ | ||
|
||
DataFusion, along with [Apache Calcite], Facebook's [Velox] and | ||
similar technology are part of the next generation "[Deconstructed | ||
Database]" architectures, where new systems are built on a foundation | ||
of fast, modular components, rather as a single tightly integrated | ||
system. | ||
|
||
[apache calcite]: https://calcite.apache.org | ||
[velox]: https://github.com/facebookincubator/velox | ||
[deconstructed database]: https://www.usenix.org/publications/login/winter2018/khurana | ||
[spark]: https://spark.apache.org/ | ||
[duckdb]: https://duckdb.org | ||
[pola.rs]: https://www.pola.rs/ | ||
|
||
|
||
While single tightly integrated systems such as [Spark], [DuckDB] and | ||
[Pola.rs] are great pieces of technology, our community believes that | ||
anyone developing new data heavy application, such as those common in | ||
machine learning in the next 5 years, will **require** a high | ||
performance, vectorized, query engine to remain relevant. The only | ||
practical way to gain access to such technology without investing many | ||
millions of dollars to build a new tightly integrated engine, is | ||
though open source projects like DataFusion and similar enabling | ||
technologies such as [Apache Arrow] and [Rust]. | ||
|
||
DataFusion is targeted primarily at developers creating other data | ||
intensive analytics, and offers: | ||
|
||
- High performance, native, parallel streaming execution engine | ||
- Mature [SQL support], featuring subqueries, window functions, grouping sets, and more | ||
- Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others | ||
- Native DataFrame API and [python bindings] | ||
- [Well documented] source code and architecture, designed to be customized to suit downstream project needs | ||
- High quality, easy to use code [released every 2 weeks to crates.io] | ||
- Welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation] | ||
|
||
The rest of this post highlights some of the improvements we have made | ||
to DataFusion over the last 6 months and a preview of where we are | ||
heading. You can see a list of all changes in the detailed | ||
[CHANGELOG]. | ||
|
||
[SQL support]: https://arrow.apache.org/datafusion/user-guide/sql/index.html | ||
[apache software foundation]: https://www.apache.org/ | ||
[well documented]: https://docs.rs/datafusion/latest/datafusion/index.html | ||
[python bindings]: https://arrow.apache.org/datafusion-python/ | ||
[changelog]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md | ||
[released every 2 weeks to crates.io]: https://crates.io/crates/datafusion/versions | ||
|
||
## (Even) Better Performance | ||
|
||
[Various] benchmarks show DataFusion to be quite close or [even | ||
faster] to the state of the art in analytic performance (at the moment | ||
this seems to be DuckDB). We continually work on improving performance | ||
(see [#5546] for a list) and would love additional help in this area. | ||
|
||
[various]: https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter | ||
[even faster]: https://github.com/tustvold/access-log-bench | ||
[#5546]: https://github.com/apache/arrow-datafusion/issues/5546 | ||
|
||
DataFusion now reads single large Parquet files significantly faster by | ||
[parallelizing across multiple cores]. Native speeds for reading JSON | ||
and CSV files are also up to 2.5x faster thanks to improvements | ||
upstream in arrow-rs [JSON reader] and [CSV reader]. | ||
|
||
[parallelizing across multiple cores]: https://github.com/apache/arrow-datafusion/pull/5057 | ||
[json reader]: https://github.com/apache/arrow-rs/pull/3479#issuecomment-1384353159 | ||
[csv reader]: https://github.com/apache/arrow-rs/pull/3365 | ||
|
||
Also, we have integrated the [arrow-rs Row Format] into DataFusion resulting in up to [2-3x faster sorting and merging]. | ||
|
||
[arrow-rs row format]: https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/ | ||
[2-3x faster sorting and merging]: https://github.com/apache/arrow-datafusion/pull/6163 | ||
|
||
## Improved Documentation and Website | ||
|
||
Part of growing the DataFusion community is ensuring that DataFusion's | ||
features are understood and that it is easy to contribute and | ||
participate. To that end the [website] has been cleaned up, [the | ||
architecture guide] expanded, the [roadmap] updated, and several | ||
overview talks created: | ||
|
||
- Apr 2023 _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p) | ||
- April 2023 _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30) | ||
- April 2023 _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg) | ||
|
||
[website]: https://arrow.apache.org/datafusion/ | ||
[the architecture guide]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture | ||
[roadmap]: https://arrow.apache.org/datafusion/contributor-guide/roadmap.html | ||
|
||
## New Features | ||
|
||
### More Streaming, Less Memory | ||
|
||
We have made significany progress on the [streaming execution roadmap] | ||
such as [unbounded datasources], [streaming group by], sophisticated | ||
[sort] and [repartitioning] improvements in the optimizer, and support | ||
for [symmetric hash join] (read more about that in the great [Synnada | ||
Blog Post] on the topic). Together, these features both 1) make it | ||
easier to build streaming systems using DataFusion that can | ||
incrementally generate output before (or ever) seeing the end of the | ||
input and 2) allow general queries to use less memory and generate their | ||
results faster. | ||
|
||
We have also improved the runtime [memory management] system so that | ||
DataFusion now stays within its declared memory budget [generate | ||
runtime errors]. | ||
|
||
[sort]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/global_sort_selection/index.html | ||
[repartitioning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/repartition/index.html | ||
[streaming execution roadmap]: https://github.com/apache/arrow-datafusion/issues/4285 | ||
[unbounded datasources]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.unbounded_output | ||
[streaming group by]: https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/enum.GroupByOrderMode.html | ||
[symmetric hash join]: https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.SymmetricHashJoinExec.html | ||
[synnada blog post]: https://github.com/apache/arrow-datafusion/issues/4285 | ||
[memory management]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/index.html | ||
[generate runtime errors]: https://github.com/apache/arrow-datafusion/issues/3941 | ||
|
||
### DML Support (`INSERT`, `DELETE`, `UPDATE`, etc) | ||
|
||
Part of building high performance data systems includes writing data, | ||
and DataFusion supports several features for creating new files: | ||
|
||
- `INSERT INTO` and `SELECT ... INTO ` support for memory backed and CSV tables | ||
- New [API for writing data into TableProviders] | ||
|
||
We are working on easier to use [COPY INTO] syntax, better support | ||
for writing parquet, JSON, and AVRO, and more -- see our [tracking epic] | ||
for more details. | ||
|
||
[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569 | ||
[api for writing data into tableproviders]: https://docs.rs/datafusion/latest/datafusion/physical_plan/insert/trait.DataSink.html | ||
[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569 | ||
[copy into]: https://github.com/apache/arrow-datafusion/issues/5654 | ||
|
||
### Timestamp and Intervals | ||
|
||
One mark of the maturity of a SQL engine is how it handles the tricky | ||
world of timestamp, date, times and interval arithmetic. DataFusion is | ||
feature complete in this area and behaves as you would expect, | ||
supporting queries such as | ||
|
||
```sql | ||
SELECT now() + '1 month' FROM my_table; | ||
``` | ||
|
||
We still have a long tail of [date and time improvements], which we are working on as well. | ||
|
||
[date and time improvements]: https://github.com/apache/arrow-datafusion/issues/3148 | ||
|
||
### Querying Structured Types (`List` and `Struct`s) | ||
|
||
Arrow and Parquet [support nested data] well and DataFusion lets you | ||
easily query such `Struct` and `List`. For example, you can use | ||
DataFusion to read and query the [JSON Datasets for Exploratory OLAP - | ||
Mendeley Data] like this: | ||
|
||
[support nested data]: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/ | ||
[json datasets for exploratory olap - mendeley data]: https://data.mendeley.com/datasets/ct8f9skv97 | ||
|
||
```sql | ||
---------- | ||
-- Explore structured data using SQL | ||
---------- | ||
SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10; | ||
+---------------------------------------------------------------------------------------------------------------------------+ | ||
| delete | | ||
+---------------------------------------------------------------------------------------------------------------------------+ | ||
| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} | | ||
| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} | | ||
| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}} | | ||
| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}} | | ||
| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}} | | ||
| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} | | ||
| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} | | ||
| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} | | ||
| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} | | ||
| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} | | ||
+---------------------------------------------------------------------------------------------------------------------------+ | ||
|
||
---------- | ||
-- Select some deeply nested fields | ||
---------- | ||
SELECT | ||
delete['status']['id']['$numberLong'] as delete_id, | ||
delete['status']['user_id'] as delete_user_id | ||
FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10; | ||
|
||
+--------------------+----------------+ | ||
| delete_id | delete_user_id | | ||
+--------------------+----------------+ | ||
| 135037425050320896 | 334902461 | | ||
| 134703982051463168 | 405383453 | | ||
| 134773741740765184 | 64823441 | | ||
| 132543659655704576 | 45917834 | | ||
| 133786431926697984 | 67229952 | | ||
| 134619093570560002 | 182430773 | | ||
| 134019857527214080 | 257396311 | | ||
| 133931546469076993 | 124539548 | | ||
| 134397743350296576 | 139836391 | | ||
| 127833661767823360 | 244442687 | | ||
+--------------------+----------------+ | ||
``` | ||
|
||
### Subqueries All the Way Down | ||
|
||
DataFusion can run many different subqueries by rewriting them to | ||
joins. It has been able to run the full suite of TPC-H queries for at | ||
least the last year, but recently we have implemented significant | ||
improvements to this logic, sufficient to run almost all queries in | ||
the TPC-DS benchmark as well. | ||
|
||
## Community and Project Growth | ||
|
||
The six months since [our last update] saw significant growth in | ||
the DataFusion community. Between versions `17.0.0` and `26.0.0`, | ||
DataFusion merged 711 PRs from 107 distinct contributors, not | ||
including all the work that goes into our core dependencies such as | ||
[arrow](https://crates.io/crates/arrow), | ||
[parquet](https://crates.io/crates/parquet), and | ||
[object_store](https://crates.io/crates/object_store), that much of | ||
the same community helps support. | ||
|
||
In addition, we have added 7 new committers and 1 new PMC member to | ||
the Apache Arrow project, largely focused on DataFusion, and we | ||
learned about some of the cool [new systems] which are using | ||
DataFusion. Given the growth of the community and interest in the | ||
project, we also clarified the [mission statement] and are | ||
[discussing] "graduate"ing DataFusion to a new top level | ||
Apache Software Foundation project. | ||
|
||
[our last update]: https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0 | ||
[new systems]: https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users | ||
[mission statement]: https://github.com/apache/arrow-datafusion/discussions/6441 | ||
[discussing]: https://github.com/apache/arrow-datafusion/discussions/6475 | ||
|
||
<!-- | ||
$ git log --pretty=oneline 17.0.0..26.0.0 . | wc -l | ||
711 | ||
$ git shortlog -sn 17.0.0..26.0.0 . | wc -l | ||
107 | ||
--> | ||
|
||
# How to Get Involved | ||
|
||
Kudos to everyone in the community who has contributed ideas, | ||
discussions, bug reports, documentation and code. It is exciting to be | ||
innovating on the next generation of database architectures together! | ||
|
||
If you are interested in contributing to DataFusion, we would love to | ||
have you join us. You can try out DataFusion on some of your own | ||
data and projects and let us know how it goes or contribute a PR with | ||
documentation, tests or code. A list of open issues suitable for | ||
beginners is [here]. | ||
|
||
Check out our [Communication Doc] for more ways to engage with the | ||
community. | ||
|
||
[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 | ||
[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html |