[WEBSITE] DataFusion 26.0.0 Blog Post (#370)

Closes apache/datafusion#5812 This PR tries to summarize the last 6 months of DataFusion and tell people what has been going on in the community. --------- Co-authored-by: Paddy Horan <[email protected]> Co-authored-by: Folyd <[email protected]>
apache · Jun 24, 2023 · 4ad44e6 · 4ad44e6
1 parent 855f956
commit 4ad44e6
Showing 1 changed file with 301 additions and 0 deletions.
diff --git a/_posts/2023-06-24-datafusion-25.0.0.md b/_posts/2023-06-24-datafusion-25.0.0.md
@@ -0,0 +1,301 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 26.0.0"
+date: "2023-06-24 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+It has been a whirlwind 6 months of DataFusion development since [our
+last update]: the community has grown, many features have been added,
+performance improved and we are [discussing] branching out to our own
+top level Apache Project.
+
+## Background
+
+[Apache Arrow DataFusion] is an extensible query engine and database
+toolkit, written in [Rust], that uses [Apache Arrow] as its in-memory
+format.
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion, along with [Apache Calcite], Facebook's [Velox] and
+similar technology are part of the next generation "[Deconstructed
+Database]" architectures, where new systems are built on a foundation
+of fast, modular components, rather as a single tightly integrated
+system.
+
+[apache calcite]: https://calcite.apache.org
+[velox]: https://github.com/facebookincubator/velox
+[deconstructed database]: https://www.usenix.org/publications/login/winter2018/khurana
+[spark]: https://spark.apache.org/
+[duckdb]: https://duckdb.org
+[pola.rs]: https://www.pola.rs/
+
+
+While single tightly integrated systems such as [Spark], [DuckDB] and
+[Pola.rs] are great pieces of technology, our community believes that
+anyone developing new data heavy application, such as those common in
+machine learning in the next 5 years, will **require** a high
+performance, vectorized, query engine to remain relevant. The only
+practical way to gain access to such technology without investing many
+millions of dollars to build a new tightly integrated engine, is
+though open source projects like DataFusion and similar enabling
+technologies such as [Apache Arrow] and [Rust].
+
+DataFusion is targeted primarily at developers creating other data
+intensive analytics, and offers:
+
+- High performance, native, parallel streaming execution engine
+- Mature [SQL support], featuring  subqueries, window functions, grouping sets, and more
+- Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
+- Native DataFrame API and [python bindings]
+- [Well documented] source code and architecture, designed to be customized to suit downstream project needs
+- High quality, easy to use code [released every 2 weeks to crates.io]
+- Welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation]
+
+The rest of this post highlights some of the improvements we have made
+to DataFusion over the last 6 months and a preview of where we are
+heading. You can see a list of all changes in the detailed
+[CHANGELOG].
+
+[SQL support]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
+[apache software foundation]: https://www.apache.org/
+[well documented]: https://docs.rs/datafusion/latest/datafusion/index.html
+[python bindings]: https://arrow.apache.org/datafusion-python/
+[changelog]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+[released every 2 weeks to crates.io]: https://crates.io/crates/datafusion/versions
+
+## (Even) Better Performance
+
+[Various] benchmarks show DataFusion to be quite close or [even
+faster] to the state of the art in analytic performance (at the moment
+this seems to be DuckDB). We continually work on improving performance
+(see [#5546] for a list) and would love additional help in this area.
+
+[various]: https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter
+[even faster]: https://github.com/tustvold/access-log-bench
+[#5546]: https://github.com/apache/arrow-datafusion/issues/5546
+
+DataFusion now reads single large Parquet files significantly faster by
+[parallelizing across multiple cores]. Native speeds for reading JSON
+and CSV files are also up to 2.5x faster thanks to improvements
+upstream in arrow-rs [JSON reader] and [CSV reader].
+
+[parallelizing across multiple cores]: https://github.com/apache/arrow-datafusion/pull/5057
+[json reader]: https://github.com/apache/arrow-rs/pull/3479#issuecomment-1384353159
+[csv reader]: https://github.com/apache/arrow-rs/pull/3365
+
+Also, we have integrated the [arrow-rs Row Format] into DataFusion resulting in up to [2-3x faster sorting and merging].
+
+[arrow-rs row format]: https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
+[2-3x faster sorting and merging]: https://github.com/apache/arrow-datafusion/pull/6163
+
+## Improved Documentation and Website
+
+Part of growing the DataFusion community is ensuring that DataFusion's
+features are understood and that it is easy to contribute and
+participate. To that end the [website] has been cleaned up, [the
+architecture guide] expanded, the [roadmap] updated, and several
+overview talks created:
+
+- Apr 2023 _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
+- April 2023 _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
+- April 2023 _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
+
+[website]: https://arrow.apache.org/datafusion/
+[the architecture guide]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture
+[roadmap]: https://arrow.apache.org/datafusion/contributor-guide/roadmap.html
+
+## New Features
+
+### More Streaming, Less Memory
+
+We have made significany progress on the [streaming execution roadmap]
+such as [unbounded datasources], [streaming group by], sophisticated
+[sort] and [repartitioning] improvements in the optimizer, and support
+for [symmetric hash join] (read more about that in the great [Synnada
+Blog Post] on the topic). Together, these features both 1) make it
+easier to build streaming systems using DataFusion that can
+incrementally generate output before (or ever) seeing the end of the
+input and 2) allow general queries to use less memory and generate their
+results faster.
+
+We have also improved the runtime [memory management] system so that
+DataFusion now stays within its declared memory budget [generate
+runtime errors].
+
+[sort]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/global_sort_selection/index.html
+[repartitioning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/repartition/index.html
+[streaming execution roadmap]: https://github.com/apache/arrow-datafusion/issues/4285
+[unbounded datasources]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.unbounded_output
+[streaming group by]: https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/enum.GroupByOrderMode.html
+[symmetric hash join]: https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.SymmetricHashJoinExec.html
+[synnada blog post]: https://github.com/apache/arrow-datafusion/issues/4285
+[memory management]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/index.html
+[generate runtime errors]: https://github.com/apache/arrow-datafusion/issues/3941
+
+### DML Support (`INSERT`, `DELETE`, `UPDATE`, etc)
+
+Part of building high performance data systems includes writing data,
+and DataFusion supports several features for creating new files:
+
+- `INSERT INTO` and `SELECT ... INTO ` support for memory backed and CSV tables
+- New [API for writing data into TableProviders]
+
+We are working on easier to use [COPY INTO] syntax, better support
+for writing parquet, JSON, and AVRO, and more -- see our [tracking epic]
+for more details.
+
+[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569
+[api for writing data into tableproviders]: https://docs.rs/datafusion/latest/datafusion/physical_plan/insert/trait.DataSink.html
+[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569
+[copy into]: https://github.com/apache/arrow-datafusion/issues/5654
+
+### Timestamp and Intervals
+
+One mark of the maturity of a SQL engine is how it handles the tricky
+world of timestamp, date, times and interval arithmetic. DataFusion is
+feature complete in this area and behaves as you would expect,
+supporting queries such as
+
+```sql
+SELECT now() + '1 month' FROM my_table;
+```
+
+We still have a long tail of [date and time improvements], which we are working on as well.
+
+[date and time improvements]: https://github.com/apache/arrow-datafusion/issues/3148
+
+### Querying Structured Types (`List` and `Struct`s)
+
+Arrow and Parquet [support nested data] well and DataFusion lets you
+easily query such `Struct` and `List`. For example, you can use
+DataFusion to read and query the [JSON Datasets for Exploratory OLAP -
+Mendeley Data] like this:
+
+[support nested data]: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
+[json datasets for exploratory olap - mendeley data]: https://data.mendeley.com/datasets/ct8f9skv97
+
+```sql
+----------
+-- Explore structured data using SQL
+----------
+SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10;
++---------------------------------------------------------------------------------------------------------------------------+
+| delete                                                                                                                    |
++---------------------------------------------------------------------------------------------------------------------------+
+| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} |
+| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} |
+| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}}   |
+| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}}   |
+| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}}   |
+| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} |
+| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} |
+| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} |
+| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} |
+| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} |
++---------------------------------------------------------------------------------------------------------------------------+
+
+----------
+-- Select some deeply nested fields
+----------
+SELECT
+  delete['status']['id']['$numberLong'] as delete_id,
+  delete['status']['user_id'] as delete_user_id
+FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10;
+
++--------------------+----------------+
+| delete_id          | delete_user_id |
++--------------------+----------------+
+| 135037425050320896 | 334902461      |
+| 134703982051463168 | 405383453      |
+| 134773741740765184 | 64823441       |
+| 132543659655704576 | 45917834       |
+| 133786431926697984 | 67229952       |
+| 134619093570560002 | 182430773      |
+| 134019857527214080 | 257396311      |
+| 133931546469076993 | 124539548      |
+| 134397743350296576 | 139836391      |
+| 127833661767823360 | 244442687      |
++--------------------+----------------+
+```
+
+### Subqueries All the Way Down
+
+DataFusion can run many different subqueries by rewriting them to
+joins. It has been able to run the full suite of TPC-H queries for at
+least the last year, but recently we have implemented significant
+improvements to this logic, sufficient to run almost all queries in
+the TPC-DS benchmark as well.
+
+## Community and Project Growth
+
+The six months since [our last update] saw significant growth in
+the DataFusion community. Between versions `17.0.0` and `26.0.0`,
+DataFusion merged 711 PRs from 107 distinct contributors, not
+including all the work that goes into our core dependencies such as
+[arrow](https://crates.io/crates/arrow),
+[parquet](https://crates.io/crates/parquet), and
+[object_store](https://crates.io/crates/object_store), that much of
+the same community helps support.
+
+In addition, we have added 7 new committers and 1 new PMC member to
+the Apache Arrow project, largely focused on DataFusion, and we
+learned about some of the cool [new systems] which are using
+DataFusion. Given the growth of the community and interest in the
+project, we also clarified the [mission statement] and are
+[discussing] "graduate"ing DataFusion to a new top level
+Apache Software Foundation project.
+
+[our last update]: https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0
+[new systems]: https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users
+[mission statement]: https://github.com/apache/arrow-datafusion/discussions/6441
+[discussing]: https://github.com/apache/arrow-datafusion/discussions/6475
+
+<!--
+$ git log --pretty=oneline 17.0.0..26.0.0 . | wc -l
+     711
+
+$ git shortlog -sn 17.0.0..26.0.0 . | wc -l
+      107
+-->
+
+# How to Get Involved
+
+Kudos to everyone in the community who has contributed ideas,
+discussions, bug reports, documentation and code. It is exciting to be
+innovating on the next generation of database architectures together!
+
+If you are interested in contributing to DataFusion, we would love to
+have you join us. You can try out DataFusion on some of your own
+data and projects and let us know how it goes or contribute a PR with
+documentation, tests or code. A list of open issues suitable for
+beginners is [here].
+
+Check out our [Communication Doc] for more ways to engage with the
+community.
+
+[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22
+[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html