From ba6729d43e37dca7fc5e965d13653bbf0d9d375b Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Thu, 1 Jun 2023 16:17:37 -0400 Subject: [PATCH 1/6] DataFusion 25.0.0 Highlights Blog Post --- _posts/2023-06-01-datafusion-25.0.0.md | 290 +++++++++++++++++++++++++ 1 file changed, 290 insertions(+) create mode 100644 _posts/2023-06-01-datafusion-25.0.0.md diff --git a/_posts/2023-06-01-datafusion-25.0.0.md b/_posts/2023-06-01-datafusion-25.0.0.md new file mode 100644 index 000000000000..e4e5d641a1ed --- /dev/null +++ b/_posts/2023-06-01-datafusion-25.0.0.md @@ -0,0 +1,290 @@ +--- +layout: post +title: "Apache Arrow DataFusion 25.0.0" +date: "2023-06-02 00:00:00" +author: pmc +categories: [release] +--- + + + +# Introduction + +The community is at it again at it again -- it has been a whirlwind 6 +months of DataFusion development since [our last update]. Our +community has grown, many features have been added, performance +improved and we are [discussing] branching out to our own top +level Apache Project. + +## Background + +[Apache Arrow DataFusion] is an extensible query engine and database +toolkit, written in [Rust], that uses [Apache Arrow] as its in-memory +format. + +[apache arrow datafusion]: https://arrow.apache.org/datafusion/ +[apache arrow]: https://arrow.apache.org +[rust]: https://www.rust-lang.org/ + +DataFusion, along with [Apache Calcite], Facebook's [Velox] and +similar technology enable the the next generation of database and +analytic system architecture, the [Deconstructed Database], described +by Khurana and Le Dem. In this new architecture, systems build off a +foundation of fast, modular components, rather than implementing an +entire, tightly integrated system such as has been necessary +historically with systems such as [Spark], [DuckDB] and [Pola.rs]. + +[apache calcite]: https://calcite.apache.org +[velox]: https://github.com/facebookincubator/velox +[deconstructed database]: https://www.usenix.org/publications/login/winter2018/khurana +[spark]: https://spark.apache.org/ +[duckdb]: https://duckdb.org +[pola.rs]: https://www.pola.rs/ + +Our community believes that anyone developing an analytic DBMSs or +other data heavy application, such as those common in machine learning +in the next 5 years, will **require** a vectorized, highly performant +query engine to remain relevant. The only practical way to make such +technology widely available without many millions of dollars of +investment for a new tightly integrated engine, is though open source +projects such as DataFusion and the underlying technologies of [Apache +Arrow] and [Rust]. + +DataFusion is targeted primarily at developers creating other data +intensive analytics, and offers: + +- A high performance, native, parallel streaming execution engine, +- Mature [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) with subqueries, full supported Window functions +- Mature optimizer frameworks, with significant support for sort and distribution aware optimizations +- A native DataFrame API as well as [python bindings] +- Native support for Parquet, Avro, CSV, JSON and Arrow files +- [Well documented] source code and architecture +- Plan and expression serialization (both to protocol buffers, as well as [Substrait](https://substrait.io)) +- A welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation] +- High quality code [released every 2 weeks to crates.io] as well as the Apache distribution network. + +The rest of this post highlights some of the improvements we have made +to DataFusion over the last 6 months and some where we are +heading. You can see a list of all changes in the detailed +[CHANGELOG]. + +[apache software foundation]: https://www.apache.org/ +[well documented]: https://docs.rs/datafusion/latest/datafusion/index.html +[python bindings]: https://arrow.apache.org/datafusion-python/ +[changelog]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md +[released every 2 weeks to crates.io]: https://crates.io/crates/datafusion/versions + +## (Even) Better Performance + +[Various] [benchmarks] show DataFusion to be quite close to the state of the art DuckDB and sometimes even faster. We continually work on improving performance (see [#5546] for a list) and would love additional help in this area. + +[various]: https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter +[benchmarks]: https://github.com/tustvold/access-log-bench +[#5546]: https://github.com/apache/arrow-datafusion/issues/5546 + +DataFusion can now also read single large Parquet files significantly faster by [parallelizing across multiple cores]. Native speeds for reading JSON and CSV files are also up to 2.5x faster thanks to improvements in the [JSON reader] and [CSV reader] upstream in arrow-rs. + +[parallelizing across multiple cores]: https://github.com/apache/arrow-datafusion/pull/5057 +[json reader]: https://github.com/apache/arrow-rs/pull/3479#issuecomment-1384353159 +[csv reader]: https://github.com/apache/arrow-rs/pull/3365 + +Also, we have integrated the [arrow-rs Row Format] into DataFusion resulting in up to [2-3x faster sorting and merging]. + +[arrow-rs row format]: https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/ +[3x faster sorting and merging]: https://github.com/apache/arrow-datafusion/pull/6163 + +We continue to improving performance for high cardinality Grouping, +Joins, and parallelized JSON and CSV reading. + +## Improved Documentation and Website + +Part of growing the DataFusion community is ensuring that DataFusion's features are understood and that it is easy to contribute back. To that end the [website] has been cleaned up, [the architecture guide] significantly expanded, updated the [roadmap], and we gave several overview talks: + +- Apr 2023 _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p) +- April 2023 _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30) +- April 2023 _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg) + +[website]: https://arrow.apache.org/datafusion/ +[the architecture guide]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture +[roadmap]: https://arrow.apache.org/datafusion/contributor-guide/roadmap.html + +## New Features + +### More Streaming, Less Memory + +We have made significany progress on the [streaming execution roadmap] +such as [unbounded datasources], [streaming group by] and +sophisticated [sort and repartitioning] improvements in the optimizer, +and support for [symmetric hash join] (read more about that in the +great [Synnada Blog Post] on the topic). Together, these features not +only support streaming usecases like incrementally generating output +before (or ever) seeing the end of the input, it also allow other +queries to use less memory and generate their results faster. + +[streaming execution roadmap]: https://github.com/apache/arrow-datafusion/issues/4285 +[unbounded datasources]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.unbounded_output +[streaming group by]: https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/enum.GroupByOrderMode.html +[symmetric hash join]: https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.SymmetricHashJoinExec.html +[synnada blog post]: https://github.com/apache/arrow-datafusion/issues/4285 + +We have also improved the runtime [memory management] system and DataFusion now stays within its declared memory budget [or generates runtime errors]. + +[memory management]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/index.html +[or generates runtime errors]: https://github.com/apache/arrow-datafusion/issues/3941 + +### DML Support (`INSERT`, `DELETE`, `UPDATE`, etc) + +Part of building high performance data systems includes writing data, +and DataFusion now supports several features for creating new files: + +- `INSERT INTO` and `SELECT ... INTO ` support for memory backed and CSV tables +- New [API for writing data into TableProviders] + +We are working on easier to use [COPY INTO] syntax as well as support +for writing parquet, json, and avro soon -- see our [tracking epic] +for more details. + +[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569 +[api for writing data into tableproviders]: https://docs.rs/datafusion/latest/datafusion/physical_plan/insert/trait.DataSink.html +[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569 +[copy into]: https://github.com/apache/arrow-datafusion/issues/5654 + +### Timestamp and Intervals + +One mark of the maturity of a SQL engine is how it handles the tricky +world of timestamp, date, times and interval arithmetic. DataFusion is +feature complete in this area and behaves as you would expect, +supporting queries such as + +```sql +SELECT now() + '1 month' FROM my_table; +``` + +We still have a long tail of [date and time improvements], which we are working on as well. + +[date and time improvements]: https://github.com/apache/arrow-datafusion/issues/3148 + +### Querying Structured Types (`List` and `Struct`s) + +Arrow and Parquet support nested data (link to influxdata blog post about this) and DataFusion supports `Struct` and `List` as well. DataFusion can be used to qury such + +For example, you can easily read and query parquet files that contain structed types such as from the [JSON Datasets for Exploratory OLAP - Mendeley Data]. + +[json datasets for exploratory olap - mendeley data]: https://data.mendeley.com/datasets/ct8f9skv97 + +```sql +SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10; + ++---------------------------------------------------------------------------------------------------------------------------+ +| delete | ++---------------------------------------------------------------------------------------------------------------------------+ +| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} | +| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} | +| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}} | +| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}} | +| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}} | +| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} | +| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} | +| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} | +| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} | +| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} | ++---------------------------------------------------------------------------------------------------------------------------+ + +-- Select some deeply nested fields +SELECT + delete['status']['id']['$numberLong'] as delete_id, + delete['status']['user_id'] as delete_user_id +FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10; + ++--------------------+----------------+ +| delete_id | delete_user_id | ++--------------------+----------------+ +| 135037425050320896 | 334902461 | +| 134703982051463168 | 405383453 | +| 134773741740765184 | 64823441 | +| 132543659655704576 | 45917834 | +| 133786431926697984 | 67229952 | +| 134619093570560002 | 182430773 | +| 134019857527214080 | 257396311 | +| 133931546469076993 | 124539548 | +| 134397743350296576 | 139836391 | +| 127833661767823360 | 244442687 | ++--------------------+----------------+ +``` + +### Subqueries All the Way Down + +DataFusion can run many different subqueries by rewriting them to +joins. It has been able to run all TPCH queries for the last year, but +recently we have implemented more significant improvements to this +logic, sufficient to run almost all queries in the TPC-H and TPC-DS +benchmarks + +## Community and Project Growth + +The last six months since [our last update] saw significant growth in +the DataFusion community . Between versions `17.0.0` and `26.0.0`, +DataFusion merged 711 PRs from 107 distinct contributors, not +including all the work that goes into our core dependencies such as +[arrow](https://crates.io/crates/arrow), +[parquet](https://crates.io/crates/parquet), and +[object_store](https://crates.io/crates/object_store), that much of +the same community helps support. + +In addition, we have added 7 new committers and 1 new PMC member, +largely focused on Apache Arrow DataFusion, and we learned about some +of the cool [new systems] which are using DataFusion as a foundation +on which to build. + +Given the growth of the community and interest in the project, we have +clarified the [mission statement] and are [discussing] +"graduate"ing DataFusion to a new top level project Apache Software +Foundation project. + +[our last update]: https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0 +[new systems]: https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users +[mission statement]: https://github.com/apache/arrow-datafusion/discussions/6441 +[discussing]: https://github.com/apache/arrow-datafusion/discussions/6475 + + + +# How to Get Involved + +Kudos to everyone in the community who has contributed ideas, +discussions, bug reports, documentation and code. It is exciting to be +innvating on the next generation of database architectures together! + +If you are interested in contributing to DataFusion, we would love to +have you join us. You can try out DataFusion on some of your own +data and projects and let us know how it goes or contribute a PR with +documentation, tests or code. A list of open issues suitable for +beginners is [here]. + +Check out our [Communication Doc] for more ways to engage with the +community. + +[here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 +[communication doc]: https://arrow.apache.org/datafusion/community/communication.html From 472ac02cdff6b3327164990c2da4bcb225d5b8c7 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 18 Jun 2023 06:20:11 -0400 Subject: [PATCH 2/6] Updates --- _posts/2023-06-01-datafusion-25.0.0.md | 159 +++++++++++++------------ 1 file changed, 86 insertions(+), 73 deletions(-) diff --git a/_posts/2023-06-01-datafusion-25.0.0.md b/_posts/2023-06-01-datafusion-25.0.0.md index e4e5d641a1ed..0c4a8b1c9dfe 100644 --- a/_posts/2023-06-01-datafusion-25.0.0.md +++ b/_posts/2023-06-01-datafusion-25.0.0.md @@ -27,11 +27,10 @@ limitations under the License. # Introduction -The community is at it again at it again -- it has been a whirlwind 6 -months of DataFusion development since [our last update]. Our -community has grown, many features have been added, performance -improved and we are [discussing] branching out to our own top -level Apache Project. +It has been a whirlwind 6 months of DataFusion development since [our +last update]: the community has grown, many features have been added, +performance improved and we are [discussing] branching out to our own +top level Apache Project. ## Background @@ -44,12 +43,10 @@ format. [rust]: https://www.rust-lang.org/ DataFusion, along with [Apache Calcite], Facebook's [Velox] and -similar technology enable the the next generation of database and -analytic system architecture, the [Deconstructed Database], described -by Khurana and Le Dem. In this new architecture, systems build off a -foundation of fast, modular components, rather than implementing an -entire, tightly integrated system such as has been necessary -historically with systems such as [Spark], [DuckDB] and [Pola.rs]. +similar technology are part of the next generation "[Deconstructed +Database]" architectures, where new systems are built on a foundation +of fast, modular components, rather as a single tightly integrated +system. [apache calcite]: https://calcite.apache.org [velox]: https://github.com/facebookincubator/velox @@ -58,33 +55,34 @@ historically with systems such as [Spark], [DuckDB] and [Pola.rs]. [duckdb]: https://duckdb.org [pola.rs]: https://www.pola.rs/ -Our community believes that anyone developing an analytic DBMSs or -other data heavy application, such as those common in machine learning -in the next 5 years, will **require** a vectorized, highly performant -query engine to remain relevant. The only practical way to make such -technology widely available without many millions of dollars of -investment for a new tightly integrated engine, is though open source -projects such as DataFusion and the underlying technologies of [Apache -Arrow] and [Rust]. + +While single tightly integrated systems such as [Spark], [DuckDB] and +[Pola.rs] are great pieces of technology, our community believes that +anyone developing new data heavy application, such as those common in +machine learning in the next 5 years, will **require** a high +performance, vectorized, query engine to remain relevant. The only +practical way to gain access to such technology without investing many +millions of dollars to build a new tightly integrated engine, is +though open source projects such as DataFusion and the underlying +technologies of [Apache Arrow] and [Rust]. DataFusion is targeted primarily at developers creating other data intensive analytics, and offers: -- A high performance, native, parallel streaming execution engine, -- Mature [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) with subqueries, full supported Window functions -- Mature optimizer frameworks, with significant support for sort and distribution aware optimizations -- A native DataFrame API as well as [python bindings] -- Native support for Parquet, Avro, CSV, JSON and Arrow files -- [Well documented] source code and architecture -- Plan and expression serialization (both to protocol buffers, as well as [Substrait](https://substrait.io)) -- A welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation] -- High quality code [released every 2 weeks to crates.io] as well as the Apache distribution network. +- High performance, native, parallel streaming execution engine +- Mature [SQL support], featuring subqueries, window functions, grouping sets, and more +- Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others +- Native DataFrame API and [python bindings] +- [Well documented] source code and architecture, designed to be customized to suit downstream project needs +- High quality, easy to use code [released every 2 weeks to crates.io] +- Welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation] The rest of this post highlights some of the improvements we have made -to DataFusion over the last 6 months and some where we are +to DataFusion over the last 6 months and a preview of where we are heading. You can see a list of all changes in the detailed [CHANGELOG]. +[SQL support]: https://arrow.apache.org/datafusion/user-guide/sql/index.html [apache software foundation]: https://www.apache.org/ [well documented]: https://docs.rs/datafusion/latest/datafusion/index.html [python bindings]: https://arrow.apache.org/datafusion-python/ @@ -93,13 +91,19 @@ heading. You can see a list of all changes in the detailed ## (Even) Better Performance -[Various] [benchmarks] show DataFusion to be quite close to the state of the art DuckDB and sometimes even faster. We continually work on improving performance (see [#5546] for a list) and would love additional help in this area. +[Various] benchmarks show DataFusion to be quite close or [even +faster] to the state of the art in analytic performance (at the moment +this seems to be DuckDB). We continually work on improving performance +(see [#5546] for a list) and would love additional help in this area. [various]: https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter -[benchmarks]: https://github.com/tustvold/access-log-bench +[even faster]: https://github.com/tustvold/access-log-bench [#5546]: https://github.com/apache/arrow-datafusion/issues/5546 -DataFusion can now also read single large Parquet files significantly faster by [parallelizing across multiple cores]. Native speeds for reading JSON and CSV files are also up to 2.5x faster thanks to improvements in the [JSON reader] and [CSV reader] upstream in arrow-rs. +DataFusion now read single large Parquet files significantly faster by +[parallelizing across multiple cores]. Native speeds for reading JSON +and CSV files are also up to 2.5x faster thanks to improvements +upstream in arrow-rs [JSON reader] and [CSV reader]. [parallelizing across multiple cores]: https://github.com/apache/arrow-datafusion/pull/5057 [json reader]: https://github.com/apache/arrow-rs/pull/3479#issuecomment-1384353159 @@ -108,14 +112,15 @@ DataFusion can now also read single large Parquet files significantly faster by Also, we have integrated the [arrow-rs Row Format] into DataFusion resulting in up to [2-3x faster sorting and merging]. [arrow-rs row format]: https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/ -[3x faster sorting and merging]: https://github.com/apache/arrow-datafusion/pull/6163 - -We continue to improving performance for high cardinality Grouping, -Joins, and parallelized JSON and CSV reading. +[2-3x faster sorting and merging]: https://github.com/apache/arrow-datafusion/pull/6163 ## Improved Documentation and Website -Part of growing the DataFusion community is ensuring that DataFusion's features are understood and that it is easy to contribute back. To that end the [website] has been cleaned up, [the architecture guide] significantly expanded, updated the [roadmap], and we gave several overview talks: +Part of growing the DataFusion community is ensuring that DataFusion's +features are understood and that it is easy to contribute and +participate. To that end the [website] has been cleaned up, [the +architecture guide] expanded, the [roadmap] updated, and several +overview talks created: - Apr 2023 _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p) - April 2023 _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30) @@ -130,35 +135,39 @@ Part of growing the DataFusion community is ensuring that DataFusion's features ### More Streaming, Less Memory We have made significany progress on the [streaming execution roadmap] -such as [unbounded datasources], [streaming group by] and -sophisticated [sort and repartitioning] improvements in the optimizer, -and support for [symmetric hash join] (read more about that in the -great [Synnada Blog Post] on the topic). Together, these features not -only support streaming usecases like incrementally generating output -before (or ever) seeing the end of the input, it also allow other -queries to use less memory and generate their results faster. - +such as [unbounded datasources], [streaming group by], sophisticated +[sort] and [repartitioning] improvements in the optimizer, and support +for [symmetric hash join] (read more about that in the great [Synnada +Blog Post] on the topic). Together, these features both 1) make it +easier to build streaming systems using DataFusion that can +incrementally generate output before (or ever) seeing the end of the +input and 2) allow general queries to use less memory and generate their +results faster. + +We have also improved the runtime [memory management] system so that +DataFusion now stays within its declared memory budget [generate +runtime errors]. + +[sort]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/global_sort_selection/index.html +[repartitioning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/repartition/index.html [streaming execution roadmap]: https://github.com/apache/arrow-datafusion/issues/4285 [unbounded datasources]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.unbounded_output [streaming group by]: https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/enum.GroupByOrderMode.html [symmetric hash join]: https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.SymmetricHashJoinExec.html [synnada blog post]: https://github.com/apache/arrow-datafusion/issues/4285 - -We have also improved the runtime [memory management] system and DataFusion now stays within its declared memory budget [or generates runtime errors]. - [memory management]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/index.html -[or generates runtime errors]: https://github.com/apache/arrow-datafusion/issues/3941 +[generate runtime errors]: https://github.com/apache/arrow-datafusion/issues/3941 ### DML Support (`INSERT`, `DELETE`, `UPDATE`, etc) Part of building high performance data systems includes writing data, -and DataFusion now supports several features for creating new files: +and DataFusion supports several features for creating new files: - `INSERT INTO` and `SELECT ... INTO ` support for memory backed and CSV tables - New [API for writing data into TableProviders] -We are working on easier to use [COPY INTO] syntax as well as support -for writing parquet, json, and avro soon -- see our [tracking epic] +We are working on easier to use [COPY INTO] syntax, better support +for writing parquet, JSON, and AVRO, and more -- see our [tracking epic] for more details. [tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569 @@ -183,15 +192,19 @@ We still have a long tail of [date and time improvements], which we are working ### Querying Structured Types (`List` and `Struct`s) -Arrow and Parquet support nested data (link to influxdata blog post about this) and DataFusion supports `Struct` and `List` as well. DataFusion can be used to qury such - -For example, you can easily read and query parquet files that contain structed types such as from the [JSON Datasets for Exploratory OLAP - Mendeley Data]. +Arrow and Parquet [support nested data] well and DataFusion lets you +easily query such `Struct` and `List`. For example, you can use +DataFusion to read and query the [JSON Datasets for Exploratory OLAP - +Mendeley Data] like this: +[support nested data]: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/ [json datasets for exploratory olap - mendeley data]: https://data.mendeley.com/datasets/ct8f9skv97 ```sql +---------- +-- Explore structured data using SQL +---------- SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10; - +---------------------------------------------------------------------------------------------------------------------------+ | delete | +---------------------------------------------------------------------------------------------------------------------------+ @@ -207,7 +220,9 @@ SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL | {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} | +---------------------------------------------------------------------------------------------------------------------------+ --- Select some deeply nested fields +---------- +-- Select some deeply nested fields +---------- SELECT delete['status']['id']['$numberLong'] as delete_id, delete['status']['user_id'] as delete_user_id @@ -232,15 +247,15 @@ FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10; ### Subqueries All the Way Down DataFusion can run many different subqueries by rewriting them to -joins. It has been able to run all TPCH queries for the last year, but -recently we have implemented more significant improvements to this -logic, sufficient to run almost all queries in the TPC-H and TPC-DS -benchmarks +joins. It has been able to run the full suite of TPC-H queries for at +least the last year, but recently we have implemented significant +improvements to this logic, sufficient to run almost all queries in +the TPC-DS benchmark as well. ## Community and Project Growth -The last six months since [our last update] saw significant growth in -the DataFusion community . Between versions `17.0.0` and `26.0.0`, +The six months since [our last update] saw significant growth in +the DataFusion community .Between versions `17.0.0` and `26.0.0`, DataFusion merged 711 PRs from 107 distinct contributors, not including all the work that goes into our core dependencies such as [arrow](https://crates.io/crates/arrow), @@ -248,15 +263,13 @@ including all the work that goes into our core dependencies such as [object_store](https://crates.io/crates/object_store), that much of the same community helps support. -In addition, we have added 7 new committers and 1 new PMC member, -largely focused on Apache Arrow DataFusion, and we learned about some -of the cool [new systems] which are using DataFusion as a foundation -on which to build. - -Given the growth of the community and interest in the project, we have -clarified the [mission statement] and are [discussing] -"graduate"ing DataFusion to a new top level project Apache Software -Foundation project. +In addition, we have added 7 new committers and 1 new PMC member to +the Apache Arrow project, largely focused on DataFusion, and we +learned about some of the cool [new systems] which are using +DataFusion. Given the growth of the community and interest in the +project, we also clarified the [mission statement] and are +[discussing] "graduate"ing DataFusion to a new top level +Apache Software Foundation project. [our last update]: https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0 [new systems]: https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users @@ -287,4 +300,4 @@ Check out our [Communication Doc] for more ways to engage with the community. [here]: https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22 -[communication doc]: https://arrow.apache.org/datafusion/community/communication.html +[communication doc]: https://arrow.apache.org/datafusion/contributor-guide/communication.html From df16c611706189aece684d79ed4b42a2b88c5ed2 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sun, 18 Jun 2023 10:19:28 -0400 Subject: [PATCH 3/6] Apply suggestions from code review Co-authored-by: Paddy Horan <5733408+paddyhoran@users.noreply.github.com> --- _posts/2023-06-01-datafusion-25.0.0.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/_posts/2023-06-01-datafusion-25.0.0.md b/_posts/2023-06-01-datafusion-25.0.0.md index 0c4a8b1c9dfe..659b13e4c4b4 100644 --- a/_posts/2023-06-01-datafusion-25.0.0.md +++ b/_posts/2023-06-01-datafusion-25.0.0.md @@ -100,7 +100,7 @@ this seems to be DuckDB). We continually work on improving performance [even faster]: https://github.com/tustvold/access-log-bench [#5546]: https://github.com/apache/arrow-datafusion/issues/5546 -DataFusion now read single large Parquet files significantly faster by +DataFusion now reads single large Parquet files significantly faster by [parallelizing across multiple cores]. Native speeds for reading JSON and CSV files are also up to 2.5x faster thanks to improvements upstream in arrow-rs [JSON reader] and [CSV reader]. @@ -255,7 +255,7 @@ the TPC-DS benchmark as well. ## Community and Project Growth The six months since [our last update] saw significant growth in -the DataFusion community .Between versions `17.0.0` and `26.0.0`, +the DataFusion community. Between versions `17.0.0` and `26.0.0`, DataFusion merged 711 PRs from 107 distinct contributors, not including all the work that goes into our core dependencies such as [arrow](https://crates.io/crates/arrow), @@ -288,7 +288,7 @@ $ git shortlog -sn 17.0.0..26.0.0 . | wc -l Kudos to everyone in the community who has contributed ideas, discussions, bug reports, documentation and code. It is exciting to be -innvating on the next generation of database architectures together! +innovating on the next generation of database architectures together! If you are interested in contributing to DataFusion, we would love to have you join us. You can try out DataFusion on some of your own From 18551e98dc83cc4b98143609848711fa954054d4 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Wed, 21 Jun 2023 13:10:40 -0400 Subject: [PATCH 4/6] Update _posts/2023-06-01-datafusion-25.0.0.md Co-authored-by: Folyd --- _posts/2023-06-01-datafusion-25.0.0.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_posts/2023-06-01-datafusion-25.0.0.md b/_posts/2023-06-01-datafusion-25.0.0.md index 659b13e4c4b4..1467ebe3b8c8 100644 --- a/_posts/2023-06-01-datafusion-25.0.0.md +++ b/_posts/2023-06-01-datafusion-25.0.0.md @@ -1,7 +1,7 @@ --- layout: post -title: "Apache Arrow DataFusion 25.0.0" -date: "2023-06-02 00:00:00" +title: "Apache Arrow DataFusion 26.0.0" +date: "2023-06-20 00:00:00" author: pmc categories: [release] --- From c3c96bbe6fd7638071b5689937d34e6933a2c0f0 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jun 2023 07:03:21 -0400 Subject: [PATCH 5/6] Update date to 2023-06-24 --- ...-01-datafusion-25.0.0.md => 2023-06-24-datafusion-25.0.0.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename _posts/{2023-06-01-datafusion-25.0.0.md => 2023-06-24-datafusion-25.0.0.md} (99%) diff --git a/_posts/2023-06-01-datafusion-25.0.0.md b/_posts/2023-06-24-datafusion-25.0.0.md similarity index 99% rename from _posts/2023-06-01-datafusion-25.0.0.md rename to _posts/2023-06-24-datafusion-25.0.0.md index 1467ebe3b8c8..2bdc44f4f53f 100644 --- a/_posts/2023-06-01-datafusion-25.0.0.md +++ b/_posts/2023-06-24-datafusion-25.0.0.md @@ -1,7 +1,7 @@ --- layout: post title: "Apache Arrow DataFusion 26.0.0" -date: "2023-06-20 00:00:00" +date: "2023-06-24 00:00:00" author: pmc categories: [release] --- From 0c9db4d6c28a6e8dc853319f7880d20926ae052a Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 24 Jun 2023 07:09:14 -0400 Subject: [PATCH 6/6] final wordsmiting --- _posts/2023-06-24-datafusion-25.0.0.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/_posts/2023-06-24-datafusion-25.0.0.md b/_posts/2023-06-24-datafusion-25.0.0.md index 2bdc44f4f53f..a5b93522794e 100644 --- a/_posts/2023-06-24-datafusion-25.0.0.md +++ b/_posts/2023-06-24-datafusion-25.0.0.md @@ -25,8 +25,6 @@ limitations under the License. {% endcomment %} --> -# Introduction - It has been a whirlwind 6 months of DataFusion development since [our last update]: the community has grown, many features have been added, performance improved and we are [discussing] branching out to our own @@ -63,8 +61,8 @@ machine learning in the next 5 years, will **require** a high performance, vectorized, query engine to remain relevant. The only practical way to gain access to such technology without investing many millions of dollars to build a new tightly integrated engine, is -though open source projects such as DataFusion and the underlying -technologies of [Apache Arrow] and [Rust]. +though open source projects like DataFusion and similar enabling +technologies such as [Apache Arrow] and [Rust]. DataFusion is targeted primarily at developers creating other data intensive analytics, and offers: