7.0.0 (2022-02-14)
Breaking changes:
- Consolidate various configurations options, remove unrelated
batch_size
#1565 - Extract logical plans in LogicalPlan as independent struct #1228
- Update
ExecutionPlan
to know about sortedness and repartitioning optimizer pass respect the invariants #1776 (alamb) - Update to
arrow 8.0.0
#1673 (alamb) - Remove non idiomatic
DataFusionError::into_arrow_external_error
in favor of From conversion #1645 (alamb) - Remove
Accumulator::update
andAccumulator::merge
#1582 (Jimexist) - implement
Hash
for various types and replacePartialOrd
#1580 (Jimexist) - Replace
DatafusionError
withGenericError
inObjectStore
interface #1541 (matthewmturner) - Make
FLOAT
SQL type map toFloat32
rather thanFloat64
#1423 [sql] (liukun4515) - Map
REAL
SQL type toFloat32
rather thanFloat64
to be consistent with pg #1390 [sql] (hntd187)
Implemented enhancements:
- Create new
datafusion_expr
crate #1753 - Create new
datafusion_common
crate #1752 - API to get Expr's type and nullability without a
DFSchema
#1725 - Cleaner API to create
Expr::ScalarFunction
programatically #1718 - Introduce a
Vec<u8>
based row-wise representation for DataFusion #1708 - Simplify creating new
ListingTable
#1705 - Implement TableProvider for DataFrameImpl to allow registration of logical plans #1698
- Public Expr simplification API #1694
- Query Optimizer: Add OUTER --> INNER join conversion #1670
- Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669
- Remove
DataFusionError::into_arrow_external_error
in favor ofFrom
conversion #1644 - Include join type in display implementation for logical plan #1620
- Switch datafusion to using
eq_dyn_scalar
, etc kernels #1610 - Proposal: Remove
Accumulator::update
andAccumulator::merge
#1549 - Replace DataFusionError/Result with impl Error for ObjectStore and Reader #1540
- Add
approx_quantile
support #1538 - support sorting decimal data type #1522
- Keep all datafusion's packages up to date with Dependabot #1472
- ExecutionContext support init ExecutionContextState with
new(state: Arc<Mutex<ExecutionContextState>>)
method #1439 - support the decimal scalar value #1393
- Documentation for using scalar functions with the DataFrame API #1364
- Support
boolean == boolean
andboolean != boolean
operators #1159 - Support DataType::Decimal(15, 2) in TPC-H benchmark #174
- Make
MemoryStream
public #150 - Add support for Parquet schema merging #132
- Add SQL support for IN expression #118
- Add logging to datafusion-cli #1789 (alamb)
- Add
approx_median()
aggregate function #1729 (realno) - Add join type for logical plan display #1674 [sql] (xudong963)
- Fix null comparison for Parquet pruning predicate #1595 (viirya)
- Add
corr
aggregate function #1561 (realno) - Add
covar
,covar_pop
andcovar_samp
aggregate functions #1551 (realno) - Add
approx_quantile()
aggregation function #1539 (domodwyer) - Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526 (yjshen)
- Add
stddev
andvariance
#1525 (realno) - Add
rem
operation for Expr #1467 (liukun4515) - support decimal data type in create table #1431 [sql] (liukun4515)
- Ordering by index in select expression #1419 [sql] (hntd187)
- Add support for
ORDER BY
on unprojected columns #1415 (viirya) - Support decimal for
min
andmax
aggregate #1407 (liukun4515) - Consolidate
ConstantFolding
andSimplifyExpression
#1375 (alamb) - Datafusion cli quiet mode command to contain option bool #1345 (Jimexist)
- Implement
array_agg
aggregate function #1300 (viirya) - Add a command to switch output format in cli #1284 (capkurmagati)
- Support
=
,<
,<=
,>
,>=
,!=
,is distinct from
,is not distinct from
forBooleanArray
#1163 (alamb)
Fixed bugs:
- Unsupported data type in hasher: Timestamp(Second, None) #1768
- SQL column identifiers should be converted to lowercase when unquoted #1746
- Data type Dictionary(Int32, Utf8) not supported for binary operation 'eq' on dyn arrays #1605
- datafusion doesn't process predicate pushdown correctly when there is outer join #1586
- casting
Int64
toFloat64
unsuccessfully caused tpch8 to fail #1576 - CTE/WITH .. UNION ALL confuses name resolution in WHERE #1509
- ORDER BY min(x) results in error
Plan("No field named 'foo.x'. Valid fields are 'MIN(foo.x)'.")
#1479 - Sort discards field metadata on the output schema #1476
- Datafusion should not strip out timezone information from existing types #1454
- Error on some queries: "column types must match schema types, expected XXX but found YYY" #1447
- Query failing to return any results when filter is an equality check on strings (bad statistics in parquet) #1433
- Field names containing period such as
f.c1
cannot be named in SQL query #1432 Select *
returns an unexpected result #1412- Turn off unused default features of chrono and ahash #1398
- real data type is float32 in PG database, but in the datafusion it is as float64 #1380
- TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367
- ProjectionExec Loses Field Metadata #1361
- Support Filter on unprojected columns #1351
- NULLS ORDER is inconsistent with postgres #1343
- Fix bug while merging
RecordBatch
, addSortPreservingMerge
fuzz tester #1678 (alamb) - fix a cte block with same name for many times #1639 [sql] (xudong963)
- fix: casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1601 (xudong963)
- Fix single_distinct_to_groupby for arbitrary expressions #1519 (james727)
- Fix SortExec discards field metadata on the output schema #1477 (alamb)
- fix calculate in many_to_many_hash_partition test. #1463 (Ted-Jiang)
- Add Timezone to Scalar::Time* types, and better timezone awareness to Datafusion's time types #1455 (maxburke)
- Support identifiers with
.
in them #1449 [sql] (alamb) - Fixes for working with functions in dataframes, additional documentation #1430 (tobyhede)
- [Minor] Fix
send_time
metric for hash-repartition #1421 (Dandandan) - fix: Select * returns an unexpected result #1413 [sql] (xudong963)
- Make cli handle multiple whitespaces #1388 (capkurmagati)
- Metadata is kept in projections for non-derived columns #1378 (hntd187)
- Fix Predicate Pushdown: split_members should be able to split aliased predicate #1368 (viirya)
- Change the arg names and make parameters more meaningful #1357 (liukun4515)
- collect table stats by default for listing table #1347 (houqp)
- fix: make nulls-order consistent with postgres #1344 [sql] (xudong963)
- Avoid changing expression names during constant folding #1319 (viirya)
- improve error message for invalid create table statement #1294 [sql] (houqp)
- Forbid creating the table with the same name #1288 (liukun4515)
Documentation updates:
- Clarify docs about
Accumulator::update
andAccumulator::update_batch
#1542 (alamb) - Fix duplicated
cargo run --example parquet_sql
#1482 (sergey-melnychuk) - add documentation to Datafusion cli's new commands #1348 (liukun4515)
- fix some clippy warnings from nightly channel #1277 [sql] (Jimexist)
Performance improvements:
- Parquet pruning predicate for
IS NULL
#1591 - Fix predicate pushdown for outer joins #1618 (james727)
- fix: sql planner creates cross join instead of inner join from select predicates #1566 [sql] (xudong963)
- Split fetch_metadata into fetch_statistics and fetch_schema #1365 (Dandandan)
- Optimize the performance queries with a single distinct aggregate #1315 (ic4y)
- Left join could use bitmap for left join instead of Vec<bool> #1291 (boazberman)
Closed issues:
- Add
release compile
to CI #1728 - DiskManager and TempFiles getting created several times per query #1690
- Add a test for the
pyarrow
feature in CI #1635 - SQL tests for when sorting exceeded available memory and had to spill to disk #1573
- Consolidate the N-way merging code and
SortPreservingMergeStream
(which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572 - Consolidate the
SortExec
code (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). #1571 - Track memory usage in Non Limited Operators #1569
- [Question] Why does ballista store tables in the client instead of in the SchedulerServer #1473
- Consolidate Projection for Schema and RecordBatch #1425
- Support Sort on unprojected columns #1372
- Unused code in hash_aggregate #1362
- Why use the expr types before coercion to get the result type? #1358
- A problem about the projection_push_down optimizer gathers valid columns #1312
- apply constant folding to
LogicalPlan::Values
#1170 - reduce usage of
IntoIterator<Item = Expr>
in logical plan builder window fn #372 - Why does DataFusion throw a Tokio 0.2 runtime error? #176
- TPC-H Query 14 #165
- Length kernel returns bytes not character length #156
- Split the logical operators out into separate source files #115
Merged pull requests:
- Fixup some doc warnings #1811 (alamb)
- Ensure most of links in docs are correct #1808 [sql] (HaoYang670)
- Update CHANGELOG.md, update release scripts #1807 (alamb)
- Update versions for split crates #1803 (matthewmturner)
- Improve the error message and UX of tpch benchmark program #1800 (alamb)
- rename references of expr in logical plan module after datafusion-expr split #1797 (Jimexist)
- Update to sqlparser 0.14 #1796 [sql] (alamb)
- [split/13] move rest of expr to expr_fn in datafusion-expr module #1794 (Jimexist)
- Update datafusion versions #1793 (matthewmturner)
- Less verbose plans in debug logging #1787 (alamb)
- [split/11] split expr type and null info to be expr-schemable #1784 (Jimexist)
- Introduce
Row
format backed by raw bytes #1782 (yjshen) - rewrite predicates before pushing to union inputs #1781 (korowa)
- Update datafusion to use arrow 9.0.0 #1775 (alamb)
- [split/10] split up expr for rewriting, visiting, and simplification traits #1774 [sql] (Jimexist)
- #1768 Support TimeUnit::Second in hasher #1769 (jychen7)
- TPC-H benchmark can optionally write JSON output file with benchmark summary #1766 (andygrove)
- [split/8] move
Accumulator
andColumnarValue
to datafusion-expr #1765 (Jimexist) - [split/7] move built-in scalar function to datafusion-expr #1764 (Jimexist)
- [split/6] move signature, type signature, volatility to datafusion-expr #1763 (Jimexist)
- [split/9+12] move udf, udaf,
Expr
to datafusion-expr module #1762 [sql] (Jimexist) - [split/5] move window frame and operator to datafusion-expr module #1761 (Jimexist)
- [split/4] move scalar value to datafusion-common #1760 (Jimexist)
- [split/3] split datafusion expr module and move aggregate and window function expr #1759 (Jimexist)
- [split/2] move column and dfschema to datafusion-common module #1758 (Jimexist)
- Use ordered-float 2.10 #1756 (andygrove)
- [split/1] split datafusion-common module #1751 (Jimexist)
- use clap 3 style args parsing for datafusion cli #1749 (Jimexist)
- fix: Case insensitive unquoted identifiers in SQL #1747 [sql] (mkmik)
- Move more tests out of context.rs #1743 (alamb)
- Move optimize test out of context.rs #1742 (alamb)
- Fix typos in crate documentation #1739 (r4ntix)
- add
cargo check --release
to ci #1737 (xudong963) - Update parking_lot requirement from 0.11 to 0.12 #1735 (dependabot[bot])
- Create built-in scalar functions programmatically #1734 (HaoYang670)
- Prevent repartitioning of certain operator's direct children (#1731) #1732 (tustvold)
- API to get Expr's type and nullability without a
DFSchema
#1726 (alamb) - minor: fix
cargo run --release
error #1723 (xudong963) - substitute
parking_lot::Mutex
forstd::sync::Mutex
#1720 (xudong963) - Convert boolean case expressions to boolean logic #1719 (tustvold)
- Add Expression Simplification API #1717 (alamb)
- Create ListingTableConfig which includes file format and schema inference #1715 (matthewmturner)
- make
select_to_plan
clearer #1714 [sql] (xudong963) - Add upper bound for public function
signature
#1713 (HaoYang670) - Add tests and CI for optional pyarrow module #1711 (wjones127)
- Create SchemaAdapter trait to map table schema to file schemas #1709 (thinkharderdev)
- refine test in repartition.rs & coalesce_batches.rs #1707 (xudong963)
- Fuzz test for spillable sort #1706 (yjshen)
- Support
create_physical_expr
andExecutionContextState
orDefaultPhysicalPlanner
for faster speed #1700 (alamb) - Implement TableProvider for DataFrameImpl #1699 (cpcloud)
- Move timestamp related tests out of context.rs and into sql integration test #1696 (alamb)
- Lazy TempDir creation in DiskManager #1695 (alamb)
- Add
MemTrackingMetrics
to ease memory tracking for non-limited memory consumers #1691 (yjshen) - (minor) Reduce memory manager and disk manager logs from
info!
todebug!
#1689 (alamb) - Make
SortPreservingMergeStream
stable on input stream order #1687 (alamb) - Incorporate dyn scalar kernels #1685 (matthewmturner)
- Move
information_schema
tests out of execution/context.rs tosql_integration
tests #1684 (alamb) - Add a new metric type:
Gauge
+CurrentMemoryUsage
to metrics #1682 (yjshen) - refactor array_agg to not to have
update
andmerge
#1681 (Jimexist) - Use NamedTempFile rather than
String
in DiskManager #1680 (alamb) - upgrade clap to version 3 #1672 (Jimexist)
- Improve configuration and resource use of
MemoryManager
andDiskManager
#1668 (alamb) - feat: Support quarter granularity in date_trunc function #1667 (ovr)
- Fix can not load parquet table form spark in datafusion-cli. #1665 (Ted-Jiang)
- Make
MemoryManager
andMemoryStream
public #1664 (yjshen) - [Cleanup] Move
AggregatedMetricsSet
tometrics
for further reuse #1663 (yjshen) - fix: substr - correct behaivour with negative start pos #1660 (ovr)
- suppport bitwise and as an example #1653 [sql] (liukun4515)
- refine match pattern related code #1650 (xudong963)
- update md-5, sha2, blake2 #1647 (xudong963)
- Add
DataFusionError
->ArrowError
conversion #1643 (alamb) - Add
spill_count
andspilled_bytes
toBaselineMetrics
, test sort with spill #1641 (yjshen) - support hash decimal array and group by #1640 (liukun4515)
- Consolidate Schema and RecordBatch projection #1638 (alamb)
- Update hashbrown requirement from 0.11 to 0.12 #1631 (dependabot[bot])
- Update pyo3 requirement from 0.14 to 0.15 #1627 (dependabot[bot])
- Optimize
SortPreservingMergeStream
to avoidSortKeyCursor
sharing #1624 (yjshen) - Handle merging of evolved schemas in ParquetExec #1622 (thinkharderdev)
- feat: Support Substring(str [from int] [for int]) #1621 [sql] (ovr)
- feat: Support complex interval via IntervalMonthDayNano #1615 [sql] (ovr)
- consolidate binary_expr coercion rule code into
binary_rule.rs
module #1607 (alamb) - Fix comparison of dictionary arrays #1606 (alamb)
- add test for decimal to decimal #1603 (liukun4515)
- update nightly version #1597 (Jimexist)
- Consolidate sort and external_sort #1596 (yjshen)
- support from_slice for binary, string, and boolean array types #1589 (Jimexist)
- add from_slice trait to ease arrow2 migration #1588 (Jimexist)
- Implement ARRAY_AGG(DISTINCT ...) #1579 (james727)
- Rename sql integration tests from
mod
tosql_integration
#1575 (alamb) - minor: improve the benchmark readme #1567 (xudong963)
- Consolidate
batch_size
configuration inExecutionConfig
,RuntimeConfig
andPhysicalPlanConfig
#1562 (yjshen) - Update to rust 1.58 #1557 (xudong963)
- support mathematics operation for decimal data type #1554 (liukun4515)
- Address clippy warnings #1553 (sergey-melnychuk)
- enhance arithmetic operation for array with scalar #1552 (liukun4515)
- Remove unused
update
andmerge
implementations from Aggregates and supportingScalarValue
arithmetic #1550 (alamb) - Add batch operations to stddev #1547 (realno)
- Mark ARRAY_AGG(DISTINCT ...) not implemented #1534 (james727)
- Update to arrow-7.0.0 #1523 (alamb)
- Fix ORDER BY on aggregate #1506 (viirya)
- Add example on how to query multiple parquet files #1497 (nitisht)
- Refactor testing modules #1491 (hntd187)
- add rfcs for datafusion #1490 (xudong963)
- support comparison for decimal data type and refactor the binary coercion rule #1483 (liukun4515)
- Minor: Rename
predicate_builder
-->pruning_predicate
for consistency #1481 (alamb) - Tests for support try_cast/cast decimal to numeric #1465 (liukun4515)
- Avoid send empty batches for Hash partitioning. #1459 (Ted-Jiang)
- Planner code cleanup #1450 [sql] (alamb)
- Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448 (alamb)
- Update arrow-rs to 6.4.0 and replace boolean comparison in datafusion with arrow compute kernel #1446 (xudong963)
- support cast/try_cast for decimal: signed numeric to decimal #1442 (liukun4515)
- Consolidate decimal error checking and improve error messages #1438 [sql] (alamb)
- use 0.13 sql parser #1435 (Jimexist)
- Minor Code cleanups #1428 (alamb)
- Clarify communication on bi-weekly sync #1427 (alamb)
- support sum/avg agg for decimal, change sum(float32) --> float64 #1408 [sql] (liukun4515)
- Fix bugs with nullability during rewrites: Combine
simplify
andSimplifier
#1401 (alamb) - Minimize features #1399 (carols10cents)
- Update rust vesion to 1.57 #1395 [sql] (xudong963)
- support decimal scalar value #1394 (liukun4515)
- Add coercion rules for AggregateFunctions #1387 (liukun4515)
- upgrade the arrow-rs version #1385 (liukun4515)
- add array agg name #1382 (liukun4515)
- Make tests for
simplify
andSimplifer
consistent #1376 (alamb) - Refactor: Consolidate expression simplification code in
simplify_expression.rs
#1374 (alamb) - remove unused code in hash_aggregate #1370 (ic4y)
- Use
BufReader
for LocalFileReader to revert performance regression in parquet reading #1366 (Dandandan) - Add unit test for constant folding on values #1355 (viirya)
- Extract logical plan: rename the plan name (follow up) #1354 [sql] (liukun4515)
- Moved aggr_test_schema to test_utils #1338 (rdettai)
- upgrade arrow-rs to 6.2.0 #1334 (liukun4515)
- Update release instructions #1331 (alamb)
- #1268: allow datafusion-cli to toggle quiet flag within CLI #1330 (jgoday)
- Extract Aggregate, Sort, and Join to struct from AggregatePlan #1326 (matthewmturner)
- Extract
EmptyRelation
,Limit
,Values
fromLogicalPlan
#1325 (liukun4515) - Extract CrossJoin, Repartition, Union in LogicalPlan #1322 (liukun4515)
- Fifth batch of updating sql tests to use assert_batches_eq #1318 (matthewmturner)
- Extract Explain, Analyze, Extension in LogicalPlan as independent struct #1317 [sql] (xudong963)
- Extract CreateMemoryTable, DropTable, CreateExternalTable in LogicalPlan as independent struct #1311 [sql] (liukun4515)
- Extract Projection, Filter, Window in LogicalPlan as independent struct #1309 (ic4y)
- Add PSQL comparison tests for except, intersect #1292 (mrob95)
- Extract logical plans in LogicalPlan as independent struct: TableScan #1290 (xudong963)
- Add statement helper command to cli #1285 (matthewmturner)
- Python bindings for window functions #819 [sql] (jgoday)