Skip to content

Latest commit

 

History

History
310 lines (292 loc) · 40.7 KB

7.0.0.md

File metadata and controls

310 lines (292 loc) · 40.7 KB

7.0.0 (2022-02-14)

Full Changelog

Breaking changes:

  • Consolidate various configurations options, remove unrelated batch_size #1565
  • Extract logical plans in LogicalPlan as independent struct #1228
  • Update ExecutionPlan to know about sortedness and repartitioning optimizer pass respect the invariants #1776 (alamb)
  • Update to arrow 8.0.0 #1673 (alamb)
  • Remove non idiomatic DataFusionError::into_arrow_external_error in favor of From conversion #1645 (alamb)
  • Remove Accumulator::update and Accumulator::merge #1582 (Jimexist)
  • implement Hash for various types and replace PartialOrd #1580 (Jimexist)
  • Replace DatafusionError with GenericError in ObjectStore interface #1541 (matthewmturner)
  • Make FLOAT SQL type map to Float32 rather than Float64 #1423 [sql] (liukun4515)
  • Map REAL SQL type to Float32 rather than Float64 to be consistent with pg #1390 [sql] (hntd187)

Implemented enhancements:

  • Create new datafusion_expr crate #1753
  • Create new datafusion_common crate #1752
  • API to get Expr's type and nullability without a DFSchema #1725
  • Cleaner API to create Expr::ScalarFunction programatically #1718
  • Introduce a Vec<u8> based row-wise representation for DataFusion #1708
  • Simplify creating new ListingTable #1705
  • Implement TableProvider for DataFrameImpl to allow registration of logical plans #1698
  • Public Expr simplification API #1694
  • Query Optimizer: Add OUTER --> INNER join conversion #1670
  • Support reading from CSV, Avro and Json files that have mergeable/compatible, but not identical schemas #1669
  • Remove DataFusionError::into_arrow_external_error in favor of From conversion #1644
  • Include join type in display implementation for logical plan #1620
  • Switch datafusion to using eq_dyn_scalar, etc kernels #1610
  • Proposal: Remove Accumulator::update and Accumulator::merge #1549
  • Replace DataFusionError/Result with impl Error for ObjectStore and Reader #1540
  • Add approx_quantile support #1538
  • support sorting decimal data type #1522
  • Keep all datafusion's packages up to date with Dependabot #1472
  • ExecutionContext support init ExecutionContextState with new(state: Arc<Mutex<ExecutionContextState>>) method #1439
  • support the decimal scalar value #1393
  • Documentation for using scalar functions with the DataFrame API #1364
  • Support boolean == boolean and boolean != boolean operators #1159
  • Support DataType::Decimal(15, 2) in TPC-H benchmark #174
  • Make MemoryStream public #150
  • Add support for Parquet schema merging #132
  • Add SQL support for IN expression #118
  • Add logging to datafusion-cli #1789 (alamb)
  • Add approx_median() aggregate function #1729 (realno)
  • Add join type for logical plan display #1674 [sql] (xudong963)
  • Fix null comparison for Parquet pruning predicate #1595 (viirya)
  • Add corr aggregate function #1561 (realno)
  • Add covar, covar_pop and covar_samp aggregate functions #1551 (realno)
  • Add approx_quantile() aggregation function #1539 (domodwyer)
  • Initial MemoryManager and DiskManager APIs for query execution + External Sort implementation #1526 (yjshen)
  • Add stddev and variance #1525 (realno)
  • Add rem operation for Expr #1467 (liukun4515)
  • support decimal data type in create table #1431 [sql] (liukun4515)
  • Ordering by index in select expression #1419 [sql] (hntd187)
  • Add support for ORDER BY on unprojected columns #1415 (viirya)
  • Support decimal for min and max aggregate #1407 (liukun4515)
  • Consolidate ConstantFolding and SimplifyExpression #1375 (alamb)
  • Datafusion cli quiet mode command to contain option bool #1345 (Jimexist)
  • Implement array_agg aggregate function #1300 (viirya)
  • Add a command to switch output format in cli #1284 (capkurmagati)
  • Support =, <, <=, >, >=, !=, is distinct from, is not distinct from for BooleanArray #1163 (alamb)

Fixed bugs:

  • Unsupported data type in hasher: Timestamp(Second, None) #1768
  • SQL column identifiers should be converted to lowercase when unquoted #1746
  • Data type Dictionary(Int32, Utf8) not supported for binary operation 'eq' on dyn arrays #1605
  • datafusion doesn't process predicate pushdown correctly when there is outer join #1586
  • casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1576
  • CTE/WITH .. UNION ALL confuses name resolution in WHERE #1509
  • ORDER BY min(x) results in error Plan("No field named 'foo.x'. Valid fields are 'MIN(foo.x)'.") #1479
  • Sort discards field metadata on the output schema #1476
  • Datafusion should not strip out timezone information from existing types #1454
  • Error on some queries: "column types must match schema types, expected XXX but found YYY" #1447
  • Query failing to return any results when filter is an equality check on strings (bad statistics in parquet) #1433
  • Field names containing period such as f.c1 cannot be named in SQL query #1432
  • Select * returns an unexpected result #1412
  • Turn off unused default features of chrono and ahash #1398
  • real data type is float32 in PG database, but in the datafusion it is as float64 #1380
  • TPC-H q10 performance regression (expression for filter with added alias is not pushed down) #1367
  • ProjectionExec Loses Field Metadata #1361
  • Support Filter on unprojected columns #1351
  • NULLS ORDER is inconsistent with postgres #1343
  • Fix bug while merging RecordBatch, add SortPreservingMerge fuzz tester #1678 (alamb)
  • fix a cte block with same name for many times #1639 [sql] (xudong963)
  • fix: casting Int64 to Float64 unsuccessfully caused tpch8 to fail #1601 (xudong963)
  • Fix single_distinct_to_groupby for arbitrary expressions #1519 (james727)
  • Fix SortExec discards field metadata on the output schema #1477 (alamb)
  • fix calculate in many_to_many_hash_partition test. #1463 (Ted-Jiang)
  • Add Timezone to Scalar::Time* types, and better timezone awareness to Datafusion's time types #1455 (maxburke)
  • Support identifiers with . in them #1449 [sql] (alamb)
  • Fixes for working with functions in dataframes, additional documentation #1430 (tobyhede)
  • [Minor] Fix send_time metric for hash-repartition #1421 (Dandandan)
  • fix: Select * returns an unexpected result #1413 [sql] (xudong963)
  • Make cli handle multiple whitespaces #1388 (capkurmagati)
  • Metadata is kept in projections for non-derived columns #1378 (hntd187)
  • Fix Predicate Pushdown: split_members should be able to split aliased predicate #1368 (viirya)
  • Change the arg names and make parameters more meaningful #1357 (liukun4515)
  • collect table stats by default for listing table #1347 (houqp)
  • fix: make nulls-order consistent with postgres #1344 [sql] (xudong963)
  • Avoid changing expression names during constant folding #1319 (viirya)
  • improve error message for invalid create table statement #1294 [sql] (houqp)
  • Forbid creating the table with the same name #1288 (liukun4515)

Documentation updates:

Performance improvements:

  • Parquet pruning predicate for IS NULL #1591
  • Fix predicate pushdown for outer joins #1618 (james727)
  • fix: sql planner creates cross join instead of inner join from select predicates #1566 [sql] (xudong963)
  • Split fetch_metadata into fetch_statistics and fetch_schema #1365 (Dandandan)
  • Optimize the performance queries with a single distinct aggregate #1315 (ic4y)
  • Left join could use bitmap for left join instead of Vec<bool> #1291 (boazberman)

Closed issues:

  • Add release compile to CI #1728
  • DiskManager and TempFiles getting created several times per query #1690
  • Add a test for the pyarrow feature in CI #1635
  • SQL tests for when sorting exceeded available memory and had to spill to disk #1573
  • Consolidate the N-way merging code and SortPreservingMergeStream (which has quite good tests of what is often quite tricky code, and it will be performance critical) #1572
  • Consolidate the SortExec code (so there is only a single sort operator that does in memory sorting if it has enough memory budget but then spills to disk if needed). #1571
  • Track memory usage in Non Limited Operators #1569
  • [Question] Why does ballista store tables in the client instead of in the SchedulerServer #1473
  • Consolidate Projection for Schema and RecordBatch #1425
  • Support Sort on unprojected columns #1372
  • Unused code in hash_aggregate #1362
  • Why use the expr types before coercion to get the result type? #1358
  • A problem about the projection_push_down optimizer gathers valid columns #1312
  • apply constant folding to LogicalPlan::Values #1170
  • reduce usage of IntoIterator<Item = Expr> in logical plan builder window fn #372
  • Why does DataFusion throw a Tokio 0.2 runtime error? #176
  • TPC-H Query 14 #165
  • Length kernel returns bytes not character length #156
  • Split the logical operators out into separate source files #115

Merged pull requests:

  • Fixup some doc warnings #1811 (alamb)
  • Ensure most of links in docs are correct #1808 [sql] (HaoYang670)
  • Update CHANGELOG.md, update release scripts #1807 (alamb)
  • Update versions for split crates #1803 (matthewmturner)
  • Improve the error message and UX of tpch benchmark program #1800 (alamb)
  • rename references of expr in logical plan module after datafusion-expr split #1797 (Jimexist)
  • Update to sqlparser 0.14 #1796 [sql] (alamb)
  • [split/13] move rest of expr to expr_fn in datafusion-expr module #1794 (Jimexist)
  • Update datafusion versions #1793 (matthewmturner)
  • Less verbose plans in debug logging #1787 (alamb)
  • [split/11] split expr type and null info to be expr-schemable #1784 (Jimexist)
  • Introduce Row format backed by raw bytes #1782 (yjshen)
  • rewrite predicates before pushing to union inputs #1781 (korowa)
  • Update datafusion to use arrow 9.0.0 #1775 (alamb)
  • [split/10] split up expr for rewriting, visiting, and simplification traits #1774 [sql] (Jimexist)
  • #1768 Support TimeUnit::Second in hasher #1769 (jychen7)
  • TPC-H benchmark can optionally write JSON output file with benchmark summary #1766 (andygrove)
  • [split/8] move Accumulator and ColumnarValue to datafusion-expr #1765 (Jimexist)
  • [split/7] move built-in scalar function to datafusion-expr #1764 (Jimexist)
  • [split/6] move signature, type signature, volatility to datafusion-expr #1763 (Jimexist)
  • [split/9+12] move udf, udaf, Expr to datafusion-expr module #1762 [sql] (Jimexist)
  • [split/5] move window frame and operator to datafusion-expr module #1761 (Jimexist)
  • [split/4] move scalar value to datafusion-common #1760 (Jimexist)
  • [split/3] split datafusion expr module and move aggregate and window function expr #1759 (Jimexist)
  • [split/2] move column and dfschema to datafusion-common module #1758 (Jimexist)
  • Use ordered-float 2.10 #1756 (andygrove)
  • [split/1] split datafusion-common module #1751 (Jimexist)
  • use clap 3 style args parsing for datafusion cli #1749 (Jimexist)
  • fix: Case insensitive unquoted identifiers in SQL #1747 [sql] (mkmik)
  • Move more tests out of context.rs #1743 (alamb)
  • Move optimize test out of context.rs #1742 (alamb)
  • Fix typos in crate documentation #1739 (r4ntix)
  • add cargo check --release to ci #1737 (xudong963)
  • Update parking_lot requirement from 0.11 to 0.12 #1735 (dependabot[bot])
  • Create built-in scalar functions programmatically #1734 (HaoYang670)
  • Prevent repartitioning of certain operator's direct children (#1731) #1732 (tustvold)
  • API to get Expr's type and nullability without a DFSchema #1726 (alamb)
  • minor: fix cargo run --release error #1723 (xudong963)
  • substitute parking_lot::Mutex for std::sync::Mutex #1720 (xudong963)
  • Convert boolean case expressions to boolean logic #1719 (tustvold)
  • Add Expression Simplification API #1717 (alamb)
  • Create ListingTableConfig which includes file format and schema inference #1715 (matthewmturner)
  • make select_to_plan clearer #1714 [sql] (xudong963)
  • Add upper bound for public function signature #1713 (HaoYang670)
  • Add tests and CI for optional pyarrow module #1711 (wjones127)
  • Create SchemaAdapter trait to map table schema to file schemas #1709 (thinkharderdev)
  • refine test in repartition.rs & coalesce_batches.rs #1707 (xudong963)
  • Fuzz test for spillable sort #1706 (yjshen)
  • Support create_physical_expr and ExecutionContextState or DefaultPhysicalPlanner for faster speed #1700 (alamb)
  • Implement TableProvider for DataFrameImpl #1699 (cpcloud)
  • Move timestamp related tests out of context.rs and into sql integration test #1696 (alamb)
  • Lazy TempDir creation in DiskManager #1695 (alamb)
  • Add MemTrackingMetrics to ease memory tracking for non-limited memory consumers #1691 (yjshen)
  • (minor) Reduce memory manager and disk manager logs from info! to debug! #1689 (alamb)
  • Make SortPreservingMergeStream stable on input stream order #1687 (alamb)
  • Incorporate dyn scalar kernels #1685 (matthewmturner)
  • Move information_schema tests out of execution/context.rs to sql_integration tests #1684 (alamb)
  • Add a new metric type: Gauge + CurrentMemoryUsage to metrics #1682 (yjshen)
  • refactor array_agg to not to have update and merge #1681 (Jimexist)
  • Use NamedTempFile rather than String in DiskManager #1680 (alamb)
  • upgrade clap to version 3 #1672 (Jimexist)
  • Improve configuration and resource use of MemoryManager and DiskManager #1668 (alamb)
  • feat: Support quarter granularity in date_trunc function #1667 (ovr)
  • Fix can not load parquet table form spark in datafusion-cli. #1665 (Ted-Jiang)
  • Make MemoryManager and MemoryStream public #1664 (yjshen)
  • [Cleanup] Move AggregatedMetricsSet to metrics for further reuse #1663 (yjshen)
  • fix: substr - correct behaivour with negative start pos #1660 (ovr)
  • suppport bitwise and as an example #1653 [sql] (liukun4515)
  • refine match pattern related code #1650 (xudong963)
  • update md-5, sha2, blake2 #1647 (xudong963)
  • Add DataFusionError -> ArrowError conversion #1643 (alamb)
  • Add spill_count and spilled_bytes to BaselineMetrics, test sort with spill #1641 (yjshen)
  • support hash decimal array and group by #1640 (liukun4515)
  • Consolidate Schema and RecordBatch projection #1638 (alamb)
  • Update hashbrown requirement from 0.11 to 0.12 #1631 (dependabot[bot])
  • Update pyo3 requirement from 0.14 to 0.15 #1627 (dependabot[bot])
  • Optimize SortPreservingMergeStream to avoid SortKeyCursor sharing #1624 (yjshen)
  • Handle merging of evolved schemas in ParquetExec #1622 (thinkharderdev)
  • feat: Support Substring(str [from int] [for int]) #1621 [sql] (ovr)
  • feat: Support complex interval via IntervalMonthDayNano #1615 [sql] (ovr)
  • consolidate binary_expr coercion rule code into binary_rule.rs module #1607 (alamb)
  • Fix comparison of dictionary arrays #1606 (alamb)
  • add test for decimal to decimal #1603 (liukun4515)
  • update nightly version #1597 (Jimexist)
  • Consolidate sort and external_sort #1596 (yjshen)
  • support from_slice for binary, string, and boolean array types #1589 (Jimexist)
  • add from_slice trait to ease arrow2 migration #1588 (Jimexist)
  • Implement ARRAY_AGG(DISTINCT ...) #1579 (james727)
  • Rename sql integration tests from mod to sql_integration #1575 (alamb)
  • minor: improve the benchmark readme #1567 (xudong963)
  • Consolidate batch_size configuration in ExecutionConfig, RuntimeConfig and PhysicalPlanConfig #1562 (yjshen)
  • Update to rust 1.58 #1557 (xudong963)
  • support mathematics operation for decimal data type #1554 (liukun4515)
  • Address clippy warnings #1553 (sergey-melnychuk)
  • enhance arithmetic operation for array with scalar #1552 (liukun4515)
  • Remove unused update and merge implementations from Aggregates and supporting ScalarValue arithmetic #1550 (alamb)
  • Add batch operations to stddev #1547 (realno)
  • Mark ARRAY_AGG(DISTINCT ...) not implemented #1534 (james727)
  • Update to arrow-7.0.0 #1523 (alamb)
  • Fix ORDER BY on aggregate #1506 (viirya)
  • Add example on how to query multiple parquet files #1497 (nitisht)
  • Refactor testing modules #1491 (hntd187)
  • add rfcs for datafusion #1490 (xudong963)
  • support comparison for decimal data type and refactor the binary coercion rule #1483 (liukun4515)
  • Minor: Rename predicate_builder --> pruning_predicate for consistency #1481 (alamb)
  • Tests for support try_cast/cast decimal to numeric #1465 (liukun4515)
  • Avoid send empty batches for Hash partitioning. #1459 (Ted-Jiang)
  • Planner code cleanup #1450 [sql] (alamb)
  • Fix bug in projection: "column types must match schema types, expected XXX but found YYY" #1448 (alamb)
  • Update arrow-rs to 6.4.0 and replace boolean comparison in datafusion with arrow compute kernel #1446 (xudong963)
  • support cast/try_cast for decimal: signed numeric to decimal #1442 (liukun4515)
  • Consolidate decimal error checking and improve error messages #1438 [sql] (alamb)
  • use 0.13 sql parser #1435 (Jimexist)
  • Minor Code cleanups #1428 (alamb)
  • Clarify communication on bi-weekly sync #1427 (alamb)
  • support sum/avg agg for decimal, change sum(float32) --> float64 #1408 [sql] (liukun4515)
  • Fix bugs with nullability during rewrites: Combine simplify and Simplifier #1401 (alamb)
  • Minimize features #1399 (carols10cents)
  • Update rust vesion to 1.57 #1395 [sql] (xudong963)
  • support decimal scalar value #1394 (liukun4515)
  • Add coercion rules for AggregateFunctions #1387 (liukun4515)
  • upgrade the arrow-rs version #1385 (liukun4515)
  • add array agg name #1382 (liukun4515)
  • Make tests for simplify and Simplifer consistent #1376 (alamb)
  • Refactor: Consolidate expression simplification code in simplify_expression.rs #1374 (alamb)
  • remove unused code in hash_aggregate #1370 (ic4y)
  • Use BufReader for LocalFileReader to revert performance regression in parquet reading #1366 (Dandandan)
  • Add unit test for constant folding on values #1355 (viirya)
  • Extract logical plan: rename the plan name (follow up) #1354 [sql] (liukun4515)
  • Moved aggr_test_schema to test_utils #1338 (rdettai)
  • upgrade arrow-rs to 6.2.0 #1334 (liukun4515)
  • Update release instructions #1331 (alamb)
  • #1268: allow datafusion-cli to toggle quiet flag within CLI #1330 (jgoday)
  • Extract Aggregate, Sort, and Join to struct from AggregatePlan #1326 (matthewmturner)
  • Extract EmptyRelation, Limit, Values from LogicalPlan #1325 (liukun4515)
  • Extract CrossJoin, Repartition, Union in LogicalPlan #1322 (liukun4515)
  • Fifth batch of updating sql tests to use assert_batches_eq #1318 (matthewmturner)
  • Extract Explain, Analyze, Extension in LogicalPlan as independent struct #1317 [sql] (xudong963)
  • Extract CreateMemoryTable, DropTable, CreateExternalTable in LogicalPlan as independent struct #1311 [sql] (liukun4515)
  • Extract Projection, Filter, Window in LogicalPlan as independent struct #1309 (ic4y)
  • Add PSQL comparison tests for except, intersect #1292 (mrob95)
  • Extract logical plans in LogicalPlan as independent struct: TableScan #1290 (xudong963)
  • Add statement helper command to cli #1285 (matthewmturner)
  • Python bindings for window functions #819 [sql] (jgoday)