All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Support diff for Spark Connect implemened via PySpark Dataset API (#251)
- Add ignore columns to diff in Python API (#252)
- Check that the Java / Scala package is installed when needed by Python (#250)
- Diff change column should respect comparators (#238)
- Make create_temporary_dir work with pyspark-extension only (#222). This allows installing PIP packages and Poetry projects via pure Python spark-extension package (Maven package not required any more).
- Add map diff comparator to Python API (#226)
- Add count_null aggregate function (#206)
- Support reading parquet schema (#208)
- Add more columns to reading parquet metadata (#209, #211)
- Provide groupByKey shortcuts for groupBy.as (#213)
- Allow to install PIP packages into PySpark job (#215)
- Allow to install Poetry projects into PySpark job (#216)
- Update setup.py to include parquet methods in python package (#191)
- Add --statistics option to diff app (#189)
- Add --filter option to diff app (#190)
- Add key order sensitive map comparator (#187)
- Use dataset encoder rather than implicit value encoder for implicit dataset extension class (#183)
- Fix key-sensitivity in map comparator (#186)
- Add method to set and automatically unset Spark job description. (#172)
- Add column function that converts between .Net (C#, F#, Visual Basic)
DateTime.Ticks
and Spark timestamp / Unix epoch timestamps. (#153)
- Spark app to diff files or tables and write result back to file or table. (#160)
- Add null value count to
parquetBlockColumns
andparquet_block_columns
. (#162) - Add
parallelism
argument to Parquet metadata methods. (#164)
- Change data type of column name in
parquetBlockColumns
andparquet_block_columns
to array of strings. Cast to string to get earlier behaviour (string column name). (#162)
- Add reader for parquet metadata. (#154)
- Add whitespace agnostic diff comparator. (#137)
- Add Python whl package build. (#151)
- Allow for custom diff equality. (#127)
- Fix Python API calling into Scala code. (#132)
- Add diffWith to Scala, Java and Python Diff API. (#109)
- Diff similar Datasets with ignoreColumns. Before, only similar DataFrame could be diffed with ignoreColumns. (#111)
- Cache before writing via partitionedBy to work around SPARK-40588. Unpersist via UnpersistHandle. (#124)
- Add (global) row numbers transformation to Scala, Java and Python API. (#97)
- Removed support for Pyton 3.6
- Add sorted group methods to Dataset. (#76)
- Add support for Spark 3.2 and Scala 2.13.
- Support to ignore columns in diff API. (#63)
- Removed support for Spark 2.4.
- Add support for Spark 3.1.
- Refine conditional transformation helper methods.
- Refine conditional transformation helper methods.
- Add transformation to compute histogram. (#26)
- Add conditional transformation helper methods. (#27)
- Add partitioned writing helpers that simplifies writing optimally ordered partitioned data. (#29)
- Add diff modes (#22): column-by-column, side-by-side, left and right side diff modes.
- Adds sparse mode (#23): diff DataFrame contains only changed values.
- Add Python API for Diff transformation.
- Add change column to Diff transformation providing column names of all changed columns in a row.
- Add fluent methods to change immutable diff options.
- Add
backticks
method to handle column names that contain dots (.
).
- Add Diff transformation for Datasets.