Skip to content

Modin 0.21.0

Compare
Choose a tag to compare
@mvashishtha mvashishtha released this 24 May 21:27
· 437 commits to master since this release
e8e57d9

Modin 0.21.0

This release includes many bug fixes, performance enhancements, and new features.

Key Features and Updates Since 0.20.0

  • Stability and Bugfixes
    • FIX-#4828: allow dict_apply_builder use keyword argument internal_indices (#5945)
    • FIX-#5091: Handle pd.Grouper objects correctly (#6174)
    • FIX-#5203: don't raise AttributeError: 'list' object has no attribute '_query_compiler' in join op (#5939)
    • FIX-#5985: BUG: ArrowPeriodType and ArrowIntervalType are not supported by HDK (#5987)
    • FIX-#5988: BUG: Concatenation of frames with strings is not supported by HDK (#5989)
    • FIX-#5993: Fix documentation building in CI (#5994)
    • FIX-#5997: Run build-docs CI job regardless of the files being changed (#5998)
    • FIX-#6000: HDK: read_csv(): Do not parse dates, if the parse_dates argument is not specified (#6001)
    • FIX-#6022: support lazy import of modin.pandas module (#6023)
    • FIX-#6037: Simplified filter node expression for ranges (#6038)
    • FIX-#6053: align 'Series.str' signatures with pandas (#6054)
    • FIX-#6069: Improve the way resample is handled at the API layer (#6179)
    • FIX-#6070: Simplify implementation of shift (#6168)
    • FIX-#6074: cap pyarrow<12 to fix CI (#6075)
    • FIX-#6094: pin 'urllib3<2' for pip command in 'test-ray-master' job (#6178)
    • FIX-#6095: Implement the to_csv() method in the HDK backend (#6099)
    • FIX-#6097: Pass storage_options to the to_csv function of PandasOnRayIO class with fsspec (#6098)
    • FIX-#6106: Fix API layer implementation of reindex_like (#6131)
    • FIX-#6107: Allow pass through of tz_convert and tz_localize to QC if possible (#6137)
    • FIX-#6109: Don't use join() when indicator is true (#6130)
    • FIX-#6110: Generalize logic to test if an index is a MultiIndex (#6135)
    • FIX-#6112: Ensure that truncate verifies that before <= after (#6134)
    • FIX-#6113: Add QC Layer implementation for idxmin/max (#6170)
    • FIX-#6114: Fix series groupby list of numpy methods (#6129)
    • FIX-#6115: Check for _to_datetime attribute in pd.to_datetime (#6133)
    • FIX-#6117: Add error checking at API level for diff (#6167)
    • FIX-#6120: HDK read_csv(): Fixed parsing dates with nanosecond precision (#6121)
    • FIX-#6146: Fix pivot when values=None (#6166)
    • FIX-#6152: make numeric_only default to True (#6162)
    • FIX-#6154: Ensure GroupBy.getitem preserves key order (#6164)
    • FIX-#6155: Fully implement droplevel for axis=0 (#6180)
    • FIX-#6175: Fix groupby agg columns for empty column partition (#6176)
    • FIX-#6181: Do not ignore copy argument in tz_convert and tz_localize (#6182)
    • FIX-#6183: Ensure array resets index and columns for all storage formats (#6185)
    • FIX-#6184: Make Series.to_list return proper list (#6188)
    • FIX-#6186: Don't use pandas extension types (#6187)
    • FIX-#6194: Fix crashes on groupby.{pct_change,diff} (#6195)
    • FIX-#6196: Align 'Series.cat' signatures with pandas (#6061)
    • FIX-#6204: Use reset_index instead of insert in to_sql (#6205)
    • FIX-#6172: Pass storage_options to the to_csv function of PandasOnUnidist class with fsspec (#6173)
  • Performance enhancements
    • PERF-#5835: Introduce lazy categorical proxy for pandas backend (#6055)
    • PERF-#5840: Precompute dtypes cache for binary operations more often (#5949)
    • PERF-#5841: Precompute dtypes for boolean setitem (#5952)
    • PERF-#5999: Do not set Ray's runtime_env for a single-node case (#6028)
    • PERF-#6122: Extract Feather's metadata without reading a whole file (#6123)
  • Refactor Codebase
    • REFACTOR-#5844: remove inplace kwarg from query compiler clip arguments (#5954)
    • REFACTOR-#5951: remove code duplication for to_pickle_distributed (#5950)
    • REFACTOR-#5992: remove 'apply_license_header.py' as unused (#5990)
    • REFACTOR-#6012: move experimental dispatchers under modin/experimental/... folder (#6011)
    • REFACTOR-#6024: remove code duplication for to_* functions (#5953)
    • REFACTOR-#6044: remove code duplication for 'get_objects_from_partitions' (#6045)
    • REFACTOR-#6046: remove code duplication for 'progress_bar_wrapper' (#6047)
    • REFACTOR-#6062: Add query compiler interfaces for expanding methods (#6064)
    • REFACTOR-#6063: Add query compiler interfaces for some strings methods. (#6088)
    • REFACTOR-#6065: Use between_time in at_time (#6158)
    • REFACTOR-#6066: Support rolling.{rank,quantile,sem} (#6084)
    • REFACTOR-#6067: Simplify describe() query compiler interface (#6082)
    • REFACTOR-#6068: Simplify info() call (#6087)
    • REFACTOR-#6071: Push first and last down to query compiler. (#64) (#6125)
    • REFACTOR-#6091: Push more of memory_usage down to query compiler. (#6092)
    • REFACTOR-#6105: Explicitly pass default value of np.nan to Series.reindex (#6138)
    • REFACTOR-#6108: Move implementation of pd.cut to QC layer (#6136)
    • REFACTOR-#6116: Move groupby_ohlc implementation to QC layer (#6132)
    • REFACTOR-#6119: #6118: Add query compiler methods for groupby diff, pct_change (#6128)
    • REFACTOR-#6151: Get slicer without consructing pandas dataframe. (#6161)
    • REFACTOR-#6159: Stop defaulting at API layer for a few more methods (#6160)
  • Update testing suite
    • TEST-#5956: Verify dtypes equality in tests (#5955)
    • TEST-#5980: use cancel-in-progress only for PRs (#5917)
    • TEST-#5991: add simple tests for read_orc, read_spss, json_normalize, read_xml, read_gbq (#5983)
    • TEST-#6004: add more '# pragma: no cover' for io functions (#6002)
    • TEST-#6006: test modin/test/test_partition_api.py on unidist and dask (#6003)
    • TEST-#6009: use tmp_path fixture instead of ensure_clean_dir as pandas 2.0.0 does (#6008)
    • TEST-#6010: add some more test directories into 'setup.cfg' (#6007)
    • TEST-#6020: exclude '_version.py' from coverage (#6019)
    • TEST-#6027: Test installing Unidist via pip in a clean environment, as we do for Dask and Ray (#6025)
    • TEST-#6030: test the function parameters of Series.str accessor for pandas equivalence (#6033)
    • TEST-#6031: test the function parameters of 'Series.dt' accessor for pandas equivalence (#6197)
    • TEST-#6076: Use 2 cores for experimental groupby on dask (#6077)
    • TEST-#6198: add 'pragma: no cover' for unidist and ray utils that used in remote context (#6059)
    • TEST-#6260: Increase test_io timeout (#6207)
  • Documentation improvements
    • DOCS-#5449: Add page for Modin interoperability with select third party libraries (#5517)
    • DOCS-#6021: Add a section regarding reshuffling groupby to Modin's documentation (#6051)
    • DOCS-#6078: correct default values for MODIN_CPUS and MODIN_NPARTITIONS (#6177)
    • DOCS-#6079: Make 'experimental/index.html' accessible through the readthedocs website (#6080)
  • New Features
    • FEAT-#5816: Implement '.split' method for axis partitions (#5856)
    • FEAT-#5867: Introduce groupby implementation via range-partitioning (#5928)
    • FEAT-#6014: Stop defaulting to pandas in groupby frontend for fill-like methods (#5996)
    • FEAT-#6039: Implement Series.str through CachedAccessor (#6043)
    • FEAT-#6040: implement 'Series.dt' through 'CachedAccessor' (#6056)
    • FEAT-#6041: implement 'Series.cat' through 'CachedAccessor' (#6057)
    • FEAT-#6144: Stop defaulting at API layer for a bunch of methods (#6145)
    • FEAT-#6147: HDK: Arrow-based columns concatenation of frames with trivial index. (#6148)
    • FEAT-#6153: Add API layer implementations for some stat methods. (#6156)

Contributors

@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@arunjose696
@dchigarev
@devin-petersohn
@helmeleegy
@jkew
@labanyamukhopadhyay
@mdatre
@mvashishtha
@noloerino
@pyrito
@vnlitvinov
@naren-ponder