Skip to content

Modin 0.26.0

Compare
Choose a tag to compare
@anmyachev anmyachev released this 14 Dec 15:17
· 155 commits to master since this release
0.26.0
47a9a4a

This release introduces a new, faster implementation for groupby.apply, as well as many performance fixes related to improving asynchronous execution, a new namespace for accessing experimental functions (for example, DataFrame.modin.to_pickle_distributed), a fix for a long-standing problem with the use of Modin objects inside UDFs for apply and many other fixes.

Note: to get Modin on MPI through unidist (as of unidist 0.5.0) fully working by installing with pip it is required to have a working MPI implementation installed beforehand.

Key Features and Updates Since 0.25.0

  • Stability and Bugfixes
    • FIX-#4355: Fix rename algebraic operator to avoid copying (#4356)
    • FIX-#6594: Fix usage of Modin objects inside UDFs for apply (#6673)
    • FIX-#6664: Use @lazy_metadata_decorator for PandasDataFrame.finalize (#6720)
    • FIX-#6684: Adapt to pandas 2.1.2 (#6685)
    • FIX-#6687: Explicitly add users to CODEOWNERS (#6688)
    • FIX-#6693: Revert creating an additional copy in astype op (#6692)
    • FIX-#6703: Don't use set_index_name(None) (#6698)
    • FIX-#6732: Fix inferring result dtypes for binary operations (#6737)
    • FIX-#6745: Pin unidist <= 0.4.1 (#6746)
    • FIX-#6752: Preserve dtypes cache on .insert() (#6757)
    • FIX-#6768: Make sure to_numpy use **kwargs after #6704 (#6769)
    • FIX-#6771: Avoid ValueError: assignment destination is read-only for cumsum (#6772)
    • FIX-#6773: Make sure _to_pandas return mutable pandas objects (#6775)
    • FIX-#6774: Modify conditions for loc to get similar behavior to pandas (#6798)
    • FIX-#6778: Read parquet files without file extensions using fastparquet (#6790)
    • FIX-#6779: Pass only one indexer into Series.__getitem__ (#6780)
    • FIX-#6781: Use pandas.api.types.pandas_dtype to convert to valid numpy and pandas only dtypes (#6788)
    • FIX-#6782: Filter pandas warnings when precomputing dtypes (#6811)
    • FIX-#6786: Properly d2p for cross DataFrame.join (#6787)
    • FIX-#6791: Pass additional environment variables to MPI workers (#6792)
    • FIX-#6799: Allow creating incomplete ModinIndex objects (#6800)
    • FIX-#6822: Do not propagate NotImplementedError to a user on a set_columns() with dupl labels (#6823)
    • FIX-#6824: Invalidate ModinIndex._lengths_id on empty partitions filtering (#6825)
  • Performance enhancements
    • PERF-#4777: Don't use copy=True parameter for concat calls inside to_pandas (#4778)
    • PERF-#4804: Preserve lengths/widths caches in broadcast_apply_full_axis (#6760)
    • PERF-#6666: Avoid internal reset_index for left merge (#6665)
    • PERF-#6668: Use copy=False for internal usage of set_axis (#6667)
    • PERF-#6669: Avoid one extra copy() call for Series.reset_index (#6670)
    • PERF-#6671: Don't iterate over the result of the Series.tolist function (#6672)
    • PERF-#6690: Use sync_labels=False for rank function (#6689)
    • PERF-#6694: Use lazy_map_partitions() for dtypes conversion (#6695)
    • PERF-#6696: Use cached dtypes in fillna when possible. (#6697)
    • PERF-#6701: Use get_axis internal function instead of axes property (#6700)
    • PERF-#6702: Don't materialize axes when calling to_numpy (#6699)
    • PERF-#6710: Don't materialize index in _groupby_shuffle internal function (#6707)
    • PERF-#6712: Copy _shape_hint in query_complier.copy function (#6713)
    • PERF-#6714: Assign qc._shape_hint = column in columnarize function (#6715)
    • PERF-#6716: Avoid materializing axes in _filter_empties (#6717)
    • PERF-#6718: Use _get_axis_lengths function instead of _axes_lengths property (#6719)
    • PERF-#6721: Use keep_partitioning=True, for duplicated implementation (#6722)
    • PERF-#6723: Use _shape_hint = "column" in DataFrame.squeeze (#6724)
    • PERF-#6727: Remove remaining result.name = None in groupby code (#6726)
    • PERF-#6728: In the case of narrow dataframes, it is cheaper to convert partitions to numpy in the main process. (#6704)
    • PERF-#6747: Preserve columns/dtypes cache when merging on a single index level (#6748)
    • PERF-#6749: Preserve partial dtype for the result of reset_index() (#6751)
    • PERF-#6753: Preserve dtypes cache on .__setitem__() (#6758)
    • PERF-#6754: Merge partial dtype caches on .concat(axis=0) (#6759)
    • PERF-#6756: Don't materialize index when sorting (#6755)
    • PERF-#6762: Carry dtypes information in lazy indices (#6763)
  • Refactor Codebase
    • REFACTOR-#0000: Cleanup one todo and flake8 issues in modin/utils.py (#6826)
    • REFACTOR-#6739: Use execution_wrapper instead of directly addressing DaskWrapper (#6740)
    • REFACTOR-#6805: Move all IO functions to modin.pandas.io module (#6806)
    • REFACTOR-#6807: Rename experimental groupby and experimental numpy variables (#6809)
    • REFACTOR-#6815: Move experimental parsers into modin.experimental folder (#6813)
    • REFACTOR-#6818: Don't implicitly enable experimental mode (#6817)
  • Update testing suite
    • TEST-#6705: Don't compare 'pkl' files (#6706)
    • TEST-#6729: Use custom pytest mark instead of --extra-test-parameters option (#6730)
    • TEST-#6777: Make to_csv tests on Unidist more stable (#6776)
    • TEST-#6795: Don't use platform-dependent int type (#6796)
  • Documentation improvements
    • DOCS-#0000: Add conda forge doc (#6627)
    • DOCS-#6819: Update Modin on cluster documentation (#6678)
  • New Features
    • FEAT-#5836: Introduce 'partial' dtypes cache (#6663)
    • FEAT-#6735: Make Modin on MPI through unidist component more obvious (#6736)
    • FEAT-#6767: Provide the ability to use experimental functionality when experimental mode is not enabled globally via an environment variable (#6764)
    • FEAT-#6784: Add d2p implementations for DataFrame.__rdivmod__/__divmod__ (#6785)
    • FEAT-#6801: Add modin.pandas.error module (#6802)
    • FEAT-#6803: Enable range-partitioning impl for groupby.apply() by default (#6804)
    • FEAT-#6820: Make sure IO functions works with path-like filenames (#6821)

Contributors

@AndreyPavlenko
@JignyasAnand
@RehanSD
@YarShev
@anmyachev
@devin-petersohn
@dchigarev
@mvashishtha
@seydar