Skip to content

Releases: modin-project/modin

Modin 0.28.0

07 Mar 18:35
0.28.0
14452a8
Compare
Choose a tag to compare

This release introduces modin.pandas.api.extensions module, faster implementations for merge and
groupby.rolling(by default) functions, and new functions to work with Ray Dataset: to/from_ray_dataset.
It also includes some other new features, performance optimizations and bug fixes.

Key Features and Updates Since 0.27.0

  • Stability and Bugfixes
    • FIX-#6935: Fix merge when right operand is an empty dataframe (#6941)
    • FIX-#6936: Fix read_parquet when dataset is created with to_parquet and index=False (#6937)
    • FIX-#6944: Apply isort formatting for scripts from tutorials (#6945)
    • FIX-#6946: Remove needs: [lint-black-isort, ...] (#6947)
    • FIX-#6948: Fix groupby when Modin dataframe has several column partitions (#6951)
    • FIX-#6952: Use render_as_string to get sqlalchemy engine url (#6953)
    • FIX-#6968: Align API with pandas (#6969)
    • FIX-#6974: Always use actual pandas version in test_all_urls_exist (#6975)
    • FIX-#6982: Updating data in notebooks from yellow taxi to green taxi dataset (#6993)
    • FIX-#6984: Ensure the results of inplace operations materialize (for tests) (#6985)
  • Performance enhancements
    • PERF-#6976: Do not trigger unnecessary computations on ._propagate_index_objs() (#6977)
    • PERF-#6979: Do not trigger ._copartition() for identical indices on binary operations (#6980)
  • Refactor Codebase
    • REFACTOR-#6856: Rename read_pickle_distributed/to_pickle_distributed to read_pickle_glob/to_pickle_glob (#6957)
    • REFACTOR-#6939: Make modin.pandas.DataFrame._to_pandas a public method (#6940)
    • REFACTOR-#6958: Remove DataFrame.to_pickle_distributed in favour of DataFrame.modin.to_pickle_distributed (#6959)
    • REFACTOR-#7002: Get more information about exceptions from eval_general utility (#7003)
    • REFACTOR-#7008: Remove check_exception_type argument of eval_general function (#7009)
    • REFACTOR-#7013: Move to_pandas and to_ray_dataset into modin namespace (#7014)
    • REFACTOR-#7017: Align to_hdf and hist signatures to pandas (#7018)
  • Update testing suite
    • TEST-#6932: Don't use deprecated pandas._testing.makeStringIndex (#6933)
    • TEST-#6994: Update tests in test_series.py (#6995)
    • TEST-#6996: Update tests in test_io.py (#6997)
  • Documentation improvements
  • New Features
    • FEAT-#3044: Create Extensions Module in Modin (#6961)
    • FEAT-#4622: Unify data type of log_level in logging module (#6992)
    • FEAT-#6913: Support sqlalchemy connectables in read_sql by getting connection url (#6956)
    • FEAT-#6934: Support include_groups=False parameter in groupby.apply() (#6938)
    • FEAT-#6942: Enable range-partitioning impl for groupby().rolling() by default (#6943)
    • FEAT-#6965: Implement .merge() using range-partitioning implementation (#6966)
    • FEAT-#6970: Implement to/from_ray_dataset functions (#6971)
    • FEAT-#6983: Add Pluggable Documentation Module Support (#6986)
    • FEAT-#7001: Do not force materialization in MetaList.__getitem__() (#7006)

Contributors

@AndreyPavlenko
@Retribution98
@YarShev
@anmyachev
@arunjose696
@dchigarev
@sfc-gh-dpetersohn
@tochigiv

Modin 0.27.0

14 Feb 14:00
0.27.0
d54dcfd
Compare
Choose a tag to compare

This release updates pandas to 2.2, introduces lazy execution mode on Ray, new functions that support glob
syntax and speeds up several more groupby cases. It also includes some other new features, performance
optimizations and many bug fixes.

Key Features and Updates Since 0.26.0

  • Stability and Bugfixes
    • FIX-#2405: Make sure named aggregation work for Series objects (#6892)
    • FIX-#5925: Put a sorting-hack into groupby tests to hide #6875 bug (#6896)
    • FIX-#6830: Pass AWS related env vars to mpiexec (#6867)
    • FIX-#6840: Call tolist function in DtypesDescriptor._merge_dtypes (#6844)
    • FIX-#6855: Make sure read_parquet works with integer columns for pyarrow engine (#6874)
    • FIX-#6879: Convert the right DF to single partition before broadcasting in query_compiler.merge (#6880)
    • FIX-#6881: Make sure astype works correctly with int32 and float32 dtypes (#6884)
    • FIX-#6897: Preprocess kernel function that aligns columns in groupby (#6898)
    • FIX-#6897: Revert unidist specific fix for groupby (#6902)
    • FIX-#6899: Avoid sending lazy categorical proxies to workers (#6900)
    • FIX-#6904: Align levels of partially known dtypes with MultiIndex labels (#6905)
    • FIX-#6911: Remove unidist specific workaround in .from_pandas() (#6912)
    • FIX-#6916: Unpin pydantic dependency (#6917)
    • FIX-#6924: HDK: Use JoinNode instead of MaskNode for non-range row_position (#6926)
  • Performance enhancements
    • PERF-#6876: Skip the masking stage on iloc where beneficial (#6878)
    • PERF-#6922: Set DaskThreadsPerWorker to 1 (#6923)
  • Refactor Codebase
    • REFACTOR-#6293: Corrected missmatch to mismatch in ErrorMessage.missmatch_with_pandas method (#6901)
    • REFACTOR-#6812: Remove PyarrowOnRay execution in favour of pyarrow-backed pandas dataframes (#6848)
    • REFACTOR-#6833: Remove SocksProxy, DoLogRpyc, DoTraceRpyc outdated classes (#6834)
    • REFACTOR-#6845: Fix import issues found by CodeQL (#6837)
    • REFACTOR-#6852: Remove OrderedDict in favor of builtin dict (#6853)
    • REFACTOR-#6858: Rename _get_dimensions and change arguments (#6859)
    • REFACTOR-#6889: Define __all__ in modin.config.__init__.py (#6886)
    • REFACTOR-#6903: Remove duplicated definitions of create_test_series (#6910)
    • REFACTOR-#6918: Docstring and type hints fixes (#6925)
  • Update testing suite
    • TEST-#6708: Create test files using tmp_path fixture (#6709)
    • TEST-#6777: Make to_csv tests on Unidist more stable (for test-all-unidist CI job) (#6851)
    • TEST-#6830: Use local s3 server instead of public s3 buckets (#6863)
    • TEST-#6846: Skip unstable Unidist to_csv tests (#6847)
    • TEST-#6868: Remove tests for gs remote protocol since we rely on fsspec (#6882)
    • TEST-#6885: Switch to black>=24.1.0 (#6887)
    • TEST-#6893: Added support for pytest 8.0.0 (#6894)
    • TEST-#6920: Remove testing for Ray client (#6921)
  • Documentation improvements
    • DOCS-#6860: Add an ecosystem page to the docs (#6861)
  • New Features
    • FEAT-#3450: Implement read_json_glob and to_json_glob (#6873)
    • FEAT-#5809: New implementation of the Ray lazy execution queue (#6731)
    • FEAT-#5925: Enable grouping on categoricals with range-partitioning impl (#6862)
    • FEAT-#6382: Execute bitwise NOT (~) operations on HDK (#6383)
    • FEAT-#6398: Improved performance of list-like objects insertion into HDK DataFrames (#6412)
    • FEAT-#6830: Remove public s3 bucket reference (#6829)
    • FEAT-#6831: Implement read_parquet_glob and to_parquet_glob (#6854)
    • FEAT-#6832: Implement read_xml_glob, to_xml_glob (#6930)
    • FEAT-#6835: Do not put binary functions to the Ray storage multiple times (#6836)
    • FEAT-#6838: Prefer lazy execution for binary operations with scalar (#6839)
    • FEAT-#6841: Fixing ray anti pattern with .length() and .width() being called in a loop (#6842)
    • FEAT-#6849: Removing to_pandas call in merge and join functions (#6850)
    • FEAT-#6883: Support grouping on a Series with range-partitioning impl (#6888)
    • FEAT-#6906: Update to pandas 2.2.* (#6907)
    • FEAT-#6908: Remove the warning regarding engine initialization (#6909)
    • FEAT-#6914: Add a config for setting a number of threads per Dask worker (#6915)
    • FEAT-#6918: Add auto mode to the lazy execution. (#6919)

Contributors

@AndreyPavlenko
@YarShev
@anmyachev
@arunjose696
@dchigarev
@leshikus
@vedant

Modin 0.26.1

19 Jan 15:53
0.26.1
c207880
Compare
Choose a tag to compare

This release includes a fix for concat function.

Key Features and Updates Since 0.26.0

  • Stability and Bugfixes
    • FIX-#6830: Pass AWS related env vars to mpiexec (#6867)
    • FIX-#6840: Call tolist function in DtypesDescriptor._merge_dtypes (#6844)
  • Update testing suite
    • TEST-#6777: Make to_csv tests on Unidist more stable (for test-all-unidist CI job) (#6851)
    • TEST-#6830: Use local s3 server instead of public s3 buckets (#6863)
    • TEST-#6846: Skip unstable Unidist to_csv tests (#6847)
  • New Features
    • FEAT-#6830: Remove public s3 bucket reference (#6829)

Contributors

@leshikus
@anmyachev

Modin 0.26.0

14 Dec 15:17
0.26.0
47a9a4a
Compare
Choose a tag to compare

This release introduces a new, faster implementation for groupby.apply, as well as many performance fixes related to improving asynchronous execution, a new namespace for accessing experimental functions (for example, DataFrame.modin.to_pickle_distributed), a fix for a long-standing problem with the use of Modin objects inside UDFs for apply and many other fixes.

Note: to get Modin on MPI through unidist (as of unidist 0.5.0) fully working by installing with pip it is required to have a working MPI implementation installed beforehand.

Key Features and Updates Since 0.25.0

  • Stability and Bugfixes
    • FIX-#4355: Fix rename algebraic operator to avoid copying (#4356)
    • FIX-#6594: Fix usage of Modin objects inside UDFs for apply (#6673)
    • FIX-#6664: Use @lazy_metadata_decorator for PandasDataFrame.finalize (#6720)
    • FIX-#6684: Adapt to pandas 2.1.2 (#6685)
    • FIX-#6687: Explicitly add users to CODEOWNERS (#6688)
    • FIX-#6693: Revert creating an additional copy in astype op (#6692)
    • FIX-#6703: Don't use set_index_name(None) (#6698)
    • FIX-#6732: Fix inferring result dtypes for binary operations (#6737)
    • FIX-#6745: Pin unidist <= 0.4.1 (#6746)
    • FIX-#6752: Preserve dtypes cache on .insert() (#6757)
    • FIX-#6768: Make sure to_numpy use **kwargs after #6704 (#6769)
    • FIX-#6771: Avoid ValueError: assignment destination is read-only for cumsum (#6772)
    • FIX-#6773: Make sure _to_pandas return mutable pandas objects (#6775)
    • FIX-#6774: Modify conditions for loc to get similar behavior to pandas (#6798)
    • FIX-#6778: Read parquet files without file extensions using fastparquet (#6790)
    • FIX-#6779: Pass only one indexer into Series.__getitem__ (#6780)
    • FIX-#6781: Use pandas.api.types.pandas_dtype to convert to valid numpy and pandas only dtypes (#6788)
    • FIX-#6782: Filter pandas warnings when precomputing dtypes (#6811)
    • FIX-#6786: Properly d2p for cross DataFrame.join (#6787)
    • FIX-#6791: Pass additional environment variables to MPI workers (#6792)
    • FIX-#6799: Allow creating incomplete ModinIndex objects (#6800)
    • FIX-#6822: Do not propagate NotImplementedError to a user on a set_columns() with dupl labels (#6823)
    • FIX-#6824: Invalidate ModinIndex._lengths_id on empty partitions filtering (#6825)
  • Performance enhancements
    • PERF-#4777: Don't use copy=True parameter for concat calls inside to_pandas (#4778)
    • PERF-#4804: Preserve lengths/widths caches in broadcast_apply_full_axis (#6760)
    • PERF-#6666: Avoid internal reset_index for left merge (#6665)
    • PERF-#6668: Use copy=False for internal usage of set_axis (#6667)
    • PERF-#6669: Avoid one extra copy() call for Series.reset_index (#6670)
    • PERF-#6671: Don't iterate over the result of the Series.tolist function (#6672)
    • PERF-#6690: Use sync_labels=False for rank function (#6689)
    • PERF-#6694: Use lazy_map_partitions() for dtypes conversion (#6695)
    • PERF-#6696: Use cached dtypes in fillna when possible. (#6697)
    • PERF-#6701: Use get_axis internal function instead of axes property (#6700)
    • PERF-#6702: Don't materialize axes when calling to_numpy (#6699)
    • PERF-#6710: Don't materialize index in _groupby_shuffle internal function (#6707)
    • PERF-#6712: Copy _shape_hint in query_complier.copy function (#6713)
    • PERF-#6714: Assign qc._shape_hint = column in columnarize function (#6715)
    • PERF-#6716: Avoid materializing axes in _filter_empties (#6717)
    • PERF-#6718: Use _get_axis_lengths function instead of _axes_lengths property (#6719)
    • PERF-#6721: Use keep_partitioning=True, for duplicated implementation (#6722)
    • PERF-#6723: Use _shape_hint = "column" in DataFrame.squeeze (#6724)
    • PERF-#6727: Remove remaining result.name = None in groupby code (#6726)
    • PERF-#6728: In the case of narrow dataframes, it is cheaper to convert partitions to numpy in the main process. (#6704)
    • PERF-#6747: Preserve columns/dtypes cache when merging on a single index level (#6748)
    • PERF-#6749: Preserve partial dtype for the result of reset_index() (#6751)
    • PERF-#6753: Preserve dtypes cache on .__setitem__() (#6758)
    • PERF-#6754: Merge partial dtype caches on .concat(axis=0) (#6759)
    • PERF-#6756: Don't materialize index when sorting (#6755)
    • PERF-#6762: Carry dtypes information in lazy indices (#6763)
  • Refactor Codebase
    • REFACTOR-#0000: Cleanup one todo and flake8 issues in modin/utils.py (#6826)
    • REFACTOR-#6739: Use execution_wrapper instead of directly addressing DaskWrapper (#6740)
    • REFACTOR-#6805: Move all IO functions to modin.pandas.io module (#6806)
    • REFACTOR-#6807: Rename experimental groupby and experimental numpy variables (#6809)
    • REFACTOR-#6815: Move experimental parsers into modin.experimental folder (#6813)
    • REFACTOR-#6818: Don't implicitly enable experimental mode (#6817)
  • Update testing suite
    • TEST-#6705: Don't compare 'pkl' files (#6706)
    • TEST-#6729: Use custom pytest mark instead of --extra-test-parameters option (#6730)
    • TEST-#6777: Make to_csv tests on Unidist more stable (#6776)
    • TEST-#6795: Don't use platform-dependent int type (#6796)
  • Documentation improvements
    • DOCS-#0000: Add conda forge doc (#6627)
    • DOCS-#6819: Update Modin on cluster documentation (#6678)
  • New Features
    • FEAT-#5836: Introduce 'partial' dtypes cache (#6663)
    • FEAT-#6735: Make Modin on MPI through unidist component more obvious (#6736)
    • FEAT-#6767: Provide the ability to use experimental functionality when experimental mode is not enabled globally via an environment variable (#6764)
    • FEAT-#6784: Add d2p implementations for DataFrame.__rdivmod__/__divmod__ (#6785)
    • FEAT-#6801: Add modin.pandas.error module (#6802)
    • FEAT-#6803: Enable range-partitioning impl for groupby.apply() by default (#6804)
    • FEAT-#6820: Make sure IO functions works with path-like filenames (#6821)

Contributors

@AndreyPavlenko
@JignyasAnand
@RehanSD
@YarShev
@anmyachev
@devin-petersohn
@dchigarev
@mvashishtha
@seydar

Modin 0.24.1.post0

17 Nov 23:23
0.24.1.post0
46b7e66
Compare
Choose a tag to compare

Hotfix for Unidist.

Key Features and Updates Since 0.24.1

  • Stability and Bugfixes

Note: broken pip wheel, use https://github.com/modin-project/modin/releases/tag/0.24.1.post1 instead

Contributors

@anmyachev
@dchigarev

Modin 0.25.1

16 Nov 15:49
0.25.1
b99cf06
Compare
Choose a tag to compare

Hotfix for Unidist.

Key Features and Updates Since 0.25.0

  • Stability and Bugfixes

Contributors

@anmyachev

Modin 0.23.1.post0

15 Nov 19:54
0.23.1.post0
0c3746b
Compare
Choose a tag to compare

The main purpose of this release is to port as many fixes as possible to the latest version, which supports Python 3.8.

Key Features and Updates Since 0.23.1

  • Stability and Bugfixes
    • FIX-#0000: Pin unidist<=0.4.1
    • FIX-#4347: read_excel: defaults to pandas for unsupported types of io (#6462)
    • FIX-#4507: Do not call ray.get() inside of the kernel executing call queues (#6633)
    • FIX-#4687: Change Column.null_count to return a built-in int instead of NumPy scalar (#6526)
    • FIX-#5164: Fix unwrap_partitions for virtual partitions when axis=None (#6560)
    • FIX-#5536: Remove branch disabling __getattribute__ for experimental mode (#6529)
    • FIX-#6465: Fix groupby.apply() for UDFs that change the output's shape (#6506)
    • FIX-#6479: HDK CalciteBuilder: Do not call is_bool_dtype() for categorical (#6480)
    • FIX-#6509: Fix reshuffling in case of a string key (#6510)
    • FIX-#6514: test_sort_cols_str from test_dataframe.py crashed on HDK 0.7.0 and python 3.9 (#6515)
    • FIX-#6516: HDK: test_dataframe.py is crashed if Calcite is disabled (#6517)
    • FIX-#6518: Fix interchange protocol for string columns (#6523)
    • FIX-#6519: Consider botocore as an optional dependency (#6521)
    • FIX-#6532: Fix read_excel so that it doesn't use rich_text param for old openpyxl (#6534)
    • FIX-#6535: Pin s3fs<2023.9.0 (#6536)
    • FIX-#6537: Unpin s3fs<2023.9.0 (#6544)
    • FIX-#6541: Fix ValueError: buffer source array is read-only for iloc (#6538)
    • FIX-#6553: Fix read_csv with iterator=True (#6554)
    • FIX-#6572: Execute simple queries row-wise in pandas backend (#6575)
    • FIX-#6594: Fix usage of Modin objects inside UDFs for apply (#6673)
    • FIX-#6600: Fix usage of list of UDF functions in Series.groupby.agg (#6613)
    • FIX-#6601: sort_values shouldn't affect source dataframe/series (#6603)
    • FIX-#6602: Refactor join to avoid distributing a dict object warning (#6612)
    • FIX-#6607: Fix incorrect cache after .sort_values() (#6608)
    • FIX-#6628: Allow groupby diff for dates (#6631)
    • FIX-#6632: Return Series instead of Dataframe for groupby.apply in case of experimental groupby (#6649)
    • FIX-#6635: HDK: read_csv: treat object dtype as string (#6636)
    • FIX-#6637: Fix skiprows parameter usage for read_excel (#6638)
    • FIX-#6642: Fix modin.numpy.array.sum on HDK (#6643)
    • FIX-#6647: Added init file to make modin/experimental/sql/hdk/query.py part of modin package (#6646)
    • FIX-#6651: Make sure Series.between works correctly (#6656)
    • FIX-#6680: Specify navigation_with_keys=True to fix docs build (#6681)

Contributors

@AndreyPavlenko
@Egor-Krivov
@Garra1980
@RehanSD
@anmyachev
@dchigarev
@vnlitvinov

Modin 0.25.0

26 Oct 19:46
0.25.0
e12b217
Compare
Choose a tag to compare

This release introduces modin.utils.execute function to improve benchmarking experience, includes new version of HDK 0.9.
It also includes performance optimizations for sort_values, value_counts, 2D setitem and several others, as well as many bug fixes.

Key Features and Updates Since 0.24.0

  • Stability and Bugfixes
    • FIX-#4507: Do not call ray.get() inside of the kernel executing call queues (#6633)
    • FIX-#6585: Avoid FutureWarnings in rolling unless necessary (#6586)
    • FIX-#6600: Fix usage of list of UDF functions in Series.groupby.agg (#6613)
    • FIX-#6602: Refactor join to avoid distributing a dict object warning (#6612)
    • FIX-#6604: HDK: Added support for list to DataFrame.agg() (#6606)
    • FIX-#6607: Fix incorrect cache after .sort_values() (#6608)
    • FIX-#6624: Add FutureWarnings for first/last/bool (#6625)
    • FIX-#6628: Allow groupby.diff() for dates (#6631)
    • FIX-#6632: Return Series instead of Dataframe for groupby.apply in case of experimental groupby (#6649)
    • FIX-#6635: HDK: read_csv(): treat object dtype as string (#6636)
    • FIX-#6637: Fix skiprows parameter usage for read_excel (#6638)
    • FIX-#6642: Fix modin.numpy.array.sum on HDK (#6643)
    • FIX-#6647: Added init file to make modin/experimental/sql/hdk/query.py part of modin package (#6646)
    • FIX-#6651: Make sure Series.between works correctly (#6656)
    • FIX-#6680: Specify navigation_with_keys=True to fix docs build (#6681)
  • Performance enhancements
    • PERF-#2813: Distributed from_pandas() for numerical data in Ray (#6640)
    • PERF-#5533: Improved sort_values by reducing the number of partitions (#6589)
    • PERF-#6362: Implement 2D setitem without to-pandas conversion (#6618)
    • PERF-#6614: HDK: Use MODIN_CPUS instead of os.cpu_count() for the fragment size calculation (#6615)
    • PERF-#6629: HDK: Avoid LazyProxyCategoricalDtype materialization on merge (#6630)
    • PERF-#6645: Avoid label synchronization for dot operation (#6644)
    • PERF-#6653: value_counts(): Eliminate redundant sorting. (#6654)
    • PERF-#6661: Do not convert columns dtypes if the new dtypes are the same (#6662)
  • Refactor Codebase
    • REFACTOR-#6622: Don't use deprecated random_integers func (#6623)
  • Update testing suite
    • TEST-#5489: Allow for pytest to print warnings in tests output (#6621)
  • Documentation improvements
    • DOCS-#4085: Replace vague links to actual names of the pages/sections in docs (#4096)
    • DOCS-#6658: Add a note how to enable object spilling in a multi-node Ray cluster (#6659)
  • New Features
    • FEAT-#5221: Add execute to trigger lazy computations and wait for them to complete (#6648)
    • FEAT-#5634: Introduce materialize parameter for partition.ip func (#6650)
    • FEAT-#6675: Bump pyhdk version to 0.9 (#6676)

Contributors

@AndreyPavlenko
@Egor-Krivov
@Garra1980
@YarShev
@anmyachev
@dchigarev

Modin 0.24.1

28 Sep 12:40
0.24.1
4c01f64
Compare
Choose a tag to compare

Hotfix for sort_values.

Key Features and Updates Since 0.24.0

  • Stability and Bugfixes
    • FIX-#6604: HDK: Added support for list to DataFrame.agg() (#6606)
    • FIX-#6607: Fix incorrect cache after .sort_values() (#6608)

Contributors

@AndreyPavlenko
@dchigarev

Modin 0.24.0

26 Sep 23:36
0.24.0
22ce95e
Compare
Choose a tag to compare

This release upgrades the pandas version to 2.1, updates the minimum supported python version up to 3.9, introduces ModinDataLoader to improve interaction with PyTorch, fixes several issues with interchange protocol that solved known compatibility issues with Plotly, Seaborn and Altair, includes new version of HDK 0.8. It also includes some other new features, and many bug fixes.

Key Features and Updates Since 0.23.0

  • Stability and Bugfixes
    • FIX-#0000: Don't test experimental xgboost with Ray nightly build (#6424)
    • FIX-#0000: Fix xgboost tests with ray>2.6.0 (#6425)
    • FIX-#1930: Fix one of the cases of heterogeneous data for read_csv (#5507)
    • FIX-#4347: read_excel: defaults to pandas for unsupported types of 'io' (#6462)
    • FIX-#4580: Fix access by row label in query and eval (#6488)
    • FIX-#4687: Change Column.null_count to return a built-in int instead of NumPy scalar (#6526)
    • FIX-#5164: Fix unwrap_partitions for virtual partitions when axis=None (#6560)
    • FIX-#5536: Remove branch disabling __getattribute__ for experimental mode (#6529)
    • FIX-#5627: Stop checking temp_df.dtype == 'category' (#6360)
    • FIX-#5972: Compute correct dtype for Series.str.find/index/rfind/rindex (#6426)
    • FIX-#6219: Don't default to pandas for copy on empty DataFrame/Series objects (#6371)
    • FIX-#6299: __array__ method always returns array of vanilla numpy (#6300)
    • FIX-#6334: Improve error message if HDK isn't installed in the environment (#6358)
    • FIX-#6347: Remove 'modin in the cloud' experimental feature (#6408)
    • FIX-#6364: Make reshuffling work with BenchmarkMode.put(True) (#6365)
    • FIX-#6367: Enable support for groupby.size() in reshuffling groupby (#6370)
    • FIX-#6368: Apply deferred indices before map-reduce groupby (#6369)
    • FIX-#6372: Precompute dtypes for sum operation (#6421)
    • FIX-#6375: Don't initialize engines at import time (#6374)
    • FIX-#6386: Don't make unnecessary astype calls for modin.array.sum op (#6395)
    • FIX-#6392: Compute dtypes for the DataFrame.mean() result (#6520)
    • FIX-#6394: Preserve dtypes for __setitem__ op when using not hashable key (#6547)
    • FIX-#6396: Set __factory to None in case of any problems during initialization (#6397)
    • FIX-#6402: Allow datetime and timedelta types in diff (#6403)
    • FIX-#6405: Apply disable_logging to __getattr__ (#6406)
    • FIX-#6410: Add a link to @modin_project twitter (#6411)
    • FIX-#6414: Fix read_feather with pyarrow<11.0 (#6415)
    • FIX-#6427: Make code compatible with flake8==6.1.0 (#6428)
    • FIX-#6429: Exclude pymssql==2.2.8 from environments (#6430)
    • FIX-#6436: Support ~ in paths in IO functions correctly (#6448)
    • FIX-#6443: Cast boolean columns before sum|mean|median groupby aggregations (#6444)
    • FIX-#6446: Stop requiring modin-xgboost approval (#6447)
    • FIX-#6456: Create fake xgboost module for building docs (#6457)
    • FIX-#6459: Support fastparquet>=2023.1.0 (#6458)
    • FIX-#6465: Fix groupby.apply() for UDFs that change the output's shape (#6506)
    • FIX-#6479: HDK CalciteBuilder: Do not call is_bool_dtype() for categorical (#6480)
    • FIX-#6483: Default to pandas for __array_ufunc__ (#6486)
    • FIX-#6509: Fix 'reshuffling' in case of a string key (#6510)
    • FIX-#6514: test_sort_cols_str from test_dataframe.py crashed on HDK 0.7.0 and python 3.9 (#6515)
    • FIX-#6516: HDK: test_dataframe.py is crashed if Calcite is disabled (#6517)
    • FIX-#6518: Fix interchange protocol for string columns (#6523)
    • FIX-#6519: Consider botocore as an optional dependency (#6521)
    • FIX-#6532: Fix read_excel so that it doesn't use rich_text param for old openpyxl (#6534)
    • FIX-#6535: Pin s3fs<2023.9.0 (#6536)
    • FIX-#6537: Unpin s3fs<2023.9.0 (#6544)
    • FIX-#6540: Correct handling of range indices and index names in read_parquet (#6545)
    • FIX-#6541: Fix ValueError: buffer source array is read-only for iloc (#6538)
    • FIX-#6549: Remove usage of dfsql module (#6550)
    • FIX-#6552: Avoid FutureWarnings in groupby unless necessary (#6595)
    • FIX-#6553: Fix read_csv with iterator=True (#6554)
    • FIX-#6558: Normalize the number of partitions after .read_parquet() (#6559)
    • FIX-#6561: Remove MODIN_OMNISCI_* env vars in favor of MODIN_HDK_* (#6562)
    • FIX-#6565: Don't implement map function via applymap (#6566)
    • FIX-#6572: Execute simple queries row-wise in pandas backend (#6575)
    • FIX-#6582: Avoid FutureWarnings in bfill/backfill/ffill/pad unless necessary (#6599)
    • FIX-#6587: Use different env files for unidist engine for windows and linux (#6588)
    • FIX-#6601: sort_values shouldn't affect source dataframe/series (#6603)
  • Performance enhancements
    • PERF-#6332: Don't materialize axes in concat operation (#6381)
    • PERF-#6373: Preserve dtypes cache for _repartition (#6376)
    • PERF-#6378: Use numpy.array operations in internals of iloc/loc operation (#6393)
    • PERF-#6388: Avoid masking in __getitem__ when the number of rows to be taken > 90% (#6423)
    • PERF-#6398: Improved performance of list-like objects insertion into DataFrames (#6476)
    • PERF-#6433: Implement .dropna() using map-reduce pattern (#6472)
    • PERF-#6437: Preserve dtypes for reindex (#6438)
    • PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases (#6533)
    • PERF-#6466: Verify indices equality without triggering any computations (#6491)
    • PERF-#6478: Do not propagate new columns if they're identical to the previous ones (#6481)
    • PERF-#6524: Add a 'column' shape hint for the results of qc.to_datetime() (#6525)
    • PERF-#6583: Remove redundant index reassignment in query() (#6584)
    • PERF-#6590: Chunk axes independently in .from_pandas() (#6591)
  • Refactor Codebase
    • REFACTOR-#4278: Remove unused arguments from BasePandasDataset.apply (#6451)
    • REFACTOR-#4902: Use isort (#6551)
    • REFACTOR-#6470: Remove Patcher internal class (#6471)
    • REFACTOR-#6489: Enforce API-layer bool/integer argument for __invert__ (#6490)
    • REFACTOR-#6569: Use contextlib.nullcontext instead of custom one (#6570)
    • REFACTOR-#6576: Don't use deprecated is_int64_dtype and is_period_dtype function (#6577)
  • Update testing suite
    • TEST-#0000: Download ray wheel for python 3.9 (#6513)
    • TEST-#2008: Reduce runtime of CI checks a lot (#6356)
    • TEST-#4270: Revert disabling time_groupby_agg_nunique ASV bench (#6564)
    • TEST-#4348: Use psycopg2-binary for testing and developing purpose (#6573)
    • TEST-#4477: Add tests for df.eval with scalar and groupby.transofm call in the expr (#6546)
    • TEST-#4643: Add interchange test for empty dataframe (#6454)
    • TEST-#5008: Set benchmark mode within unit test instead of with environment variable (#6359)
    • TEST-#6349: Update minimum versions for test dependencies in general environments (#6350)
    • TEST-#6439: Create HDK environment manually for ASV (#6431)
    • TEST-#6449: Run tests in test_dmatrix.py only for Ray engine (#6450)
    • TEST-#6460: Don't use repr to force materialization (#6461)
    • TEST-#6469: Pin numexpr<2.8.5 (#6474)
    • TEST-#6477: Update ASV to 0.5.1 (#6432)
    • TEST-#6497: Remove boto3 from environments to speedup creation (#6496)
    • TEST-#6505: Update python version for ASV benchmarks on HDK (#6504)
    • TEST-#6593: Adapt tests for pandas 2.1.1 (#6592)
  • Documentation improvements
    • DOCS-#0000: Update CI link in README to show only pushes (#6531)
    • DOCS-#6416: Fix import path for spreadsheet feature (#6581)
    • DOCS-#6419: Clarify read_parquet supported parameters (#6420)
    • DOCS-#6452: Update copyright year (#6453)
  • New Features
    • FEAT-#1611: Add some datetime extraction functions for HDK (#6568)
    • FEAT-#5645: Add support for modin's numpy array in dataframe.insert function (#6400)
    • FEAT-#6139: DataLoader interplay. (#6140)
    • FEAT-#6377: HDK: Do not keep reference to arrow table imported to HDK (#6380)
    • FEAT-#6389: Make sure git ignores logs in .modin folder (#6390)
    • FEAT-#6401: Support compression param and more file extensions in to_parquet (#6404)
    • FEAT-#6407: Update minimum dependency versions (#6342)
    • FEAT-#6417: Add support for filters to read_parquet (#6442)
    • FEAT-#6434: HDK: Do not convert dictionary columns to string when importing arrow tables (#6435)
    • FEAT-#6440: Use different HDK parameters for different queries (#6441)
    • FEAT-#6484: HDK: Add support for nlargest/nsmallest groupby aggregation (#6485)
    • FEAT-#6500: HDK: Add support for datetime64 to int64 cast (#6501)
    • FEAT-#6502: HDK: Add enable_multifrag_execution_result=1 HDK launch parameter (#6503)
    • FEAT-#6511: Update the minimum supported python version up to 3.9 (#6508)
    • FEAT-#6522: Update to pandas 2.1.0 (#6512)
    • FEAT-#6527: HDK: Add support for the quantile group by aggregation. (#6528)
    • FEAT-#6597: Bump pyhdk version to 0.8 (#6598)

Contributors

@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@vnlitvinov
@abykovsk
@zmbc
@noloerino
@rentruewang