Modin 0.24.0
This release upgrades the pandas version to 2.1, updates the minimum supported python version up to 3.9, introduces ModinDataLoader to improve interaction with PyTorch, fixes several issues with interchange protocol that solved known compatibility issues with Plotly, Seaborn and Altair, includes new version of HDK 0.8. It also includes some other new features, and many bug fixes.
Key Features and Updates Since 0.23.0
- Stability and Bugfixes
- FIX-#0000: Don't test experimental xgboost with Ray nightly build (#6424)
- FIX-#0000: Fix xgboost tests with
ray>2.6.0
(#6425) - FIX-#1930: Fix one of the cases of heterogeneous data for
read_csv
(#5507) - FIX-#4347:
read_excel
: defaults to pandas for unsupported types of 'io' (#6462) - FIX-#4580: Fix access by row label in
query
andeval
(#6488) - FIX-#4687: Change
Column.null_count
to return a built-inint
instead of NumPy scalar (#6526) - FIX-#5164: Fix
unwrap_partitions
for virtual partitions whenaxis=None
(#6560) - FIX-#5536: Remove branch disabling
__getattribute__
for experimental mode (#6529) - FIX-#5627: Stop checking
temp_df.dtype == 'category'
(#6360) - FIX-#5972: Compute correct dtype for
Series.str.find/index/rfind/rindex
(#6426) - FIX-#6219: Don't default to pandas for
copy
on empty DataFrame/Series objects (#6371) - FIX-#6299:
__array__
method always returns array of vanilla numpy (#6300) - FIX-#6334: Improve error message if HDK isn't installed in the environment (#6358)
- FIX-#6347: Remove 'modin in the cloud' experimental feature (#6408)
- FIX-#6364: Make reshuffling work with
BenchmarkMode.put(True)
(#6365) - FIX-#6367: Enable support for
groupby.size()
in reshuffling groupby (#6370) - FIX-#6368: Apply deferred indices before map-reduce groupby (#6369)
- FIX-#6372: Precompute dtypes for
sum
operation (#6421) - FIX-#6375: Don't initialize engines at import time (#6374)
- FIX-#6386: Don't make unnecessary
astype
calls formodin.array.sum
op (#6395) - FIX-#6392: Compute dtypes for the
DataFrame.mean()
result (#6520) - FIX-#6394: Preserve dtypes for
__setitem__
op when using not hashable key (#6547) - FIX-#6396: Set
__factory
toNone
in case of any problems during initialization (#6397) - FIX-#6402: Allow datetime and timedelta types in
diff
(#6403) - FIX-#6405: Apply
disable_logging
to__getattr__
(#6406) - FIX-#6410: Add a link to @modin_project twitter (#6411)
- FIX-#6414: Fix
read_feather
withpyarrow<11.0
(#6415) - FIX-#6427: Make code compatible with
flake8==6.1.0
(#6428) - FIX-#6429: Exclude
pymssql==2.2.8
from environments (#6430) - FIX-#6436: Support
~
in paths in IO functions correctly (#6448) - FIX-#6443: Cast boolean columns before
sum|mean|median
groupby aggregations (#6444) - FIX-#6446: Stop requiring modin-xgboost approval (#6447)
- FIX-#6456: Create fake xgboost module for building docs (#6457)
- FIX-#6459: Support
fastparquet>=2023.1.0
(#6458) - FIX-#6465: Fix
groupby.apply()
for UDFs that change the output's shape (#6506) - FIX-#6479: HDK CalciteBuilder: Do not call
is_bool_dtype()
for categorical (#6480) - FIX-#6483: Default to pandas for
__array_ufunc__
(#6486) - FIX-#6509: Fix 'reshuffling' in case of a string key (#6510)
- FIX-#6514:
test_sort_cols_str
from test_dataframe.py crashed on HDK 0.7.0 and python 3.9 (#6515) - FIX-#6516: HDK: test_dataframe.py is crashed if Calcite is disabled (#6517)
- FIX-#6518: Fix interchange protocol for string columns (#6523)
- FIX-#6519: Consider
botocore
as an optional dependency (#6521) - FIX-#6532: Fix
read_excel
so that it doesn't userich_text
param for oldopenpyxl
(#6534) - FIX-#6535: Pin
s3fs<2023.9.0
(#6536) - FIX-#6537: Unpin
s3fs<2023.9.0
(#6544) - FIX-#6540: Correct handling of range indices and index names in
read_parquet
(#6545) - FIX-#6541: Fix
ValueError: buffer source array is read-only
foriloc
(#6538) - FIX-#6549: Remove usage of
dfsql
module (#6550) - FIX-#6552: Avoid
FutureWarning
s ingroupby
unless necessary (#6595) - FIX-#6553: Fix
read_csv
withiterator=True
(#6554) - FIX-#6558: Normalize the number of partitions after
.read_parquet()
(#6559) - FIX-#6561: Remove
MODIN_OMNISCI_*
env vars in favor ofMODIN_HDK_*
(#6562) - FIX-#6565: Don't implement
map
function viaapplymap
(#6566) - FIX-#6572: Execute simple queries row-wise in pandas backend (#6575)
- FIX-#6582: Avoid
FutureWarning
s inbfill/backfill/ffill/pad
unless necessary (#6599) - FIX-#6587: Use different env files for unidist engine for windows and linux (#6588)
- FIX-#6601:
sort_values
shouldn't affect source dataframe/series (#6603)
- Performance enhancements
- PERF-#6332: Don't materialize axes in
concat
operation (#6381) - PERF-#6373: Preserve dtypes cache for
_repartition
(#6376) - PERF-#6378: Use
numpy.array
operations in internals ofiloc/loc
operation (#6393) - PERF-#6388: Avoid masking in
__getitem__
when the number of rows to be taken > 90% (#6423) - PERF-#6398: Improved performance of list-like objects insertion into DataFrames (#6476)
- PERF-#6433: Implement
.dropna()
using map-reduce pattern (#6472) - PERF-#6437: Preserve dtypes for
reindex
(#6438) - PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases (#6533)
- PERF-#6466: Verify indices equality without triggering any computations (#6491)
- PERF-#6478: Do not propagate new columns if they're identical to the previous ones (#6481)
- PERF-#6524: Add a 'column' shape hint for the results of
qc.to_datetime()
(#6525) - PERF-#6583: Remove redundant index reassignment in
query()
(#6584) - PERF-#6590: Chunk axes independently in
.from_pandas()
(#6591)
- PERF-#6332: Don't materialize axes in
- Refactor Codebase
- REFACTOR-#4278: Remove unused arguments from
BasePandasDataset.apply
(#6451) - REFACTOR-#4902: Use
isort
(#6551) - REFACTOR-#6470: Remove
Patcher
internal class (#6471) - REFACTOR-#6489: Enforce API-layer bool/integer argument for
__invert__
(#6490) - REFACTOR-#6569: Use
contextlib.nullcontext
instead of custom one (#6570) - REFACTOR-#6576: Don't use deprecated
is_int64_dtype
andis_period_dtype
function (#6577)
- REFACTOR-#4278: Remove unused arguments from
- Update testing suite
- TEST-#0000: Download ray wheel for python 3.9 (#6513)
- TEST-#2008: Reduce runtime of CI checks a lot (#6356)
- TEST-#4270: Revert disabling
time_groupby_agg_nunique
ASV bench (#6564) - TEST-#4348: Use
psycopg2-binary
for testing and developing purpose (#6573) - TEST-#4477: Add tests for
df.eval
with scalar andgroupby.transofm
call in the expr (#6546) - TEST-#4643: Add interchange test for empty dataframe (#6454)
- TEST-#5008: Set benchmark mode within unit test instead of with environment variable (#6359)
- TEST-#6349: Update minimum versions for test dependencies in general environments (#6350)
- TEST-#6439: Create HDK environment manually for ASV (#6431)
- TEST-#6449: Run tests in test_dmatrix.py only for Ray engine (#6450)
- TEST-#6460: Don't use
repr
to force materialization (#6461) - TEST-#6469: Pin
numexpr<2.8.5
(#6474) - TEST-#6477: Update ASV to 0.5.1 (#6432)
- TEST-#6497: Remove
boto3
from environments to speedup creation (#6496) - TEST-#6505: Update python version for ASV benchmarks on HDK (#6504)
- TEST-#6593: Adapt tests for pandas 2.1.1 (#6592)
- Documentation improvements
- New Features
- FEAT-#1611: Add some datetime extraction functions for HDK (#6568)
- FEAT-#5645: Add support for modin's numpy array in
dataframe.insert
function (#6400) - FEAT-#6139:
DataLoader
interplay. (#6140) - FEAT-#6377: HDK: Do not keep reference to arrow table imported to HDK (#6380)
- FEAT-#6389: Make sure git ignores logs in
.modin
folder (#6390) - FEAT-#6401: Support compression param and more file extensions in
to_parquet
(#6404) - FEAT-#6407: Update minimum dependency versions (#6342)
- FEAT-#6417: Add support for filters to
read_parquet
(#6442) - FEAT-#6434: HDK: Do not convert dictionary columns to string when importing arrow tables (#6435)
- FEAT-#6440: Use different HDK parameters for different queries (#6441)
- FEAT-#6484: HDK: Add support for
nlargest/nsmallest
groupby aggregation (#6485) - FEAT-#6500: HDK: Add support for
datetime64
toint64
cast (#6501) - FEAT-#6502: HDK: Add
enable_multifrag_execution_result=1
HDK launch parameter (#6503) - FEAT-#6511: Update the minimum supported python version up to 3.9 (#6508)
- FEAT-#6522: Update to pandas 2.1.0 (#6512)
- FEAT-#6527: HDK: Add support for the quantile group by aggregation. (#6528)
- FEAT-#6597: Bump pyhdk version to 0.8 (#6598)
Contributors
@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@vnlitvinov
@abykovsk
@zmbc
@noloerino
@rentruewang