Modin 0.21.0
Modin 0.21.0
This release includes many bug fixes, performance enhancements, and new features.
Key Features and Updates Since 0.20.0
- Stability and Bugfixes
- FIX-#4828: allow
dict_apply_builder
use keyword argumentinternal_indices
(#5945) - FIX-#5091: Handle pd.Grouper objects correctly (#6174)
- FIX-#5203: don't raise
AttributeError: 'list' object has no attribute '_query_compiler'
injoin
op (#5939) - FIX-#5985: BUG: ArrowPeriodType and ArrowIntervalType are not supported by HDK (#5987)
- FIX-#5988: BUG: Concatenation of frames with strings is not supported by HDK (#5989)
- FIX-#5993: Fix documentation building in CI (#5994)
- FIX-#5997: Run
build-docs
CI job regardless of the files being changed (#5998) - FIX-#6000: HDK: read_csv(): Do not parse dates, if the parse_dates argument is not specified (#6001)
- FIX-#6022: support lazy import of
modin.pandas
module (#6023) - FIX-#6037: Simplified filter node expression for ranges (#6038)
- FIX-#6053: align 'Series.str' signatures with pandas (#6054)
- FIX-#6069: Improve the way resample is handled at the API layer (#6179)
- FIX-#6070: Simplify implementation of
shift
(#6168) - FIX-#6074: cap pyarrow<12 to fix CI (#6075)
- FIX-#6094: pin 'urllib3<2' for pip command in 'test-ray-master' job (#6178)
- FIX-#6095: Implement the to_csv() method in the HDK backend (#6099)
- FIX-#6097: Pass storage_options to the to_csv function of PandasOnRayIO class with fsspec (#6098)
- FIX-#6106: Fix API layer implementation of reindex_like (#6131)
- FIX-#6107: Allow pass through of
tz_convert
andtz_localize
to QC if possible (#6137) - FIX-#6109: Don't use join() when indicator is true (#6130)
- FIX-#6110: Generalize logic to test if an index is a MultiIndex (#6135)
- FIX-#6112: Ensure that
truncate
verifies that before <= after (#6134) - FIX-#6113: Add QC Layer implementation for idxmin/max (#6170)
- FIX-#6114: Fix series groupby list of numpy methods (#6129)
- FIX-#6115: Check for
_to_datetime
attribute inpd.to_datetime
(#6133) - FIX-#6117: Add error checking at API level for
diff
(#6167) - FIX-#6120: HDK read_csv(): Fixed parsing dates with nanosecond precision (#6121)
- FIX-#6146: Fix
pivot
whenvalues=None
(#6166) - FIX-#6152: make
numeric_only
default toTrue
(#6162) - FIX-#6154: Ensure GroupBy.getitem preserves key order (#6164)
- FIX-#6155: Fully implement droplevel for axis=0 (#6180)
- FIX-#6175: Fix groupby agg columns for empty column partition (#6176)
- FIX-#6181: Do not ignore
copy
argument intz_convert
andtz_localize
(#6182) - FIX-#6183: Ensure array resets index and columns for all storage formats (#6185)
- FIX-#6184: Make Series.to_list return proper list (#6188)
- FIX-#6186: Don't use pandas extension types (#6187)
- FIX-#6194: Fix crashes on groupby.{pct_change,diff} (#6195)
- FIX-#6196: Align 'Series.cat' signatures with pandas (#6061)
- FIX-#6204: Use reset_index instead of insert in to_sql (#6205)
- FIX-#6172: Pass storage_options to the to_csv function of PandasOnUnidist class with fsspec (#6173)
- FIX-#4828: allow
- Performance enhancements
- PERF-#5835: Introduce lazy categorical proxy for pandas backend (#6055)
- PERF-#5840: Precompute dtypes cache for binary operations more often (#5949)
- PERF-#5841: Precompute dtypes for boolean setitem (#5952)
- PERF-#5999: Do not set Ray's
runtime_env
for a single-node case (#6028) - PERF-#6122: Extract Feather's metadata without reading a whole file (#6123)
- Refactor Codebase
- REFACTOR-#5844: remove
inplace
kwarg from query compilerclip
arguments (#5954) - REFACTOR-#5951: remove code duplication for
to_pickle_distributed
(#5950) - REFACTOR-#5992: remove 'apply_license_header.py' as unused (#5990)
- REFACTOR-#6012: move experimental dispatchers under
modin/experimental/...
folder (#6011) - REFACTOR-#6024: remove code duplication for
to_*
functions (#5953) - REFACTOR-#6044: remove code duplication for 'get_objects_from_partitions' (#6045)
- REFACTOR-#6046: remove code duplication for 'progress_bar_wrapper' (#6047)
- REFACTOR-#6062: Add query compiler interfaces for expanding methods (#6064)
- REFACTOR-#6063: Add query compiler interfaces for some strings methods. (#6088)
- REFACTOR-#6065: Use between_time in at_time (#6158)
- REFACTOR-#6066: Support rolling.{rank,quantile,sem} (#6084)
- REFACTOR-#6067: Simplify describe() query compiler interface (#6082)
- REFACTOR-#6068: Simplify info() call (#6087)
- REFACTOR-#6071: Push first and last down to query compiler. (#64) (#6125)
- REFACTOR-#6091: Push more of memory_usage down to query compiler. (#6092)
- REFACTOR-#6105: Explicitly pass default value of np.nan to Series.reindex (#6138)
- REFACTOR-#6108: Move implementation of
pd.cut
to QC layer (#6136) - REFACTOR-#6116: Move
groupby_ohlc
implementation to QC layer (#6132) - REFACTOR-#6119: #6118: Add query compiler methods for groupby diff, pct_change (#6128)
- REFACTOR-#6151: Get slicer without consructing pandas dataframe. (#6161)
- REFACTOR-#6159: Stop defaulting at API layer for a few more methods (#6160)
- REFACTOR-#5844: remove
- Update testing suite
- TEST-#5956: Verify dtypes equality in tests (#5955)
- TEST-#5980: use
cancel-in-progress
only for PRs (#5917) - TEST-#5991: add simple tests for
read_orc
,read_spss
,json_normalize
,read_xml
,read_gbq
(#5983) - TEST-#6004: add more '# pragma: no cover' for io functions (#6002)
- TEST-#6006: test
modin/test/test_partition_api.py
on unidist and dask (#6003) - TEST-#6009: use
tmp_path
fixture instead ofensure_clean_dir
as pandas 2.0.0 does (#6008) - TEST-#6010: add some more test directories into 'setup.cfg' (#6007)
- TEST-#6020: exclude '_version.py' from coverage (#6019)
- TEST-#6027: Test installing Unidist via pip in a clean environment, as we do for Dask and Ray (#6025)
- TEST-#6030: test the function parameters of
Series.str
accessor for pandas equivalence (#6033) - TEST-#6031: test the function parameters of 'Series.dt' accessor for pandas equivalence (#6197)
- TEST-#6076: Use 2 cores for experimental groupby on dask (#6077)
- TEST-#6198: add 'pragma: no cover' for unidist and ray utils that used in remote context (#6059)
- TEST-#6260: Increase test_io timeout (#6207)
- Documentation improvements
- DOCS-#5449: Add page for Modin interoperability with select third party libraries (#5517)
- DOCS-#6021: Add a section regarding reshuffling groupby to Modin's documentation (#6051)
- DOCS-#6078: correct default values for MODIN_CPUS and MODIN_NPARTITIONS (#6177)
- DOCS-#6079: Make 'experimental/index.html' accessible through the readthedocs website (#6080)
- New Features
- FEAT-#5816: Implement '.split' method for axis partitions (#5856)
- FEAT-#5867: Introduce groupby implementation via range-partitioning (#5928)
- FEAT-#6014: Stop defaulting to pandas in groupby frontend for fill-like methods (#5996)
- FEAT-#6039: Implement
Series.str
throughCachedAccessor
(#6043) - FEAT-#6040: implement 'Series.dt' through 'CachedAccessor' (#6056)
- FEAT-#6041: implement 'Series.cat' through 'CachedAccessor' (#6057)
- FEAT-#6144: Stop defaulting at API layer for a bunch of methods (#6145)
- FEAT-#6147: HDK: Arrow-based columns concatenation of frames with trivial index. (#6148)
- FEAT-#6153: Add API layer implementations for some stat methods. (#6156)
Contributors
@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@arunjose696
@dchigarev
@devin-petersohn
@helmeleegy
@jkew
@labanyamukhopadhyay
@mdatre
@mvashishtha
@noloerino
@pyrito
@vnlitvinov
@naren-ponder