Releases: modin-project/modin
Modin 0.28.0
This release introduces modin.pandas.api.extensions
module, faster implementations for merge
and
groupby.rolling
(by default) functions, and new functions to work with Ray Dataset: to/from_ray_dataset
.
It also includes some other new features, performance optimizations and bug fixes.
Key Features and Updates Since 0.27.0
- Stability and Bugfixes
- FIX-#6935: Fix
merge
when right operand is an empty dataframe (#6941) - FIX-#6936: Fix
read_parquet
when dataset is created withto_parquet
andindex=False
(#6937) - FIX-#6944: Apply
isort
formatting for scripts from tutorials (#6945) - FIX-#6946: Remove
needs: [lint-black-isort, ...]
(#6947) - FIX-#6948: Fix
groupby
when Modin dataframe has several column partitions (#6951) - FIX-#6952: Use
render_as_string
to get sqlalchemy engine url (#6953) - FIX-#6968: Align API with pandas (#6969)
- FIX-#6974: Always use actual pandas version in
test_all_urls_exist
(#6975) - FIX-#6982: Updating data in notebooks from yellow taxi to green taxi dataset (#6993)
- FIX-#6984: Ensure the results of inplace operations materialize (for tests) (#6985)
- FIX-#6935: Fix
- Performance enhancements
- Refactor Codebase
- REFACTOR-#6856: Rename
read_pickle_distributed/to_pickle_distributed
toread_pickle_glob/to_pickle_glob
(#6957) - REFACTOR-#6939: Make
modin.pandas.DataFrame._to_pandas
a public method (#6940) - REFACTOR-#6958: Remove
DataFrame.to_pickle_distributed
in favour ofDataFrame.modin.to_pickle_distributed
(#6959) - REFACTOR-#7002: Get more information about exceptions from
eval_general
utility (#7003) - REFACTOR-#7008: Remove
check_exception_type
argument ofeval_general
function (#7009) - REFACTOR-#7013: Move
to_pandas
andto_ray_dataset
into modin namespace (#7014) - REFACTOR-#7017: Align
to_hdf
andhist
signatures to pandas (#7018)
- REFACTOR-#6856: Rename
- Update testing suite
- Documentation improvements
- New Features
- FEAT-#3044: Create Extensions Module in Modin (#6961)
- FEAT-#4622: Unify data type of
log_level
in logging module (#6992) - FEAT-#6913: Support sqlalchemy connectables in
read_sql
by getting connection url (#6956) - FEAT-#6934: Support
include_groups=False
parameter ingroupby.apply()
(#6938) - FEAT-#6942: Enable range-partitioning impl for
groupby().rolling()
by default (#6943) - FEAT-#6965: Implement
.merge()
using range-partitioning implementation (#6966) - FEAT-#6970: Implement
to/from_ray_dataset
functions (#6971) - FEAT-#6983: Add Pluggable Documentation Module Support (#6986)
- FEAT-#7001: Do not force materialization in
MetaList.__getitem__()
(#7006)
Contributors
@AndreyPavlenko
@Retribution98
@YarShev
@anmyachev
@arunjose696
@dchigarev
@sfc-gh-dpetersohn
@tochigiv
Modin 0.27.0
This release updates pandas to 2.2, introduces lazy execution mode on Ray, new functions that support glob
syntax and speeds up several more groupby cases. It also includes some other new features, performance
optimizations and many bug fixes.
Key Features and Updates Since 0.26.0
- Stability and Bugfixes
- FIX-#2405: Make sure named aggregation work for Series objects (#6892)
- FIX-#5925: Put a sorting-hack into groupby tests to hide #6875 bug (#6896)
- FIX-#6830: Pass AWS related env vars to mpiexec (#6867)
- FIX-#6840: Call
tolist
function inDtypesDescriptor._merge_dtypes
(#6844) - FIX-#6855: Make sure
read_parquet
works with integer columns for pyarrow engine (#6874) - FIX-#6879: Convert the right DF to single partition before broadcasting in
query_compiler.merge
(#6880) - FIX-#6881: Make sure
astype
works correctly withint32
andfloat32
dtypes (#6884) - FIX-#6897: Preprocess kernel function that aligns columns in groupby (#6898)
- FIX-#6897: Revert unidist specific fix for groupby (#6902)
- FIX-#6899: Avoid sending lazy categorical proxies to workers (#6900)
- FIX-#6904: Align levels of partially known dtypes with MultiIndex labels (#6905)
- FIX-#6911: Remove unidist specific workaround in
.from_pandas()
(#6912) - FIX-#6916: Unpin
pydantic
dependency (#6917) - FIX-#6924: HDK: Use
JoinNode
instead ofMaskNode
for non-range row_position (#6926)
- Performance enhancements
- Refactor Codebase
- REFACTOR-#6293: Corrected
missmatch
tomismatch
inErrorMessage.missmatch_with_pandas
method (#6901) - REFACTOR-#6812: Remove
PyarrowOnRay
execution in favour of pyarrow-backed pandas dataframes (#6848) - REFACTOR-#6833: Remove
SocksProxy
,DoLogRpyc
,DoTraceRpyc
outdated classes (#6834) - REFACTOR-#6845: Fix import issues found by CodeQL (#6837)
- REFACTOR-#6852: Remove
OrderedDict
in favor of builtindict
(#6853) - REFACTOR-#6858: Rename
_get_dimensions
and change arguments (#6859) - REFACTOR-#6889: Define
__all__
inmodin.config.__init__.py
(#6886) - REFACTOR-#6903: Remove duplicated definitions of
create_test_series
(#6910) - REFACTOR-#6918: Docstring and type hints fixes (#6925)
- REFACTOR-#6293: Corrected
- Update testing suite
- TEST-#6708: Create test files using
tmp_path
fixture (#6709) - TEST-#6777: Make
to_csv
tests on Unidist more stable (fortest-all-unidist
CI job) (#6851) - TEST-#6830: Use local s3 server instead of public s3 buckets (#6863)
- TEST-#6846: Skip unstable Unidist
to_csv
tests (#6847) - TEST-#6868: Remove tests for
gs
remote protocol since we rely onfsspec
(#6882) - TEST-#6885: Switch to
black>=24.1.0
(#6887) - TEST-#6893: Added support for
pytest 8.0.0
(#6894) - TEST-#6920: Remove testing for Ray client (#6921)
- TEST-#6708: Create test files using
- Documentation improvements
- New Features
- FEAT-#3450: Implement
read_json_glob
andto_json_glob
(#6873) - FEAT-#5809: New implementation of the Ray lazy execution queue (#6731)
- FEAT-#5925: Enable grouping on categoricals with range-partitioning impl (#6862)
- FEAT-#6382: Execute bitwise NOT (~) operations on HDK (#6383)
- FEAT-#6398: Improved performance of list-like objects insertion into HDK DataFrames (#6412)
- FEAT-#6830: Remove public s3 bucket reference (#6829)
- FEAT-#6831: Implement
read_parquet_glob
andto_parquet_glob
(#6854) - FEAT-#6832: Implement
read_xml_glob
,to_xml_glob
(#6930) - FEAT-#6835: Do not put binary functions to the Ray storage multiple times (#6836)
- FEAT-#6838: Prefer lazy execution for binary operations with scalar (#6839)
- FEAT-#6841: Fixing ray anti pattern with
.length()
and.width()
being called in a loop (#6842) - FEAT-#6849: Removing
to_pandas
call inmerge
andjoin
functions (#6850) - FEAT-#6883: Support grouping on a Series with range-partitioning impl (#6888)
- FEAT-#6906: Update to pandas
2.2.*
(#6907) - FEAT-#6908: Remove the warning regarding engine initialization (#6909)
- FEAT-#6914: Add a config for setting a number of threads per Dask worker (#6915)
- FEAT-#6918: Add auto mode to the lazy execution. (#6919)
- FEAT-#3450: Implement
Contributors
@AndreyPavlenko
@YarShev
@anmyachev
@arunjose696
@dchigarev
@leshikus
@vedant
Modin 0.26.1
This release includes a fix for concat
function.
Key Features and Updates Since 0.26.0
- Stability and Bugfixes
- Update testing suite
- New Features
Contributors
Modin 0.26.0
This release introduces a new, faster implementation for groupby.apply
, as well as many performance fixes related to improving asynchronous execution, a new namespace for accessing experimental functions (for example, DataFrame.modin.to_pickle_distributed
), a fix for a long-standing problem with the use of Modin objects inside UDFs for apply
and many other fixes.
Note: to get Modin on MPI through unidist (as of unidist 0.5.0) fully working by installing with pip it is required to have a working MPI implementation installed beforehand.
Key Features and Updates Since 0.25.0
- Stability and Bugfixes
- FIX-#4355: Fix rename algebraic operator to avoid copying (#4356)
- FIX-#6594: Fix usage of Modin objects inside UDFs for
apply
(#6673) - FIX-#6664: Use
@lazy_metadata_decorator
forPandasDataFrame.finalize
(#6720) - FIX-#6684: Adapt to pandas 2.1.2 (#6685)
- FIX-#6687: Explicitly add users to CODEOWNERS (#6688)
- FIX-#6693: Revert creating an additional copy in
astype
op (#6692) - FIX-#6703: Don't use
set_index_name(None)
(#6698) - FIX-#6732: Fix inferring result dtypes for binary operations (#6737)
- FIX-#6745: Pin
unidist <= 0.4.1
(#6746) - FIX-#6752: Preserve dtypes cache on
.insert()
(#6757) - FIX-#6768: Make sure
to_numpy
use**kwargs
after #6704 (#6769) - FIX-#6771: Avoid
ValueError: assignment destination is read-only
forcumsum
(#6772) - FIX-#6773: Make sure
_to_pandas
return mutable pandas objects (#6775) - FIX-#6774: Modify conditions for
loc
to get similar behavior to pandas (#6798) - FIX-#6778: Read parquet files without file extensions using fastparquet (#6790)
- FIX-#6779: Pass only one indexer into
Series.__getitem__
(#6780) - FIX-#6781: Use
pandas.api.types.pandas_dtype
to convert to valid numpy and pandas only dtypes (#6788) - FIX-#6782: Filter pandas warnings when precomputing dtypes (#6811)
- FIX-#6786: Properly d2p for cross
DataFrame.join
(#6787) - FIX-#6791: Pass additional environment variables to MPI workers (#6792)
- FIX-#6799: Allow creating incomplete
ModinIndex
objects (#6800) - FIX-#6822: Do not propagate
NotImplementedError
to a user on aset_columns()
with dupl labels (#6823) - FIX-#6824: Invalidate
ModinIndex._lengths_id
on empty partitions filtering (#6825)
- Performance enhancements
- PERF-#4777: Don't use
copy=True
parameter forconcat
calls insideto_pandas
(#4778) - PERF-#4804: Preserve lengths/widths caches in
broadcast_apply_full_axis
(#6760) - PERF-#6666: Avoid internal
reset_index
for leftmerge
(#6665) - PERF-#6668: Use
copy=False
for internal usage ofset_axis
(#6667) - PERF-#6669: Avoid one extra
copy()
call forSeries.reset_index
(#6670) - PERF-#6671: Don't iterate over the result of the
Series.tolist
function (#6672) - PERF-#6690: Use
sync_labels=False
forrank
function (#6689) - PERF-#6694: Use
lazy_map_partitions()
for dtypes conversion (#6695) - PERF-#6696: Use cached dtypes in fillna when possible. (#6697)
- PERF-#6701: Use
get_axis
internal function instead ofaxes
property (#6700) - PERF-#6702: Don't materialize axes when calling
to_numpy
(#6699) - PERF-#6710: Don't materialize index in
_groupby_shuffle
internal function (#6707) - PERF-#6712: Copy
_shape_hint
inquery_complier.copy
function (#6713) - PERF-#6714: Assign
qc._shape_hint = column
incolumnarize
function (#6715) - PERF-#6716: Avoid materializing axes in
_filter_empties
(#6717) - PERF-#6718: Use
_get_axis_lengths
function instead of_axes_lengths
property (#6719) - PERF-#6721: Use
keep_partitioning=True
, forduplicated
implementation (#6722) - PERF-#6723: Use
_shape_hint = "column"
inDataFrame.squeeze
(#6724) - PERF-#6727: Remove remaining
result.name = None
in groupby code (#6726) - PERF-#6728: In the case of narrow dataframes, it is cheaper to convert partitions to numpy in the main process. (#6704)
- PERF-#6747: Preserve columns/dtypes cache when merging on a single index level (#6748)
- PERF-#6749: Preserve partial dtype for the result of
reset_index()
(#6751) - PERF-#6753: Preserve dtypes cache on
.__setitem__()
(#6758) - PERF-#6754: Merge partial dtype caches on
.concat(axis=0)
(#6759) - PERF-#6756: Don't materialize index when sorting (#6755)
- PERF-#6762: Carry dtypes information in lazy indices (#6763)
- PERF-#4777: Don't use
- Refactor Codebase
- REFACTOR-#0000: Cleanup one todo and flake8 issues in modin/utils.py (#6826)
- REFACTOR-#6739: Use
execution_wrapper
instead of directly addressingDaskWrapper
(#6740) - REFACTOR-#6805: Move all IO functions to
modin.pandas.io
module (#6806) - REFACTOR-#6807: Rename experimental groupby and experimental numpy variables (#6809)
- REFACTOR-#6815: Move experimental parsers into
modin.experimental
folder (#6813) - REFACTOR-#6818: Don't implicitly enable experimental mode (#6817)
- Update testing suite
- Documentation improvements
- New Features
- FEAT-#5836: Introduce 'partial' dtypes cache (#6663)
- FEAT-#6735: Make Modin on MPI through unidist component more obvious (#6736)
- FEAT-#6767: Provide the ability to use experimental functionality when experimental mode is not enabled globally via an environment variable (#6764)
- FEAT-#6784: Add d2p implementations for
DataFrame.__rdivmod__/__divmod__
(#6785) - FEAT-#6801: Add
modin.pandas.error
module (#6802) - FEAT-#6803: Enable range-partitioning impl for
groupby.apply()
by default (#6804) - FEAT-#6820: Make sure IO functions works with path-like filenames (#6821)
Contributors
@AndreyPavlenko
@JignyasAnand
@RehanSD
@YarShev
@anmyachev
@devin-petersohn
@dchigarev
@mvashishtha
@seydar
Modin 0.24.1.post0
Hotfix for Unidist.
Key Features and Updates Since 0.24.1
- Stability and Bugfixes
Note: broken pip wheel, use https://github.com/modin-project/modin/releases/tag/0.24.1.post1 instead
Contributors
Modin 0.25.1
Hotfix for Unidist.
Key Features and Updates Since 0.25.0
- Stability and Bugfixes
Contributors
Modin 0.23.1.post0
The main purpose of this release is to port as many fixes as possible to the latest version, which supports Python 3.8.
Key Features and Updates Since 0.23.1
- Stability and Bugfixes
- FIX-#0000: Pin
unidist<=0.4.1
- FIX-#4347:
read_excel
: defaults to pandas for unsupported types ofio
(#6462) - FIX-#4507: Do not call
ray.get()
inside of the kernel executing call queues (#6633) - FIX-#4687: Change
Column.null_count
to return a built-inint
instead of NumPy scalar (#6526) - FIX-#5164: Fix
unwrap_partitions
for virtual partitions whenaxis=None
(#6560) - FIX-#5536: Remove branch disabling
__getattribute__
for experimental mode (#6529) - FIX-#6465: Fix
groupby.apply()
for UDFs that change the output's shape (#6506) - FIX-#6479: HDK CalciteBuilder: Do not call
is_bool_dtype()
for categorical (#6480) - FIX-#6509: Fix
reshuffling
in case of a string key (#6510) - FIX-#6514:
test_sort_cols_str
fromtest_dataframe.py
crashed on HDK 0.7.0 and python 3.9 (#6515) - FIX-#6516: HDK:
test_dataframe.py
is crashed if Calcite is disabled (#6517) - FIX-#6518: Fix interchange protocol for string columns (#6523)
- FIX-#6519: Consider
botocore
as an optional dependency (#6521) - FIX-#6532: Fix
read_excel
so that it doesn't userich_text
param for oldopenpyxl
(#6534) - FIX-#6535: Pin
s3fs<2023.9.0
(#6536) - FIX-#6537: Unpin
s3fs<2023.9.0
(#6544) - FIX-#6541: Fix
ValueError: buffer source array is read-only
foriloc
(#6538) - FIX-#6553: Fix
read_csv
withiterator=True
(#6554) - FIX-#6572: Execute simple queries row-wise in pandas backend (#6575)
- FIX-#6594: Fix usage of Modin objects inside UDFs for
apply
(#6673) - FIX-#6600: Fix usage of list of UDF functions in
Series.groupby.agg
(#6613) - FIX-#6601:
sort_values
shouldn't affect source dataframe/series (#6603) - FIX-#6602: Refactor
join
to avoiddistributing a dict object
warning (#6612) - FIX-#6607: Fix incorrect cache after
.sort_values()
(#6608) - FIX-#6628: Allow groupby diff for dates (#6631)
- FIX-#6632: Return Series instead of Dataframe for
groupby.apply
in case of experimental groupby (#6649) - FIX-#6635: HDK:
read_csv
: treat object dtype as string (#6636) - FIX-#6637: Fix
skiprows
parameter usage forread_excel
(#6638) - FIX-#6642: Fix
modin.numpy.array.sum
on HDK (#6643) - FIX-#6647: Added init file to make
modin/experimental/sql/hdk/query.py
part of modin package (#6646) - FIX-#6651: Make sure
Series.between
works correctly (#6656) - FIX-#6680: Specify
navigation_with_keys=True
to fix docs build (#6681)
- FIX-#0000: Pin
Contributors
@AndreyPavlenko
@Egor-Krivov
@Garra1980
@RehanSD
@anmyachev
@dchigarev
@vnlitvinov
Modin 0.25.0
This release introduces modin.utils.execute
function to improve benchmarking experience, includes new version of HDK 0.9.
It also includes performance optimizations for sort_values
, value_counts
, 2D setitem and several others, as well as many bug fixes.
Key Features and Updates Since 0.24.0
- Stability and Bugfixes
- FIX-#4507: Do not call
ray.get()
inside of the kernel executing call queues (#6633) - FIX-#6585: Avoid
FutureWarning
s inrolling
unless necessary (#6586) - FIX-#6600: Fix usage of list of UDF functions in
Series.groupby.agg
(#6613) - FIX-#6602: Refactor
join
to avoiddistributing a dict object
warning (#6612) - FIX-#6604: HDK: Added support for list to
DataFrame.agg()
(#6606) - FIX-#6607: Fix incorrect cache after
.sort_values()
(#6608) - FIX-#6624: Add
FutureWarning
s forfirst/last/bool
(#6625) - FIX-#6628: Allow
groupby.diff()
for dates (#6631) - FIX-#6632: Return Series instead of Dataframe for
groupby.apply
in case of experimental groupby (#6649) - FIX-#6635: HDK:
read_csv()
: treat object dtype as string (#6636) - FIX-#6637: Fix
skiprows
parameter usage forread_excel
(#6638) - FIX-#6642: Fix
modin.numpy.array.sum
on HDK (#6643) - FIX-#6647: Added init file to make
modin/experimental/sql/hdk/query.py
part of modin package (#6646) - FIX-#6651: Make sure
Series.between
works correctly (#6656) - FIX-#6680: Specify
navigation_with_keys=True
to fix docs build (#6681)
- FIX-#4507: Do not call
- Performance enhancements
- PERF-#2813: Distributed
from_pandas()
for numerical data in Ray (#6640) - PERF-#5533: Improved
sort_values
by reducing the number of partitions (#6589) - PERF-#6362: Implement 2D setitem without to-pandas conversion (#6618)
- PERF-#6614: HDK: Use
MODIN_CPUS
instead ofos.cpu_count()
for the fragment size calculation (#6615) - PERF-#6629: HDK: Avoid
LazyProxyCategoricalDtype
materialization onmerge
(#6630) - PERF-#6645: Avoid label synchronization for
dot
operation (#6644) - PERF-#6653:
value_counts()
: Eliminate redundant sorting. (#6654) - PERF-#6661: Do not convert columns dtypes if the new dtypes are the same (#6662)
- PERF-#2813: Distributed
- Refactor Codebase
- Update testing suite
- Documentation improvements
- New Features
Contributors
@AndreyPavlenko
@Egor-Krivov
@Garra1980
@YarShev
@anmyachev
@dchigarev
Modin 0.24.1
Hotfix for sort_values
.
Key Features and Updates Since 0.24.0
- Stability and Bugfixes
Contributors
Modin 0.24.0
This release upgrades the pandas version to 2.1, updates the minimum supported python version up to 3.9, introduces ModinDataLoader to improve interaction with PyTorch, fixes several issues with interchange protocol that solved known compatibility issues with Plotly, Seaborn and Altair, includes new version of HDK 0.8. It also includes some other new features, and many bug fixes.
Key Features and Updates Since 0.23.0
- Stability and Bugfixes
- FIX-#0000: Don't test experimental xgboost with Ray nightly build (#6424)
- FIX-#0000: Fix xgboost tests with
ray>2.6.0
(#6425) - FIX-#1930: Fix one of the cases of heterogeneous data for
read_csv
(#5507) - FIX-#4347:
read_excel
: defaults to pandas for unsupported types of 'io' (#6462) - FIX-#4580: Fix access by row label in
query
andeval
(#6488) - FIX-#4687: Change
Column.null_count
to return a built-inint
instead of NumPy scalar (#6526) - FIX-#5164: Fix
unwrap_partitions
for virtual partitions whenaxis=None
(#6560) - FIX-#5536: Remove branch disabling
__getattribute__
for experimental mode (#6529) - FIX-#5627: Stop checking
temp_df.dtype == 'category'
(#6360) - FIX-#5972: Compute correct dtype for
Series.str.find/index/rfind/rindex
(#6426) - FIX-#6219: Don't default to pandas for
copy
on empty DataFrame/Series objects (#6371) - FIX-#6299:
__array__
method always returns array of vanilla numpy (#6300) - FIX-#6334: Improve error message if HDK isn't installed in the environment (#6358)
- FIX-#6347: Remove 'modin in the cloud' experimental feature (#6408)
- FIX-#6364: Make reshuffling work with
BenchmarkMode.put(True)
(#6365) - FIX-#6367: Enable support for
groupby.size()
in reshuffling groupby (#6370) - FIX-#6368: Apply deferred indices before map-reduce groupby (#6369)
- FIX-#6372: Precompute dtypes for
sum
operation (#6421) - FIX-#6375: Don't initialize engines at import time (#6374)
- FIX-#6386: Don't make unnecessary
astype
calls formodin.array.sum
op (#6395) - FIX-#6392: Compute dtypes for the
DataFrame.mean()
result (#6520) - FIX-#6394: Preserve dtypes for
__setitem__
op when using not hashable key (#6547) - FIX-#6396: Set
__factory
toNone
in case of any problems during initialization (#6397) - FIX-#6402: Allow datetime and timedelta types in
diff
(#6403) - FIX-#6405: Apply
disable_logging
to__getattr__
(#6406) - FIX-#6410: Add a link to @modin_project twitter (#6411)
- FIX-#6414: Fix
read_feather
withpyarrow<11.0
(#6415) - FIX-#6427: Make code compatible with
flake8==6.1.0
(#6428) - FIX-#6429: Exclude
pymssql==2.2.8
from environments (#6430) - FIX-#6436: Support
~
in paths in IO functions correctly (#6448) - FIX-#6443: Cast boolean columns before
sum|mean|median
groupby aggregations (#6444) - FIX-#6446: Stop requiring modin-xgboost approval (#6447)
- FIX-#6456: Create fake xgboost module for building docs (#6457)
- FIX-#6459: Support
fastparquet>=2023.1.0
(#6458) - FIX-#6465: Fix
groupby.apply()
for UDFs that change the output's shape (#6506) - FIX-#6479: HDK CalciteBuilder: Do not call
is_bool_dtype()
for categorical (#6480) - FIX-#6483: Default to pandas for
__array_ufunc__
(#6486) - FIX-#6509: Fix 'reshuffling' in case of a string key (#6510)
- FIX-#6514:
test_sort_cols_str
from test_dataframe.py crashed on HDK 0.7.0 and python 3.9 (#6515) - FIX-#6516: HDK: test_dataframe.py is crashed if Calcite is disabled (#6517)
- FIX-#6518: Fix interchange protocol for string columns (#6523)
- FIX-#6519: Consider
botocore
as an optional dependency (#6521) - FIX-#6532: Fix
read_excel
so that it doesn't userich_text
param for oldopenpyxl
(#6534) - FIX-#6535: Pin
s3fs<2023.9.0
(#6536) - FIX-#6537: Unpin
s3fs<2023.9.0
(#6544) - FIX-#6540: Correct handling of range indices and index names in
read_parquet
(#6545) - FIX-#6541: Fix
ValueError: buffer source array is read-only
foriloc
(#6538) - FIX-#6549: Remove usage of
dfsql
module (#6550) - FIX-#6552: Avoid
FutureWarning
s ingroupby
unless necessary (#6595) - FIX-#6553: Fix
read_csv
withiterator=True
(#6554) - FIX-#6558: Normalize the number of partitions after
.read_parquet()
(#6559) - FIX-#6561: Remove
MODIN_OMNISCI_*
env vars in favor ofMODIN_HDK_*
(#6562) - FIX-#6565: Don't implement
map
function viaapplymap
(#6566) - FIX-#6572: Execute simple queries row-wise in pandas backend (#6575)
- FIX-#6582: Avoid
FutureWarning
s inbfill/backfill/ffill/pad
unless necessary (#6599) - FIX-#6587: Use different env files for unidist engine for windows and linux (#6588)
- FIX-#6601:
sort_values
shouldn't affect source dataframe/series (#6603)
- Performance enhancements
- PERF-#6332: Don't materialize axes in
concat
operation (#6381) - PERF-#6373: Preserve dtypes cache for
_repartition
(#6376) - PERF-#6378: Use
numpy.array
operations in internals ofiloc/loc
operation (#6393) - PERF-#6388: Avoid masking in
__getitem__
when the number of rows to be taken > 90% (#6423) - PERF-#6398: Improved performance of list-like objects insertion into DataFrames (#6476)
- PERF-#6433: Implement
.dropna()
using map-reduce pattern (#6472) - PERF-#6437: Preserve dtypes for
reindex
(#6438) - PERF-#6464: Improve reshuffling for multi-column groupby in low-cardinality cases (#6533)
- PERF-#6466: Verify indices equality without triggering any computations (#6491)
- PERF-#6478: Do not propagate new columns if they're identical to the previous ones (#6481)
- PERF-#6524: Add a 'column' shape hint for the results of
qc.to_datetime()
(#6525) - PERF-#6583: Remove redundant index reassignment in
query()
(#6584) - PERF-#6590: Chunk axes independently in
.from_pandas()
(#6591)
- PERF-#6332: Don't materialize axes in
- Refactor Codebase
- REFACTOR-#4278: Remove unused arguments from
BasePandasDataset.apply
(#6451) - REFACTOR-#4902: Use
isort
(#6551) - REFACTOR-#6470: Remove
Patcher
internal class (#6471) - REFACTOR-#6489: Enforce API-layer bool/integer argument for
__invert__
(#6490) - REFACTOR-#6569: Use
contextlib.nullcontext
instead of custom one (#6570) - REFACTOR-#6576: Don't use deprecated
is_int64_dtype
andis_period_dtype
function (#6577)
- REFACTOR-#4278: Remove unused arguments from
- Update testing suite
- TEST-#0000: Download ray wheel for python 3.9 (#6513)
- TEST-#2008: Reduce runtime of CI checks a lot (#6356)
- TEST-#4270: Revert disabling
time_groupby_agg_nunique
ASV bench (#6564) - TEST-#4348: Use
psycopg2-binary
for testing and developing purpose (#6573) - TEST-#4477: Add tests for
df.eval
with scalar andgroupby.transofm
call in the expr (#6546) - TEST-#4643: Add interchange test for empty dataframe (#6454)
- TEST-#5008: Set benchmark mode within unit test instead of with environment variable (#6359)
- TEST-#6349: Update minimum versions for test dependencies in general environments (#6350)
- TEST-#6439: Create HDK environment manually for ASV (#6431)
- TEST-#6449: Run tests in test_dmatrix.py only for Ray engine (#6450)
- TEST-#6460: Don't use
repr
to force materialization (#6461) - TEST-#6469: Pin
numexpr<2.8.5
(#6474) - TEST-#6477: Update ASV to 0.5.1 (#6432)
- TEST-#6497: Remove
boto3
from environments to speedup creation (#6496) - TEST-#6505: Update python version for ASV benchmarks on HDK (#6504)
- TEST-#6593: Adapt tests for pandas 2.1.1 (#6592)
- Documentation improvements
- New Features
- FEAT-#1611: Add some datetime extraction functions for HDK (#6568)
- FEAT-#5645: Add support for modin's numpy array in
dataframe.insert
function (#6400) - FEAT-#6139:
DataLoader
interplay. (#6140) - FEAT-#6377: HDK: Do not keep reference to arrow table imported to HDK (#6380)
- FEAT-#6389: Make sure git ignores logs in
.modin
folder (#6390) - FEAT-#6401: Support compression param and more file extensions in
to_parquet
(#6404) - FEAT-#6407: Update minimum dependency versions (#6342)
- FEAT-#6417: Add support for filters to
read_parquet
(#6442) - FEAT-#6434: HDK: Do not convert dictionary columns to string when importing arrow tables (#6435)
- FEAT-#6440: Use different HDK parameters for different queries (#6441)
- FEAT-#6484: HDK: Add support for
nlargest/nsmallest
groupby aggregation (#6485) - FEAT-#6500: HDK: Add support for
datetime64
toint64
cast (#6501) - FEAT-#6502: HDK: Add
enable_multifrag_execution_result=1
HDK launch parameter (#6503) - FEAT-#6511: Update the minimum supported python version up to 3.9 (#6508)
- FEAT-#6522: Update to pandas 2.1.0 (#6512)
- FEAT-#6527: HDK: Add support for the quantile group by aggregation. (#6528)
- FEAT-#6597: Bump pyhdk version to 0.8 (#6598)
Contributors
@AndreyPavlenko
@RehanSD
@YarShev
@anmyachev
@dchigarev
@mvashishtha
@vnlitvinov
@abykovsk
@zmbc
@noloerino
@rentruewang