Releases: mosaicml/streaming
v0.10.0
🚀 Streaming v0.10.0
Streaming v0.10.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.10.0
Improvements
1. Reusable cloud download clients (#817)
- Streaming now reuses cloud download clients when downloading shard files instead of creating a new client for each download.
- This avoids run failures that sometimes occur with too many open sockets or excessive cloud authentication requests.
2: py1b
shuffle algorithm deprecation (#837)
- The
py1b
shuffle algorithm has now been deprecated. Please use the improvedpy1e
(default) or thepy1br
shuffle algorithms instead.
What's Changed
- Update FAQs to indicate wrapping not supported by @milocress in #822
- refactored the download module to have reusable clients by @ethantang-db in #817
- Update pytest-cov requirement from <6,>=4 to >=4,<7 by @dependabot in #821
- Consistent errors for unused streams in batching methods by @snarayan21 in #826
- Update setuptools requirement from <68.0.0 to <76.0.0 by @dependabot in #825
- fix f string by @XiaohanZhangCMU in #829
- Bump fastapi from 0.115.4 to 0.115.5 by @dependabot in #830
- Bump uvicorn from 0.32.0 to 0.32.1 by @dependabot in #834
- Bump pydantic from 2.9.2 to 2.10.1 by @dependabot in #833
- Bump pytest from 8.3.3 to 8.3.4 by @dependabot in #836
- Bump pydantic from 2.10.1 to 2.10.2 by @dependabot in #835
- Version bump to 0.11.0.dev0, including deprecations by @snarayan21 in #837
New Contributors
- @ethantang-db made their first contribution in #817
Full Changelog: v0.9.1...v0.10.0
v0.9.1
🚀 Streaming v0.9.1
Streaming v0.9.1
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.9.1
What's New
1. Streaming is added to Gurubase (#805)
- Streaming now has an AI assistant available to help users with their questions! Try out Streaming Guru which uses the data from this repo and data from the docs to answer questions by leveraging the LLM.
Improvements
1. Permission Issue Resolution (#813)
- Resolved read permission issues occurring when shared memory files are created in shared computing environments. We added retry conditions to allow the creation of new shared memory files upon encountering permission errors.
- Prefix Integrity for Shared Memory Files: When creating shared memory files, both LOCALS and FILELOCKS are now validated to ensure no overlap with existing files, and they are matched with consistent prefix identifiers.
- Handling Non-Normal Program Exits: Enhanced cleanup procedures to address cases where non-normal program exits left some shared memory files uncleared. All files in SHM_TO_CLEAN are now checked to prevent duplicates.
These changes improve shared memory management and reliability in shared environments.
2. Fix Shard Eviction Hanging (#795)
- Changed the search for coldest shard to avoid looping over remote shards by considering local shards only as possible candidates for eviction.
What's Changed
- Bump pydantic from 2.9.1 to 2.9.2 by @dependabot in #785
- Bump fastapi from 0.114.2 to 0.115.0 by @dependabot in #786
- Bump uvicorn from 0.30.6 to 0.31.0 by @dependabot in #793
- Fixed broken links in README.md by @LukaszSztukiewicz in #794
- Shard evict fix by @snarayan21 in #795
- Update huggingface-hub requirement from <0.25,>=0.23.4 to >=0.23.4,<0.26 by @dependabot in #787
- Fix dataset.size() typo in docs by @snarayan21 in #798
- Warning -> info about defaults from v0.7.0 by @snarayan21 in #799
- Bump uvicorn from 0.31.0 to 0.31.1 by @dependabot in #803
- Bump fastapi from 0.115.0 to 0.115.2 by @dependabot in #804
- Introducing Streaming Guru on Gurubase.io by @kursataktas in #805
- Add better error message for shared prefix by @XiaohanZhangCMU in #806
- Bump uvicorn from 0.31.1 to 0.32.0 by @dependabot in #809
- Bump pytest-split from 0.9.0 to 0.10.0 by @dependabot in #810
- Fix logo png by @XiaohanZhangCMU in #808
- Update huggingface-hub requirement from <0.26,>=0.23.4 to >=0.23.4,<0.27 by @dependabot in #814
- Bump fastapi from 0.115.2 to 0.115.4 by @dependabot in #815
- Fix shared memory permission issue in a shared pod environment by @XiaohanZhangCMU in #813
New Contributors
- @LukaszSztukiewicz made their first contribution in #794
- @kursataktas made their first contribution in #805
Full Changelog: v0.9.0...v0.9.1
v0.9.0
🚀 Streaming v0.9.0
Streaming v0.9.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.9.0
Whats new
1. Improved compatibility for ndarray and json types (#776, #777)
It is now possible to have columns including a map type successfully convert to JSON in an MDS file if the given type for the column is specified as 'json', and allows the JSON encoder to handle ndarray types.
What's Changed
- Bump fastapi from 0.112.1 to 0.112.2 by @dependabot in #768
- Bump ci testing by @snarayan21 in #770
- Bump jupyter from 1.0.0 to 1.1.1 by @dependabot in #772
- Bump fastapi from 0.112.2 to 0.114.0 by @dependabot in #779
- Bump pydantic from 2.8.2 to 2.9.1 by @dependabot in #778
- Allow JSON encoder to handle ndarray by @srowen in #777
- Add MapType as JSON-compatible by @srowen in #776
- Bump fastapi from 0.114.0 to 0.114.2 by @dependabot in #783
- Update datasets requirement from <3,>=2.4.0 to >=2.4.0,<4 by @dependabot in #784
- Bump pytest from 8.3.2 to 8.3.3 by @dependabot in #782
- Bump main branch to 0.10.0.dev0 by @dakinggg in #790
Full Changelog: v0.8.1...v0.9.0
v0.8.1
🚀 Streaming v0.8.1
Streaming v0.8.1
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.8.1
🔧 Improvements
Dataloader hanging between epochs has now been resolved! We've seen training time improvements of up to 40% for some many-epoch training jobs. If this was impacting your runs and has now been fixed, please let us know!
- Fix dataloader hang at the end of an epoch by @XiaohanZhangCMU in #741
- Add default compression, and warning about local paths to dataframe_to_mds by @srowen in #748
- Throw exception when event.is_set() after write()s by @srowen in #754
🐛 Bug Fixes
- Ensure deterministic sample order between epochs when
shuffle=False
by @snarayan21 in #750
What's Changed
- Make Pytest log in color in Github Action by @eitanturok in #739
- fix azure container name and blob name in download_from_azure by @jaehwana2z in #733
- Bump uvicorn from 0.30.3 to 0.30.5 by @dependabot in #743
- Update huggingface-hub requirement from <0.24,>=0.23.4 to >=0.23.4,<0.25 by @dependabot in #729
- Bump fastapi from 0.111.1 to 0.112.0 by @dependabot in #744
- Bump ci-testing to v0.1.0 by @snarayan21 in #745
- Patching conf.py due to Sphinx deprecating config manipulation by @snarayan21 in #746
- Bump ci-testing to v0.1.2 by @snarayan21 in #747
- Type hints conformant with pep 585 by @snarayan21 in #752
- Ruff rule to remove unused imports by @snarayan21 in #756
- Fix linting for numpy 2.1.0 by @snarayan21 in #764
- Bump fastapi from 0.112.0 to 0.112.1 by @dependabot in #760
- Bump uvicorn from 0.30.5 to 0.30.6 by @dependabot in #762
- Version 0.8.1 bump! by @snarayan21 in #766
New Contributors
- @eitanturok made their first contribution in #739
- @jaehwana2z made their first contribution in #733
- @srowen made their first contribution in #748
Full Changelog: v0.8.0...v0.8.1
v0.8.0
✨ What's New ✨
1. HF File System Streaming (#711)
Streaming now supports streaming data from HF file system! This adds another popular backend as an option to host your data.
What's Changed
- Bump fastapi from 0.110.2 to 0.111.0 by @dependabot in #670
- Fix: having zero bytes files after converting spark dataframe to MDS saved on dbfs:/Volumes by @XiaohanZhangCMU in #668
- Ensure shards cannot be larger than 4GB by @snarayan21 in #672
- Helpful error on
py1e
for improperly written datasets by @snarayan21 in #673 - Bump pytest from 8.2.0 to 8.2.1 by @dependabot in #680
- Update platform references by @aspfohl in #675
- Update CODEOWNERS by @karan6181 in #681
- Fix
batch_size
typo forStream
object in docs by @snarayan21 in #682 - Bump databricks-sdk from 0.27.0 to 0.27.1 by @dependabot in #679
- Improve local temp directory error when only
remote
is specified by @snarayan21 in #683 - Fix node calculation in
replication
forWorld
object by @snarayan21 in #685 - Warning condition changed for Sequence Parallelism by @XiaohanZhangCMU in #688
- Bump pydantic from 2.7.1 to 2.7.2 by @dependabot in #692
- Bump uvicorn from 0.29.0 to 0.30.1 by @dependabot in #691
- Make sure epoch_size is an int by @snarayan21 in #693
- Bump databricks-sdk from 0.27.1 to 0.28.0 by @dependabot in #687
- Bump pytest from 8.2.1 to 8.2.2 by @dependabot in #697
- fix: expand user path for Writer's output directory. by @huxuan in #694
- Bump pydantic from 2.7.2 to 2.7.3 by @dependabot in #696
- Fix edge cases with scalar or empty numpy array encoding by @snarayan21 in #702
- Raise IndexError in
Spanner
object instead ofValueError
by @snarayan21 in #701 - Fix linting issues with numpy 2 by @snarayan21 in #705
- Bump pydantic from 2.7.3 to 2.7.4 by @dependabot in #704
- Enable correct resumption from the end of an epoch by @snarayan21 in #700
- Fix
drop_first
checking in partitioning to account forworld_size
divisibility by @snarayan21 in #706 - fix convert imagenet by @Hprairie in #708
- Bump pytest-split from 0.8.2 to 0.9.0 by @dependabot in #710
- Remove duplicate
dbfs:
prefix from error message by @vanshcsingh in #712 - enable adaptive retry for s3 download by @bigning in #713
- Upgrade ci_testing, remove codeql by @snarayan21 in #714
- Fix Linting from Pillow version update by @XiaohanZhangCMU in #719
- Bump pydantic from 2.7.4 to 2.8.2 by @dependabot in #718
- Bump databricks-sdk from 0.28.0 to 0.29.0 by @dependabot in #715
- Add HF File System Support to Streaming by @orionw in #711
- Improve error message on non-0 rank when index file download failed by @bigning in #723
- Bump pytest from 8.2.2 to 8.3.2 by @dependabot in #735
- Bump uvicorn from 0.30.1 to 0.30.3 by @dependabot in #730
- Bump fastapi from 0.111.0 to 0.111.1 by @dependabot in #724
- Bump Streaming Version to 0.8.0 by @mvpatel2000 in #738
New Contributors
- @aspfohl made their first contribution in #675
- @huxuan made their first contribution in #694
- @Hprairie made their first contribution in #708
- @vanshcsingh made their first contribution in #712
- @orionw made their first contribution in #711
Full Changelog: v0.7.6...v0.8.0
v0.7.6
🚀 Streaming v0.7.6
Streaming v0.7.6
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.7.6
💎 New Features
1. device_per_stream
batching method
Users can now construct batches such that each device sees only samples from a single stream. This is very useful in cases where different data sources have samples/tensors of different sizes, but the model should still see samples from these different data sources at each optimizer step.
- Adding
device_per_stream
batching by @snarayan21 in #661
2. Add ndarray
type for Spark dataframes.
Enable parsing Spark's ArrayType (of ShortType, LongType, IntegerType, FloatType, DoubleType) when converting a Spark dataframe to MDS.
- Add ndarray type by @XiaohanZhangCMU in #623
3. Support for Alipan storage
Adds support for Alipan, Alibaba's cloud storage service.
- Add support for Alipan Storage backend by @PeterDing in #651
What's Changed
- Bump fastapi from 0.110.0 to 0.110.2 by @dependabot in #660
- Bump pydantic from 2.6.4 to 2.7.0 by @dependabot in #653
- Bump pydantic from 2.7.0 to 2.7.1 by @dependabot in #666
- Bump pytest from 8.1.1 to 8.2.0 by @dependabot in #664
- Bump databricks-sdk from 0.23.0 to 0.27.0 by @dependabot in #667
- Version bump to v0.7.6 by @snarayan21 in #669
New Contributors
- @PeterDing made their first contribution in #651
Full Changelog: v0.7.5...v0.7.6
v0.7.5
🚀 Streaming v0.7.5
Streaming v0.7.5
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.7.5
💎 New Features
1. Tensor/Sequence Parallelism Support
Using the replication
argument, easily share data samples across multiple ranks, enabling sequence or tensor parallelism.
- Replicating samples across devices (SP / TP enablement) by @knighton in #597
- Expanded replication testing + documentation by @snarayan21 in #607
- Make streaming use the correct number of unique samples with SP/TP by @snarayan21 in #619
2. Overhauled Streaming Documentation
New and improved streaming documentation can be found here -- please submit issues with any feedback.
- Major overhaul of Streaming documentation by @snarayan21 in #636
3. batch_size
is now required for StreamingDataset
As we have seen multiple errors and performance degradations from users not setting the batch_size
argument to StreamingDataset, we are making it a requirement to iterate over the dataset.
- You must set batch size. There is no other way. by @snarayan21 in #624
3. Support for Python 3.11, deprecate Python 3.8
- Add support for Python 3.11 and deprecate Python 3.8 by @karan6181 in #586
🐛 Bug Fixes
- [easy typo fix] fix f-string by @bigning in #596
- Change comparison in partitions to include equals by @JAEarly in #587
- Use type int when initializing SharedMemory size by @bchiang2 in #604
- COCO Dataset fix -- avoids
allow_unsafe_types=True
by @snarayan21 in #647
🔧 Improvements
- Allow writers to overwrite existing data by @JAEarly in #594
- Update careers link by @milocress in #611
- Update license by @b-chu in #568
- Updated documentation for S3-compatible object stores by @AIproj in #592
- Make yamllint consistent with Composer by @b-chu in #583
- Switch linting workflows to ci-testing repo by @b-chu in #616
What's Changed
- Bump uvicorn from 0.26.0 to 0.27.1 by @dependabot in #599
- Bump pytest-split from 0.8.1 to 0.8.2 by @dependabot in #581
- Update ruff to 0.2.2 by @Skylion007 in #608
- Bump fastapi from 0.109.0 to 0.110.0 by @dependabot in #610
- Bump yamllint from 1.33.0 to 1.35.1 by @dependabot in #601
- Bump uvicorn from 0.27.1 to 0.28.0 by @dependabot in #626
- Update moto requirement from <5,>=4.0 to >=4.0,<6 by @dependabot in #580
- Bump furo from 2023.7.26 to 2024.1.29 by @dependabot in #631
- Bump pypandoc from 1.12 to 1.13 by @dependabot in #630
- Bump databricks-sdk from 0.14.0 to 0.22.0 by @dependabot in #629
- Add batch_size to 1 if not provided for regression testing by @karan6181 in #635
- Fixed docstring note for getting sequential sample ordering by @snarayan21 in #632
- Bump pytest and fix failing test by @snarayan21 in #642
- Update pytest-cov requirement from <5,>=4 to >=4,<6 by @dependabot in #638
- Bump pydantic from 2.5.3 to 2.6.4 by @dependabot in #639
- Bump uvicorn from 0.28.0 to 0.29.0 by @dependabot in #640
- Bump databricks-sdk from 0.22.0 to 0.23.0 by @dependabot in #644
- Version bump to 0.7.5 by @snarayan21 in #650
New Contributors
- @bigning made their first contribution in #596
- @JAEarly made their first contribution in #587
- @AIproj made their first contribution in #592
- @milocress made their first contribution in #611
- @bchiang2 made their first contribution in #604
Full Changelog: v0.7.4...v0.7.5
v0.7.4
🚀 Streaming v0.7.4
Streaming v0.7.4
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.7.4
🐛 Bug Fixes
- Download to temporary path from azure by @philipnrmn in #566
- fix(merge_index): scheme was not well formatted by @fwertel in #576
- Update misplaced params of _format_remote_index_files by @lsongx in #584
- Modifications to resumption shared memory allowing
load_state_dict
multiple times. by @snarayan21 in #593
What's Changed
- Bump fastapi from 0.108.0 to 0.109.0 by @dependabot in #564
- Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in #565
- Download to temporary path from azure by @philipnrmn in #566
- Use
tempfile.gettempdir()
instead of a hardcoded temp root. by @knighton in #570 - fix(merge_index): scheme was not well formatted by @fwertel in #576
- Bump uvicorn from 0.25.0 to 0.26.0 by @dependabot in #572
- Bump sphinx-tabs from 3.4.4 to 3.4.5 by @dependabot in #571
- Update misplaced params of _format_remote_index_files by @lsongx in #584
- Remove .ci folder and move FILE_HEADER and CODEOWNERS by @irenedea in #588
- Modifications to resumption shared memory allowing
load_state_dict
multiple times. by @snarayan21 in #593 - Bump version to 0.7.4 by @snarayan21 in #595
New Contributors
- @philipnrmn made their first contribution in #566
- @fwertel made their first contribution in #576
- @lsongx made their first contribution in #584
- @irenedea made their first contribution in #588
Full Changelog: v0.7.3...v0.7.4
v0.7.3
🚀 Streaming v0.7.3
Streaming v0.7.3
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.7.3
🐛 Bug Fixes
- Logging messages for new defaults only show once per rank. (#543)
- Fixed padding calculation for repeat samples in the partition. (#544)
🔧 Other improvements
- Update copyright license year from 2023 -> 2022-2024. (#560)
What's Changed
- Logging messages from new defaults only show once per rank. by @snarayan21 in #543
- Fixed condition for warning when partitioning over tiny datasets. by @snarayan21 in #544
- Removing stray print statement by @snarayan21 in #553
- Bump pydantic from 2.5.2 to 2.5.3 by @dependabot in #548
- Bump uvicorn from 0.24.0.post1 to 0.25.0 by @dependabot in #549
- Bump fastapi from 0.104.1 to 0.108.0 by @dependabot in #557
- Bump pytest from 7.4.3 to 7.4.4 by @dependabot in #558
- Update copyright: 2023 -> 2022-2024. by @knighton in #560
- Bump version to 0.7.3 by @karan6181 in #562
Full Changelog: v0.7.2...v0.7.3
v0.7.2
🚀 Streaming v0.7.2
Streaming v0.7.2
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.7.2
💎 New Features
1. Canned ACL Support (#512)
Add support for the Canned ACL using the environment variable S3_CANNED_ACL
for AWS S3. Checkout Canned ACL document on how to use it.
2. Allow/reject datasets containing unsafe types (#519)
The pickle serialization format, one of the available MDS encodings, is a potential security vulnerability. We added a boolean flag allow_unsafe_types
in the StreamingDataset
class to allow or reject datasets containing Pickle.
🐛 Bug Fixes
- Retrieve batch size correctly from vision yamls for the streaming simulator (#501)
- Fix for CVE-2023-47248 (#504)
- Streaming simulator bug fixes (proportion, repeat, yaml ingestion) (#514)
- Proportion of None instead of a string 'None' is now handled correctly.
- Repeat of None instead of a string 'None' is now handled correctly.
- Added warning for StreamingDataset subclass defaults
- Fix sample partitioning algorithm bug for tiny datasets (#517)
🔧 Improvements
- Added warning messages for new streaming dataset defaults to inform users about the old and new values. (#502)
What's Changed
- Migrate pydocstyle to ruff by @Skylion007 in #500
- Bump fastapi from 0.104.0 to 0.104.1 by @dependabot in #496
- Bump uvicorn from 0.23.2 to 0.24.0.post1 by @dependabot in #497
- Retrieve batch size correctly from vision yamls for simulator by @snarayan21 in #501
- Adding warning messages for new defaults by @snarayan21 in #502
- Fix for CVE-2023-47248 by @bandish-shah in #504
- Bump pydantic from 2.4.2 to 2.5.2 by @dependabot in #513
- Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in #506
- Fixed comments and update dataframe_to_MDS API signature by @karan6181 in #515
- Simulator bug fixes (proportion, repeat, yaml ingestion) by @snarayan21 in #514
- Add support for the Canned ACL environment variable for AWS S3 by @karan6181 in #512
- Fixed bugs when trying to use very small datasets by @snarayan21 in #517
- Bump databricks-sdk from 0.8.0 to 0.14.0 by @dependabot in #518
- Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) by @knighton in #519
- improve exception error messages for downloading by @Skylion007 in #525
- doc: add NDArray format by @OrenLeung in #527
- Offload exception to mds_write. by @XiaohanZhangCMU in #528
- Add allow_unsafe_types parameter to the streaming regression tests by @karan6181 in #531
- Bump version to 0.7.2 by @karan6181 in #532
New Contributors
- @OrenLeung made their first contribution in #527
Full Changelog: v0.7.1...v0.7.2