Skip to content

Releases: mosaicml/streaming

v0.10.0

03 Dec 21:14
Compare
Choose a tag to compare

🚀 Streaming v0.10.0

Streaming v0.10.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.10.0

Improvements

1. Reusable cloud download clients (#817)

  • Streaming now reuses cloud download clients when downloading shard files instead of creating a new client for each download.
  • This avoids run failures that sometimes occur with too many open sockets or excessive cloud authentication requests.

2: py1b shuffle algorithm deprecation (#837)

  • The py1b shuffle algorithm has now been deprecated. Please use the improved py1e (default) or the py1br shuffle algorithms instead.

What's Changed

New Contributors

Full Changelog: v0.9.1...v0.10.0

v0.9.1

04 Nov 20:50
Compare
Choose a tag to compare

🚀 Streaming v0.9.1

Streaming v0.9.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.9.1

What's New

1. Streaming is added to Gurubase (#805)

  • Streaming now has an AI assistant available to help users with their questions! Try out Streaming Guru which uses the data from this repo and data from the docs to answer questions by leveraging the LLM.

Improvements

1. Permission Issue Resolution (#813)

  • Resolved read permission issues occurring when shared memory files are created in shared computing environments. We added retry conditions to allow the creation of new shared memory files upon encountering permission errors.
  • Prefix Integrity for Shared Memory Files: When creating shared memory files, both LOCALS and FILELOCKS are now validated to ensure no overlap with existing files, and they are matched with consistent prefix identifiers.
  • Handling Non-Normal Program Exits: Enhanced cleanup procedures to address cases where non-normal program exits left some shared memory files uncleared. All files in SHM_TO_CLEAN are now checked to prevent duplicates.
    These changes improve shared memory management and reliability in shared environments.

2. Fix Shard Eviction Hanging (#795)

  • Changed the search for coldest shard to avoid looping over remote shards by considering local shards only as possible candidates for eviction.

What's Changed

New Contributors

Full Changelog: v0.9.0...v0.9.1

v0.9.0

25 Sep 02:34
Compare
Choose a tag to compare

🚀 Streaming v0.9.0

Streaming v0.9.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.9.0

Whats new

1. Improved compatibility for ndarray and json types (#776, #777)

It is now possible to have columns including a map type successfully convert to JSON in an MDS file if the given type for the column is specified as 'json', and allows the JSON encoder to handle ndarray types.

What's Changed

Full Changelog: v0.8.1...v0.9.0

v0.8.1

23 Aug 20:26
a9a7d04
Compare
Choose a tag to compare

🚀 Streaming v0.8.1

Streaming v0.8.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.8.1

🔧 Improvements

Dataloader hanging between epochs has now been resolved! We've seen training time improvements of up to 40% for some many-epoch training jobs. If this was impacting your runs and has now been fixed, please let us know!

  • Fix dataloader hang at the end of an epoch by @XiaohanZhangCMU in #741
  • Add default compression, and warning about local paths to dataframe_to_mds by @srowen in #748
  • Throw exception when event.is_set() after write()s by @srowen in #754

🐛 Bug Fixes

  • Ensure deterministic sample order between epochs when shuffle=False by @snarayan21 in #750

What's Changed

New Contributors

Full Changelog: v0.8.0...v0.8.1

v0.8.0

30 Jul 17:00
b14cd7a
Compare
Choose a tag to compare

✨ What's New ✨

1. HF File System Streaming (#711)

Streaming now supports streaming data from HF file system! This adds another popular backend as an option to host your data.

What's Changed

New Contributors

Full Changelog: v0.7.6...v0.8.0

v0.7.6

10 May 22:22
97eae28
Compare
Choose a tag to compare

🚀 Streaming v0.7.6

Streaming v0.7.6 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.6

💎 New Features

1. device_per_stream batching method

Users can now construct batches such that each device sees only samples from a single stream. This is very useful in cases where different data sources have samples/tensors of different sizes, but the model should still see samples from these different data sources at each optimizer step.

2. Add ndarray type for Spark dataframes.

Enable parsing Spark's ArrayType (of ShortType, LongType, IntegerType, FloatType, DoubleType) when converting a Spark dataframe to MDS.

3. Support for Alipan storage

Adds support for Alipan, Alibaba's cloud storage service.

What's Changed

New Contributors

Full Changelog: v0.7.5...v0.7.6

v0.7.5

09 Apr 00:35
3ba9301
Compare
Choose a tag to compare

🚀 Streaming v0.7.5

Streaming v0.7.5 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.5

💎 New Features

1. Tensor/Sequence Parallelism Support

Using the replication argument, easily share data samples across multiple ranks, enabling sequence or tensor parallelism.

  • Replicating samples across devices (SP / TP enablement) by @knighton in #597
  • Expanded replication testing + documentation by @snarayan21 in #607
  • Make streaming use the correct number of unique samples with SP/TP by @snarayan21 in #619

2. Overhauled Streaming Documentation

New and improved streaming documentation can be found here -- please submit issues with any feedback.

3. batch_size is now required for StreamingDataset

As we have seen multiple errors and performance degradations from users not setting the batch_size argument to StreamingDataset, we are making it a requirement to iterate over the dataset.

3. Support for Python 3.11, deprecate Python 3.8

  • Add support for Python 3.11 and deprecate Python 3.8 by @karan6181 in #586

🐛 Bug Fixes

  • [easy typo fix] fix f-string by @bigning in #596
  • Change comparison in partitions to include equals by @JAEarly in #587
  • Use type int when initializing SharedMemory size by @bchiang2 in #604
  • COCO Dataset fix -- avoids allow_unsafe_types=True by @snarayan21 in #647

🔧 Improvements

What's Changed

New Contributors

Full Changelog: v0.7.4...v0.7.5

v0.7.4

08 Feb 22:00
a0443bb
Compare
Choose a tag to compare

🚀 Streaming v0.7.4

Streaming v0.7.4 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.4

🐛 Bug Fixes

  • Download to temporary path from azure by @philipnrmn in #566
  • fix(merge_index): scheme was not well formatted by @fwertel in #576
  • Update misplaced params of _format_remote_index_files by @lsongx in #584
  • Modifications to resumption shared memory allowing load_state_dict multiple times. by @snarayan21 in #593

What's Changed

New Contributors

Full Changelog: v0.7.3...v0.7.4

v0.7.3

12 Jan 18:12
47efc9d
Compare
Choose a tag to compare

🚀 Streaming v0.7.3

Streaming v0.7.3 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.3

🐛 Bug Fixes

  • Logging messages for new defaults only show once per rank. (#543)
  • Fixed padding calculation for repeat samples in the partition. (#544)

🔧 Other improvements

  • Update copyright license year from 2023 -> 2022-2024. (#560)

What's Changed

Full Changelog: v0.7.2...v0.7.3

v0.7.2

14 Dec 17:26
fac84b4
Compare
Choose a tag to compare

🚀 Streaming v0.7.2

Streaming v0.7.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.7.2

💎 New Features

1. Canned ACL Support (#512)

Add support for the Canned ACL using the environment variable S3_CANNED_ACL for AWS S3. Checkout Canned ACL document on how to use it.

2. Allow/reject datasets containing unsafe types (#519)

The pickle serialization format, one of the available MDS encodings, is a potential security vulnerability. We added a boolean flag allow_unsafe_types in the StreamingDataset class to allow or reject datasets containing Pickle.

🐛 Bug Fixes

  • Retrieve batch size correctly from vision yamls for the streaming simulator (#501)
  • Fix for CVE-2023-47248 (#504)
  • Streaming simulator bug fixes (proportion, repeat, yaml ingestion) (#514)
    • Proportion of None instead of a string 'None' is now handled correctly.
    • Repeat of None instead of a string 'None' is now handled correctly.
    • Added warning for StreamingDataset subclass defaults
  • Fix sample partitioning algorithm bug for tiny datasets (#517)

🔧 Improvements

  • Added warning messages for new streaming dataset defaults to inform users about the old and new values. (#502)

What's Changed

New Contributors

Full Changelog: v0.7.1...v0.7.2