Releases: mosaicml/streaming
v0.3.0
🚀 Streaming v0.3.0
Streaming v0.3.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.3.0
New Features
☁️ Cloud uploading
Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to MDSWriter
. Track the progress of individual uploads with progress_bar=True
, and tune background upload workers with max_workers=4
.
User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a out
parameter as a cloud provider bucket location as part of Writer
class. Below is the example to upload output files to AWS S3 bucket
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass
User can choose to keep a output shard files locally by providing a local directory path as part of Writer
. For example,
output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
for sample in samples:
pass
User can see the progress of the cloud upload file by setting progress_bar=True
as part of Writer
. For example,
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
for sample in samples:
pass
User can control the number of background upload threads via parameter max_workers
as part of Writer
who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if max_workers=4
, maximum 4 threads would be active at a same time uploading one shard file each.
output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
for sample in samples:
pass
🔀 2x faster shuffling
We’ve added a new shuffling algorithm py1s
which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding shuffle_algo
(old behavior: py2s
). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.
📨 2x faster partitioning
We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the partition_algo
argument to StreamingDataset
. You will experience this as faster start and resumption for large datasets.
🔗 Extensible downloads
We provide examples of modifying StreamingDataset
to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.
API changes
-
Class
Writer
and its derived classes (MDSWriter
,XSVWriter
,TSVWriter
,CSVWriter
, andJSONWriter
) parameter has been changed fromdirname
toout
with the following advanced functionalities:- If
out
is a local directory, shard files are saved locally. For example,out=/tmp/mds/
. - If
out
is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example,out=s3://bucket/dir/path
. - If
out
is a tuple of(local_dir, remote_dir)
, shard files are saved in the
local_dir
and also uploaded to a remote location. For example,out=('/tmp/mds/', 's3://bucket/dir/path')
.
- If
-
Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of
Writer
and its subclasses (likeMDSWriter
) andStreamingDataset
to require kwargs.
Bug Fixes
- Fix broken blog post link and community email link in the README (#177).
- Download the shard files as tmp extension until it finishes for OCI blob storage (#178).
- Supported cloud providers documentation (#169).
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
doc on how to configure cloud storage credentials.
- Streaming Dataset support Amazon S3, Google Cloud Storage, and Oracle Cloud Storage providers to stream your data to any compute cluster. Read [this](https://streaming.docs.mosaicml.com/en/stable/how_to_guides/configure_cloud_storage_cred.html)
- Make setup.py deterministic by sorting dependencies (#165).
- Fix overlong lines for better readability (#163).
What's Changed
- Bump fastapi from 0.89.1 to 0.91.0 by @dependabot in #154
- Bump sphinxext-opengraph from 0.7.5 to 0.8.1 by @dependabot in #155
- Compare arrow vs mds vs parquet. by @knighton in #160
- Improve serialization format comparison. by @knighton in #161
- WebVid: conversion and benchmarking for storing the MP4s separately vs inside the MDS shards. by @knighton in #143
- Update download badge link to pepy by @karan6181 in #162
- CloudWriter interface: local=, remote=, keep=. by @knighton in #148
- Fix overlong lines. by @knighton in #163
- Make setup.py deterministic by sorting dependencies. by @nharada1 in #165
- Bump pydantic from 1.10.4 to 1.10.5 by @dependabot in #166
- Bump gitpython from 3.1.30 to 3.1.31 by @dependabot in #167
- Bump fastapi from 0.91.0 to 0.92.0 by @dependabot in #168
- Adjust StreamingDataset arguments by @knighton in #170
- add 2x faster shuffle algorithm; add shuffle bench/plot by @knighton in #137
- Docstring fix by @knighton in #173
- Add a supported cloud providers documentation by @karan6181 in #169
- Add callout fence to Configure Cloud Storage Credentials guide by @karan6181 in #174
- Fix broken links in the README by @knighton in #177
- Download the shard files as tmp extension until it finishes for OCI by @karan6181 in #178
- Add a support of uploading shard files to a cloud as part of Writer by @karan6181 in #171
- Refactor partitioning to be much faster. by @knighton in #179
- Bump version to 0.3.0 by @karan6181 in #180
New Contributors
Full Changelog: v0.2.5...v0.3.0
v0.2.5
🚀 Streaming v0.2.5
Streaming v0.2.5 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.2.5
Bug Fixes
What's Changed
- Update README.md by @knighton in #152
- Fix typo by @dakinggg in #156
- Fixed CPU crash by @karan6181 in #153
- Update example notebooks by @karan6181 in #157
- bump version to 0.2.5 by @karan6181 in #158
Full Changelog: v0.2.4...v0.2.5
v0.2.4
🚀 Streaming v0.2.4
Streaming v0.2.4 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.2.4
What's Changed
- Fix Lossy JPEG reencoding for MDS format by @JJGO in #142
- Add message to size assert & change to KeyError by @samhavens in #146
- Synchronize prefix_int across all ranks to resolve hang issue by @karan6181 in #147
- Pin setuptools in build requirements by @dakinggg in #136
- Graphics. by @knighton in #150
- bump version to 0.2.4 by @karan6181 in #151
New Contributors
- @JJGO made their first contribution in #142
- @samhavens made their first contribution in #146
Full Changelog: v0.2.3...v0.2.4
v0.2.3
🚀 Streaming v0.2.3
Streaming v0.2.3 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.2.3
New Features
- Add scalar MDS encodings data types (#130)
- Support of WebVid-10M dataset (#132)
- Support of LAION-400M dataset (#87)
- Make
StreamingDataset[sample_id]
block to download the given sample's shard if it is not present, so that the dataset can be used lazily (#118) - Support of a Streaming benchmarking script to get time taken by the individual component (#121)
Bug Fixes
- Nuke concat option in C4 dataset (#129)
- Fixed bug report markdown doc (#140)
- Fixed ADE20K dataset conversion script (#133)
What's Changed
- Make getitem block to download shard if not present. by @knighton in #118
- 2022 -> 2023. by @knighton in #119
- Benchmark generating the epoch. by @knighton in #121
- Move datasets dependency into .[dev]. by @knighton in #123
- Bump sphinxcontrib-katex from 0.9.3 to 0.9.4 by @dependabot in #113
- Bump sphinxext-opengraph from 0.7.4 to 0.7.5 by @dependabot in #114
- Bump pytest from 7.2.0 to 7.2.1 by @dependabot in #124
- Bump fastapi from 0.88.0 to 0.89.1 by @dependabot in #125
- Bump yamllint from 1.28.0 to 1.29.0 by @dependabot in #126
- Update paramiko requirement from <3,>=2.11.0 to >=2.11.0,<4 by @dependabot in #127
- Bump nbsphinx from 0.8.11 to 0.8.12 by @dependabot in #128
- Nuke concat option. by @knighton in #129
- Add scalar MDS encodings (data types). by @knighton in #130
- WebVid. by @knighton in #132
- LAION-400M processing by @knighton in #87
- Update isort version by @karan6181 in #135
- Update pre-commit requirement from <3,>=2.18.1 to >=2.18.1,<4 by @dependabot in #134
- Fixed bug report markdown by @karan6181 in #140
- Fix ade20k conversion script by @dblalock in #133
- bump version to 0.2.3 by @karan6181 in #141
Full Changelog: v0.2.2...v0.2.3
v0.2.2
🚀 Streaming v0.2.2
Streaming v0.2.2 is released! Install via pip:
pip install --upgrade mosaicml-streaming==0.2.2
New Features
Bug Fixes
- Get dataloader worker multiprocessing working with spawn, removing Mac OSX fork requirement (#97)
- Improve error messaging (#100)
- Fix CUDA OOM (#103)
- Fix broken source code links in docs (#104)
- Reference the shared memory object in a worker process when using spawn multiprocessing method (#106)
- Release all the StreamingDataset resources during job termination (#107)
What's Changed
- Lazily instantiate the worker barrier in iter (so it all pickles). by @knighton in #97
- linkcode -> viewcode by @dakinggg in #104
- Update writer.py by @sophiawisdom in #100
- Bump sphinxext-opengraph from 0.7.3 to 0.7.4 by @dependabot in #105
- Removed cuda memory allocation which was causing CUDA OOM by @karan6181 in #103
- Reference the shared memory object in a worker process when using spawn multiprocessing method by @karan6181 in #106
- Release all the StreamingDataset resources during job termination by @karan6181 in #107
- Bump gitpython from 3.1.29 to 3.1.30 by @dependabot in #109
- Bump nbsphinx from 0.8.10 to 0.8.11 by @dependabot in #111
- Visualize partitioning by @knighton in #108
- Command-line partitioning visualizer. by @knighton in #115
- Fix (sys.meta_path is None, Python is likely shutting down) by @knighton in #116
- Bump version. by @knighton in #117
New Contributors
- @dakinggg made their first contribution in #104
- @sophiawisdom made their first contribution in #100
Full Changelog: v0.2.1...v0.2.2
v0.2.1
🚀 Streaming v0.2.1
Streaming v0.2.1
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.2.1
Bug Fixes
- Make StreamingDataset smarter about when to init dist itself, fixing env var rendezvous problem (#94).
- Shorten shared memory names for Mac OSX (#95).
- Reduce memory usage in StreamingDataset, alleviating inscrutable worker OOMs with large datasets (#96).
- Better exception handling in downloading (#98).
- Hard require fork for dataloader multiprocessing in Mac OSX due to unpickleable objects (#101).
What's Changed
- Also check if dist env vars are set. If not set, don't init dist. by @knighton in #94
- Shorten the names of shared memory objects to make OSX happy. by @knighton in #95
- Just do the partitioning/shuffling in the local leader worker. by @knighton in #96
- propagate the actual exception and raise by @karan6181 in #98
- Set multiprocessing method as fork for Mac OS by @karan6181 in #101
- Bump version to 0.2.1 by @karan6181 in #102
Full Changelog: v0.2.0...v0.2.1
v0.2.0
🚀 Streaming v0.2.0
Streaming v0.2.0
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.2.0
New Features
-
Elastic world size deterministic shuffle
Shuffled or not, StreamingDataset now collectively traverses the samples in identical order across all the devices, given a seed and a canonical number of nodes. This ordering holds true even if you checkpoint and resume training of the same epoch on a different number of nodes.
-
Instant Mid-Epoch Resumption
Waiting while your data loader spins to resume from where you left off can be costly! StreamingDataset now lets you resume immediately.
-
NEW StreamingDataLoader
AStreamingDataLoader
is a drop-in replacement for your PyTorchDataLoader
with a Mid-Epoch Resumption functionality where it resumes from where you left off without spinning the dataloader. -
Support for Oracle Cloud Infrastructure (OCI) blob storage
Streaming now supports OCI blob storage as a storage backend for streaming. One can pass the OCI blob storage as either
oci://<bucket_name>@<namespace>/<folder_name>/<filename>
oroci://<bucket_name>/<folder_name>/<filename>
to aStreamingDataset
class. For example:from streaming import StreamingDataset remote = 'oci://<bucket>@<namespace>/<path>' local = '/tmp/dataset/' train_dataset = StreamingDataset(local=local, remote=remote, split='train')
Streaming expects the credentials to be present in
~/.oci/config
path. -
Support for public AWS S3 buckets
Streaming now supports AWS S3 buckets which are public resources that can be accessed without credentials, apart from the already supported private AWS S3 buckets. One can instantiate the
StreamingDataset
class with an AWS S3 bucket as followsfrom streaming import StreamingDataset remote = 's3://<bucket>/<path>' local = '/tmp/dataset/' train_dataset = StreamingDataset(local=local, remote=remote, split='train')
API changes
- The class
Dataset
has been renamed as classStreamingDataset
(#37).- Similarly, built-in most popular datasets class has also been renamed. For example,
C4
renamed asStreamingC4
EnWiki
renamed asStreamingEnWiki
Pile
renamed asStreamingEnWiki
ADE20K
renamed asStreamingADE20K
CIFAR10
renamed asStreamingCIFAR10
COCO
renamed asStreamingCOCO
ImageNet
renamed asStreamingImageNet
- Similarly, built-in most popular datasets class has also been renamed. For example,
- The parameter
prefetch
in classDataset
has been renamed aspredownload
in classStreamingDataset
(#37). - The parameter
retry
in classDataset
has been renamed asdownload_retry
in classStreamingDataset
(#37). - The parameter
timeout
in classDataset
has been renamed asdownload_timeout
in classStreamingDataset
(#37). - The parameter
hash
in classDataset
has been renamed asvalidate_hash
in classStreamingDataset
(#37).
What's Changed
- Bump nbsphinx from 0.8.9 to 0.8.10 by @dependabot in #73
- Bump sphinx-argparse from 0.3.2 to 0.4.0 by @dependabot in #74
- The Pile (conversion + streaming dataset) by @knighton in #71
- [Docs] Switch back to RTD search by @bandish-shah in #83
- make pyright precommit check actually run by @dblalock in #84
- Fixed stale URL references by @bandish-shah in #85
- Bump sphinx-copybutton from 0.5.0 to 0.5.1 by @dependabot in #78
- Bump pandoc from 2.2 to 2.3 by @dependabot in #79
- Bump sphinxcontrib-katex from 0.9.0 to 0.9.3 by @dependabot in #80
- Bump sphinxext-opengraph from 0.7.2 to 0.7.3 by @dependabot in #81
- Support for concat option in C4 Dataset by @karan6181 in #77
- Elastic world size deterministic shuffle with mid-epoch resumption by @knighton in #37
- Support for S3 public bucket by @karan6181 in #88
- Add OCI Cloud Storage support by @karan6181 in #86
- Make StreamingDataset state_dict() more flexible by @knighton in #90
- Bump version to 0.2.0 by @karan6181 in #92
Full Changelog: v0.1.2...v0.2.0
v0.1.2
🚀 Streaming v0.1.2
Streaming v0.1.2
is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.1.2
What's Changed
- Fixed contributing page link by @karan6181 in #61
- Add Distributed test and supported multi device unittest by @karan6181 in #57
- Added template and adhere to standard coding practice by @karan6181 in #62
- Bump pytest from 7.1.3 to 7.2.0 by @dependabot in #63
- Bump pypandoc from 1.9 to 1.10 by @dependabot in #65
- Add code coverage report and moved scripts outside of src by @karan6181 in #66
- Bump sphinxext-opengraph from 0.6.3 to 0.7.2 by @dependabot in #67
- Add Google Cloud Storage support by @karan6181 in #68
- Create and push release branch as part of workflow by @karan6181 in #69
- Add test CI badge in README by @karan6181 in #70
- Add unit test for download, encodings, hashing, and others by @karan6181 in #72
- Bump version to 0.1.2 by @karan6181 in #75
Full Changelog: v0.1.1...v0.1.2
v0.1.1
🚀 Streaming v0.1.1
Streaming v0.1.1 is released! Install via pip
:
pip install --upgrade mosaicml-streaming==0.1.1
What's Changed
- Streaming datasets V2 by @knighton in #2
- Initial Docs Site by @bandish-shah in #3
- Added a ADE20K and COCO2017 data conversion scripts by @karan6181 in #5
- Added pre-commit config by @karan6181 in #6
- Added pre-commit config for a License Header by @karan6181 in #7
- Convert relative imports to absolute imports by @karan6181 in #8
- C4 dataset by @knighton in #4
- Add a ADE20K streaming dataset class by @karan6181 in #9
- PyPi mods for setup.py by @bandish-shah in #10
- Disable local shard deletion by @knighton in #12
- Add a COCO streaming dataset class by @karan6181 in #13
- Add docstrings. by @knighton in #14
- Added unittest for Writer and Reader by @karan6181 in #16
- added new streaming logos by @ejyuen in #15
- Update package version code for unification by @karan6181 in #17
- Fix wait-for-unzip race by @knighton in #18
- Added algolia search to streaming docs site by @nqn in #19
- Add a pre-commit GitHub workflow by @karan6181 in #21
- Added pydocstyle and docformatter in pre-commit config by @karan6181 in #20
- Improve algorithmic complexity of sample-to-shard lookup from O(log N) to O(1) by @knighton in #22
- Add enwiki-20200101 streaming dataset by @knighton in #23
- Add submodules to api reference doc by @karan6181 in #24
- Initial Docs site content by @bandish-shah in #11
- Add unittest for compression by @karan6181 in #25
- Fix hang when compression is used but compressed files are not retained by @knighton in #26
- Add long_description for packaging by @bandish-shah in #29
- Update tutorial notebooks to have it run end-to-end by @karan6181 in #30
- Adjustment for last partition bug by @knighton in #27
- Fix preprocessing for English Wikipedia dataset by @knighton in #28
- Fix enwiki dataset by @dskhudia in #31
- Skip pre-commit check for enwiki convert skip to have code parity by @karan6181 in #32
- Update doc and fixed reference links by @karan6181 in #33
- Parallel tfrecord creation, validate sample counts vs MDS by @knighton in #34
- Bump up the version to 0.0.1b by @karan6181 in #35
- Add NLP synthetic dataset jupyter notebook tutorial by @karan6181 in #36
- Add README and CONTRIBUTING guide by @karan6181 in #38
- Typos + copy editing in README by @dblalock in #40
- Re-factor docs tutorials to top-level examples by @bandish-shah in #39
- Fixed typos and update documentation by @karan6181 in #42
- Add CodeQL security scanner and Dependabot workflow by @karan6181 in #43
- Bump gitpython from 3.1.28 to 3.1.29 by @dependabot in #46
- Bump myst-parser from 0.16.1 to 0.18.1 by @dependabot in #47
- Add bug report and feature request template by @karan6181 in #48
- mlperf enwiki conversion code mild cleanup by @knighton in #41
- Add Build publish to PyPI and create GitHub release workflow by @karan6181 in #50
- Added writer unittest and update existing test by @karan6181 in #52
- Bump version to 0.1.0 by @karan6181 in #53
- Fixed dead image link in pypi home page by @karan6181 in #54
- Add TorchVision VisionDataset inheritance. by @knighton in #55
- bump version to 0.1.1b0 by @karan6181 in #56
- Fixed rendering of pypi image by @karan6181 in #59
- Bump version to 0.1.1 by @karan6181 in #60
New Contributors
- @knighton made their first contribution in #2
- @bandish-shah made their first contribution in #3
- @karan6181 made their first contribution in #5
- @ejyuen made their first contribution in #15
- @nqn made their first contribution in #19
- @dskhudia made their first contribution in #31
- @dblalock made their first contribution in #40
- @dependabot made their first contribution in #46
Full Changelog: https://github.com/mosaicml/streaming/commits/v0.1.1