Skip to content

Releases: mosaicml/streaming

v0.3.0

01 Mar 08:30
11d0944
Compare
Choose a tag to compare

🚀 Streaming v0.3.0

Streaming v0.3.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.3.0

New Features

☁️ Cloud uploading

Now, you can automatically upload shards to cloud storage on the fly by providing a cloud path to MDSWriter. Track the progress of individual uploads with progress_bar=True, and tune background upload workers with max_workers=4.

User can choose to upload a output shard files automatically to a supported cloud (AWS S3, GCP, OCI) by providing a out parameter as a cloud provider bucket location as part of Writer class. Below is the example to upload output files to AWS S3 bucket

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, ...) as out:
    for sample in samples:
        pass

User can choose to keep a output shard files locally by providing a local directory path as part of Writer. For example,

output_dir = '/tmp/mds'
with MDSWriter(out=output_dir, ...) as out:
    for sample in samples:
        pass

User can see the progress of the cloud upload file by setting progress_bar=True as part of Writer. For example,

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, progress_bar=True, ...) as out:
    for sample in samples:
        pass

User can control the number of background upload threads via parameter max_workers as part of Writer who is responsible for uploading the shard files to a remote location if provided. One thread is responsible for one file upload. For example, if max_workers=4, maximum 4 threads would be active at a same time uploading one shard file each.

output_dir = 's3://bucket/dir/path'
with MDSWriter(out=output_dir, max_workers=4, ...) as out:
    for sample in samples:
        pass

🔀 2x faster shuffling

We’ve added a new shuffling algorithm py1s which is twice as fast on typical workloads. You can toggle which shuffling algorithm is used by overriding shuffle_algo (old behavior: py2s). You will experience this as faster epoch starts and faster mid-epoch resumption for large datasets.

📨 2x faster partitioning

We’ve also reimplemented how shards/samples are assigned to nodes/devices/dataloader workers to run about twice as fast on typical workloads while giving identical results. This is exposed as the partition_algo argument to StreamingDataset. You will experience this as faster start and resumption for large datasets.

🔗 Extensible downloads

We provide examples of modifying StreamingDataset to stream from a dataset of links to external data sources. In our examples, using the WebVid dataset, each sample points to a video file which exists outside of the shards in its original format and is downloaded separately. Benchmarking is included.

API changes

  • Class Writer and its derived classes (MDSWriter, XSVWriter, TSVWriter, CSVWriter, and JSONWriter) parameter has been changed from dirname to out with the following advanced functionalities:

    • If out is a local directory, shard files are saved locally. For example, out=/tmp/mds/.
    • If out is a remote directory, a local temporary directory is created to cache the shard files and then the shard files are uploaded to a remote location. At the end, the temp directory is deleted once shards are uploaded. For example, out=s3://bucket/dir/path.
    • If out is a tuple of (local_dir, remote_dir), shard files are saved in the
      local_dir and also uploaded to a remote location. For example, out=('/tmp/mds/', 's3://bucket/dir/path').
  • Given the complexity of their arguments, and the need to be able to safely upgrade them over time, we have updated the APIs of Writer and its subclasses (like MDSWriter) and StreamingDataset to require kwargs.

Bug Fixes

What's Changed

New Contributors

Full Changelog: v0.2.5...v0.3.0

v0.2.5

14 Feb 05:29
4bf9c1c
Compare
Choose a tag to compare

🚀 Streaming v0.2.5

Streaming v0.2.5 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.5

Bug Fixes

  • Fixed CPU crash (#153)
  • Update example notebooks (#157)

What's Changed

Full Changelog: v0.2.4...v0.2.5

v0.2.4

10 Feb 00:52
f392223
Compare
Choose a tag to compare

🚀 Streaming v0.2.4

Streaming v0.2.4 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.4

What's Changed

New Contributors

Full Changelog: v0.2.3...v0.2.4

v0.2.3

31 Jan 20:36
6a30df6
Compare
Choose a tag to compare

🚀 Streaming v0.2.3

Streaming v0.2.3 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.3

New Features

  • Add scalar MDS encodings data types (#130)
  • Support of WebVid-10M dataset (#132)
  • Support of LAION-400M dataset (#87)
  • Make StreamingDataset[sample_id] block to download the given sample's shard if it is not present, so that the dataset can be used lazily (#118)
  • Support of a Streaming benchmarking script to get time taken by the individual component (#121)

Bug Fixes

  • Nuke concat option in C4 dataset (#129)
  • Fixed bug report markdown doc (#140)
  • Fixed ADE20K dataset conversion script (#133)

What's Changed

Full Changelog: v0.2.2...v0.2.3

v0.2.2

09 Jan 22:11
f29bac1
Compare
Choose a tag to compare

🚀 Streaming v0.2.2

Streaming v0.2.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.2

New Features

  • Add in-browser partitioning visualizer (#108)
  • Add command-line partitioning visualizer (#115)

Bug Fixes

  • Get dataloader worker multiprocessing working with spawn, removing Mac OSX fork requirement (#97)
  • Improve error messaging (#100)
  • Fix CUDA OOM (#103)
  • Fix broken source code links in docs (#104)
  • Reference the shared memory object in a worker process when using spawn multiprocessing method (#106)
  • Release all the StreamingDataset resources during job termination (#107)

What's Changed

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1

22 Dec 23:09
0dec354
Compare
Choose a tag to compare

🚀 Streaming v0.2.1

Streaming v0.2.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.1

Bug Fixes

  • Make StreamingDataset smarter about when to init dist itself, fixing env var rendezvous problem (#94).
  • Shorten shared memory names for Mac OSX (#95).
  • Reduce memory usage in StreamingDataset, alleviating inscrutable worker OOMs with large datasets (#96).
  • Better exception handling in downloading (#98).
  • Hard require fork for dataloader multiprocessing in Mac OSX due to unpickleable objects (#101).

What's Changed

  • Also check if dist env vars are set. If not set, don't init dist. by @knighton in #94
  • Shorten the names of shared memory objects to make OSX happy. by @knighton in #95
  • Just do the partitioning/shuffling in the local leader worker. by @knighton in #96
  • propagate the actual exception and raise by @karan6181 in #98
  • Set multiprocessing method as fork for Mac OS by @karan6181 in #101
  • Bump version to 0.2.1 by @karan6181 in #102

Full Changelog: v0.2.0...v0.2.1

v0.2.0

09 Dec 06:44
1067f1b
Compare
Choose a tag to compare

🚀 Streaming v0.2.0

Streaming v0.2.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.2.0

New Features

  1. Elastic world size deterministic shuffle

    Shuffled or not, StreamingDataset now collectively traverses the samples in identical order across all the devices, given a seed and a canonical number of nodes. This ordering holds true even if you checkpoint and resume training of the same epoch on a different number of nodes.

  2. Instant Mid-Epoch Resumption

    Waiting while your data loader spins to resume from where you left off can be costly! StreamingDataset now lets you resume immediately.

  3. NEW StreamingDataLoader
    A StreamingDataLoader is a drop-in replacement for your PyTorch DataLoader with a Mid-Epoch Resumption functionality where it resumes from where you left off without spinning the dataloader.

  4. Support for Oracle Cloud Infrastructure (OCI) blob storage

    Streaming now supports OCI blob storage as a storage backend for streaming. One can pass the OCI blob storage as either oci://<bucket_name>@<namespace>/<folder_name>/<filename> or oci://<bucket_name>/<folder_name>/<filename> to a StreamingDataset class. For example:

    from streaming import StreamingDataset
    
    remote = 'oci://<bucket>@<namespace>/<path>'
    local = '/tmp/dataset/'
    
    train_dataset = StreamingDataset(local=local, remote=remote, split='train')

    Streaming expects the credentials to be present in ~/.oci/config path.

  5. Support for public AWS S3 buckets

    Streaming now supports AWS S3 buckets which are public resources that can be accessed without credentials, apart from the already supported private AWS S3 buckets. One can instantiate the StreamingDataset class with an AWS S3 bucket as follows

    from streaming import StreamingDataset
    
    remote = 's3://<bucket>/<path>'
    local = '/tmp/dataset/'
    
    train_dataset = StreamingDataset(local=local, remote=remote, split='train')
    

API changes

  • The class Dataset has been renamed as class StreamingDataset (#37).
    • Similarly, built-in most popular datasets class has also been renamed. For example,
      • C4 renamed as StreamingC4
      • EnWiki renamed as StreamingEnWiki
      • Pile renamed as StreamingEnWiki
      • ADE20K renamed as StreamingADE20K
      • CIFAR10 renamed as StreamingCIFAR10
      • COCO renamed as StreamingCOCO
      • ImageNet renamed as StreamingImageNet
  • The parameter prefetch in class Dataset has been renamed as predownload in class StreamingDataset (#37).
  • The parameter retry in class Dataset has been renamed as download_retry in class StreamingDataset (#37).
  • The parameter timeout in class Dataset has been renamed as download_timeout in class StreamingDataset (#37).
  • The parameter hash in class Dataset has been renamed as validate_hash in class StreamingDataset (#37).

What's Changed

Full Changelog: v0.1.2...v0.2.0

v0.1.2

14 Nov 22:14
0c34652
Compare
Choose a tag to compare

🚀 Streaming v0.1.2

Streaming v0.1.2 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.1.2

What's Changed

Full Changelog: v0.1.1...v0.1.2

v0.1.1

24 Oct 22:15
2035a72
Compare
Choose a tag to compare

🚀 Streaming v0.1.1

Streaming v0.1.1 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.1.1

What's Changed

New Contributors

Full Changelog: https://github.com/mosaicml/streaming/commits/v0.1.1