Skip to content

v0.5.0

Compare
Choose a tag to compare
@karan6181 karan6181 released this 06 Jun 13:31
· 317 commits to main since this release
8e16aa9

🚀 Streaming v0.5.0

Streaming v0.5.0 is released! Install via pip:

pip install --upgrade mosaicml-streaming==0.5.0

New Features

🆕 Cold Shard Eviction. ( #219 )

Dynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument cache_limit. See the shuffling guide for more details.

from streaming import StreamingDataset

dataset = StreamingDataset(
    cache_limit='100gb',
    ...
)

🤙 Fetch sample using NumPy style indexing. ( #120 )

Users can now randomly access samples using NumPy-style indexing with StreamingDataset. For example,

import numpy as np
from streaming import StreamingDataset

dataset = StreamingDataset(local=local, remote=remote)

dataset[0]  # Fetch sample 0
dataset[-1]  # Fetch last sample
dataset[[10, 20]]  # Fetch sample 10 and 20
dataset[slice(1, 10, 2)]  # Fetch sample 1, 3, 5, 7, and 9
dataset[5:0:-1]  # Fetch sample 5, 4, 3, 2, 1
dataset[np.array([4, 7])]  # Fetch sample 4 and 7

🦾 Any S3 compatible object store. ( #265 )

Support of any S3 compatible object stores, meaning, an object store which uses the S3 API to communicate with any connected device or system. Some of the S3 compatible object stores are Cloudflare R2, Coreweave, Backblaze b2, etc. User needs to provide an environment variable S3_ENDPOINT_URL based on the object store that you are using. Details on how to configure credentials can be found here.

🦾 Azure cloud blob storage. ( #256 )

Support of Azure cloud blob storage. Details on how to configure credentials can be found here.

Bug Fixes

  • Wait for download and ready thread to finish before terminating job. ( #286 )
  • Fixed length calculation to use resampled epoch size, not underlying num samples. ( #278 )
  • Fixed mypy errors by adding a py.typed marker file. ( #245 )
  • Create a new boto3 session per thread to avoid sharing resources. ( #241 )

🔧 API changes

  • The argument samples_per_epoch has been renamed to epoch_size in StreamingDatasetto better distinguish the actual number of underlying samples as serialized and the number of observed samples when iterating (which may be different due to weighting sub-datasets).
  • The argument samples has been renamed to choose in Stream to better distinguish the underlying sample vs resampled data.
  • The argument keep_raw has been removed in StreamingDataset in the process of finalizing the design for shard eviction (see the newly-added cache_limit parameter).
  • The default value of predownload in StreamingDataset was updated; it is now derived using batch size and number of canonical nodes instead of previous constant value of 100_000. This is to prevent predownloaded shards from getting evicted before ever being used.
  • The default value of num_canonical_nodes in StreamingDataset was updated to 64 times the number of nodes of the initial run instead of number of nodes of the initial run to increase data source diversity and improve convergence.
  • The default value of shuffle_algo in StreamingDataset was changed from py1b to py1s as it requires less shards to be downloaded during iteration. More details about different shuffling algorithms can be found here.

What's Changed

New Contributors

Full Changelog: v0.4.1...v0.5.0