Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Simulator! #385

Merged
merged 40 commits into from
Nov 2, 2023
Merged
Show file tree
Hide file tree
Changes from 35 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
b5d6098
simulationgit status
snarayan21 Aug 17, 2023
128214c
pylint fixes
snarayan21 Aug 17, 2023
4bc747f
more pylint fixes
snarayan21 Aug 17, 2023
661d118
simulation bug fixing with cache lim
snarayan21 Aug 23, 2023
6ef391d
added py1e to simulator
snarayan21 Aug 23, 2023
22fccef
throughput scales as device batch size decreases
snarayan21 Aug 24, 2023
d874523
script changes
snarayan21 Aug 24, 2023
ef02aa8
new UI, simulator as generator, simulation testing
snarayan21 Aug 26, 2023
2a3ecde
testing fixes, remove prints
snarayan21 Aug 28, 2023
2cd70f4
ui text
snarayan21 Aug 28, 2023
2a8ba75
ui text change
snarayan21 Aug 28, 2023
9037ddc
added streaming metrics, warnings, errors.
snarayan21 Aug 29, 2023
a4902e6
modified update interval
snarayan21 Sep 6, 2023
a42bc24
sim changes
snarayan21 Sep 18, 2023
66688f2
ported files to streaming repo
snarayan21 Oct 3, 2023
72a2915
merged from main
snarayan21 Oct 3, 2023
c9f024d
add file info strings
snarayan21 Oct 3, 2023
002b9f8
fixing docstrings and typing for core functions
snarayan21 Oct 4, 2023
fac8fd8
fixed all typing and pyright stuff
snarayan21 Oct 4, 2023
0ea3979
Merge branch 'main' of https://github.com/snarayan21/saaketh-streamin…
snarayan21 Oct 4, 2023
590bb4f
reversed change to shuffle init.py
snarayan21 Oct 4, 2023
a16edd2
linting
snarayan21 Oct 4, 2023
25c3e50
fixed yaml parsing bug
snarayan21 Oct 4, 2023
88b92e8
addressed comments on setup.py and create_index.py
snarayan21 Oct 10, 2023
3233687
Merge branch 'main' into simulator
snarayan21 Oct 17, 2023
650bcfb
Merge branch 'main' of https://github.com/snarayan21/saaketh-streamin…
snarayan21 Oct 17, 2023
63c8364
added 'simulator' command for easy startup
snarayan21 Oct 17, 2023
910cbde
Merge branch 'simulator' of https://github.com/snarayan21/saaketh-str…
snarayan21 Oct 17, 2023
bd9ed1b
changed file prefixes to be consistent
snarayan21 Oct 17, 2023
eeaec74
shuffle quality metric is now relative to naive
snarayan21 Oct 17, 2023
da269be
tuple to Tuple
snarayan21 Oct 17, 2023
18488e2
added docs, deleted redundant images folder
snarayan21 Oct 17, 2023
831f708
Merge branch 'main' of https://github.com/snarayan21/saaketh-streamin…
snarayan21 Oct 18, 2023
4cc2655
addressed Karan comments
snarayan21 Oct 18, 2023
1b0c8a4
Merge branch 'main' into simulator
snarayan21 Oct 24, 2023
d5f5d0a
Merge branch 'main' of https://github.com/snarayan21/saaketh-streamin…
snarayan21 Oct 27, 2023
76aeaa7
addressed comments, fixed assert statements
snarayan21 Oct 27, 2023
f67b815
changed to current defaults
snarayan21 Oct 28, 2023
524675f
minor usability improvements
snarayan21 Oct 30, 2023
be6779a
Merge branch 'main' into simulator
snarayan21 Nov 2, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,7 @@ test:
web:
uvicorn scripts.partition.web:app --port 1337 --reload

simulator:
streamlit run simulation/interfaces/sim_ui.py

.PHONY: test lint style
Binary file added docs/source/_static/images/downloads.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/images/inputs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/images/stats.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/images/throughput.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/images/yaml_toggle.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
45 changes: 45 additions & 0 deletions docs/source/fundamentals/simulator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Streaming Simulator
A simulator for throughput, network use, and shuffle quality with MosaicML Streaming. The simulator allows you to:
- Plan runs and anticipate issues beforehand
- Find optimal run configurations
- Debug issues with underperforming runs
- Better understand the impact of different configurations

## Getting Started
Run the following to install simulator-specific dependencies, if they don't already exist:
```
pip install --upgrade "mosaicml-streaming[simulator]"
```
Then, simply run `simulator` in your command line to open the Web UI and get simulating!
## Key Features

### Throughput
Throughput is estimated for the duration of the run and is displayed as the simulation progresses. We estimate throughput by iterating over the samples of the dataset in order, and performing shard downloads based on an estimate of network bandwidth. The 10-step rolling average is displayed.

<img src="../_static/images/throughput.png" alt="Throughput Graph" width="500"/>

### Network Downloads
Cumulative network downloads are also estimated for the run and displayed. It is calculated in conjunction with throughput. If shards are compressed, we assume they are downloaded in compressed form and immediately uncompressed.

<img src="../_static/images/downloads.png" alt="Downloads Graph" width="500"/>

### Simulation Stats
We also provide various useful statistics from the simulation, such as:
- Minimum cache limit (i.e., maximum space used by live shards)
- Steps slowed down by shard downloads
- Estimated time to first batch
- Estimated warmup time (i.e., time until throughput maximized)

<img src="../_static/images/stats.png" alt="Simulation Stats" width="500"/>

### Shuffle Quality
You can choose to evaluate the quality of different shuffling algorithms for your run. We provide an estimate of shuffle quality based on the entropy calculated over the probability distribution of differences between neighboring sample indices and shard indices of the dataset. *These shuffle quality metrics are noisy and may not reflect the true strength of a shuffle.*

<img src="../_static/images/shuffle_quality_toggle.png" alt="Shuffle Quality Toggle" width="300"/>

<img src="../_static/images/shuffle_quality_graph.png" alt="Shuffle Quality Graph" width="500"/>

### Yaml Support
Yaml files that follow MosaicML conventions can be uploaded and simulated as well. Simply click the toggle, enter any needed additional information, and see your results. Parameters can also be modified to quickly test out configurations.

<img src="../_static/images/yaml_toggle.png" alt="Yaml Quality Toggle" width="300"/>
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ If you have any questions, please feel free to reach out to us on [Twitter](htt
fundamentals/shuffling.md
fundamentals/sampling.md
fundamentals/batching.md
fundamentals/simulator.md

.. toctree::
:hidden:
Expand Down
13 changes: 13 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,16 @@
'sphinx-tabs==3.4.1',
]

extra_deps['simulator'] = [
snarayan21 marked this conversation as resolved.
Show resolved Hide resolved
'sortedcollections>=2.1.0,<3',
'streamlit>=1.26.0,<2',
'altair>=5.1.1,<6',
'omegaconf>=2.3.0,<3',
'PyYAML>=6.0,<7',
'pandas>=2.0.3,<3',
'wandb>=0.15.5,<1',
]

extra_deps['spark'] = [
'pyspark>=3,<4',
]
Expand Down Expand Up @@ -123,6 +133,9 @@
'streaming': ['py.typed'],
},
packages=setuptools.find_packages(exclude=['tests*']),
entry_points={
'console_scripts': ['simulator = simulation.launcher:launch_simulation_ui',],
},
classifiers=classifiers,
install_requires=install_requires,
extras_require=extra_deps,
Expand Down
48 changes: 48 additions & 0 deletions simulation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# 🤖 Streaming Simulator
A simulator for throughput, network use, and shuffle quality with MosaicML Streaming. The simulator allows you to:
- Plan runs and anticipate issues beforehand
- Find optimal run configurations
- Debug issues with underperforming runs
- Better understand the impact of different configurations

## 🚀 Getting Started
Run the following to install simulator-specific dependencies, if they don't already exist:
```
pip install --upgrade "mosaicml-streaming[simulator]"
```
Then, simply run `simulator` in your command line to open the Web UI and get simulating!
snarayan21 marked this conversation as resolved.
Show resolved Hide resolved
## 🔑 Key Features

### Throughput
Throughput is estimated for the duration of the run and is displayed as the simulation progresses. We estimate throughput by iterating over the samples of the dataset in order, and performing shard downloads based on an estimate of network bandwidth. The 10-step rolling average is displayed.

![Throughput Graph](../docs/source/_static/images/throughput.png)

### Network Downloads
Cumulative network downloads are also estimated for the run and displayed. It is calculated in conjunction with throughput. If shards are compressed, we assume they are downloaded in compressed form and immediately uncompressed.

![Downloads Graph](../docs/source/_static/images/downloads.png)

### Simulation Stats
We also provide various useful statistics from the simulation, such as:
- Minimum cache limit (i.e., maximum space used by live shards)
- Steps slowed down by shard downloads
- Estimated time to first batch
- Estimated warmup time (i.e., time until throughput maximized)

![Simulation Stats](../docs/source/_static/images/stats.png)

### Shuffle Quality
You can choose to evaluate the quality of different shuffling algorithms for your run. We provide an estimate of shuffle quality based on the entropy calculated over the probability distribution of differences between neighboring sample indices and shard indices of the dataset. *These shuffle quality metrics are noisy and may not reflect the true strength of a shuffle.*

![Shuffle Quality Toggle](../docs/source/_static/images/shuffle_quality_toggle.png)

![Shuffle Quality Graph](../docs/source/_static/images/shuffle_quality_graph.png)

### Yaml Support
Yaml files that follow MosaicML conventions can be uploaded and simulated as well. Simply click the toggle, enter any needed additional information, and see your results. Parameters can also be modified to quickly test out configurations.

![Yaml Quality Toggle](../docs/source/_static/images/yaml_toggle.png)

## 💬 Contact
If you have problems, questions, or suggestions, please reach out to the MosaicML team on our [community slack channel](https://mosaicml.me/slack).
4 changes: 4 additions & 0 deletions simulation/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

"""Streaming simulation for throughput, network downloads, and shuffle quality."""
86 changes: 86 additions & 0 deletions simulation/core/create_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

"""Create a dataset index file from input parameters."""

import json
import logging
import os
import random
import string
from typing import Optional

from streaming.base.format import get_index_basename

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)


def get_random_foldername() -> str:
"""Generate random folder name to store the index file in.

Returns:
str: random alphanumeric folder name.
"""
return ''.join(
random.choice(string.ascii_uppercase + string.ascii_lowercase + string.digits)
for _ in range(16))


def create_stream_index(shards: int, samples_per_shard: int, avg_raw_shard_size: int,
avg_zip_shard_size: Optional[int]) -> str:
"""Create dataset index file from input parameters.

Args:
shards (int): Number of shards.
samples_per_shard (int): Number of samples per shard.
avg_raw_shard_size (int): Average raw shard size.
avg_zip_shard_size (int): Average compressed shard size.

Returns:
local path to created index file for stream.
"""
index_data = {'version': 2, 'shards': []}

shards_list = []
for _ in range(shards):
shard_data = {
'column_encodings': [],
'column_names': [],
'column_sizes': [],
'format': 'mds',
'raw_data': {
'basename': '',
'bytes': avg_raw_shard_size,
'hashes': {}
},
'hashes': [],
'samples': samples_per_shard,
'size_limit': avg_raw_shard_size,
'version': 2,
'zip_data': None,
'compression': None
}
if avg_zip_shard_size is not None:
shard_data['zip_data'] = {'basename': '', 'bytes': avg_zip_shard_size, 'hashes': {}}
shard_data['compression'] = ''
shards_list.append(shard_data)

index_data['shards'] = shards_list

# Try making the directory for the stream's index.json file
foldername = get_random_foldername() + '_indexcreated'
try:
os.mkdir(foldername)
except FileExistsError:
logger.warning(' Folder already exists, trying again...')
foldername = get_random_foldername()
os.mkdir(foldername)
karan6181 marked this conversation as resolved.
Show resolved Hide resolved

index_basename = get_index_basename()
index_path = os.path.join(foldername, index_basename)

with open(index_path, 'w') as f:
json.dump(index_data, f)

return index_path
37 changes: 37 additions & 0 deletions simulation/core/last_used_ordered_set.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Copyright 2023 MosaicML Streaming authors
# SPDX-License-Identifier: Apache-2.0

"""An ordered set that can be used as an LRU cache."""

from collections import OrderedDict
from typing import Any


class LastUsedOrderedSet(OrderedDict):
"""An ordered dict that can be used as an LRU cache.

This is a subclass of OrderedDict, with some LRU-specific functions and all values as ``None``.
"""

def setitem(self, key: Any, move_to_end: bool = True):
"""Set/add an item.

Args:
key (Any): key to be added.
move_to_end (bool, optional): whether to move the item to the end, signifying most
recent access. Defaults to ``True``.
"""
super().__setitem__(key, None)
self.move_to_end(key, last=move_to_end)

def popLRU(self):
"""Pop the least recently used item (located at the front)."""
return self.popitem(last=False)[0]

def setuse(self, key: Any):
"""Mark an item as used, moving it to the end.

Args:
key (Any): key of element to move to the end, signifying most recent access.
"""
self.setitem(key)
Loading