Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set epoch_seed_change attribute on SimulationDataset #840

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

srstevenson
Copy link
Contributor

@srstevenson srstevenson commented Dec 3, 2024

Set epoch_seed_change attribute on SimulationDataset

This was added to the StreamingDataset which the SimulationDataset inherits, so also needed to be added here. Without this, the code attempts to access the missing attribute when running a simulation:

AttributeError: 'SimulationDataset' object has no attribute 'epoch_seed_change'
Traceback:
File "/home/scott/projects/streaming/.venv/lib64/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "/home/scott/projects/streaming/.venv/lib64/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 579, in code_to_exec
    exec(code, module.__dict__)
File "/home/scott/projects/streaming/simulation/interfaces/sim_ui.py", line 409, in <module>
    submit_jobs(shuffle_quality, dataset, time_per_sample, node_internet_bandwidth,
File "/home/scott/projects/streaming/simulation/interfaces/sim_ui.py", line 110, in submit_jobs
    for output in gen_sim:
                  ^^^^^^^
File "/home/scott/projects/streaming/simulation/core/main.py", line 110, in simulate
    samples_per_node = dataset.get_samples_per_node(epoch, 0)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/simulation/core/sim_dataset.py", line 367, in get_samples_per_node
    partition = generate_work(self.batching_method, self, self.world, epoch, sample_in_epoch)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/batching/__init__.py", line 45, in generate_work
    return get(dataset, world, epoch, sample_in_epoch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/batching/random.py", line 49, in generate_work_random_batching
    shuffle_units, small_per_big = dataset.resample_streams(epoch)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/dataset.py", line 878, in resample_streams
    epoch, self.epoch_seed_change)
           ^^^^^^^^^^^^^^^^^^^^^^

Closes #831

@srstevenson srstevenson force-pushed the srs/add-missing-call-to-parent branch from c78a1c7 to 64cd184 Compare December 9, 2024 22:07
@snarayan21
Copy link
Collaborator

Hey @srstevenson, we actually don't want to call StreamingDataset's constructor here -- SimulationDataset is meant to run on single process unlike StreamingDataset so there's some logic that's different between the two init methods. The epoch_seed_change attribute not being present is because when that attribute was added to StreamingDataset it was not also added to SimulationDataset. Admittedly, this isn't the best design pattern. But yeah the immediate solution would be to copy over that logic from StreamingDataset to SimulationDataset.

@srstevenson srstevenson marked this pull request as draft December 10, 2024 21:59
@srstevenson
Copy link
Contributor Author

Thanks, @snarayan21, that makes sense. I've marked this PR as draft for now, and will update it to set epoch_seed_change without calling the parent constructor.

This was added to the `StreamingDataset` which the `SimulationDataset`
inherits, so also needed to be added here. Without this, the code
attempts to access the missing attribute when running a simulation:

```
AttributeError: 'SimulationDataset' object has no attribute 'epoch_seed_change'
Traceback:
File "/home/scott/projects/streaming/.venv/lib64/python3.12/site-packages/streamlit/runtime/scriptrunner/exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
File "/home/scott/projects/streaming/.venv/lib64/python3.12/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 579, in code_to_exec
    exec(code, module.__dict__)
File "/home/scott/projects/streaming/simulation/interfaces/sim_ui.py", line 409, in <module>
    submit_jobs(shuffle_quality, dataset, time_per_sample, node_internet_bandwidth,
File "/home/scott/projects/streaming/simulation/interfaces/sim_ui.py", line 110, in submit_jobs
    for output in gen_sim:
                  ^^^^^^^
File "/home/scott/projects/streaming/simulation/core/main.py", line 110, in simulate
    samples_per_node = dataset.get_samples_per_node(epoch, 0)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/simulation/core/sim_dataset.py", line 367, in get_samples_per_node
    partition = generate_work(self.batching_method, self, self.world, epoch, sample_in_epoch)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/batching/__init__.py", line 45, in generate_work
    return get(dataset, world, epoch, sample_in_epoch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/batching/random.py", line 49, in generate_work_random_batching
    shuffle_units, small_per_big = dataset.resample_streams(epoch)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/scott/projects/streaming/streaming/base/dataset.py", line 878, in resample_streams
    epoch, self.epoch_seed_change)
           ^^^^^^^^^^^^^^^^^^^^^^
```

Closes mosaicml#831
@srstevenson srstevenson force-pushed the srs/add-missing-call-to-parent branch from 64cd184 to 148157b Compare December 16, 2024 18:38
@srstevenson srstevenson changed the title Add missing call to parent constructor in SimulationDataset Set epoch_seed_change attribute on SimulationDataset Dec 16, 2024
@srstevenson srstevenson marked this pull request as ready for review December 16, 2024 18:39
@srstevenson
Copy link
Contributor Author

@snarayan21 I've updated this to just set epoch_seed_change with the same logic as in the parent class, PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Streaming Simulator errors with default parameters
2 participants