-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified dataset specification #111
base: main
Are you sure you want to change the base?
Conversation
ec7857a
to
6541b7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some comments. I didn't get to review the full PR.
self._dataset = dataset | ||
# Only required when using epochs when training dataset. | ||
self._estimated_length = estimated_length | ||
self._length = num_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the name change? estimated_length
made it clear that this is just an estimation, not necessarily the true value.
raise ValueError( | ||
f"Sample is None in dataset {self._config.alias} for row {row}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raise?
This was supposed to be a mechanism for filtering samples.
raise ValueError( | ||
f"Sample is None in dataset {self._config.alias} for row {row}" | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Why use an else
? You're raising an error anyway.
The level of nesting in this function is a bit high.
warnings.warn( | ||
f"Audio length ({sample.audio.shape[-1] / SAMPLE_RATE}s) exceeds max audio duration ({self._args.max_audio_duration_secs}s) in dataset {self._config.alias}, skipping sample." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this turn into a warning? It used to be a filter.
else: | ||
yield sample | ||
else: | ||
yield sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Seems like you want to yield sample
either way, so you can simplify it:
else: | |
yield sample | |
else: | |
yield sample | |
yield sample |
|
||
max_steps: 20 # x8x24 = 2,764,800 | ||
|
||
train_dataset_args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are datasets being rewritten here? Does the llama_whisper combination require a specific set of train/val/eval datasets that other architectures cannot use?
report_logs_to: ["tensorboard", "wandb"] | ||
|
||
# include speech translation (zh-en, en-zh) and transcription (peoples_speech) for validation, with and without audio | ||
val_dataset_args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue here. What's the point of so much copy & pasting?
|
||
max_steps: 20 # x8x24 = 2,764,800 | ||
|
||
train_dataset_args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
def sharded_batch_iterator( | ||
ds: data.IterableDataset, batch_size: int, num_shards: int, shard_index: int | ||
): | ||
batch = [] | ||
for idx, sample in enumerate(ds): | ||
if idx % num_shards == shard_index: | ||
batch.append((idx, sample)) | ||
if len(batch) == batch_size: | ||
yield batch | ||
batch = [] | ||
# Yield any remaining samples in the last incomplete batch | ||
if batch: | ||
yield batch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why copy & paste this? The same functionality can be used when batch_size
equals 1. Also I believe the only reference to sharded_iterator
is already removed.
if model_id is None: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for making this optional?
Hi @zqhuang211, there are a lot of interesting ideas in this PR but it has far too much surface area. The key focus here should be on the data/ changes for the generic dataset approach - and we should handle any other changes (e.g., the new eval approach in separate PRs). For the data/ classes, I agree with Farzad that the configs have a lot of duplication and errors can easily creep in as a result. To me, the benefit of this new approach seems to be that you could add a new dataset entirely via config, with no code changes, but I don't think we want to have to write more config code than the existing amount of Python code. (My prior here is that having to manually handle each language in Covost via config is more complicated than the current setup.) This also suggests that starting with a smaller PR will be better as we can more easily discuss the relative merits of the approaches when the amount of changes is smaller. |
Yes, it touches many areas because dataset specification is used throughout different parts of the code. Changes to the dataset specification breaks other parts that rely on the existing methods. The previous GenericDatasetClass is a trade-off that attempts to add a new dataset class that fits within the existing code, but it definitely has its limits. We may need a more comprehensive process to decide what we want to achieve and how to get there as a group.
I agree that some of the duplication in the configs can be removed/reduced (e.g., certain prompts), and these concerns can be addressed. The config approach supports adding new datasets without adding new code, but this is not the main reason to advocate for the use of configs. I agree that for this use case, there is not a significant difference between using Python code or config, because it would involve a similar amount of work. The main reason for using configs is the flexibility to change all aspects of datasets without changing the Python code, e.g.,
The current approach already exposes some options to the config, e.g., which existing datasets to use and some global settings for how to use the datasets, but it doesn’t allow fine control of individual datasets. I think if we want to make it flexible (easier and less likely to break python code) to run experiments, we want to expose these controls to the config, so that the same Python code can be reused and continuously improved (and can be made more robust by including additional validation and safeguards). We could also include some of the configs into defaults, like these in the meta_config.yaml, for example. I think it comes down to what we want to change in order to run new validations of experiments, Python code or YAML config. My preference is to minimize changes to Python code, if changes can be made in a well-structured config file. In the end, the use of datasets should be configurable, similar to how model parameters and training specifications are configurable. |
I want to add that this is not a Python code vs. YAML config problem. The real issue (let’s ignore the global train/eval/val configs for now) is Python code + <dataset_namestring_only> config vs. Python code + YAML config. The use of YAML config is simply more powerful; it can support different levels of control with a unified dataset specification schema and forces us to think modularly about dataset design. Let’s consider two extremes: In one extreme, the details of all datasets are hardcoded in Python code, so
is equivalent to
In the other extreme, a new dataset can be specified using the YAML config following the dataset specification schema, without writing any new lines of code. The Python code + YAML config approach allows us to use all spectra between these two extremes, depending on different stages of model development and the specifics of datasets:
As we go through different stages of model development, if a dataset deserves special treatment, it can be moved from being entirely configurable in YAML to being partially or entirely hardcoded in Python, following this unified approach. Similarly, an entirely hardcoded dataset can become partially or entirely configurable in YAML if we want to configure its attributes or reuse its logic to support other datasets. Of course, we can include and update the default validation/evaluation datasets in the meta_config.yaml. I can think of two downsides to this approach:
|
This PR unifies the dataset configuration and training/validation/evaluation specifications for datasets.
The datasets used for training, validation, and evaluation are configured in the
train_dataset_configs
,val_dataset_configs
, andeval_dataset_configs
fields using a config YAML file, with global settings specified intrain_dataset_args
,val_dataset_args
, andeval_dataset_args
. Most datasets (including those used in the v0.4 release) can be configured using the GenericVoiceDataset class, while legacy dataset classes can be configured similarly, with the default Huggingface/MDS paths hardcoded. This approach offers several benefits:Additionally, the evaluation code has been updated to support batch inference. It is now integrated into the training workflow (enabled by default via
eval_dataset_configs
in the training config YAML) in and supports standalone evaluation (seemcloud_eval.yaml
).ultravox/tools/infer_tool.py
is temporarily` removed due to significant difference in how datasets are specified. It can be added back once we settle on the dataset specification mechanism.