Unified dataset specification #111

zqhuang211 · 2024-09-12T05:05:55Z

This PR unifies the dataset configuration and training/validation/evaluation specifications for datasets.

The datasets used for training, validation, and evaluation are configured in the train_dataset_configs, val_dataset_configs, and eval_dataset_configs fields using a config YAML file, with global settings specified in train_dataset_args, val_dataset_args, and eval_dataset_args. Most datasets (including those used in the v0.4 release) can be configured using the GenericVoiceDataset class, while legacy dataset classes can be configured similarly, with the default Huggingface/MDS paths hardcoded. This approach offers several benefits:

Unified support for both legacy and new datasets.
Simplified configuration for data splits, subsets, and weighting across all datasets.
Greater flexibility for experimentation to support different training, validation, and evaluation configurations.

Additionally, the evaluation code has been updated to support batch inference. It is now integrated into the training workflow (enabled by default via eval_dataset_configs in the training config YAML) in and supports standalone evaluation (see mcloud_eval.yaml).

ultravox/tools/infer_tool.py is temporarily` removed due to significant difference in how datasets are specified. It can be added back once we settle on the dataset specification mechanism.

farzadab

Here are some comments. I didn't get to review the full PR.

farzadab · 2024-09-17T15:43:24Z

ultravox/data/datasets.py

        self._dataset = dataset
-        # Only required when using epochs when training dataset.
-        self._estimated_length = estimated_length
+        self._length = num_samples


Why the name change? estimated_length made it clear that this is just an estimation, not necessarily the true value.

farzadab · 2024-09-17T15:45:09Z

ultravox/data/datasets.py

+                raise ValueError(
+                    f"Sample is None in dataset {self._config.alias} for row {row}"
+                )


Why raise?
This was supposed to be a mechanism for filtering samples.

farzadab · 2024-09-17T15:46:16Z

ultravox/data/datasets.py

+                raise ValueError(
+                    f"Sample is None in dataset {self._config.alias} for row {row}"
+                )
+            else:


nit: Why use an else? You're raising an error anyway.
The level of nesting in this function is a bit high.

farzadab · 2024-09-17T15:48:10Z

ultravox/data/datasets.py

+                        warnings.warn(
+                            f"Audio length ({sample.audio.shape[-1] / SAMPLE_RATE}s) exceeds max audio duration ({self._args.max_audio_duration_secs}s) in dataset {self._config.alias}, skipping sample."
+                        )


Why did this turn into a warning? It used to be a filter.

farzadab · 2024-09-17T15:49:07Z

ultravox/data/datasets.py

+                    else:
+                        yield sample
+                else:
                    yield sample


nit: Seems like you want to yield sample either way, so you can simplify it:

Suggested change

else:

yield sample

else:

yield sample

yield sample

farzadab · 2024-09-17T20:28:31Z

ultravox/training/configs/llama_whisper.yaml

+
+max_steps: 20 # x8x24 = 2,764,800
+
+train_dataset_args:


Why are datasets being rewritten here? Does the llama_whisper combination require a specific set of train/val/eval datasets that other architectures cannot use?

farzadab · 2024-09-17T20:29:20Z

ultravox/training/configs/meta_config.yaml

 report_logs_to: ["tensorboard", "wandb"]
+
+# include speech translation (zh-en, en-zh) and transcription (peoples_speech) for validation, with and without audio
+val_dataset_args:


Same issue here. What's the point of so much copy & pasting?

farzadab · 2024-09-17T20:29:38Z

ultravox/training/configs/tinyllama_whisper.yaml

+
+max_steps: 20 # x8x24 = 2,764,800
+
+train_dataset_args:


farzadab · 2024-09-17T20:31:09Z

ultravox/training/ddp_utils.py

+def sharded_batch_iterator(
+    ds: data.IterableDataset, batch_size: int, num_shards: int, shard_index: int
+):
+    batch = []
+    for idx, sample in enumerate(ds):
+        if idx % num_shards == shard_index:
+            batch.append((idx, sample))
+            if len(batch) == batch_size:
+                yield batch
+                batch = []
+    # Yield any remaining samples in the last incomplete batch
+    if batch:
+        yield batch


Why copy & paste this? The same functionality can be used when batch_size equals 1. Also I believe the only reference to sharded_iterator is already removed.

farzadab · 2024-09-17T20:31:54Z

ultravox/training/helpers/prefetch_weights.py

+        if model_id is None:
+            continue


What's the use case for making this optional?

juberti · 2024-09-18T20:57:37Z

Hi @zqhuang211, there are a lot of interesting ideas in this PR but it has far too much surface area. The key focus here should be on the data/ changes for the generic dataset approach - and we should handle any other changes (e.g., the new eval approach in separate PRs).

For the data/ classes, I agree with Farzad that the configs have a lot of duplication and errors can easily creep in as a result. To me, the benefit of this new approach seems to be that you could add a new dataset entirely via config, with no code changes, but I don't think we want to have to write more config code than the existing amount of Python code. (My prior here is that having to manually handle each language in Covost via config is more complicated than the current setup.)

This also suggests that starting with a smaller PR will be better as we can more easily discuss the relative merits of the approaches when the amount of changes is smaller.

zqhuang211 · 2024-09-18T23:35:36Z

Hi @zqhuang211, there are a lot of interesting ideas in this PR but it has far too much surface area. The key focus here should be on the data/ changes for the generic dataset approach - and we should handle any other changes (e.g., the new eval approach in separate PRs).

Yes, it touches many areas because dataset specification is used throughout different parts of the code. Changes to the dataset specification breaks other parts that rely on the existing methods. The previous GenericDatasetClass is a trade-off that attempts to add a new dataset class that fits within the existing code, but it definitely has its limits. We may need a more comprehensive process to decide what we want to achieve and how to get there as a group.

For the data/ classes, I agree with Farzad that the configs have a lot of duplication and errors can easily creep in as a result. To me, the benefit of this new approach seems to be that you could add a new dataset entirely via config, with no code changes, but I don't think we want to have to write more config code than the existing amount of Python code. (My prior here is that having to manually handle each language in Covost via config is more complicated than the current setup.)

I agree that some of the duplication in the configs can be removed/reduced (e.g., certain prompts), and these concerns can be addressed.

The config approach supports adding new datasets without adding new code, but this is not the main reason to advocate for the use of configs. I agree that for this use case, there is not a significant difference between using Python code or config, because it would involve a similar amount of work.

The main reason for using configs is the flexibility to change all aspects of datasets without changing the Python code, e.g.,

Which datasets to use (old or new) for training, eval, and validation.
What prompt to use for each dataset.
Using the same dataset with different prompts.
Using different subsets from individual datasets.
Changing specifications of different fields of a dataset.

The current approach already exposes some options to the config, e.g., which existing datasets to use and some global settings for how to use the datasets, but it doesn’t allow fine control of individual datasets.

I think if we want to make it flexible (easier and less likely to break python code) to run experiments, we want to expose these controls to the config, so that the same Python code can be reused and continuously improved (and can be made more robust by including additional validation and safeguards).

We could also include some of the configs into defaults, like these in the meta_config.yaml, for example.

I think it comes down to what we want to change in order to run new validations of experiments, Python code or YAML config. My preference is to minimize changes to Python code, if changes can be made in a well-structured config file. In the end, the use of datasets should be configurable, similar to how model parameters and training specifications are configurable.

zqhuang211 · 2024-09-19T13:01:02Z

I want to add that this is not a Python code vs. YAML config problem. The real issue (let’s ignore the global train/eval/val configs for now) is Python code + <dataset_namestring_only> config vs. Python code + YAML config. The use of YAML config is simply more powerful; it can support different levels of control with a unified dataset specification schema and forces us to think modularly about dataset design.

Let’s consider two extremes:

In one extreme, the details of all datasets are hardcoded in Python code, so

# YAML config
train_datasets:
  - type: 'librispeech'
  - type: 'commonvoice'

is equivalent to

# dataset_namestring_only config
train_datasets: ['librispeech', 'commonvoice']

In the other extreme, a new dataset can be specified using the YAML config following the dataset specification schema, without writing any new lines of code.

The Python code + YAML config approach allows us to use all spectra between these two extremes, depending on different stages of model development and the specifics of datasets:

Individual datasets hardcoded entirely in Python code, with their own logic, only configurable via the type field in the YAML config, similar to the current design.
Individual datasets with certain attributes hardcoded in the Python code (e.g., HF path, name, splits, etc.), while the rest of the attributes are configurable via the YAML config.
A group of datasets with their own shared logic hardcoded in Python code, with some (or all) attributes configurable in the YAML config. GenericVoiceDataset is one example. I can imagine that new groups of datasets will be added in a similar way in the future.

As we go through different stages of model development, if a dataset deserves special treatment, it can be moved from being entirely configurable in YAML to being partially or entirely hardcoded in Python, following this unified approach. Similarly, an entirely hardcoded dataset can become partially or entirely configurable in YAML if we want to configure its attributes or reuse its logic to support other datasets.

Of course, we can include and update the default validation/evaluation datasets in the meta_config.yaml.

I can think of two downsides to this approach:

You have to type five more characters, “type:”, for the entirely hardcoded datasets.
There is potential for abuse, which in most cases means we need to improve the Python code and maybe add new features.

liPatrick mentioned this pull request Sep 12, 2024

Replacing weight with multiplier #105

Open

Zhongqiang Huang added 2 commits September 16, 2024 17:12

Update

27c61da

Update

6541b7a

zqhuang211 force-pushed the zhuang/unified_dataset_specification branch from ec7857a to 6541b7a Compare September 16, 2024 23:34

Zhongqiang Huang added 2 commits September 16, 2024 19:34

Update

22ba1f8

Update

fcb81ff

zqhuang211 marked this pull request as ready for review September 16, 2024 23:56

Update

0c80894

zqhuang211 requested review from farzadab and juberti September 17, 2024 00:18

farzadab reviewed Sep 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified dataset specification #111

Unified dataset specification #111

zqhuang211 commented Sep 12, 2024 •

edited

Loading

farzadab left a comment

farzadab Sep 17, 2024

farzadab Sep 17, 2024

farzadab Sep 17, 2024 •

edited

Loading

farzadab Sep 17, 2024

farzadab Sep 17, 2024 •

edited

Loading

farzadab Sep 17, 2024

farzadab Sep 17, 2024

farzadab Sep 17, 2024

farzadab Sep 17, 2024

farzadab Sep 17, 2024

juberti commented Sep 18, 2024

zqhuang211 commented Sep 18, 2024

zqhuang211 commented Sep 19, 2024 •

edited

Loading


		max_steps: 20 # x8x24 = 2,764,800

		train_dataset_args:


		max_steps: 20 # x8x24 = 2,764,800

		train_dataset_args:

Unified dataset specification #111

Are you sure you want to change the base?

Unified dataset specification #111

Conversation

zqhuang211 commented Sep 12, 2024 • edited Loading

farzadab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

farzadab Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

farzadab Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juberti commented Sep 18, 2024

zqhuang211 commented Sep 18, 2024

zqhuang211 commented Sep 19, 2024 • edited Loading

zqhuang211 commented Sep 12, 2024 •

edited

Loading

farzadab Sep 17, 2024 •

edited

Loading

farzadab Sep 17, 2024 •

edited

Loading

zqhuang211 commented Sep 19, 2024 •

edited

Loading