Skip to content

Latest commit

 

History

History
31 lines (18 loc) · 1.26 KB

README.md

File metadata and controls

31 lines (18 loc) · 1.26 KB

Data Mixture Configuration

This directory contains scripts for synthesizing and visualizing data mixtures used in the experiments described in the paper.

synthesize_mixture.py

This script is used to synthesize data mixtures for training the proxy models (1M models in the paper). You can use the following command to generate the data mixtures:

python synthesize_mixture.py --num_configs 512 --output_folder /path/to/configs

By default, it generates 512 configurations following the settings specified within the script. The configurations are saved in the config_1m directory.

visualize_mixture.py

This script is used to visualize the data mixtures generated for training the proxy models. By default, it visualizes the configurations stored in the config_1m directory. The visualizations are saved in weight_distributions.png.

If you want to visualize a different folder, you can use the following command:

python visualize_mixture.py --config_folder <path_to_config_dir>

Note that the folder must contain several yaml files which starts from n and ends with .yaml.

Weight Distribution

The following image illustrates a possible weight distribution for the data mixtures:

Weight Distribution