Skip to content

Latest commit

 

History

History

mixture_config

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Data Mixture Configuration

This directory contains scripts for synthesizing and visualizing data mixtures used in the experiments described in the paper.

synthesize_mixture.py

This script is used to synthesize data mixtures for training the proxy models (1M models in the paper). You can use the following command to generate the data mixtures:

python synthesize_mixture.py --num_configs 512 --output_folder /path/to/configs

By default, it generates 512 configurations following the settings specified within the script. The configurations are saved in the config_1m directory.

visualize_mixture.py

This script is used to visualize the data mixtures generated for training the proxy models. By default, it visualizes the configurations stored in the config_1m directory. The visualizations are saved in weight_distributions.png.

If you want to visualize a different folder, you can use the following command:

python visualize_mixture.py --config_folder <path_to_config_dir>

Note that the folder must contain several yaml files which starts from n and ends with .yaml.

Weight Distribution

The following image illustrates a possible weight distribution for the data mixtures:

Weight Distribution