[blog post]
[paper]
[install]
[datasets]
[select data]
By:
Logan Engstrom,
Axel Feldmann,
Aleksander Madry
DsDm is a model-aware dataset selection method that can greatly improve downstream model performance...
...see our paper for more details!
Install the python packages necessary:
git clone [email protected]:madrylab/DsDm.git
cd DsDm
pip install -r requirements.txt
We list instructions on how to both (a) load our candidate dataset and (b) select with each studied selection method (DsDm and baselines).
The candidate dataset we select with is available on Hugging Face. It is a tokenized version of the C4 en.noblocklist
split prepared by AllenAI (see Appendix A.1 of our work for more details); each example is 1024 tokens.
To load the dataset and display a slice:
from dsdm import selections, utils
# load dataset and tokenizer
# (WARNING: this will download a 400GB dataset)
ds = selections.get_candidate_dataset()
tokenizer = utils.tokenizer_maker()
# display the first example in text form
text = tokenizer.decode(ds[0])
We provide selections for five methods (dsdm
, classifier
, dsir
, random
, and semdedup
) and six target tasks (jeopardy
, squad
, lambada
, cs_algorithms
, lm_task_mix
, gpt3_mix
). Below, we describe how to load these selections.
Loading selections requires some setup. First, install git lfs. Then, pull all the required metadata files:
git lfs fetch --all
Then load the selections:
from dsdm import selections, utils
# targeted methods: dsdm, classifier, dsir
method = "dsdm"
target = "squad"
num_examples = 100_000
indices = get_indices(method, num_examples, target)
# untargeted methods: semdedup, random
method = 'semdedup'
num_examples = 100_000
indices = get_indices(method, num_examples, target)
# select a subset
ds = selections.get_candidate_dataset()
selected_ds = ds.select(indices)
🚧 Coming soon! 🏗️