Skip to content

MadryLab/DsDm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DsDm: Dataset Selection with Datamodels

[blog post] [paper] [install] [datasets] [select data]
By: Logan Engstrom, Axel Feldmann, Aleksander Madry

DsDm is a model-aware dataset selection method that can greatly improve downstream model performance...

...see our paper for more details!

Installation

Install the python packages necessary:

git clone [email protected]:madrylab/DsDm.git
cd DsDm
pip install -r requirements.txt

Datasets

We list instructions on how to both (a) load our candidate dataset and (b) select with each studied selection method (DsDm and baselines).

Candidate dataset

The candidate dataset we select with is available on Hugging Face. It is a tokenized version of the C4 en.noblocklist split prepared by AllenAI (see Appendix A.1 of our work for more details); each example is 1024 tokens.

To load the dataset and display a slice:

from dsdm import selections, utils

# load dataset and tokenizer
# (WARNING: this will download a 400GB dataset)
ds = selections.get_candidate_dataset()
tokenizer = utils.tokenizer_maker()

# display the first example in text form
text = tokenizer.decode(ds[0])

Loading selections

We provide selections for five methods (dsdm, classifier, dsir, random, and semdedup) and six target tasks (jeopardy, squad, lambada, cs_algorithms, lm_task_mix, gpt3_mix). Below, we describe how to load these selections.

Download dependencies

Loading selections requires some setup. First, install git lfs. Then, pull all the required metadata files:

git lfs fetch --all

Selecting data

Then load the selections:

from dsdm import selections, utils

# targeted methods: dsdm, classifier, dsir
method = "dsdm"
target = "squad"
num_examples = 100_000
indices = get_indices(method, num_examples, target)

# untargeted methods: semdedup, random
method = 'semdedup'
num_examples = 100_000
indices = get_indices(method, num_examples, target)

# select a subset
ds = selections.get_candidate_dataset()
selected_ds = ds.select(indices)

Selecting data

🚧 Coming soon! 🏗️

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages