GitHub

[blog post] [paper] [install] [datasets] [select data]
By: Logan Engstrom, Axel Feldmann, Aleksander Madry

DsDm is a model-aware dataset selection method that can greatly improve downstream model performance...

...see our paper for more details!

Installation

Install the python packages necessary:

git clone [email protected]:madrylab/DsDm.git
cd DsDm
pip install -r requirements.txt

Datasets

We list instructions on how to both (a) load our candidate dataset and (b) select with each studied selection method (DsDm and baselines).

Candidate dataset

The candidate dataset we select with is available on Hugging Face. It is a tokenized version of the C4 en.noblocklist split prepared by AllenAI (see Appendix A.1 of our work for more details); each example is 1024 tokens.

To load the dataset and display a slice:

from dsdm import selections, utils

# load dataset and tokenizer
# (WARNING: this will download a 400GB dataset)
ds = selections.get_candidate_dataset()
tokenizer = utils.tokenizer_maker()

# display the first example in text form
text = tokenizer.decode(ds[0])

Loading selections

We provide selections for five methods (dsdm, classifier, dsir, random, and semdedup) and six target tasks (jeopardy, squad, lambada, cs_algorithms, lm_task_mix, gpt3_mix). Below, we describe how to load these selections.

Download dependencies

Loading selections requires some setup. First, install git lfs. Then, pull all the required metadata files:

git lfs fetch --all

Selecting data

Then load the selections:

from dsdm import selections, utils

# targeted methods: dsdm, classifier, dsir
method = "dsdm"
target = "squad"
num_examples = 100_000
indices = get_indices(method, num_examples, target)

# untargeted methods: semdedup, random
method = 'semdedup'
num_examples = 100_000
indices = get_indices(method, num_examples, target)

# select a subset
ds = selections.get_candidate_dataset()
selected_ds = ds.select(indices)

Selecting data

🚧 Coming soon! 🏗️

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
dsdm		dsdm
img		img
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Datasets

Candidate dataset

Loading selections

Download dependencies

Selecting data

Selecting data

About

Releases

Packages

Languages

MadryLab/DsDm

Folders and files

Latest commit

History

Repository files navigation

Installation

Datasets

Candidate dataset

Loading selections

Download dependencies

Selecting data

Selecting data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages