Data Overlap

Background

This script computes n-gram overlap between two datasets, generally training and test sets, where the test set are HELM scenarios. After generating n-gram overlap, these results are used to compute metrics, which are aggregated into final results.

Installation

# Create a virtual environment.
# Only run this the first time.
python3 -m pip install virtualenv
python3 -m virtualenv -p python3 venv

# Activate the virtual environment.
source venv/bin/activate

# Install requirements
pip install -r requirements.txt

Input training data

Depending on your training data format, you may need to update load_documents.py to support that training data format.

Generating ngrams

Generally we recommend running the script on a small subset of the train data and a small subset of the test set (the scenario_data file in the repo)to ensure the script works correctly. There may need to be minor modifications to ensure that the script correctly parses the training data

For the actual test set, we either run on the HELM scenarios, which is a subset of the actual benchmarks, or the benchmarks associated with the HELM scenarios Benchmark Scenarios. For the latter, the memory consumption is considerable and it is recommended you shard the test data.

For parallelization, we generally just recommend sharding on training and/or test data and running multiple threads in parallel. The results can be easily joined together.

Usage:


python compute_data_overlap_metrics.py --input-data <input_data> --scenario-data <scenario_data> --output-stats <output_stats> --input-format <input_format>

For instance, you can call this with The Pile, e.g. have:
    input_data  = pile_input.json (download more at https://pile.eleuther.ai/)
    scenario_data = (example included with repo, but can use HELM to generate)
    output_stats = arbitrary output file name, e.g. "output_stats"
    input_format = the_pile

This will output two files: <output_stats> and <output_stats_ngrams>. Pass the second ngrams file to the later steps.

There are additional optional args:
--normalization default 
--tags tag1 tag2

Example:

We run the example on example_input.jsonl as the sample training data and scenario_data as the example scenario data; both are in the repo.

python compute_data_overlap_metrics.py --input-data ./example_input.jsonl --scenario-data ./scenario_data --output-stats output_stats --input-format the_pile --N 13

-> produces "output_stats" and "output_stats_ngrams". "output_stats_ngrams" will be the input for the next example stage. These outputs are also in the repo if you want to directly run these

Generating metrics

Run compute_metrics_from_ngrams.py to generate metrics.

sample_metrics_file is a file that contains metrics from the pile for testing any script that takes metrics as input.

python compute_metrics_from_ngrams.py --ngrams-path 'ngrams_input' --scenario-path 'scenario_data' --out-path 'metrics_output' --filter-path 'filtered_scenarios' --N 13    

    ngrams-path = the ngrams generate from compute_data_overlap_metrics.py
    scenario-data = scenario_data is the same file as used in compute_data_overlap_metrics.py
    out-path = arbitrary output file name, e.g. "output_metrics"
    filter-path = if you want to filter to a subset of the scenarios
    N = n in n-grams

Example: We take the input output_stats_ngrams from the last step (or directly from the repo) and run

python compute_metrics_from_ngrams.py --ngrams-path output_stats_ngrams --scenario-path scenario_data --out-path metrics_output --N 13    
-> produces metrics_output file for next step

Aggregating metrics

Run output_aggregate_metrics.py and output_aggregate_metrics_both.py to aggregate metrics.

python output_aggregate_metrics.py --metrics-path=... --out-path=... 
    metrics-path is the path to the file from compute_metrics_from_ngrams.py
    out-path is an arbitrary output file name

Example: We take the input metrics_output from the last step (or directly from the repo) and run

python output_aggregate_metrics.py --metrics-path metrics_output --out-path aggregate_metrics
python output_aggregate_metrics_both.py --metrics-path metrics_output --out-path aggregate_metrics_both

-> produces aggregate metrics output, aggregate_metrics and aggregate_metrics_both

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
common		common
scenarios		scenarios
test		test
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
aggregate_metrics		aggregate_metrics
aggregate_metrics_both		aggregate_metrics_both
aggregate_test_train_overlap_stats.ipynb		aggregate_test_train_overlap_stats.ipynb
compute_data_overlap_metrics.py		compute_data_overlap_metrics.py
compute_metrics.py		compute_metrics.py
compute_metrics_from_ngrams.py		compute_metrics_from_ngrams.py
data_overlap_beam.py		data_overlap_beam.py
data_overlap_spec.py		data_overlap_spec.py
dataset_counts.ipynb		dataset_counts.ipynb
example_input.jsonl		example_input.jsonl
example_pile_input.jsonl		example_pile_input.jsonl
filter_dataset.py		filter_dataset.py
filtered_scenario_spec_instance_ids.json		filtered_scenario_spec_instance_ids.json
light_scenario.py		light_scenario.py
light_tokenizer.py		light_tokenizer.py
load_documents.py		load_documents.py
metrics_output		metrics_output
output_aggregate_metrics.py		output_aggregate_metrics.py
output_aggregate_metrics_both.py		output_aggregate_metrics_both.py
output_stats		output_stats
output_stats_ngrams		output_stats_ngrams
remove_metric_instance_id.py		remove_metric_instance_id.py
requirements-freeze.txt		requirements-freeze.txt
requirements.txt		requirements.txt
run_data_overlap_beam.py		run_data_overlap_beam.py
sample_metrics_file		sample_metrics_file
scenario_data		scenario_data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Overlap

Background

Installation

Input training data

Generating ngrams

Generating metrics

Aggregating metrics

About

Releases

Packages

Languages

stanford-crfm/data-overlap

Folders and files

Latest commit

History

Repository files navigation

Data Overlap

Background

Installation

Input training data

Generating ngrams

Generating metrics

Aggregating metrics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages