Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dataset] add imbue #282

Merged
merged 10 commits into from
Aug 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pytest-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ jobs:
GITHUB_ACTION: 1
- name: Surface failing tests
if: always()
uses: pmeier/pytest-results-action@multi-testsuites
uses: pmeier/pytest-results-action@v0.7.1
with:
# A list of JUnit XML files, directories containing the former, and wildcard
# patterns to process.
Expand Down
17 changes: 13 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Training
Utilization

- **Blazingly Fast:** By managing the KV Cache of prefixes, we can speed up local inference by up to 6x 🚀.
- **Comprehensive Evaluation:** 56+ commonly used [datasets](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/supported-datasets.md) and benchmarks in evaluating LLMs.
- **Comprehensive Evaluation:** 59+ commonly used [datasets](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/supported-datasets.md) and benchmarks in evaluating LLMs.
- **Evaluation Methods:** Accurately reproduce results from original papers of OpenAI, LLaMA, Mistral, and other models.
- **In-Context Learning:** We support various ICL strategies, including [`KATE`](https://aclanthology.org/2022.deelio-1.10/), [`GlobalE`](https://aclanthology.org/2022.acl-long.556/), and [`APE`](https://arxiv.org/abs/2211.01910).
- **Chain-of-Thought:** For some datasets, we support three types of CoT evaluation: `base`, [`least-to-most`](https://arxiv.org/abs/2205.10625), and [`pal`](https://arxiv.org/abs/2211.10435).
Expand Down Expand Up @@ -140,7 +140,7 @@ For more details, view the [training](https://github.com/RUCAIBox/LLMBox/tree/ma

We provide a broad support on Huggingface models (e.g. `LLaMA-3`, `Mistral`, or the model you are building on), OpenAI, Anthropic, QWen and other OpenAI-compatible models for further utilization. Full list of model backends: [here](https://github.com/RUCAIBox/LLMBox/tree/main/utilization#supported-models).

Currently a total of 56+ commonly used datasets are supported, including: `HellaSwag`, `MMLU`, `GSM8K`, `GPQA`, `AGIEval`, `CEval`, and `CMMLU`. Full list of datasets: [here](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/supported-datasets.md).
Currently a total of 59+ commonly used datasets are supported, including: `HellaSwag`, `MMLU`, `GSM8K`, `GPQA`, `AGIEval`, `CEval`, and `CMMLU`. Full list of datasets: [here](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/supported-datasets.md).

```bash
CUDA_VISIBLE_DEVICES=0 python inference.py \
Expand All @@ -151,7 +151,9 @@ CUDA_VISIBLE_DEVICES=0 python inference.py \
--ranking_type ppl_no_option
```

See [benchmarking LLaMA3](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/benchmarking_llama3.md) for more examples.
- 🔥 Recently supported datasets: `imbue_code`, `imbue_public`, and `imbue_private`.

- 🔥 See [benchmarking LLaMA3](https://github.com/RUCAIBox/LLMBox/blob/main/docs/utilization/benchmarking_llama3.md) for more examples.

<table>
<tr>
Expand Down Expand Up @@ -276,7 +278,7 @@ python inference.py -m model -d dataset --cot least_to_most # --base or --pal

For a more detailed instruction on model utilization, view the [utilization](https://github.com/RUCAIBox/LLMBox/tree/main/utilization) documentation.

<!-- For a full list of evaluation results, view our paper. -->
For a full list of evaluation results, see our paper [LLMBox: A Comprehensive Library for Large Language Models](https://aclanthology.org/2024.acl-demos.37.pdf).

## Contributing

Expand Down Expand Up @@ -308,4 +310,11 @@ LLMBox uses [MIT License](./LICENSE).
If you find LLMBox useful for your research or development, please cite the following papers:

```
@inproceedings{tang2024llmbox,
title={LLMBox: A Comprehensive Library for Large Language Models},
author={Tang, Tianyi and Yiwen, Hu and Li, Bingqian and Luo, Wenyang and Qin, ZiJing and Sun, Haoxiang and Wang, Jiapeng and Xu, Shiyi and Cheng, Xiaoxue and Guo, Geyang and others},
booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)},
pages={388--399},
year={2024}
}
```
37 changes: 33 additions & 4 deletions docs/utilization/supported-datasets.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,27 @@
# Supported Datasets of LLMBox

We currently support 56+ commonly used datasets for LLMs. Each dataset is either a multiple-choice dataset or a generation dataset.
We currently support 59+ commonly used datasets for LLMs.

## Understanding Evaluation Type

Each dataset is either a multiple-choice dataset or a generation dataset. You can find the difference between them at [here](https://github.com/RUCAIBox/LLMBox/tree/main/utilization#dataset-arguments)

## Understanding Subsets

Some datasets have multiple subsets. For example, Massive Multitask Language Understanding (`mmlu`) dataset contains 57 different subsets categorized into four categories: `stem`, `social_sciences`, `humanities`, and `other`.

While some other dataset is a subset of another dataset. For example, Choice Of Plausible Alternatives (`copa`) is a subset of `super_glue`.

See how to [load datasets with subsets](https://github.com/RUCAIBox/LLMBox/tree/main/docs/utilization/how-to-load-datasets-with-subsets.md).

Some datasets support Chain-of-Thought reasoning. For example, Grade School Math 8K (`gsm8k`) supports three types of CoT: `base`, `least_to_most`, and `pal`.
## Understanding CoT

You can find the supported datasets in the following table.
Some datasets support Chain-of-Thought reasoning. For example, Grade School Math 8K (`gsm8k`) supports three types of CoT: `base`, `least_to_most`, and `pal`.

## Supported Datasets

- 🔥 Recently supported datasets: `imbue_code`, `imbue_public`, and `imbue_private`.

<table>
<tr>
<td><b>Dataset</b></td>
Expand Down Expand Up @@ -177,7 +185,7 @@ You can find the supported datasets in the following table.
<tr>
<td>Google-Proof Q&A (<code>GPQA</code>)</td>
<td><code>gpqa_main</code> (default), <code>gpqa_extended</code>, ...</td>
<td>MultipleChoiceDataset</td>
<td>MultipleChoice</td>
<td>✅</td>
<td></td>
</tr>
Expand Down Expand Up @@ -216,6 +224,27 @@ You can find the supported datasets in the following table.
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://huggingface.co/datasets/imbue/code-comprehension">Imbue Code Comprehension</a> (<code>imbue_code</code>)</td>
<td>/</td>
<td>MultipleChoice</td>
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://huggingface.co/datasets/imbue/high_quality_private_evaluations">Imbue High Quality Private Evaluations</a> (<code>imbue_private</code>)</td>
<td>/</td>
<td>MultipleChoice</td>
<td></td>
<td></td>
</tr>
<tr>
<td><a href="https://huggingface.co/datasets/imbue/high_quality_public_evaluations">Imbue High Quality Public Evaluations</a> (<code>imbue_public</code>)</td>
<td>/</td>
<td>MultipleChoice</td>
<td></td>
<td></td>
</tr>
<tr>
<td>LAnguage Modeling Broadened to Account for Discourse Aspects (<code>lambada</code>)</td>
<td><code>default</code> (default), <code>de</code>, ... (source: <i>EleutherAI/lambada_openai</i>)</td>
Expand Down
2 changes: 1 addition & 1 deletion tests/dry_test/test_datasets.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import nltk
import pytest

from utilization.utils.logging import list_datasets

from .fixtures import *

nltk.download('punkt')
nltk.download('punkt_tab')

datasets = {
"agieval": [],
Expand Down
2 changes: 2 additions & 0 deletions tests/dry_test/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,5 @@ def test_models_dry_run(run_evaluate, model, dataset, extra_args):
run_evaluate(["-m", model, "-d", dataset, "-b", "10", "--dry_run"] + extra_args, cuda=0)
except torch.cuda.OutOfMemoryError:
pytest.skip(f"Out of memory error on {model} {dataset}")
except FileNotFoundError:
pytest.skip(f"File not found error on {model} {dataset}")
37 changes: 37 additions & 0 deletions utilization/dataset/imbue_code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
from functools import cached_property
from logging import getLogger

from .multiple_choice_dataset import MultipleChoiceDataset

logger = getLogger(__name__)


class ImbueCode(MultipleChoiceDataset):
"""The dataset of Imbue code understanding questions.

These examples fall into 2 categories:
- "cloze": fill in the hole to produce the specified outcome;
- "eval": given a snippet of python code, determine the outcome.
Some questions are very easy, some are much more challenging. Most (if not all) of these questions should be relatively straightforward for an experienced programmer, even without a pencil and paper. Released as part of Imbue's 70b evals post.

Link: https://huggingface.co/datasets/imbue/code-comprehension?row=0

Example (To avoid data contamination, some fields are omitted):
'question': 'If we execute the code below, what will `result` be equal to? ```python ... ```'
'choices': [ "'66-66-66-foo'", "'foo-66-66-66'", "'66--66--66--foo'", "''" ]
'correct_answer': '66- ... -foo'
"""

instruction = "{{question}}{{'\n' + options if options}}\nAnswer:"
evaluation_set = "train"
example_set = None
load_args = ("imbue/code-comprehension",)

def format_instance(self, instance):
instance["target_idx"] = instance["choices"].index(instance["correct_answer"])
instance["options"] = instance["choices"]
return instance

@cached_property
def references(self):
return [instance["target_idx"] for instance in self.evaluation_data]
43 changes: 43 additions & 0 deletions utilization/dataset/imbue_private.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from functools import cached_property
from logging import getLogger

from .multiple_choice_dataset import MultipleChoiceDataset

logger = getLogger(__name__)


class ImbuePrivate(MultipleChoiceDataset):
"""The dataset of Imbue private evaluations.

High-quality question-answer pairs, from private versions of datasets designed to mimic ANLI, ARC, BoolQ, ETHICS, GSM8K, HellaSwag, OpenBookQA, MultiRC, RACE, Social IQa, and WinoGrande. For details, see https://imbue.com/research/70b-evals/. Format: each row contains a question, candidate answers, the correct answer (or multiple correct answers in the case of MultiRC questions), and a question quality score.

Link: https://huggingface.co/datasets/imbue/high_quality_private_evaluations

Example (To avoid data contamination, some fields are omitted):
'question': 'For this question, first read the passage below. "The artist ..." Based on the passage above, answer the following question. Which wealth ...?'
'correct_choices': [ "A ... ire" ]
'choices': [ "A billionaire", "A centimillionaire", "A trillionaire", "A decamillionaire" ]
'quality': 0.245109
'original_dataset': race
"""

instruction = "{{question}}{{'\n' + options if options}}\nAnswer:"
evaluation_set = "train"
example_set = None
load_args = ("imbue/high_quality_private_evaluations",)
category_column = "original_dataset"

def format_instance(self, instance):
if len(instance["correct_choices"]) > 1:
logger.warning(
f"Multiple correct choices found: {len(instance['correct_choices'])}. Only the first one is used. Multiple correct choices may be supported in the future."
)

correct_choice = instance["correct_choices"][0]
instance["target_idx"] = instance["choices"].index(correct_choice)
instance["options"] = instance["choices"]
return instance

@cached_property
def references(self):
return [instance["target_idx"] for instance in self.evaluation_data]
43 changes: 43 additions & 0 deletions utilization/dataset/imbue_public.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
from functools import cached_property
from logging import getLogger

from .multiple_choice_dataset import MultipleChoiceDataset

logger = getLogger(__name__)


class ImbuePublic(MultipleChoiceDataset):
"""The dataset of Imbue public evaluations.

High-quality question-answer pairs, originally from ANLI, ARC, BoolQ, ETHICS, GSM8K, HellaSwag, OpenBookQA, MultiRC, RACE, Social IQa, and WinoGrande. For details, see https://imbue.com/research/70b-evals/. Format: each row contains a question, candidate answers, the correct answer (or multiple correct answers in the case of MultiRC questions), and a question quality score.

Link: https://huggingface.co/datasets/imbue/high_quality_public_evaluations

Example:
'question': 'The man was released from jail. What is the cause of this?'
'correct_choices': [ "His family paid his bail." ]
'choices': [ "His family paid his bail.", "He attacked a fellow inmate." ]
'quality': 0.348698
'original_dataset': copa
"""

instruction = "{{question}}{{'\n' + options if options}}\nAnswer:"
evaluation_set = "train"
example_set = None
load_args = ("imbue/high_quality_public_evaluations",)
category_column = "original_dataset"

def format_instance(self, instance):
if len(instance["correct_choices"]) > 1:
logger.warning(
f"Multiple correct choices found: {len(instance['correct_choices'])}. Only the first one is used. Multiple correct choices may be supported in the future."
)

correct_choice = instance["correct_choices"][0]
instance["target_idx"] = instance["choices"].index(correct_choice)
instance["options"] = instance["choices"]
return instance

@cached_property
def references(self):
return [instance["target_idx"] for instance in self.evaluation_data]
2 changes: 1 addition & 1 deletion utilization/dataset/tldr.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class TLDR(GenerationDataset):
"""

instruction = "{source}"
evaluation_set = "train"
evaluation_set = "test"
example_set = None
metrics = [Rouge()]
load_args = ("CarperAI/openai_summarize_tldr",)
Expand Down
38 changes: 19 additions & 19 deletions utilization/model/huggingface_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ def load_tokenizer(tokenizer_name_or_path: str, use_fast: bool, max_length: int
truncation_side="left",
add_eos_token=False,
add_bos_token=False, # add in chat_template
trust_remote_code=True,
)

# TODO: [Important]!!! check for each tokenizer
Expand Down Expand Up @@ -86,23 +87,8 @@ def get_model_max_length(
def load_hf_model(args: ModelArguments) -> Tuple[PreTrainedModel, Union[PreTrainedTokenizer, PreTrainedTokenizerFast]]:
logger.info(f"Loading {args.model_name_or_path} using Hugging Face Transformers...")

# https://github.com/meta-llama/llama/issues/380#issuecomment-1656714118
if args.torch_dtype == "auto":
try:
with open(args.model_name_or_path + "/config.json") as f:
config = json.load(f)
if "torch_dtype" in config:
if config["torch_dtype"] == "float32":
torch_dtype = "float16"
else:
torch_dtype = config["torch_dtype"]
except:
torch_dtype = "float16"
else:
torch_dtype = args.torch_dtype

model_kwargs = dict(
torch_dtype=getattr(torch, torch_dtype),
torch_dtype=getattr(torch, args.torch_dtype),
device_map=args.device_map,
load_in_4bit=args.load_in_4bit,
load_in_8bit=args.load_in_8bit,
Expand Down Expand Up @@ -153,16 +139,31 @@ class HuggingFaceModel(Model):
def __init__(self, args: ModelArguments):
super().__init__(args)

# https://github.com/meta-llama/llama/issues/380#issuecomment-1656714118
if args.torch_dtype == "auto":
torch_dtype = "float16"
try:
with open(args.model_name_or_path + "/config.json") as f:
config = json.load(f)
if "torch_dtype" in config and config["torch_dtype"] != "float32":
torch_dtype = config["torch_dtype"]
except:
pass
else:
torch_dtype = args.torch_dtype
args.torch_dtype = torch_dtype

if getattr(args, "load_hf_model", None) is not None:
_load_hf_model = args.load_hf_model
else:
_load_hf_model = load_hf_model

self.model, self._tokenizer = _load_hf_model(args)
if self._tokenizer.model_max_length is None:
logger.warning(f"`model_max_length` is not set for {self.name}. Set to default {DEFAULT_MODEL_MAX_LENGTH}.")
self._tokenizer.model_max_length = DEFAULT_MODEL_MAX_LENGTH

self.device = "cuda" if torch.cuda.is_available() else "cpu"
self.device = self.model.device if torch.cuda.is_available() else "cpu"
self.model_max_input_and_output = self.tokenizer.model_max_length

# model tests
Expand Down Expand Up @@ -443,8 +444,7 @@ def get_ppl_with_cache(
last_logits = torch.cat(prefix_cache.next_logits, dim=0).to(logits.device)
shift_logits = torch.cat([last_logits, logits[:, :-1]], dim=-2)
labels[labels == self.tokenizer.pad_token_id] = -100
probs = self.loss_fct(shift_logits.view(-1, vocab_size),
labels.view(-1)).view(labels.size(0), -1)
probs = self.loss_fct(shift_logits.view(-1, vocab_size), labels.view(-1)).view(labels.size(0), -1)

if exact_match:
greedy_tokens = torch.argmax(shift_logits, dim=-1)
Expand Down
Loading