Skip to content

Latest commit

 

History

History
42 lines (35 loc) · 1.75 KB

evaluation.md

File metadata and controls

42 lines (35 loc) · 1.75 KB

Evaluating Language Models

EasyLM has builtin support for evaluating language models on a variety of tasks. Once the trained language model is served with LMServer, it can be evaluated against various benchmarks in few-shot and zero-shot settings.

LM Evaluation Harness

EasyLM comes with builtin support for lm-eval-harness, which can evaluate the language model on a variety of tasks. For example, you can use the following command to evaluate the the langauge model served with the HTTP server:

python -m EasyLM.scripts.lm_eval_harness \
    --lm_client.url='http://localhost:5007/' \
    --tasks='wsc,piqa,winogrande,openbookqa,logiqa' \
    --shots=0

The lm_eval_harness script supports the following commnad line options:

Evaluating on MMLU

The served langauge model can also be evaluated with the MMLU benchmark. In order to run the evaluation, you'll need to use my fork of MMLU which supports EasyLM LMServer.

git clone https://github.com/young-geng/mmlu_easylm.git
cd mmlu_easylm
python evaluate_easylm.py \
    --name='llama' \
    --lm_server_url='http://localhost:5007' \
    --ntrain=5