EvalBase allow easy testing of new text-to-text generation metrics. A computer program or a machine learning model that can generate text from text is called a system. Such text-to-text generation tasks can be summarization, translation, or even question answering. Correspondingly, the systems for such tasks are called summarizers, translators, and answer.
There are usually two approaches to text-to-text generation: reference-based and reference-free. For each sample in a test set, a reference-based approach first let a human do the same task to produce a reference and then compares a machine/system-generated output against the reference, while reference-free approaches directly scores the generated text with respect to the input text.
For each dataset that EvalBase supports, a main()
function is provided.
So to use EvalBase, simply add EvalBase into your $PYTHONPATH
environment variable, and then
call the main()
function with the proper argument which is a dictionary of configurations.
See run_exp.py
to see the configurations and how to call the main()
function. For how to add new metrics, see this section.
All dataset files and their processing scripts must be under dataloader
folder.
- SummEval: Run
summeval_build.sh
- Realsumm: Run
realsumm_build.sh
- Newsroom:
- Download
newsroom-human-eval.csv
, the human evaluation result, including documents and system summaries but no reference summaries:wget https://github.com/lil-lab/newsroom/raw/master/humaneval/newsroom-human-eval.csv
- Get
test.jsonl
, the test split of Newsroom, containing reference summaries. No automatic script. You will have to fill out a web form here and then follow the link in your email to download.test.jsonl
is in the downloaded tar ball.
- Download
- TAC: We assume that you have fully recursively extracted the two files.
GuidedSumm2010_eval.tgz
Downloadable from web, containing human evaluation results and system summaries.TAC2010_Summarization_Documents.tgz
Emailed by NIST, containing the documents for which summaries are generated and rated. Both files require you to apply to NIST for access.
Once you have the files above, you can run the evaluations:
python3 run_exp.py
By default, all under results
folder. Or you can modify the result_path_root
in the configuration dictionary passes to the main()
function for a dataset.
Just add your metric as an entry into the NLG_metrics
dictionary (keys are metric names as strings while values are functions -- see below) in the configuration dictionary.
Each metric function must follow the I/O of HuggingFace's the BERTScore function in the evaluate
library:
- Two must-have positional arguments:
- the output text by a text-to-text generation system, as a
List[str]
- the input text (in ref-free mode) or the reference (in ref-based mode), as a
List[str]
.
- the output text by a text-to-text generation system, as a
- Return must be a dictionary of type
dict[str, List[float]]
, e.g.,{'precision': [0.5, 0.6], 'recall': [0.7, 0.8], 'f1': [0.9, 0.75]}
.
eval_utils.py
: scoring summaries using automated metrics and computing their correlations with human scoreseval_summary_level
: the main function that does summary-level evaluation. It loads adataset_df
(see specifications below).eval_system_level
: To be finished by those taking CS 579X
- Dataloader and evaluation scripts for each dataset:
newsroom.py
: Run experiments on Newsroom datasetrealsumm.py
: Run experiments on realsumm datasetsummeval.py
: Run experiments on SummEval datasettac2010.py
: Run experiments on tac2010 dataset
We use the same local variable names for key DataFrames across functions in eval.py
to be consistent.
Please read more details of such variables in the docstrings/comments inside eval.py
dataset_df
: A summary human evaluation dataset represented as apandas.DataFrame
. The following columns must be there:
- ArticleText -- the source text
- System -- the name of the system/summarizer
- SystemSummary -- a system summary generated by the
System
at the same row - ReferenceSummary -- corresponding reference summary provided by the dataset
- Various columns of human evaluation aspects, as defined in
env.human_metrics
batch_result_df
: Each row corresponds to a sample and each column corresponds to a MultiIndex(approach, model, score_name)
.score_name
is a variant of amodel
, for example, ROUGE-1 is ascore_name
for ROUGE which is amodel
.corr_df
: The correlation coefficient dataframe, the last column which is like