Asking and answering questions to evaluate the factual consistency of summaries

Abstractive Summarization models are limited by factual inconsistencies with respect to their input.
The contribution of this paper lies in proposing a system called QAGS, which is designed to identify factual inconsistencies in a generated summary.
Contributions
- Introduction of QAGs
- Set of human judgements of factual consistency model generated summaries for two summarization datasets
- QAGs is robust to underlying model quality and domain mismatch
The QAGS(pronounced kags) model:
- Given a generated text (summary), a Question generation model produces questions (using beam search).
- QA models are used to answer the questions given both the input and the summary
- A quality score is computed based on the similarity of corresponding answers
- The framework as a nature of its design focuses on the semantically relevant parts of the text rather than weighing all parts equally (which is what rouge does).
QAGs is substantially more correlated to human judgements of correctness versus other automatic evaluation metrics (pearson correlation is used).
QAGs is specifically created to find factual inconsistency in summaries, using this as any sort of signal should be in conjunction with other kinds of signals which capture coherence, conciseness or fluency.
Takeaway and future work
- Evaluation metrics which are able to capture subtle semantic errors are required to build better models

Provide feedback