The Critique of Critique

architectures
social-sciences
MetaCritique evaluates critique quality through precision and recall scores, using AIUs for detailed assessment and providing natural language rationale.
Authors

Shichao Sun

Junlong Li

Weizhe Yuan

Ruifeng Yuan

Wenjie Li

Pengfei Liu

Published

January 9, 2024

Summary

Major Findings - MetaCritique, a framework for evaluating critiques, is proposed in this paper. Two aspects, factuality and comprehensiveness, are evaluated using precision and recall scores. - The framework uses Atomic Information Units (AIUs) to evaluate critiques at a more fine-grained level and provides natural language rationale to support each judgment. - A meta-evaluation dataset covering four tasks (question answering, reasoning, entailment, and summarization) is created to demonstrate the feasibility and effectiveness of MetaCritique. The framework achieved near-human performance and identified high-quality critiques leading to improved results.

Key Concepts - MetaCritique: A framework for evaluating critiques from two aspects - factuality and comprehensiveness using precision and recall scores. - Atomic Information Units (AIUs): Fundamental segments of informative critique used to evaluate critique at a fine-grained level. - Meta-evaluation dataset: Dataset covering four tasks used to demonstrate the feasibility and effectiveness of MetaCritique.

Feasibility and Effectiveness - GPT-4 is used to generate reference and extract AIUs, and it achieves remarkable performance, justifying its utilization for MetaCritique. - GPT-4 demonstrates high performance in executing AIU-level tasks, indicating its suitability for evaluating critiques via the MetaCritique framework. - MetaCritique achieves a better correlation with human judgments compared to other baselines, demonstrating its effectiveness in evaluating critiques. - MetaCritique identifies superior critiques leading to better refined outcomes, indicating its potential to enhance generative AI substantially.

Critique

  • The creative tasks are not suitable for the recall principle, especially when there are multiple high-quality answers, which poses a limitation to the framework.
  • The availability of reference answers or critiques remains a challenge. While GPT-4 serves as a reference, it is important to acknowledge potential errors.

Limitations

  • The limitations of the framework in handling creative tasks and the availability of reference answers or critiques are identified as potential areas for improvement in future work.

Appendix

Model gpt-3.5-turbo-1106
Date Generated 2024-02-26
Abstract http://arxiv.org/abs/2401.04518v1
HTML https://browse.arxiv.org/html/2401.04518v1
Truncated False
Word Count 8182