BatchEval: Towards Human-like Text Evaluation

robustness
prompt-engineering
BatchEval improves text evaluation over LLMs, addressing design sensitivity, noise resistance, and ensemble performance, with 10.5% higher correlations at reduced API cost.
Authors

Peiwen Yuan

Shaoxiong Feng

Yiwei Li

Xinglin Wang

Boyuan Pan

Heda Wang

Kan Li

Published

December 31, 2023

BatchEval: Towards Human-like Text Evaluation

Key Findings

  1. Inferior ensemble performance with static reference: Current large language model (LLM)-based evaluators face challenges with ensemble performance due to weak diversity and lack of comparison between analyses.
  2. Sensitivity to prompt design: Minor changes to the prompt may lead to significant variations in evaluation results.
  3. Poor resistance to noise: Evaluation scores lack discrimination and exhibit a non-uniform distribution, leading to reduced robustness against noise.

Introduction

  • Text evaluation is crucial for understanding and developing LLMs, and automatic methods have been explored to complement human evaluation, but inconsistencies with human judgments persist.

Proposed Paradigm: BatchEval

  • Addressing Issues: BatchEval alleviates prompt sensitivity, noise resistance, and ensemble performance. It conducts batch-wise evaluation

Appendix

Model gpt-3.5-turbo-1106
Date Generated 2024-02-26
Abstract http://arxiv.org/abs/2401.00437v1
HTML https://browse.arxiv.org/html/2401.00437v1
Truncated True
Word Count 15893