BatchEval: Towards Human-like Text Evaluation

robustness

prompt-engineering

BatchEval improves text evaluation over LLMs, addressing design sensitivity, noise resistance, and ensemble performance, with 10.5% higher correlations at reduced API cost.

Authors

Peiwen Yuan

Shaoxiong Feng

Yiwei Li

Xinglin Wang

Boyuan Pan

Heda Wang

Kan Li

Published

December 31, 2023

BatchEval: Towards Human-like Text Evaluation

Key Findings

Inferior ensemble performance with static reference: Current large language model (LLM)-based evaluators face challenges with ensemble performance due to weak diversity and lack of comparison between analyses.
Sensitivity to prompt design: Minor changes to the prompt may lead to significant variations in evaluation results.
Poor resistance to noise: Evaluation scores lack discrimination and exhibit a non-uniform distribution, leading to reduced robustness against noise.

Introduction

Text evaluation is crucial for understanding and developing LLMs, and automatic methods have been explored to complement human evaluation, but inconsistencies with human judgments persist.

Proposed Paradigm: BatchEval

Addressing Issues: BatchEval alleviates prompt sensitivity, noise resistance, and ensemble performance. It conducts batch-wise evaluation

Appendix

Model	gpt-3.5-turbo-1106
Date Generated	2024-02-26
Abstract	http://arxiv.org/abs/2401.00437v1
HTML	https://browse.arxiv.org/html/2401.00437v1
Truncated	True
Word Count	15893