BatchEval: Towards Human-like Text Evaluation
robustness
prompt-engineering
BatchEval improves text evaluation over LLMs, addressing design sensitivity, noise resistance, and ensemble performance, with 10.5% higher correlations at reduced API cost.

BatchEval: Towards Human-like Text Evaluation
Key Findings
- Inferior ensemble performance with static reference: Current large language model (LLM)-based evaluators face challenges with ensemble performance due to weak diversity and lack of comparison between analyses.
- Sensitivity to prompt design: Minor changes to the prompt may lead to significant variations in evaluation results.
- Poor resistance to noise: Evaluation scores lack discrimination and exhibit a non-uniform distribution, leading to reduced robustness against noise.
Introduction
- Text evaluation is crucial for understanding and developing LLMs, and automatic methods have been explored to complement human evaluation, but inconsistencies with human judgments persist.
Proposed Paradigm: BatchEval
- Addressing Issues: BatchEval alleviates prompt sensitivity, noise resistance, and ensemble performance. It conducts batch-wise evaluation
Appendix
| Model | gpt-3.5-turbo-1106 |
| Date Generated | 2024-02-26 |
| Abstract | http://arxiv.org/abs/2401.00437v1 |
| HTML | https://browse.arxiv.org/html/2401.00437v1 |
| Truncated | True |
| Word Count | 15893 |