BatchEval: Towards Human-like Text Evaluation
robustness
prompt-engineering
BatchEval improves text evaluation over LLMs, addressing design sensitivity, noise resistance, and ensemble performance, with 10.5% higher correlations at reduced API cost.
BatchEval: Towards Human-like Text Evaluation
Key Findings
- Inferior ensemble performance with static reference: Current large language model (LLM)-based evaluators face challenges with ensemble performance due to weak diversity and lack of comparison between analyses.
- Sensitivity to prompt design: Minor changes to the prompt may lead to significant variations in evaluation results.
- Poor resistance to noise: Evaluation scores lack discrimination and exhibit a non-uniform distribution, leading to reduced robustness against noise.
Introduction
- Text evaluation is crucial for understanding and developing LLMs, and automatic methods have been explored to complement human evaluation, but inconsistencies with human judgments persist.
Proposed Paradigm: BatchEval
- Addressing Issues: BatchEval alleviates prompt sensitivity, noise resistance, and ensemble performance. It conducts batch-wise evaluation
Appendix
Model | gpt-3.5-turbo-1106 |
Date Generated | 2024-02-26 |
Abstract | http://arxiv.org/abs/2401.00437v1 |
HTML | https://browse.arxiv.org/html/2401.00437v1 |
Truncated | True |
Word Count | 15893 |