Fairness in Serving Large Language Models
architectures
New scheduling algorithm VTC ensures fair LLM serving, offering superior performance and resource utilization.
Here’s the summary:
Major Findings
- Most major Large Language Model (LLM) inference services use request rate limits to ensure fair processing of client requests, but this often leads to under-utilization of resources and poor client experience when spare capacity is available.
- The paper introduces the concept of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed and proposes a novel fair scheduler called Virtual Token Counter (VTC).
- Through extensive experiments, the paper demonstrates the superior performance of VTC in ensuring fairness compared to other baseline methods under various conditions.
Methodology
Introduction
- Large Language Models (LLMs) have been integrated into various application domains, and request response time is a key metric for quality of service.
Challenges in LLM Serving
- LLM serving presents unique challenges due to unpredictable request lengths and variable token-rate capacity.
Definition of Fairness in LLM Serving
- The paper discusses the measurement of service for clients in LLM serving and defines fairness based on max-min fairness and work-conservation properties.
Achieving Fairness with VTC
- The Virtual Token Counter (VTC) algorithm is proposed to achieve fairness in LLM serving, which tracks the services received for each client and prioritizes those with the least services received.
Results
- Through synthetic and real-world workload experiments, the paper demonstrates that VTC maintains fairness among clients in various scenarios of request frequencies, request lengths, and arrival patterns.
Critique
The paper provides a comprehensive evaluation of the proposed VTC algorithm, demonstrating its superiority over other baseline methods. However, potential limitations or weaknesses of VTC, such as scalability to larger systems or potential edge cases where it may not perform optimally, are not thoroughly discussed. Further exploration is needed to ensure the generalizability of VTC to diverse LLM serving environments.
Appendix
Model | gpt-3.5-turbo-1106 |
Date Generated | 2024-02-26 |
Abstract | http://arxiv.org/abs/2401.00588v1 |
HTML | https://browse.arxiv.org/html/2401.00588v1 |
Truncated | True |
Word Count | 14021 |