Fairness in Serving Large Language Models

architectures

New scheduling algorithm VTC ensures fair LLM serving, offering superior performance and resource utilization.

Authors

Ying Sheng

Shiyi Cao

Dacheng Li

Banghua Zhu

Zhuohan Li

Danyang Zhuo

Joseph E. Gonzalez

Ion Stoica

Published

December 31, 2023

Here’s the summary:

Major Findings

Most major Large Language Model (LLM) inference services use request rate limits to ensure fair processing of client requests, but this often leads to under-utilization of resources and poor client experience when spare capacity is available.
The paper introduces the concept of LLM serving fairness based on a cost function that accounts for the number of input and output tokens processed and proposes a novel fair scheduler called Virtual Token Counter (VTC).
Through extensive experiments, the paper demonstrates the superior performance of VTC in ensuring fairness compared to other baseline methods under various conditions.

Methodology

Introduction

Large Language Models (LLMs) have been integrated into various application domains, and request response time is a key metric for quality of service.

Challenges in LLM Serving

LLM serving presents unique challenges due to unpredictable request lengths and variable token-rate capacity.

Definition of Fairness in LLM Serving

The paper discusses the measurement of service for clients in LLM serving and defines fairness based on max-min fairness and work-conservation properties.

Achieving Fairness with VTC

The Virtual Token Counter (VTC) algorithm is proposed to achieve fairness in LLM serving, which tracks the services received for each client and prioritizes those with the least services received.

Results

Through synthetic and real-world workload experiments, the paper demonstrates that VTC maintains fairness among clients in various scenarios of request frequencies, request lengths, and arrival patterns.

Critique

The paper provides a comprehensive evaluation of the proposed VTC algorithm, demonstrating its superiority over other baseline methods. However, potential limitations or weaknesses of VTC, such as scalability to larger systems or potential edge cases where it may not perform optimally, are not thoroughly discussed. Further exploration is needed to ensure the generalizability of VTC to diverse LLM serving environments.

Appendix

Model	gpt-3.5-turbo-1106
Date Generated	2024-02-26
Abstract	http://arxiv.org/abs/2401.00588v1
HTML	https://browse.arxiv.org/html/2401.00588v1
Truncated	True
Word Count	14021