跳到主要内容

[BETA]请求优先级

信息

Beta 功能。仅用于测试。

帮助我们改进此功能

在高流量时优先处理 LLM API 请求。

  • 将请求添加到优先级队列
  • 轮询队列,检查请求是否可以发出。返回 'True'
    • 如果存在健康的部署
    • 或者如果请求位于队列顶部
  • 优先级 - 数字越低,优先级越高
    • 例如 priority=0 > priority=2000

支持的路由器端点

  • acompletion (代理上的 /v1/chat/completions)
  • atext_completion (代理上的 /v1/completions)

快速入门

from litellm import Router

router = Router(
model_list=[
{
"model_name": "gpt-3.5-turbo",
"litellm_params": {
"model": "gpt-3.5-turbo",
"mock_response": "Hello world this is Macintosh!", # fakes the LLM API call
"rpm": 1,
},
},
],
timeout=2, # timeout request if takes > 2s
routing_strategy="usage-based-routing-v2",
polling_interval=0.03 # poll queue every 3ms if no healthy deployments
)

try:
_response = await router.acompletion( # 👈 ADDS TO QUEUE + POLLS + MAKES CALL
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey!"}],
priority=0, # 👈 LOWER IS BETTER
)
except Exception as e:
print("didn't make request")

LiteLLM 代理

要在 LiteLLM 代理上优先处理请求,请在请求中添加 priority

curl -X POST 'http://localhost:4000/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "gpt-3.5-turbo-fake-model",
"messages": [
{
"role": "user",
"content": "what is the meaning of the universe? 1234"
}],
"priority": 0 👈 SET VALUE HERE
}'

高级 - Redis 缓存

使用 Redis 缓存实现跨多个 LiteLLM 实例的请求优先级排序。

SDK

from litellm import Router

router = Router(
model_list=[
{
"model_name": "gpt-3.5-turbo",
"litellm_params": {
"model": "gpt-3.5-turbo",
"mock_response": "Hello world this is Macintosh!", # fakes the LLM API call
"rpm": 1,
},
},
],
### REDIS PARAMS ###
redis_host=os.environ["REDIS_HOST"],
redis_password=os.environ["REDIS_PASSWORD"],
redis_port=os.environ["REDIS_PORT"],
)

try:
_response = await router.acompletion( # 👈 ADDS TO QUEUE + POLLS + MAKES CALL
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey!"}],
priority=0, # 👈 LOWER IS BETTER
)
except Exception as e:
print("didn't make request")

代理

model_list:
- model_name: gpt-3.5-turbo-fake-model
litellm_params:
model: gpt-3.5-turbo
mock_response: "hello world!"
api_key: my-good-key

litellm_settings:
request_timeout: 600 # 👈 Will keep retrying until timeout occurs

router_settings:
redis_host; os.environ/REDIS_HOST
redis_password: os.environ/REDIS_PASSWORD
redis_port: os.environ/REDIS_PORT
$ litellm --config /path/to/config.yaml 

# RUNNING on http://0.0.0.0:4000s
curl -X POST 'http://localhost:4000/queue/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer sk-1234' \
-D '{
"model": "gpt-3.5-turbo-fake-model",
"messages": [
{
"role": "user",
"content": "what is the meaning of the universe? 1234"
}],
"priority": 0 👈 SET VALUE HERE
}'