VLLM
LiteLLM 支持 VLLM 上的所有模型。
属性 | 详情 |
---|---|
描述 | vLLM 是一个用于 LLM 推理和服务的快速易用库。 文档 |
LiteLLM 上的提供商路由 | hosted_vllm/ (用于 OpenAI 兼容服务器), vllm/ (用于 vLLM sdk 使用) |
提供商文档 | vLLM ↗ |
支持的端点 | /chat/completions , /embeddings , /completions |
快速开始
用法 - litellm.completion (调用 OpenAI 兼容端点)
vLLM 提供 OpenAI 兼容端点 - 以下是如何使用 LiteLLM 调用它
为了使用 litellm 调用托管的 vllm 服务器,请在 completion 调用中添加以下内容
model="hosted_vllm/<your-vllm-model-name>"
api_base = "your-hosted-vllm-server"
import litellm
response = litellm.completion(
model="hosted_vllm/facebook/opt-125m", # pass the vllm model name
messages=messages,
api_base="https://hosted-vllm-api.co",
temperature=0.2,
max_tokens=80)
print(response)
用法 - LiteLLM 代理服务器 (调用 OpenAI 兼容端点)
以下是如何使用 LiteLLM 代理服务器调用 OpenAI 兼容端点
修改 config.yaml
model_list:
- model_name: my-model
litellm_params:
model: hosted_vllm/facebook/opt-125m # add hosted_vllm/ prefix to route as OpenAI provider
api_base: https://hosted-vllm-api.co # add api base for OpenAI compatible provider启动代理
$ litellm --config /path/to/config.yaml
发送请求到 LiteLLM 代理服务器
- OpenAI Python v1.0.0+
- curl
import openai
client = openai.OpenAI(
api_key="sk-1234", # pass litellm proxy key, if you're using virtual keys
base_url="http://0.0.0.0:4000" # litellm-proxy-base url
)
response = client.chat.completions.create(
model="my-model",
messages = [
{
"role": "user",
"content": "what llm are you"
}
],
)
print(response)curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{
"model": "my-model",
"messages": [
{
"role": "user",
"content": "what llm are you"
}
],
}'
嵌入
- SDK
- 代理
from litellm import embedding
import os
os.environ["HOSTED_VLLM_API_BASE"] = "https://:8000"
embedding = embedding(model="hosted_vllm/facebook/opt-125m", input=["Hello world"])
print(embedding)
- 设置 config.yaml
model_list:
- model_name: my-model
litellm_params:
model: hosted_vllm/facebook/opt-125m # add hosted_vllm/ prefix to route as OpenAI provider
api_base: https://hosted-vllm-api.co # add api base for OpenAI compatible provider
- 启动代理
$ litellm --config /path/to/config.yaml
# RUNNING on http://0.0.0.0:4000
- 测试!
curl -L -X POST 'http://0.0.0.0:4000/embeddings' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{"input": ["hello world"], "model": "my-model"}'
将视频 URL 发送到 VLLM
VLLM 的实现示例见 此处
- (统一) 文件消息
- (VLLM 特定) 视频消息
使用此方法以相同格式将视频 URL 发送到 VLLM + Gemini,使用 OpenAI 的 files
消息类型。
有两种方式可以将视频 URL 发送到 VLLM
- 直接传递视频 URL
{"type": "file", "file": {"file_id": video_url}},
- 将视频数据作为 base64 传递
{"type": "file", "file": {"file_data": f"data:video/mp4;base64,{video_data_base64}"}}
- SDK
- 代理
from litellm import completion
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the following video"
},
{
"type": "file",
"file": {
"file_id": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}
}
]
}
]
# call vllm
os.environ["HOSTED_VLLM_API_BASE"] = "https://hosted-vllm-api.co"
os.environ["HOSTED_VLLM_API_KEY"] = "" # [optional], if your VLLM server requires an API key
response = completion(
model="hosted_vllm/qwen", # pass the vllm model name
messages=messages,
)
# call gemini
os.environ["GEMINI_API_KEY"] = "your-gemini-api-key"
response = completion(
model="gemini/gemini-1.5-flash", # pass the gemini model name
messages=messages,
)
print(response)
- 设置 config.yaml
model_list:
- model_name: my-model
litellm_params:
model: hosted_vllm/qwen # add hosted_vllm/ prefix to route as OpenAI provider
api_base: https://hosted-vllm-api.co # add api base for OpenAI compatible provider
- model_name: my-gemini-model
litellm_params:
model: gemini/gemini-1.5-flash # add gemini/ prefix to route as Google AI Studio provider
api_key: os.environ/GEMINI_API_KEY
- 启动代理
$ litellm --config /path/to/config.yaml
# RUNNING on http://0.0.0.0:4000
- 测试!
curl -X POST http://0.0.0.0:4000/chat/completions \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content":
[
{"type": "text", "text": "Summarize the following video"},
{"type": "file", "file": {"file_id": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}}
]
}
]
}'
使用此方法以 VLLM 原生消息格式 (video_url
) 发送视频 URL。
有两种方式可以将视频 URL 发送到 VLLM
- 直接传递视频 URL
{"type": "video_url", "video_url": {"url": video_url}},
- 将视频数据作为 base64 传递
{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{video_data_base64}"}}
- SDK
- 代理
from litellm import completion
response = completion(
model="hosted_vllm/qwen", # pass the vllm model name
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Summarize the following video"
},
{
"type": "video_url",
"video_url": {
"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}
}
]
}
],
api_base="https://hosted-vllm-api.co")
print(response)
- 设置 config.yaml
model_list:
- model_name: my-model
litellm_params:
model: hosted_vllm/qwen # add hosted_vllm/ prefix to route as OpenAI provider
api_base: https://hosted-vllm-api.co # add api base for OpenAI compatible provider
- 启动代理
$ litellm --config /path/to/config.yaml
# RUNNING on http://0.0.0.0:4000
- 测试!
curl -X POST http://0.0.0.0:4000/chat/completions \
-H "Authorization: Bearer sk-1234" \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [
{"role": "user", "content":
[
{"type": "text", "text": "Summarize the following video"},
{"type": "video_url", "video_url": {"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}}
]
}
]
}'
(已弃用) 用于 vllm pip package
使用 - litellm.completion
pip install litellm vllm
import litellm
response = litellm.completion(
model="vllm/facebook/opt-125m", # add a vllm prefix so litellm knows the custom_llm_provider==vllm
messages=messages,
temperature=0.2,
max_tokens=80)
print(response)
批量补全
from litellm import batch_completion
model_name = "facebook/opt-125m"
provider = "vllm"
messages = [[{"role": "user", "content": "Hey, how's it going"}] for _ in range(5)]
response_list = batch_completion(
model=model_name,
custom_llm_provider=provider, # can easily switch to huggingface, replicate, together ai, sagemaker, etc.
messages=messages,
temperature=0.2,
max_tokens=80,
)
print(response_list)
Prompt 模板
对于具有特殊 prompt 模板的模型 (例如 Llama2),我们将 prompt 格式化以适应其模板。
如果我们不支持您需要的模型怎么办? 您还可以指定自己的自定义 prompt 格式,以防我们尚未涵盖您的模型。
这是否意味着您必须为所有模型指定 prompt? 否。默认情况下,我们会连接您的消息内容以构成 prompt (这是 Bloom、T-5、Llama-2 基础模型等的预期格式)。
默认 Prompt 模板
def default_pt(messages):
return " ".join(message["content"] for message in messages)
我们已支持 Prompt 模板的模型
模型名称 | 适用的模型 | 函数调用 |
---|---|---|
meta-llama/Llama-2-7b-chat | 所有 meta-llama llama2 聊天模型 | completion(model='vllm/meta-llama/Llama-2-7b', messages=messages, api_base="your_api_endpoint") |
tiiuae/falcon-7b-instruct | 所有 falcon instruct 模型 | completion(model='vllm/tiiuae/falcon-7b-instruct', messages=messages, api_base="your_api_endpoint") |
mosaicml/mpt-7b-chat | 所有 mpt 聊天模型 | completion(model='vllm/mosaicml/mpt-7b-chat', messages=messages, api_base="your_api_endpoint") |
codellama/CodeLlama-34b-Instruct-hf | 所有 codellama instruct 模型 | completion(model='vllm/codellama/CodeLlama-34b-Instruct-hf', messages=messages, api_base="your_api_endpoint") |
WizardLM/WizardCoder-Python-34B-V1.0 | 所有 wizardcoder 模型 | completion(model='vllm/WizardLM/WizardCoder-Python-34B-V1.0', messages=messages, api_base="your_api_endpoint") |
Phind/Phind-CodeLlama-34B-v2 | 所有 phind-codellama 模型 | completion(model='vllm/Phind/Phind-CodeLlama-34B-v2', messages=messages, api_base="your_api_endpoint") |
自定义 prompt 模板
# Create your own custom prompt template works
litellm.register_prompt_template(
model="togethercomputer/LLaMA-2-7B-32K",
roles={
"system": {
"pre_message": "[INST] <<SYS>>\n",
"post_message": "\n<</SYS>>\n [/INST]\n"
},
"user": {
"pre_message": "[INST] ",
"post_message": " [/INST]\n"
},
"assistant": {
"pre_message": "\n",
"post_message": "\n",
}
} # tell LiteLLM how you want to map the openai messages to this model
)
def test_vllm_custom_model():
model = "vllm/togethercomputer/LLaMA-2-7B-32K"
response = completion(model=model, messages=messages)
print(response['choices'][0]['message']['content'])
return response
test_vllm_custom_model()