Triton 推理服务器
LiteLLM 支持 Triton 推理服务器上的嵌入模型
| 属性 | 详情 | 
|---|---|
| 描述 | NVIDIA Triton 推理服务器 | 
| LiteLLM 上的提供商路由 | triton/ | 
| 支持的操作 | /chat/completion、/completion、/embedding | 
| 支持的 Triton 端点 | /infer、/generate、/embeddings | 
| 提供商文档链接 | Triton 推理服务器 ↗ | 
Triton /generate - 对话补全
- SDK
- 代理
使用 triton/ 前缀路由到 Triton 服务器
from litellm import completion
response = completion(
    model="triton/llama-3-8b-instruct",
    messages=[{"role": "user", "content": "who are u?"}],
    max_tokens=10,
    api_base="https://:8000/generate",
)
- 将模型添加到 config.yaml - model_list:
 - model_name: my-triton-model
 litellm_params:
 model: triton/<your-triton-model>"
 api_base: https://your-triton-api-base/triton/generate
- 启动代理 - $ litellm --config /path/to/config.yaml --detailed_debug
- 发送请求到 LiteLLM 代理服务器 - OpenAI Python v1.0.0+
- curl
 - import openai
 from openai import OpenAI
 # set base_url to your proxy server
 # set api_key to send to proxy server
 client = OpenAI(api_key="<proxy-api-key>", base_url="http://0.0.0.0:4000")
 response = client.chat.completions.create(
 model="my-triton-model",
 messages=[{"role": "user", "content": "who are u?"}],
 max_tokens=10,
 )
 print(response)- --header是可选的,仅在使用 LiteLLM 代理和虚拟密钥时需要- curl --location 'http://0.0.0.0:4000/chat/completions' \
 --header 'Content-Type: application/json' \
 --header 'Authorization: Bearer sk-1234' \
 --data ' {
 "model": "my-triton-model",
 "messages": [{"role": "user", "content": "who are u?"}]
 }'
Triton /infer - 对话补全
- SDK
- 代理
使用 triton/ 前缀路由到 Triton 服务器
from litellm import completion
response = completion(
    model="triton/llama-3-8b-instruct",
    messages=[{"role": "user", "content": "who are u?"}],
    max_tokens=10,
    api_base="https://:8000/infer",
)
- 将模型添加到 config.yaml - model_list:
 - model_name: my-triton-model
 litellm_params:
 model: triton/<your-triton-model>"
 api_base: https://your-triton-api-base/triton/infer
- 启动代理 - $ litellm --config /path/to/config.yaml --detailed_debug
- 发送请求到 LiteLLM 代理服务器 - OpenAI Python v1.0.0+
- curl
 - import openai
 from openai import OpenAI
 # set base_url to your proxy server
 # set api_key to send to proxy server
 client = OpenAI(api_key="<proxy-api-key>", base_url="http://0.0.0.0:4000")
 response = client.chat.completions.create(
 model="my-triton-model",
 messages=[{"role": "user", "content": "who are u?"}],
 max_tokens=10,
 )
 print(response)- --header是可选的,仅在使用 LiteLLM 代理和虚拟密钥时需要- curl --location 'http://0.0.0.0:4000/chat/completions' \
 --header 'Content-Type: application/json' \
 --header 'Authorization: Bearer sk-1234' \
 --data ' {
 "model": "my-triton-model",
 "messages": [{"role": "user", "content": "who are u?"}]
 }'
Triton /embeddings - 嵌入
- SDK
- 代理
使用 triton/ 前缀路由到 Triton 服务器
from litellm import embedding
import os
response = await litellm.aembedding(
    model="triton/<your-triton-model>",                                                       
    api_base="https://your-triton-api-base/triton/embeddings", # /embeddings endpoint you want litellm to call on your server
    input=["good morning from litellm"],
)
- 将模型添加到 config.yaml - model_list:
 - model_name: my-triton-model
 litellm_params:
 model: triton/<your-triton-model>"
 api_base: https://your-triton-api-base/triton/embeddings
- 启动代理 - $ litellm --config /path/to/config.yaml --detailed_debug
- 发送请求到 LiteLLM 代理服务器 - OpenAI Python v1.0.0+
- curl
 - import openai
 from openai import OpenAI
 # set base_url to your proxy server
 # set api_key to send to proxy server
 client = OpenAI(api_key="<proxy-api-key>", base_url="http://0.0.0.0:4000")
 response = client.embeddings.create(
 input=["hello from litellm"],
 model="my-triton-model"
 )
 print(response)- --header是可选的,仅在使用 LiteLLM 代理和虚拟密钥时需要- curl --location 'http://0.0.0.0:4000/embeddings' \
 --header 'Content-Type: application/json' \
 --header 'Authorization: Bearer sk-1234' \
 --data ' {
 "model": "my-triton-model",
 "input": ["write a litellm poem"]
 }'