Triton 推理服务器

LiteLLM 支持 Triton 推理服务器上的嵌入模型

属性	详情
描述	NVIDIA Triton 推理服务器
LiteLLM 上的提供商路由	`triton/`
支持的操作	`/chat/completion`、`/completion`、`/embedding`
支持的 Triton 端点	`/infer`、`/generate`、`/embeddings`
提供商文档链接	Triton 推理服务器 ↗

Triton `/generate` - 对话补全

SDK
代理

使用 triton/ 前缀路由到 Triton 服务器

from litellm import completion
response = completion(
    model="triton/llama-3-8b-instruct",
    messages=[{"role": "user", "content": "who are u?"}],
    max_tokens=10,
    api_base="https://:8000/generate",
)

将模型添加到 config.yaml

model_list:
  - model_name: my-triton-model
    litellm_params:
      model: triton/<your-triton-model>"
      api_base: https://your-triton-api-base/triton/generate

启动代理

$ litellm --config /path/to/config.yaml --detailed_debug

发送请求到 LiteLLM 代理服务器

OpenAI Python v1.0.0+
curl

import openai
from openai import OpenAI

# set base_url to your proxy server
# set api_key to send to proxy server
client = OpenAI(api_key="<proxy-api-key>", base_url="http://0.0.0.0:4000")

response = client.chat.completions.create(
    model="my-triton-model",
    messages=[{"role": "user", "content": "who are u?"}],
    max_tokens=10,
)

print(response)

--header 是可选的，仅在使用 LiteLLM 代理和虚拟密钥时需要

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data ' {
"model": "my-triton-model",
"messages": [{"role": "user", "content": "who are u?"}]
}'

Triton `/infer` - 对话补全

SDK
代理

使用 triton/ 前缀路由到 Triton 服务器

from litellm import completion


response = completion(
    model="triton/llama-3-8b-instruct",
    messages=[{"role": "user", "content": "who are u?"}],
    max_tokens=10,
    api_base="https://:8000/infer",
)

将模型添加到 config.yaml

model_list:
  - model_name: my-triton-model
    litellm_params:
      model: triton/<your-triton-model>"
      api_base: https://your-triton-api-base/triton/infer

启动代理

$ litellm --config /path/to/config.yaml --detailed_debug

发送请求到 LiteLLM 代理服务器

OpenAI Python v1.0.0+
curl

import openai
from openai import OpenAI

# set base_url to your proxy server
# set api_key to send to proxy server
client = OpenAI(api_key="<proxy-api-key>", base_url="http://0.0.0.0:4000")

response = client.chat.completions.create(
    model="my-triton-model",
    messages=[{"role": "user", "content": "who are u?"}],
    max_tokens=10,
)

print(response)

--header 是可选的，仅在使用 LiteLLM 代理和虚拟密钥时需要

curl --location 'http://0.0.0.0:4000/chat/completions' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data ' {
"model": "my-triton-model",
"messages": [{"role": "user", "content": "who are u?"}]
}'

Triton `/embeddings` - 嵌入

SDK
代理

使用 triton/ 前缀路由到 Triton 服务器

from litellm import embedding
import os

response = await litellm.aembedding(
    model="triton/<your-triton-model>",                                                       
    api_base="https://your-triton-api-base/triton/embeddings", # /embeddings endpoint you want litellm to call on your server
    input=["good morning from litellm"],
)

将模型添加到 config.yaml

model_list:
  - model_name: my-triton-model
    litellm_params:
      model: triton/<your-triton-model>"
      api_base: https://your-triton-api-base/triton/embeddings

启动代理

$ litellm --config /path/to/config.yaml --detailed_debug

发送请求到 LiteLLM 代理服务器

OpenAI Python v1.0.0+
curl

import openai
from openai import OpenAI

# set base_url to your proxy server
# set api_key to send to proxy server
client = OpenAI(api_key="<proxy-api-key>", base_url="http://0.0.0.0:4000")

response = client.embeddings.create(
    input=["hello from litellm"],
    model="my-triton-model"
)

print(response)

--header 是可选的，仅在使用 LiteLLM 代理和虚拟密钥时需要

curl --location 'http://0.0.0.0:4000/embeddings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer sk-1234' \
--data ' {
"model": "my-triton-model",
"input": ["write a litellm poem"]
}'

Triton 推理服务器

Triton /generate - 对话补全​

Triton /infer - 对话补全​

Triton /embeddings - 嵌入​

Triton `/generate` - 对话补全

Triton `/infer` - 对话补全

Triton `/embeddings` - 嵌入