Hugging Face

LiteLLM 支持对托管在 Hugging Face Hub 上的模型，通过多个服务运行推理。

无服务器推理提供商 - Hugging Face 通过多个推理提供商（如 Together AI 和 Sambanova）提供对无服务器 AI 推理的便捷统一访问。这是将 AI 集成到产品中的最快方式，提供免维护且可扩展的解决方案。更多详情请参阅推理提供商文档。
专属推理端点 - 这是一种可轻松将模型部署到生产环境的产品。Hugging Face 在您选择的云提供商的专属、完全托管的基础设施上运行推理。您可以按照这些步骤将模型部署到 Hugging Face 推理端点。

支持的模型

无服务器推理提供商

您可以通过访问 huggingface.co/models，点击“其他”过滤器标签，然后选择您想要的提供商来查看推理提供商可用的模型

Filter models by Inference Provider

例如，您可以在此处找到所有 Fireworks 支持的模型。

专属推理端点

请参阅推理端点目录，获取可用模型列表。

使用方法

无服务器推理提供商
推理端点

认证

只需一个 Hugging Face token，您就可以通过多个提供商访问推理。您的请求会通过 Hugging Face 路由，并且使用量会以标准提供商 API 费率直接计入您的 Hugging Face 账户。

只需使用您的 Hugging Face token 设置 HF_TOKEN 环境变量，您可以在此处创建 token：https://hugging-face.cn/settings/tokens。

export HF_TOKEN="hf_xxxxxx"

或者，您也可以将您的 Hugging Face token 作为参数传递

completion(..., api_key="hf_xxxxxx")

入门指南

要使用 Hugging Face 模型，请按以下格式指定您要使用的提供商和模型

huggingface/<provider>/<hf_org_or_user>/<hf_model>

其中 <hf_org_or_user>/<hf_model> 是 Hugging Face 模型 ID，<provider> 是推理提供商。
默认情况下，如果您未指定提供商，LiteLLM 将使用 HF 推理 API。

示例

# Run DeepSeek-R1 inference through Together AI
completion(model="huggingface/together/deepseek-ai/DeepSeek-R1",...)

# Run Qwen2.5-72B-Instruct inference through Sambanova
completion(model="huggingface/sambanova/Qwen/Qwen2.5-72B-Instruct",...)

# Run Llama-3.3-70B-Instruct inference through HF Inference API
completion(model="huggingface/meta-llama/Llama-3.3-70B-Instruct",...)

基本补全

这里是通过 Together AI 使用 DeepSeek-R1 模型进行聊天补全的示例

import os
from litellm import completion

os.environ["HF_TOKEN"] = "hf_xxxxxx"

response = completion(
    model="huggingface/together/deepseek-ai/DeepSeek-R1",
    messages=[
        {
            "role": "user",
            "content": "How many r's are in the word 'strawberry'?",
        }
    ],
)
print(response)

流式传输

现在，让我们看看流式请求的样子。

import os
from litellm import completion

os.environ["HF_TOKEN"] = "hf_xxxxxx"

response = completion(
    model="huggingface/together/deepseek-ai/DeepSeek-R1",
    messages=[
        {
            "role": "user",
            "content": "How many r's are in the word `strawberry`?",
            
        }
    ],
    stream=True,
)

for chunk in response:
    print(chunk)

图片输入

当模型支持时，您也可以传递图片。这里是通过 Sambanova 使用 Llama-3.2-11B-Vision-Instruct 模型的示例。

from litellm import completion

# Set your Hugging Face Token
os.environ["HF_TOKEN"] = "hf_xxxxxx"

messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                    }
                },
            ],
        }
    ]

response = completion(
    model="huggingface/sambanova/meta-llama/Llama-3.2-11B-Vision-Instruct", 
    messages=messages,
)
print(response.choices[0])

函数调用

您可以通过提供工具访问权限来扩展模型的功能。这里是通过 Sambanova 使用 Qwen2.5-72B-Instruct 模型进行函数调用的示例。

import os
from litellm import completion

# Set your Hugging Face Token
os.environ["HF_TOKEN"] = "hf_xxxxxx"

tools = [
  {
    "type": "function",
    "function": {
      "name": "get_current_weather",
      "description": "Get the current weather in a given location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city and state, e.g. San Francisco, CA",
          },
          "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
        },
        "required": ["location"],
      },
    }
  }
]
messages = [
    {
        "role": "user",
        "content": "What's the weather like in Boston today?",
    }
]

response = completion(
    model="huggingface/sambanova/meta-llama/Llama-3.3-70B-Instruct", 
    messages=messages,
    tools=tools,
    tool_choice="auto"
)
print(response)

基本补全

在专属基础设施上部署 Hugging Face 推理端点后，您可以通过在 api_base 中提供端点基础 URL，并将 huggingface/tgi 指示为模型名称来在其上运行推理。

import os
from litellm import completion

os.environ["HF_TOKEN"] = "hf_xxxxxx"

response = completion(
    model="huggingface/tgi",
    messages=[{"content": "Hello, how are you?", "role": "user"}],
    api_base="https://my-endpoint.endpoints.huggingface.cloud/v1/"
)
print(response)

流式传输

import os
from litellm import completion

os.environ["HF_TOKEN"] = "hf_xxxxxx"

response = completion(
    model="huggingface/tgi",
    messages=[{"content": "Hello, how are you?", "role": "user"}],
    api_base="https://my-endpoint.endpoints.huggingface.cloud/v1/",
    stream=True
)

for chunk in response:
    print(chunk)

图片输入

import os
from litellm import completion

os.environ["HF_TOKEN"] = "hf_xxxxxx"

messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                    }
                },
            ],
        }
    ]
response = completion(
    model="huggingface/tgi",
    messages=messages,
    api_base="https://my-endpoint.endpoints.huggingface.cloud/v1/""
)
print(response.choices[0])

函数调用

import os
from litellm import completion

os.environ["HF_TOKEN"] = "hf_xxxxxx"

functions = [{
    "name": "get_weather",
    "description": "Get the weather in a given location",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "The location to get weather for"
            }
        },
        "required": ["location"]
    }
}]

response = completion(
    model="huggingface/tgi",
    messages=[{"content": "What's the weather like in San Francisco?", "role": "user"}],
    api_base="https://my-endpoint.endpoints.huggingface.cloud/v1/",
    functions=functions
)
print(response)

使用 Hugging Face 模型的 LiteLLM 代理服务器

您可以设置LiteLLM 代理服务器，通过任何支持的推理提供商来服务 Hugging Face 模型。方法如下

步骤 1. 设置配置文件

在此示例中，我们配置了一个代理，用于服务 Hugging Face 的 DeepSeek R1 模型，并使用 Together AI 作为后端推理提供商。

model_list:
  - model_name: my-r1-model
    litellm_params:
      model: huggingface/together/deepseek-ai/DeepSeek-R1
      api_key: os.environ/HF_TOKEN # ensure you have `HF_TOKEN` in your .env

步骤 2. 启动服务器

litellm --config /path/to/config.yaml

步骤 3. 向服务器发送请求

curl
python

curl --location 'http://0.0.0.0:4000/chat/completions' \
    --header 'Content-Type: application/json' \
    --data '{
    "model": "my-r1-model",
    "messages": [
        {
            "role": "user",
            "content": "Hello, how are you?"
        }
    ]
}'

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://0.0.0.0:4000",
    api_key="anything",
)

response = client.chat.completions.create(
    model="my-r1-model",
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ]
)
print(response)

Embedding

LiteLLM 也支持 Hugging Face 的文本 Embedding 推理模型。

from litellm import embedding
import os
os.environ['HF_TOKEN'] = "hf_xxxxxx"
response = embedding(
    model='huggingface/microsoft/codebert-base',
    input=["good morning from litellm"]
)

常见问题

Hugging Face 推理提供商如何计费？

计费集中在您的 Hugging Face 账户上，无论您使用哪个提供商。您将按标准提供商 API 费率计费，不收取额外加价 - Hugging Face 只是将提供商成本直接传递给您。请注意，Hugging Face PRO 用户每月可获得价值 2 美元的推理额度，可在所有提供商之间使用。

我需要为每个推理提供商创建账户吗？

不需要，您无需创建单独的账户。所有请求都通过 Hugging Face 路由，因此您只需要您的 HF token。这让您可以轻松地对不同提供商进行基准测试，并选择最适合您需求的提供商。

Hugging Face 未来会支持更多推理提供商吗？

是的！新的推理提供商（和模型）正在逐步添加。

我们欢迎任何改进 Hugging Face 集成的建议 - 创建一个issue / 加入 Discord！

Hugging Face

支持的模型​

无服务器推理提供商​

专属推理端点​

使用方法​

认证​

入门指南​

基本补全​

流式传输​

图片输入​

函数调用​

基本补全​

流式传输​

图片输入​

函数调用​

使用 Hugging Face 模型的 LiteLLM 代理服务器​

步骤 1. 设置配置文件​

步骤 2. 启动服务器​

步骤 3. 向服务器发送请求​

Embedding​

常见问题

支持的模型

无服务器推理提供商

专属推理端点

使用方法

认证

入门指南

基本补全

流式传输

图片输入

函数调用

基本补全

流式传输

图片输入

函数调用

使用 Hugging Face 模型的 LiteLLM 代理服务器

步骤 1. 设置配置文件

步骤 2. 启动服务器

步骤 3. 向服务器发送请求

Embedding