自动注入提示缓存检查点

使用 LiteLLM 自动注入提示缓存检查点，可将成本降低高达 90%。

工作原理

LiteLLM 可以自动将提示缓存检查点注入您向 LLM 提供商发出的请求中。这使得您可以：

降低成本：提示中冗长、静态的部分可以被缓存，避免重复处理
无需修改应用代码：您可以在 LiteLLM UI 或 litellm config.yaml 文件中配置自动缓存行为。

配置

您需要在模型配置中指定 cache_control_injection_points。这会告诉 LiteLLM：

在哪里添加缓存指令（location）
定位哪个消息（role）

然后 LiteLLM 会自动将 cache_control 指令添加到您请求中指定的消息中

"cache_control": {
    "type": "ephemeral"
}

使用示例

在此示例中，我们将通过将指令添加到所有具有 role: system 的消息来配置系统消息的缓存。

litellm config.yaml
LiteLLM UI

litellm config.yaml
model_list:
  - model_name: anthropic-auto-inject-cache-system-message
    litellm_params:
      model: anthropic/claude-3-5-sonnet-20240620
      api_key: os.environ/ANTHROPIC_API_KEY
      cache_control_injection_points:
        - location: message
          role: system

在 LiteLLM UI 中，您可以在添加模型时，在 Advanced Settings（高级设置）标签页中指定 cache_control_injection_points。

详细示例

1. 发送给 LiteLLM 的原始请求

在此示例中，我们有一个非常长且静态的系统消息以及一个不断变化的用户消息。缓存系统消息效率很高，因为它很少改变。

{
    "messages": [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a helpful assistant. This is a set of very long instructions that you will follow. Here is a legal document that you will use to answer the user's question."
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is the main topic of this legal document?"
                }
            ]
        }
    ]
}

2. LiteLLM 修改后的请求

LiteLLM 根据我们的配置自动将缓存指令注入系统消息中

{
    "messages": [
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "You are a helpful assistant. This is a set of very long instructions that you will follow. Here is a legal document that you will use to answer the user's question.",
                    "cache_control": {"type": "ephemeral"}
                }
            ]
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is the main topic of this legal document?"
                }
            ]
        }
    ]
}

当模型提供商处理此请求时，它会识别缓存指令，并且只处理一次系统消息，并将其缓存以供后续请求使用。

自动注入提示缓存检查点

工作原理​

配置​

使用示例​

详细示例​

1. 发送给 LiteLLM 的原始请求​

2. LiteLLM 修改后的请求​

工作原理

配置

使用示例

详细示例

1. 发送给 LiteLLM 的原始请求

2. LiteLLM 修改后的请求