提示缓存
支持的提供商
- OpenAI (
openai/
) - Anthropic API (
anthropic/
) - Bedrock (
bedrock/
,bedrock/invoke/
,bedrock/converse
) (Bedrock 支持提示缓存的所有模型) - Deepseek API (
deepseek/
)
对于支持的提供商,LiteLLM 遵循 OpenAI 提示缓存使用对象格式
"usage": {
"prompt_tokens": 2006,
"completion_tokens": 300,
"total_tokens": 2306,
"prompt_tokens_details": {
"cached_tokens": 1920
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
# ANTHROPIC_ONLY #
"cache_creation_input_tokens": 0
}
prompt_tokens
: 这是非缓存的提示 token (与 Anthropic 相同,等同于 Deepseek 的prompt_cache_miss_tokens
)。completion_tokens
: 这是模型生成的输出 token。total_tokens
:prompt_tokens
+completion_tokens
的总和。prompt_tokens_details
: 包含cached_tokens
的对象。cached_tokens
: 该调用中缓存命中的 token。
completion_tokens_details
: 包含reasoning_tokens
的对象。- 仅限 ANTHROPIC:
cache_creation_input_tokens
是写入缓存的 token 数量。(Anthropic 会对此收费)。
快速开始
注意:OpenAI 缓存仅适用于包含 1024 个或更多 token 的提示
- SDK
- 代理
from litellm import completion
import os
os.environ["OPENAI_API_KEY"] = ""
for _ in range(2):
response = completion(
model="gpt-4o",
messages=[
# System Message
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
# marked for caching with the cache_control parameter, so that this checkpoint can read from the previous cache.
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
# The final turn is marked with cache-control, for continuing in followups.
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)
print("response=", response)
print("response.usage=", response.usage)
assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0
- 设置 config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- 启动代理
litellm --config /path/to/config.yaml
- 测试一下!
from openai import OpenAI
import os
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
for _ in range(2):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
# System Message
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
# marked for caching with the cache_control parameter, so that this checkpoint can read from the previous cache.
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
# The final turn is marked with cache-control, for continuing in followups.
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)
print("response=", response)
print("response.usage=", response.usage)
assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0
Anthropic 示例
Anthropic 对缓存写入收费。
使用 "cache_control": {"type": "ephemeral"}
指定要缓存的内容。
如果您将其传递给任何其他 LLM 提供商,它将被忽略。
- SDK
- 代理
from litellm import completion
import litellm
import os
litellm.set_verbose = True # 👈 SEE RAW REQUEST
os.environ["ANTHROPIC_API_KEY"] = ""
response = completion(
model="anthropic/claude-3-5-sonnet-20240620",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
]
)
print(response.usage)
- 设置 config.yaml
model_list:
- model_name: claude-3-5-sonnet-20240620
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY
- 启动代理
litellm --config /path/to/config.yaml
- 测试一下!
from openai import OpenAI
import os
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
response = client.chat.completions.create(
model="claude-3-5-sonnet-20240620",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
]
)
print(response.usage)
Deepseek 示例
工作方式与 OpenAI 相同。
from litellm import completion
import litellm
import os
os.environ["DEEPSEEK_API_KEY"] = ""
litellm.set_verbose = True # 👈 SEE RAW REQUEST
model_name = "deepseek/deepseek-chat"
messages_1 = [
{
"role": "system",
"content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
},
{
"role": "user",
"content": "In what year did Qin Shi Huang unify the six states?",
},
{"role": "assistant", "content": "Answer: 221 BC"},
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
{"role": "assistant", "content": "Answer: Liu Bang"},
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
{"role": "assistant", "content": "Answer: Li Zhu"},
{
"role": "user",
"content": "Who was the founding emperor of the Ming Dynasty?",
},
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
{
"role": "user",
"content": "Who was the founding emperor of the Qing Dynasty?",
},
]
message_2 = [
{
"role": "system",
"content": "You are a history expert. The user will provide a series of questions, and your answers should be concise and start with `Answer:`",
},
{
"role": "user",
"content": "In what year did Qin Shi Huang unify the six states?",
},
{"role": "assistant", "content": "Answer: 221 BC"},
{"role": "user", "content": "Who was the founder of the Han Dynasty?"},
{"role": "assistant", "content": "Answer: Liu Bang"},
{"role": "user", "content": "Who was the last emperor of the Tang Dynasty?"},
{"role": "assistant", "content": "Answer: Li Zhu"},
{
"role": "user",
"content": "Who was the founding emperor of the Ming Dynasty?",
},
{"role": "assistant", "content": "Answer: Zhu Yuanzhang"},
{"role": "user", "content": "When did the Shang Dynasty fall?"},
]
response_1 = litellm.completion(model=model_name, messages=messages_1)
response_2 = litellm.completion(model=model_name, messages=message_2)
# Add any assertions here to check the response
print(response_2.usage)
计算成本
缓存命中的提示 token 成本可能与缓存未命中的提示 token 成本不同。
使用 completion_cost()
函数计算成本 (也处理提示缓存成本计算)。 查看更多辅助函数
cost = completion_cost(completion_response=response, model=model)
用法
- SDK
- 代理
from litellm import completion, completion_cost
import litellm
import os
litellm.set_verbose = True # 👈 SEE RAW REQUEST
os.environ["ANTHROPIC_API_KEY"] = ""
model = "anthropic/claude-3-5-sonnet-20240620"
response = completion(
model=model,
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents.",
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "what are the key terms and conditions in this agreement?",
},
]
)
print(response.usage)
cost = completion_cost(completion_response=response, model=model)
formatted_string = f"${float(cost):.10f}"
print(formatted_string)
LiteLLM 在响应头中返回计算出的成本 - x-litellm-response-cost
from openai import OpenAI
client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234..
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
response = client.chat.completions.with_raw_response.create(
messages=[{
"role": "user",
"content": "Say this is a test",
}],
model="gpt-3.5-turbo",
)
print(response.headers.get('x-litellm-response-cost'))
completion = response.parse() # get the object that `chat.completions.create()` would have returned
print(completion)
检查模型支持
使用 supports_prompt_caching()
检查模型是否支持提示缓存
- SDK
- 代理
from litellm.utils import supports_prompt_caching
supports_pc: bool = supports_prompt_caching(model="anthropic/claude-3-5-sonnet-20240620")
assert supports_pc
使用 /model/info
端点检查代理上的模型是否支持提示缓存
- 设置 config.yaml
model_list:
- model_name: claude-3-5-sonnet-20240620
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY
- 启动代理
litellm --config /path/to/config.yaml
- 测试一下!
curl -L -X GET 'http://0.0.0.0:4000/v1/model/info' \
-H 'Authorization: Bearer sk-1234' \
预期响应
{
"data": [
{
"model_name": "claude-3-5-sonnet-20240620",
"litellm_params": {
"model": "anthropic/claude-3-5-sonnet-20240620"
},
"model_info": {
"key": "claude-3-5-sonnet-20240620",
...
"supports_prompt_caching": true # 👈 LOOK FOR THIS!
}
}
]
}
这会检查我们维护的模型信息/成本地图