-
Notifications
You must be signed in to change notification settings - Fork 0
/
propmt_caching.txt
340 lines (271 loc) · 14.4 KB
/
propmt_caching.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
Prompt Caching (beta)
Prompt Caching is a powerful feature that optimizes your API usage by allowing resuming from specific prefixes in your prompts. This approach significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.
Here’s an example of how to implement Prompt Caching with the Messages API using a cache_control block:
Shell
Python
import anthropic
client = anthropic.Anthropic()
response = client.beta.prompt_caching.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
},
{
"type": "text",
"text": "<the entire contents of 'Pride and Prejudice'>",
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Analyze the major themes in 'Pride and Prejudice'."}],
)
print(response)
In this example, the entire text of “Pride and Prejudice” is cached using the cache_control parameter. This enables reuse of this large text across multiple API calls without reprocessing it each time. Changing only the user message allows you to ask various questions about the book while utilizing the cached content, leading to faster responses and improved efficiency.
Prompt Caching is in beta
We’re excited to announce that Prompt Caching is now in public beta! To access this feature, you’ll need to include the anthropic-beta: prompt-caching-2024-07-31 header in your API requests.
We’ll be iterating on this open beta over the coming weeks, so we appreciate your feedback. Please share your ideas and suggestions using this form.
How Prompt Caching works
When you send a request with Prompt Caching enabled:
The system checks if the prompt prefix is already cached from a recent query.
If found, it uses the cached version, reducing processing time and costs.
Otherwise, it processes the full prompt and caches the prefix for future use.
This is especially useful for:
Prompts with many examples
Large amounts of context or background information
Repetitive tasks with consistent instructions
Long multi-turn conversations
The cache has a 5-minute lifetime, refreshed each time the cached content is used.
Prompt Caching caches the full prefix
Prompt Caching references the entire prompt - tools, system, and messages (in that order) up to and including the block designated with cache_control.
Pricing
Prompt Caching introduces a new pricing structure. The table below shows the price per token for each supported model:
Model Base Input Tokens Cache Writes Cache Hits Output Tokens
Claude 3.5 Sonnet $3 / MTok $3.75 / MTok $0.30 / MTok $15 / MTok
Claude 3 Haiku $0.25 / MTok $0.30 / MTok $0.03 / MTok $1.25 / MTok
Claude 3 Opus $15 / MTok $18.75 / MTok $1.50 / MTok $75 / MTok
Note:
Cache write tokens are 25% more expensive than base input tokens
Cache read tokens are 90% cheaper than base input tokens
Regular input and output tokens are priced at standard rates
How to implement Prompt Caching
Supported models
Prompt Caching is currently supported on:
Claude 3.5 Sonnet
Claude 3 Haiku
Claude 3 Opus
Structuring your prompt
Place static content (tool definitions, system instructions, context, examples) at the beginning of your prompt. Mark the end of the reusable content for caching using the cache_control parameter.
Cache prefixes are created in the following order: tools, system, then messages.
Using the cache_control parameter, you can define up to 4 cache breakpoints, allowing you to cache different reusable sections separately.
Cache Limitations
The minimum cacheable prompt length is:
1024 tokens for Claude 3.5 Sonnet and Claude 3 Opus
2048 tokens for Claude 3 Haiku
Shorter prompts cannot be cached, even if marked with cache_control. Any requests to cache fewer than this number of tokens will be processed without caching. To see if a prompt was cached, see the response usage fields.
The cache has a 5 minute time to live (TTL). Currently, “ephemeral” is the only supported cache type, which corresponds to this 5-minute lifetime.
What can be cached
Every block in the request can be designated for caching with cache_control. This includes:
Tools: Tool definitions in the tools array
System messages: Content blocks in the system array
Messages: Content blocks in the messages.content array, for both user and assistant turns
Images: Content blocks in the messages.content array, in user turns
Tool use and tool results: Content blocks in the messages.content array, in both user and assistant turns
Each of these elements can be marked with cache_control to enable caching for that portion of the request.
Tracking cache performance
Monitor cache performance using these API response fields, within usage in the response (or message_start event if streaming):
cache_creation_input_tokens: Number of tokens written to the cache when creating a new entry.
cache_read_input_tokens: Number of tokens retrieved from the cache for this request.
Best practices for effective caching
To optimize Prompt Caching performance:
Cache stable, reusable content like system instructions, background information, large contexts, or frequent tool definitions.
Place cached content at the prompt’s beginning for best performance.
Use cache breakpoints strategically to separate different cacheable prefix sections.
Regularly analyze cache hit rates and adjust your strategy as needed.
Optimizing for different use cases
Tailor your Prompt Caching strategy to your scenario:
Conversational agents: Reduce cost and latency for extended conversations, especially those with long instructions or uploaded documents.
Coding assistants: Improve autocomplete and codebase Q&A by keeping relevant sections or a summarized version of the codebase in the prompt.
Large document processing: Incorporate complete long-form material including images in your prompt without increasing response latency.
Detailed instruction sets: Share extensive lists of instructions, procedures, and examples to fine-tune Claude’s responses. Developers often include an example or two in the prompt, but with prompt caching you can get even better performance by including 20+ diverse examples of high quality answers.
Agentic tool use: Enhance performance for scenarios involving multiple tool calls and iterative code changes, where each step typically requires a new API call.
Talk to books, papers, documentation, podcast transcripts, and other longform content: Bring any knowledge base alive by embedding the entire document(s) into the prompt, and letting users ask it questions.
Troubleshooting common issues
If experiencing unexpected behavior:
Ensure cached sections are identical and marked with cache_control in the same locations across calls
Check that calls are made within the 5-minute cache lifetime
Verify that tool_choice and image usage remain consistent between calls
Validate that you are caching at least the minimum number of tokens
Note that changes to tool_choice or the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created.
Cache Storage and Sharing
Organization Isolation: Caches are isolated between organizations. Different organizations never share caches, even if they use identical prompts..
Exact Matching: Cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control. The same block must be marked with cache_control during cache reads and creation.
Output Token Generation: Prompt caching has no effect on output token generation. The response you receive will be identical to what you would get if prompt caching was not used.
Prompt Caching examples
To help you get started with Prompt Caching, we’ve prepared a prompt caching cookbook with detailed examples and best practices.
Below, we’ve included several code snippets that showcase various Prompt Caching patterns. These examples demonstrate how to implement caching in different scenarios, helping you understand the practical applications of this feature:
Large Context caching example
Shell
Python
import anthropic
client = anthropic.Anthropic()
response = client.beta.prompt_caching.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an AI assistant tasked with analyzing legal documents."
},
{
"type": "text",
"text": "Here is the full text of a complex legal agreement: [Insert full text of a 50-page legal agreement here]",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "What are the key terms and conditions in this agreement?"
}
]
)
print(response)
This example demonstrates basic Prompt Caching usage, caching the full text of the legal agreement as a prefix while keeping the user instruction uncached.
For the first request:
input_tokens: Number of tokens in the user message only
cache_creation_input_tokens: Number of tokens in the entire system message, including the legal document
cache_read_input_tokens: 0 (no cache hit on first request)
For subsequent requests within the cache lifetime:
input_tokens: Number of tokens in the user message only
cache_creation_input_tokens: 0 (no new cache creation)
cache_read_input_tokens: Number of tokens in the entire cached system message
Caching tool definitions
Shell
Python
import anthropic
client = anthropic.Anthropic()
response = client.beta.prompt_caching.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
tools=[
{
"name": "get_weather",
"description": "Get the current weather in a given location",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The unit of temperature, either 'celsius' or 'fahrenheit'"
}
},
"required": ["location"]
},
},
# many more tools
{
"name": "get_time",
"description": "Get the current time in a given time zone",
"input_schema": {
"type": "object",
"properties": {
"timezone": {
"type": "string",
"description": "The IANA time zone name, e.g. America/Los_Angeles"
}
},
"required": ["timezone"]
},
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{
"role": "user",
"content": "What's the weather and time in New York?"
}
]
)
In this example, we demonstrate caching tool definitions.
The cache_control parameter is placed on the final tool (get_time) to designate all of the tools as part of the static prefix.
This means that all tool definitions, including get_weather and any other tools defined before get_time, will be cached as a single prefix.
This approach is useful when you have a consistent set of tools that you want to reuse across multiple requests without re-processing them each time.
For the first request:
input_tokens: Number of tokens in the user message
cache_creation_input_tokens: Number of tokens in all tool definitions and system prompt
cache_read_input_tokens: 0 (no cache hit on first request)
For subsequent requests within the cache lifetime:
input_tokens: Number of tokens in the user message
cache_creation_input_tokens: 0 (no new cache creation)
cache_read_input_tokens: Number of tokens in all cached tool definitions and system prompt
Continuing a multi-turn conversation
Shell
Python
import anthropic
client = anthropic.Anthropic()
response = client.beta.prompt_caching.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1024,
system=[
{
"type": "text",
"text": "...long system prompt",
"cache_control": {"type": "ephemeral"}
}
],
messages=[
# ...long conversation so far
{
"role": "user",
"content": [
{
"type": "text",
"text": "Hello, can you tell me more about the solar system?",
"cache_control": {"type": "ephemeral"}
}
]
},
{
"role": "assistant",
"content": "Certainly! The solar system is the collection of celestial bodies that orbit our Sun. It consists of eight planets, numerous moons, asteroids, comets, and other objects. The planets, in order from closest to farthest from the Sun, are: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. Each planet has its own unique characteristics and features. Is there a specific aspect of the solar system you'd like to know more about?"
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Tell me more about Mars.",
"cache_control": {"type": "ephemeral"}
}
]
}
]
)
In this example, we demonstrate how to use Prompt Caching in a multi-turn conversation.
The cache_control parameter is placed on the system message to designate it as part of the static prefix.
The conversation history (previous messages) is included in the messages array. The final turn is marked with cache-control, for continuing in followups. The second-to-last user message is marked for caching with the cache_control parameter, so that this checkpoint can read from the previous cache.
This approach is useful for maintaining context in ongoing conversations without repeatedly processing the same information.
For each request:
input_tokens: Number of tokens in the new user message (will be minimal)
cache_creation_input_tokens: Number of tokens in the new assistant and user turns
cache_read_input_tokens: Number of tokens in the conversation up to the previous turn