Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions src/lib/navigation.ts
Original file line number Diff line number Diff line change
Expand Up @@ -846,6 +846,29 @@ export const tabNavigation: NavTab[] = [
icon: 'check-double',
items: [
{ title: "Building an Eval Correction Loop: Teaching Your Evaluator What 'Good' Means for Your Domain", href: '/docs/cookbook/evaluation/eval-correction-loop' },
{ title: 'Running Continuous Evals in Production Without Blowing Your Token Budget', href: '/docs/cookbook/observe/continuous-evals-budget' },
]
},
{
title: 'Security',
icon: 'shield',
items: [
{ title: 'Simulating Prompt Injection, Jailbreak, and Social Engineering Attacks Against Your Agent', href: '/docs/cookbook/security/simulate-adversarial-attacks' },
]
},
{
title: 'Optimization',
icon: 'wand-magic-sparkles',
items: [
{ title: 'Closing the Loop: Turning Production Failures Into Automated Prompt Improvements', href: '/docs/cookbook/optimization/closing-the-loop-prod-failures' },
]
},
{
title: 'Voice',
icon: 'microphone',
items: [
{ title: 'How to Test, Evaluate, and Improve Vapi Voice Agents With Future AGI Simulate', href: '/docs/cookbook/voice/simulate-vapi' },
{ title: 'How to Test, Evaluate, and Improve Retell Voice Agents With Future AGI Simulate', href: '/docs/cookbook/voice/simulate-retell' },
]
},
{
Expand Down
37 changes: 26 additions & 11 deletions src/pages/docs/cookbook/command-center/semantic-caching.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,6 @@ title: "Cut LLM Costs 80% With Semantic Caching in Agent Command Center"
description: "Turn on exact and semantic response caching at the gateway so paraphrased duplicate prompts return cached answers instead of paying for the same call twice."
---

<TLDR>
Enable caching once in the Agent Command Center dashboard, switch the strategy to `semantic`, and your existing OpenAI SDK code starts returning cached answers for paraphrased prompts. The `x-agentcc-cache: hit_semantic` response header confirms it. You walk away with sub-100ms latency and near-zero cost on duplicate traffic, no application-code rewrites.
</TLDR>

<div style={{display: "flex", gap: "8px", flexWrap: "wrap", margin: "0.5rem 0 1rem"}}>
<a href="https://colab.research.google.com/github/future-agi/cookbooks/blob/cookbook/falcon-ai-page/command-center/semantic-caching.ipynb" target="_blank" style={{display: "inline-flex"}}><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" style={{height: "28px"}} /></a>
<a href="https://github.com/future-agi/cookbooks/blob/cookbook/falcon-ai-page/command-center/semantic-caching.ipynb" target="_blank" style={{display: "inline-flex"}}><img src="https://img.shields.io/badge/View_on_GitHub-181717?logo=github&logoColor=white" alt="GitHub" style={{height: "28px"}} /></a>
Expand All @@ -16,6 +12,8 @@ Enable caching once in the Agent Command Center dashboard, switch the strategy t
|------|-----------|---------|
| 10 min | Beginner | `agentcc` |

By the end of this cookbook you will have exact + semantic caching enabled at the Agent Command Center gateway, paraphrased duplicate prompts returning cached answers in sub-100ms with near-zero cost, and a way to bypass or invalidate the cache when you need fresh answers. The only application-code change is pointing your OpenAI SDK at the gateway base URL.

<Prerequisites>
- FutureAGI account → [app.futureagi.com](https://app.futureagi.com)
- Agent Command Center API key starting with `sk-agentcc-` (Settings → API Keys)
Expand All @@ -35,12 +33,16 @@ pip install openai agentcc
export AGENTCC_API_KEY="sk-agentcc-your-key"
```

## Tutorial
## What is Agent Command Center?

Agent Command Center is a model gateway in front of your LLM provider calls. Point your OpenAI SDK at the gateway base URL (`https://gateway.futureagi.com/v1`) and every request gets routed through it, where the platform applies caching, routing, fallback, cost tracking, and observability before forwarding to the underlying provider.

This cookbook turns on the **caching** layer specifically. The gateway has two cache tiers: an L1 **exact** cache that matches byte-identical prompts, and an L2 **semantic** cache that matches paraphrased prompts by meaning. The five steps below send a baseline request, enable exact caching from the dashboard, turn on the semantic L2 layer, measure the savings on a realistic batch, and show how to bypass or invalidate the cache when you need fresh answers.

<Steps>
<Step title="Send a baseline request and note the cost">

Point the OpenAI SDK at the gateway and send a request. The response headers tell you exactly what it cost and whether it came from cache.
Before turning anything on, send one request through the gateway with caching disabled. This is your "no caching" control. Every gateway response carries `x-agentcc-cache`, `x-agentcc-cost`, and `x-agentcc-latency-ms` headers, so you can read the exact cost and latency you're paying right now and compare it to the cached numbers in the steps that follow.

```python
import os
Expand All @@ -67,10 +69,12 @@ print(f"latency: {r.headers.get('x-agentcc-latency-ms')}ms")
</Step>
<Step title="Turn on exact caching in the dashboard">

The first cache layer is **L1 exact match**. The gateway hashes the exact request (model, messages, temperature, all parameters) and stores the response under that hash. Any future request with the same hash returns the cached response without ever calling the provider. It's the cheapest, fastest layer to enable, and it kicks in automatically the moment caching is on.

In the dashboard, go to **Gateway → Providers → Cache** and click **Configure Cache**. Toggle:

- **Enable Response Cache**: on
- **Default TTL**: `1h` (or whatever fits your data freshness needs)
- **Default TTL**: `1h` (cached entries expire after an hour. Set this based on how stale your data can be before you need a fresh model call.)

<img src="https://fi-cookbook-assets.s3.ap-south-1.amazonaws.com/command-center/semantic-caching/dashboard-caching.png" alt="Agent Command Center Cache tab with Enabled: Yes, L1 Backend: memory, and Semantic Cache: Disabled. Exact-match caching is active" style={{width: "100%", borderRadius: "0.75rem", border: "1px solid var(--color-border-default)"}} />

Expand All @@ -95,12 +99,12 @@ Use cache **namespaces** to isolate environments or experiments. Set `x-agentcc-
</Step>
<Step title="Switch to semantic caching for paraphrased prompts">

Real users don't ask the same question the same way twice. Semantic caching matches prompts by meaning rather than exact text. It runs as an L2 fallback after the L1 exact-match check.
Real users don't ask the same question the same way twice. *"What is your return policy?"* and *"Can I return a product?"* are the same question to a human but byte-different to the L1 hash, so L1 alone misses both. The **L2 semantic cache** fixes that. The gateway embeds each prompt into a vector, looks for a previously-cached vector within a similarity threshold, and returns that response. L2 only runs after L1 misses, so byte-identical prompts still take the fast path.

In the same **Configure Cache** dialog, enable:

- **L2 Semantic Cache**: on
- **Threshold**: `0.92` (similarity, 0 to 1, higher is stricter)
- **Threshold**: `0.92` (similarity, 0 to 1, higher is stricter. 0.92 catches paraphrases without colliding unrelated questions.)

The same client code now matches paraphrases:

Expand Down Expand Up @@ -128,7 +132,7 @@ Tune the threshold carefully. Too low (e.g., 0.7) and unrelated questions collid
</Step>
<Step title="Measure the savings">

Loop over a realistic mixed batch and tally cache hits, total cost, and latency.
Step 2 and 3 each verified a single hit. To see what production traffic actually saves, run a realistic batch where each unique question repeats several times (mimicking what your support bot or FAQ assistant gets all day). The first occurrence of each question pays the provider; every repeat hits cache for near-zero cost and sub-second latency. Tally the breakdown across the batch and you've measured your actual hit rate and dollar savings.

```python
import time
Expand Down Expand Up @@ -165,7 +169,9 @@ Expect ~80% hits after the first pass over each unique question (`hit_exact` for
</Step>
<Step title="Bypass or invalidate the cache when you need fresh answers">

When you change a system prompt or want a fresh response for a specific call, send `x-agentcc-cache-force-refresh: true` on that request. The gateway skips the cache read but still writes the new response back, so subsequent calls hit the refreshed entry.
Caching only helps if you can defeat it when you need to. Two real situations: you're testing a prompt change and need to see the new model output, or you've shipped a fix and want to invalidate every cached entry that was generated under the old prompt. The gateway gives you both.

For one-off bypass, send `x-agentcc-cache-force-refresh: true` on a single request. The gateway skips the cache read but still writes the new response back, so subsequent identical calls hit the refreshed entry.

```python
r = client.chat.completions.with_raw_response.create(
Expand All @@ -181,10 +187,19 @@ For a global wipe after a prompt-template update, route your traffic to a fresh
</Step>
</Steps>

## What you solved

Repetitive user questions (the kind any production support bot, FAQ assistant, or knowledge-base agent sees daily) now return cached answers for paraphrased duplicates instead of paying the provider for the same call twice. Sub-100ms responses for cache hits, full cost only for the first unique question, and a single `x-agentcc-cache` header on every response so you can audit hit rates from production logs.

<Check>
You enabled exact then semantic caching in the dashboard, watched paraphrased prompts return cached responses with `x-agentcc-cache: hit_semantic`, and measured the cost drop on a realistic batch, without changing application code beyond pointing at the gateway.
</Check>

- **Pay-for-every-call cost** (no caching at all): solved by enabling the L1 exact cache in the dashboard with one toggle
- **Cache misses on paraphrased duplicates** (exact-only caching): solved by turning on the L2 semantic cache with a similarity threshold
- **No visibility into cache hit rate**: solved by the `x-agentcc-cache`, `x-agentcc-cost`, `x-agentcc-latency-ms` response headers on every request
- **Stale cached responses after a prompt change**: solved by `x-agentcc-cache-force-refresh: true` per request, or by routing to a fresh `x-agentcc-cache-namespace`

## Explore further

<CardGroup cols={3}>
Expand Down
Loading
Loading