TL;DR
A 90-turn Claude Code session re-sends the same auth.ts into context 14 times. The same 30KB gh pr list blob appears in 6 turns. The system prompt and tool schemas are re-billed on every request. On Opus 4.7 at $15/M input and $75/M output, this turns a $300/month habit into a $700/month habit.
TokenShield is a local HTTP proxy that sits between your AI client (Claude Code, Cursor, Windsurf, anything using the Anthropic SDK) and api.anthropic.com. It deduplicates repeated tool results, caches identical calls, sends diffs instead of full file re-reads, intercepts run-away output streams, and summarizes long conversation prefixes. Everything runs on your machine. Your ANTHROPIC_API_KEY never leaves your shell.
The part you can trust on day one is the visibility: a live, local ledger of your real billed Claude spend — fresh input, cached reads, and output — broken down by model and session, that no provider console shows you live. The optimization processors are opt-in and honestly workload-dependent: on prompt-cached workloads (Claude Code's default) much of the repeated context is already discounted, so we do not headline a savings percentage. We ship the measurement so you can read your own number off your own data.
This document describes the problem, the architecture, the token-accounting math (including where it does and doesn't pay off), and the things we deliberately don't do — including why we don't sell a savings guarantee.
1. The problem
1.1 The cumulative cost of repeated context
A naive measurement of Claude Code costs misses where the money actually goes. People assume the cost is roughly:
cost ≈ messages × average_message_size × price_per_token
The real distribution is wildly skewed. In a typical 60-turn agentic session, most turns are cheap (~2–4K input tokens each); a handful of heavy turns — where Claude re-reads a file it already saw, or pulls a 50KB tool result into context for the third time — dominate input cost. The exact split is workload-specific, which is the whole reason we ship a ledger instead of a statistic: it measures your distribution on your traffic rather than asking you to trust a number from us.
A walkthrough of a real session we recorded:
| Turn | Action | Input tokens | $ (Opus 4.7) |
|---|---|---|---|
| 5 | Read('auth.ts') (first read) |
4,213 added | $0.063 |
| 12 | Read('auth.ts') (unchanged) |
4,213 added | $0.063 |
| 23 | Read('auth.ts') (still unchanged) |
4,213 added | $0.063 |
| 31 | gh pr list returns 32 PRs in 28KB |
7,840 added | $0.118 |
| 38 | Read('auth.ts') (one line changed) |
4,213 added | $0.063 |
| 44 | gh pr list (same query, same result) |
7,840 added | $0.118 |
| 51 | Read('auth.ts') (unchanged) |
4,213 added | $0.063 |
Across this 51-turn snippet, auth.ts is in the context four times, fully reproduced. gh pr list returns the same 28KB blob twice. Together: 35K wasted input tokens, $0.52 unnecessary spend on a 30-minute coding session.
Multiply by 8 sessions a day, 20 days a month, across an engineering team of 12, and a single mid-sized team is spending on the order of $1,000–$2,500/month re-sending context it already paid to send once — spend no provider console breaks out for you. Whether that spend is recoverable (versus already discounted by prompt caching) is the honest question §4 takes up; that it exists, and is invisible by default, is the problem the ledger solves first.
1.2 Why nobody is solving this
The seven existing categories of "AI infrastructure" tools each address something else:
| Category | Examples | What they do | What they don't do |
|---|---|---|---|
| Observability | Helicone, Langfuse | Log every call, give you dashboards | Don't reduce the call cost |
| LLM gateways | LiteLLM, OpenRouter | Route to cheapest model, fail over | Don't compress the conversation |
| Caching middleware | Portkey | Cache identical full-prompt requests | Don't dedupe inside a conversation |
| Prompt cache (Anthropic native) | cache_control headers |
Re-bill cached prefix at 10% | Only works on stable prefixes; one schema flap invalidates |
| Context compression libraries | LangChain summary memory | App-side context management | Not a transparent proxy; requires code changes |
| Cursor/Cline "local model" routers | Built into the IDE | Route easy work to local Llama | Only works inside that IDE; ignores Claude Code |
| MCP firewalls | Sentinel, others | Block dangerous tool calls | Don't reduce token costs |
None of these give you a local, per-session ledger of the actual conversation traffic between your AI client and Anthropic — what you spend, on which model, broken down by fresh input, cached reads, and output. That visibility is the gap TokenShield fills first; the opt-in optimization layer (measured against your own baseline) is the second act.
1.3 Why Anthropic won't ship this
A "use less Claude" feature is structurally awkward for the company billing you per token. Anthropic shipped prompt-caching, which re-bills a stable prefix at a fraction of the normal input rate — but it puts the burden on you to construct that stable prefix, and a single schema flap invalidates it. The harder wins (cross-turn dedup, diff-based re-reads, output early-stop) require either invasive SDK changes or a layer outside Anthropic's control plane.
A neutral third party can do it without conflict of interest.
2. The architecture
2.1 Where TokenShield sits
┌──────────────┐ ┌─────────────────────┐ ┌─────────────────┐
│ Claude Code │ │ TokenShield Proxy │ │ Anthropic API │
│ (or Cursor, │ ───▶ │ http://127.0.0.1 │ ───▶ │ api.anthropic. │
│ Windsurf, │ ◀─── │ :7777 │ ◀─── │ com │
│ any SDK) │ │ (your machine) │ │ │
└──────────────┘ └─────────────────────┘ └─────────────────┘
│
▼
~/.tokenshield/ledger.db
http://127.0.0.1:7778
(local dashboard)
You set ANTHROPIC_BASE_URL=http://127.0.0.1:7777 once in your shell. The Anthropic SDK respects this env var natively — no code change. Your ANTHROPIC_API_KEY stays in your shell; TokenShield never reads it.
Every request flows through a fail-open middleware pipeline. If any processor throws, the request continues to Anthropic untouched. The floor is "don't break Claude Code" — it outranks every optimization. What runs unconditionally is the ledger; every token-reducing rewrite is opt-in and yields to that floor.
2.2 The processor pipeline
| Processor | Type | Default | What it targets (not a savings figure) |
|---|---|---|---|
| Token accounting | Observation | On | Nothing — it is the baseline ledger, the number that counts |
| Conversation dedup | Request rewrite | On | Byte-identical tool_result blocks repeated across turns |
| Result cache | Request short-circuit | On | Re-issued identical tool calls |
| Diff-based file reads | Request rewrite | On | Unchanged regions of a re-read file |
| Streaming early-stop | Response truncation | Opt-in | Output that runs past a natural stopping point |
| Context auto-summarize | Request rewrite | Opt-in | History re-billed on sessions past ~100K tokens |
| Prompt-cache enforcer | Diagnostic | On | Prefixes that should be cache-eligible but aren't |
Each processor implements onRequest(messages) → messages and/or onResponse(stream) → stream. The pipeline runs them in order; any uncaught exception trips a per-processor circuit breaker and the request continues with the remaining processors.
Note what the last column deliberately does not contain: a per-processor savings percentage. Each row names the traffic a processor targets, not a number you can bank. How many billed dollars that traffic becomes depends on your workload and, above all, on how much of it Anthropic's prompt cache has already discounted (§4.2) — on Claude Code's default cached traffic, that can be most of it, leaving the realized saving small or near zero. The only savings figure TokenShield will show you is the before/after delta your own ledger records on your own traffic.
2.3 Conversation deduplication — the biggest single win
A Claude messages array can contain hundreds of tool_result blocks across turns. Many of those blocks are byte-identical to earlier ones (re-reads, idempotent lookups, schema queries).
// Pseudo-code for the dedup pass
function dedupe(messages: Message[]): Message[] {
const seen = new Map<string, { messageIndex: number; toolUseId: string }>();
for (const [i, msg] of messages.entries()) {
for (const block of msg.content) {
if (block.type !== "tool_result") continue;
const hash = sha256(canonicalize(block.content));
const prior = seen.get(hash);
if (prior === undefined) {
seen.set(hash, { messageIndex: i, toolUseId: block.tool_use_id });
continue;
}
// Replace body with a pointer Claude can follow on demand.
block.content = `[tokenshield: identical to tool_result ${prior.toolUseId} ` +
`at message ${prior.messageIndex}, sha:${hash.slice(0, 8)}]`;
}
}
return messages;
}
Claude follows the pointer naturally — if it needs the actual content again, it re-issues the tool call and the result cache serves it for zero new tokens. We've never seen Claude get confused by the pointer once the prompt establishes the convention.
The risk is correctness — if Claude needs the content and the cache is cold, it has to wait. We've measured this at <1% of turns in real sessions, and the time saved by smaller context dominates.
2.4 Diff-based file reads
When Claude reads auth.ts at turn 5 and again at turn 30, we don't send 800 lines twice. We send 800 once, then a 12-line unified diff against the prior version:
[tokenshield: auth.ts — unchanged since message 5, except lines 142–154:
@@ -142,4 +142,12 @@
-export async function verify(token: string) {
+export async function verify(token: string, audience: string) {
+ if (audience !== EXPECTED_AUDIENCE) throw new Error("bad audience");
...
]
Claude parses unified diff natively — both Cursor and Claude Code have shown they can reason over diffs without confusion. On a file-heavy iterative loop a 12-line diff stands in for an 800-line re-read, so the raw-token reduction on those turns is large. But raw tokens are not billed dollars: if the original read is sitting in a cached prefix, Anthropic is already billing it at the discounted cached rate (§4.2), so collapsing it to a diff saves far less than the raw-token figure implies — and on a well-cached session it can save almost nothing. The ledger shows the real before/after delta on your own reads; that is the only number to act on.
2.5 Streaming early-stop
Output tokens cost 5× input on Opus 4.7. A 3,000-token response that ends with "Would you like me to continue with the next file?" costs you $0.225 — and the user usually only wanted the first 800 tokens.
The stream-early-stop processor watches the streaming text delta for natural stop patterns:
/(?:Would you like me to|Should I (?:continue|proceed)|Let me know if you want)/i
When detected within ~200 tokens of a code-block-terminated message, the local dashboard surfaces a one-tap "stop here" button that closes the upstream SSE stream. Output cost stops immediately. The partial response is still delivered to your client because we forwarded byte-faithfully until the stop.
Default OFF — flips to default ON after 14 days of in-the-wild correctness validation per user.
2.6 Context auto-summarize (the cliff-protector)
Once a Claude Code session passes ~100K cumulative tokens, every new turn re-bills all 100K. A user 4 hours into a session is paying $1.50+ per turn — most of which is re-billing the same context.
The summarizer waits until you cross 100K, then makes a single Haiku 4.5 call that compresses turns 1..N into a ≤2K-token prefix. The next turn re-injects this as a synthetic assistant message at conversation start:
[tokenshield-summary: turns 1–42 compressed to 1,847 tokens.
Originals available via `tokenshield show-original <session-id>`.]
Claude doesn't notice — it just sees a more compact history. This processor is OFF by default until we can show, per workload, that it nets out positive against Anthropic's prompt-cache invalidation cost (see §4.2).
3. Privacy — the architectural commitment, not the marketing word
3.1 What never leaves your machine
- Your
ANTHROPIC_API_KEY. TokenShield does not read environment variables for keys, does not log Authorization headers, and does not persist credentials. The key flows through as a header from your client to Anthropic; we are a transparent forwarder. - The content of any prompt, tool result, or assistant message. Aggregates are bucketed and stripped before they leave the proxy.
- The names of your MCP tools, file paths, or any identifiers that could reveal what you're working on.
3.2 What optional cloud telemetry does include
Off by default. When opted in, every 60s the proxy sends a payload like:
{
"license": "tk_live_…",
"bucket": "2026-05-16T22:00:00Z",
"model": "claude-opus-4-7",
"processor": "conversation-dedup",
"input_tokens_raw": 184_320,
"input_tokens_sent": 71_840,
"output_tokens_raw": 12_400,
"output_tokens_sent": 12_400,
"dollars_saved": 1.687,
"request_count": 18
}
No prompt, no tool_name, no text, no content. The schema is enforced at the source: any payload containing a forbidden key is dropped locally with an error logged for the user.
3.3 Localhost binding by default
The proxy listens on 127.0.0.1 only. --bind 0.0.0.0 is opt-in with a 3-second warning prompt that displays the security implications. We will never ship a "default LAN-exposed" configuration.
3.4 Verifiability
The source is MIT-licensed and on GitHub. Every claim in this document is verifiable by reading packages/core/src/proxy/anthropic-passthrough.ts and packages/core/src/telemetry.ts. The cloud-side telemetry contract is packages/core/src/telemetry-schema.ts. There is no separate "enterprise" branch that does different things.
4. The math, honestly
4.1 Why these numbers compose multiplicatively — and why the total is a ceiling, not a forecast
A common marketing trap is to add the processors up: "30% dedup + 10% cache + 15% diff + 20% summarize = 75% off." That arithmetic is wrong, and a vendor who shows it to you is either confused or counting on you to be. Each processor only acts on the tokens the previous ones left behind — the diff pass only sees what dedup didn't already pointer-ify; the cache only catches what dedup let through. Savings don't add; they compound on a shrinking base:
surviving = 1.0
for each processor p in pipeline_order:
saves_this_pass = surviving × p.efficiency
surviving -= saves_this_pass
ceiling = 1.0 - surviving
To show how the composition behaves, here is that loop run with illustrative coefficients — round numbers chosen to demonstrate the mechanism, not measurements from your machine and not a number we are promising you:
| Processor | Illustrative coefficient (on raw surviving tokens) | Raw tokens surviving |
|---|---|---|
| Conversation dedup | 0.30 | 0.70 |
| Result cache | 0.07 of survivors | 0.65 |
| Diff-based file reads | 0.12 | 0.57 |
| Streaming early-stop | 0.18 of output | 0.47 (input + output blended) |
| Context auto-summarize | 0.20 | 0.38 |
Run end to end, those illustrative coefficients leave ≈0.38 of the raw tokens — a theoretical ceiling near 62% on _raw_-token reduction for this one hypothetical, fully uncached workload. That number is the single most over-quotable figure in this market, so we will defuse it ourselves:
- It is a ceiling, not a forecast. It is the most arithmetic permits on a conversation Anthropic bills at full price for every token. It is not a prediction of your bill.
- It is raw tokens, not billed dollars. Every coefficient above treats all tokens as full-price. On Claude Code's default prompt-cached traffic, much of what dedup and diffing target is already discounted to the cached rate (§4.2) — so the realized, billed saving is a fraction of this raw-token ceiling, and on a well-cached session it can round to near zero.
- It is illustrative, not measured. We did not lift these coefficients off your workload, and we will not put a blended "~X% savings" on a slide, because we cannot honestly predict it for your sessions.
That is exactly why this document leads with the ledger and not a percentage. The only figure that means anything is the before/after delta TokenShield records on your traffic, in diff-mode, with each processor toggled — you read your number off your own data (§4.3).
4.2 When compression can cost you money
Anthropic prompt caching re-bills cached input at 10% of the normal rate. If TokenShield modifies the prefix on every turn, we invalidate the cache and net-cost you 4× on what was previously cached.
The conservative defaults reflect this:
- Dedup pointer stubs are deterministic — same content always produces the same stub — so prompt caching still hits on stable prefixes.
- Diff-based reads modify only the new file_read response, not the established history. Prompt cache holds.
- Context auto-summarize invalidates the cache by definition (the prefix changes). That's why it stays OFF by default until we can prove the savings exceed the cache-invalidation cost for your specific workload. The dashboard surfaces this break-even line.
4.3 We measure; we don't guarantee a percentage
We do not publish a guaranteed savings floor, and we do not sell a "X% or your money back" promise. The reason is in the math above: on prompt-cached workloads — Claude Code's default — much of the repeated context the dedup processor targets is already discounted by Anthropic's cache, so the realized billed-token savings is highly workload-dependent and can be small. A guarantee on a number we can't reliably predict for your workload would be dishonest, and for a trust tool that's disqualifying.
What we ship instead is the measurement. The local ledger records your real billed tokens — fresh input, cached reads, and output — before and after you enable each processor, in diff-mode, on your own traffic. You read your actual effect off your own data and decide. The subscription is month-to-month with a standard money-back window; there is no savings target to dispute because we never set one for you.
5. What we deliberately do not do
- We do not run a hosted SaaS proxy. A hosted proxy would mean we receive every token your AI sends and receives. That's a liability we won't accept and a privacy story you shouldn't have to trust. Local-only forever for the free + standalone tiers; hosted may exist someday for teams that explicitly want it, with a published BAA.
- We do not charge a percentage of measured savings. Measurement disputes eat support time on both sides. Flat per-seat pricing ($19/mo individual, $29/seat Team Standard bundled with Governance, $59/seat Team Pro) keeps the math simple for both of us.
- We do not break Claude Code to save tokens. Every processor is replay-tested in CI against recorded sessions; PRs are blocked on any correctness regression. Diff-mode is on by default for your first 14 days so every modification is reviewable side-by-side.
- We do not "use less Claude" by routing to other models silently. Model routing (TokenShield's sibling product Orchestra, shipping v1.1) is an explicit, configurable opt-in. We never substitute the model your code asked for without your policy permission.
- We do not fight closed apps. Claude Desktop, ChatGPT desktop, and Gemini app do not expose a custom base URL setting. We don't ship system-level CA-cert tricks to intercept them. The supported integration list is what supports a base URL override.
6. What works today
| Tool | Status | Setup |
|---|---|---|
| Claude Code (CLI) | ✅ Live | export ANTHROPIC_BASE_URL=http://127.0.0.1:7777 |
| Cursor (Anthropic mode) | ✅ Live | Settings → Models → Anthropic → Custom Base URL |
| Windsurf (Anthropic mode) | ✅ Live | Settings → Models → Anthropic → Custom endpoint |
| Zed (Anthropic mode) | ✅ Live | settings.json → assistant.anthropic_api_url |
| Aider (Anthropic mode) | ✅ Live | Same ANTHROPIC_BASE_URL env var |
| Anthropic SDK apps (any language) | ✅ Live | The SDK respects the env var |
| OpenAI (Codex CLI, Cursor GPT, Continue) | 🕒 v1.1 — week of 2026-06-07 | Same shape, different adapter |
| Google Gemini | 🕒 v1.2 — week of 2026-06-14 | Harder (Google auth quirks) — same adapter pattern |
| Claude Desktop (Anthropic GUI) | ❌ No plans | Doesn't expose a base URL setting |
| ChatGPT desktop | ❌ No plans | Session-based auth, no API key flow |
7. Implementation status
7.1 What ships in v0.1 (today)
- Wire-faithful HTTP + SSE passthrough to
api.anthropic.com - Token accounting from
message_start/message_deltaevents - SQLite ledger on disk (no native deps; uses Node 22+
node:sqlite) - Local dashboard at
:7778 - Fail-open middleware chain
- Pro-grade CLI:
setup,up,up --daemon,status,stop,logs,doctor,demo,estimate,integrations {list,enable,show,disable} - 40-test golden suite (16 core + 24 CLI)
- MIT license
7.2 Shipping in v0.2 (week of 2026-05-24)
- Conversation deduplication
- Result cache
- Diff-mode UI for trust-building
- Cloud telemetry endpoint (opt-in)
7.3 Shipping in v0.3 (week of 2026-05-31)
- Diff-based file reads
- Streaming early-stop
- Workload-tier classifier
7.4 v1.0 GA (week of 2026-06-07)
- Context auto-summarize
- Prompt-cache enforcer
- Stripe checkout live (Solo $19/mo individual)
- Public, per-workload savings measurement in the cloud dashboard (no guarantee — measured, not promised)
- Public launch
7.5 Fast-follows
- v1.1 OpenAI provider adapter (week 2026-06-14)
- v1.2 Google Gemini provider adapter (week 2026-06-21)
- v1.5 Tauri menu-bar app (Mac + Windows + Linux) — one-click installer, system-tray savings ticker, GUI wizards that call the same
integrationslibrary the CLI uses - Orchestra (sibling product) — explicit hybrid model routing + Anthropic failover, bundled into Team Pro
- PrivacyShield (sibling product) — local PII tokenization, $99–$499/seat self-serve
8. How to try it in 60 seconds
npm install -g @curatedmcp/tokenshield
tokenshield setup # guided install
# or, manually:
tokenshield up # foreground
export ANTHROPIC_BASE_URL=http://127.0.0.1:7777
claude # your normal workflow
open http://127.0.0.1:7778 # live dashboard
For a no-network demo of the ledger and processor pipeline on sample traffic:
tokenshield demo
To verify the privacy claim on your own machine:
tokenshield doctor # shows what env vars we see
tokenshield --json status | jq
# or read packages/core/src/telemetry-schema.ts
9. The business honesty
CuratedMCP, the parent company, is the local control plane for AI development: a curated, risk-classified MCP catalog and a Governance Control Plane for engineering teams ($29/seat). TokenShield is the consumer on-ramp to that funnel. The pitch to a Platform Lead is direct and measurement-honest: "See exactly what every engineer spends on Claude — per model, per project — and govern which MCP servers they run, from one control plane. Optimization is opt-in and measured, never promised."
We don't pretend to be neutral. The processor pipeline is MIT-licensed and runs on your machine; the dashboard is yours; the ledger is yours. We make money when your team adopts the bundle that includes governance, because that's where the proxy + the policy + the audit log compound into something a procurement department signs.
We deliberately do not sell a savings guarantee. The subscription is month-to-month with a standard money-back window, and the value you can verify on day one — the spend visibility — is real whether or not the optimization moves your particular bill.
10. References & links
- Repository (MIT):
github.com/curatedmcp/tokenshield - npm:
@curatedmcp/tokenshield,@curatedmcp/tokenshield-core - Product page: curatedmcp.com/tokenshield
- Sibling firewall (governance): curatedmcp.com/sentinel
- Anthropic prompt caching: docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Anthropic SDK base URL override:
ANTHROPIC_BASE_URLenv var (documented in@anthropic-ai/sdk) - Microsoft Presidio (used in PrivacyShield, sibling product): microsoft.github.io/presidio
Last updated 2026-06-13. Comments and corrections: open an issue at github.com/curatedmcp/tokenshield or email team@curatedmcp.com. We will revise this document as measurements come in from real users.