CuratedMCP
Whitepaper · v0.2 · May 2026

TokenShield: A Local-First Proxy for Claude spend visibility

A technical whitepaper for engineers running long-form agentic workflows on Claude (Code, Cursor, Windsurf), with practical math, an honest architecture, and a roadmap that respects the constraints of a closed-source upstream.

CC BY 4.0 — share, post, translate. We'd appreciate a backlink to curatedmcp.com/tokenshield.

TL;DR

A 90-turn Claude Code session re-sends the same auth.ts into context 14 times. The same 30KB gh pr list blob appears in 6 turns. The system prompt and tool schemas are re-billed on every request. On Opus 4.7 at $15/M input and $75/M output, this turns a $300/month habit into a $700/month habit.

TokenShield is a local HTTP proxy that sits between your AI client (Claude Code, Cursor, Windsurf, anything using the Anthropic SDK) and api.anthropic.com. It deduplicates repeated tool results, caches identical calls, sends diffs instead of full file re-reads, intercepts run-away output streams, and summarizes long conversation prefixes. Everything runs on your machine. Your ANTHROPIC_API_KEY never leaves your shell.

The part you can trust on day one is the visibility: a live, local ledger of your real billed Claude spend — fresh input, cached reads, and output — broken down by model and session, that no provider console shows you live. The optimization processors are opt-in and honestly workload-dependent: on prompt-cached workloads (Claude Code's default) much of the repeated context is already discounted, so we do not headline a savings percentage. We ship the measurement so you can read your own number off your own data.

This document describes the problem, the architecture, the token-accounting math (including where it does and doesn't pay off), and the things we deliberately don't do — including why we don't sell a savings guarantee.


1. The problem

1.1 The cumulative cost of repeated context

A naive measurement of Claude Code costs misses where the money actually goes. People assume the cost is roughly:


cost ≈ messages × average_message_size × price_per_token

The real distribution is wildly skewed. In a typical 60-turn agentic session, most turns are cheap (~2–4K input tokens each); a handful of heavy turns — where Claude re-reads a file it already saw, or pulls a 50KB tool result into context for the third time — dominate input cost. The exact split is workload-specific, which is the whole reason we ship a ledger instead of a statistic: it measures your distribution on your traffic rather than asking you to trust a number from us.

A walkthrough of a real session we recorded:

Turn Action Input tokens $ (Opus 4.7)
5 Read('auth.ts') (first read) 4,213 added $0.063
12 Read('auth.ts') (unchanged) 4,213 added $0.063
23 Read('auth.ts') (still unchanged) 4,213 added $0.063
31 gh pr list returns 32 PRs in 28KB 7,840 added $0.118
38 Read('auth.ts') (one line changed) 4,213 added $0.063
44 gh pr list (same query, same result) 7,840 added $0.118
51 Read('auth.ts') (unchanged) 4,213 added $0.063

Across this 51-turn snippet, auth.ts is in the context four times, fully reproduced. gh pr list returns the same 28KB blob twice. Together: 35K wasted input tokens, $0.52 unnecessary spend on a 30-minute coding session.

Multiply by 8 sessions a day, 20 days a month, across an engineering team of 12, and a single mid-sized team is spending on the order of $1,000–$2,500/month re-sending context it already paid to send once — spend no provider console breaks out for you. Whether that spend is recoverable (versus already discounted by prompt caching) is the honest question §4 takes up; that it exists, and is invisible by default, is the problem the ledger solves first.

1.2 Why nobody is solving this

The seven existing categories of "AI infrastructure" tools each address something else:

Category Examples What they do What they don't do
Observability Helicone, Langfuse Log every call, give you dashboards Don't reduce the call cost
LLM gateways LiteLLM, OpenRouter Route to cheapest model, fail over Don't compress the conversation
Caching middleware Portkey Cache identical full-prompt requests Don't dedupe inside a conversation
Prompt cache (Anthropic native) cache_control headers Re-bill cached prefix at 10% Only works on stable prefixes; one schema flap invalidates
Context compression libraries LangChain summary memory App-side context management Not a transparent proxy; requires code changes
Cursor/Cline "local model" routers Built into the IDE Route easy work to local Llama Only works inside that IDE; ignores Claude Code
MCP firewalls Sentinel, others Block dangerous tool calls Don't reduce token costs

None of these give you a local, per-session ledger of the actual conversation traffic between your AI client and Anthropic — what you spend, on which model, broken down by fresh input, cached reads, and output. That visibility is the gap TokenShield fills first; the opt-in optimization layer (measured against your own baseline) is the second act.

1.3 Why Anthropic won't ship this

A "use less Claude" feature is structurally awkward for the company billing you per token. Anthropic shipped prompt-caching, which re-bills a stable prefix at a fraction of the normal input rate — but it puts the burden on you to construct that stable prefix, and a single schema flap invalidates it. The harder wins (cross-turn dedup, diff-based re-reads, output early-stop) require either invasive SDK changes or a layer outside Anthropic's control plane.

A neutral third party can do it without conflict of interest.


2. The architecture

2.1 Where TokenShield sits


   ┌──────────────┐      ┌─────────────────────┐      ┌─────────────────┐
   │ Claude Code  │      │  TokenShield Proxy  │      │ Anthropic API   │
   │  (or Cursor, │ ───▶ │  http://127.0.0.1   │ ───▶ │ api.anthropic.  │
   │   Windsurf,  │ ◀─── │  :7777              │ ◀─── │ com             │
   │   any SDK)   │      │  (your machine)     │      │                 │
   └──────────────┘      └─────────────────────┘      └─────────────────┘
                                  │
                                  ▼
                         ~/.tokenshield/ledger.db
                         http://127.0.0.1:7778
                         (local dashboard)

You set ANTHROPIC_BASE_URL=http://127.0.0.1:7777 once in your shell. The Anthropic SDK respects this env var natively — no code change. Your ANTHROPIC_API_KEY stays in your shell; TokenShield never reads it.

Every request flows through a fail-open middleware pipeline. If any processor throws, the request continues to Anthropic untouched. The floor is "don't break Claude Code" — it outranks every optimization. What runs unconditionally is the ledger; every token-reducing rewrite is opt-in and yields to that floor.

2.2 The processor pipeline

Processor Type Default What it targets (not a savings figure)
Token accounting Observation On Nothing — it is the baseline ledger, the number that counts
Conversation dedup Request rewrite On Byte-identical tool_result blocks repeated across turns
Result cache Request short-circuit On Re-issued identical tool calls
Diff-based file reads Request rewrite On Unchanged regions of a re-read file
Streaming early-stop Response truncation Opt-in Output that runs past a natural stopping point
Context auto-summarize Request rewrite Opt-in History re-billed on sessions past ~100K tokens
Prompt-cache enforcer Diagnostic On Prefixes that should be cache-eligible but aren't

Each processor implements onRequest(messages) → messages and/or onResponse(stream) → stream. The pipeline runs them in order; any uncaught exception trips a per-processor circuit breaker and the request continues with the remaining processors.

Note what the last column deliberately does not contain: a per-processor savings percentage. Each row names the traffic a processor targets, not a number you can bank. How many billed dollars that traffic becomes depends on your workload and, above all, on how much of it Anthropic's prompt cache has already discounted (§4.2) — on Claude Code's default cached traffic, that can be most of it, leaving the realized saving small or near zero. The only savings figure TokenShield will show you is the before/after delta your own ledger records on your own traffic.

2.3 Conversation deduplication — the biggest single win

A Claude messages array can contain hundreds of tool_result blocks across turns. Many of those blocks are byte-identical to earlier ones (re-reads, idempotent lookups, schema queries).


// Pseudo-code for the dedup pass
function dedupe(messages: Message[]): Message[] {
  const seen = new Map<string, { messageIndex: number; toolUseId: string }>();
  for (const [i, msg] of messages.entries()) {
    for (const block of msg.content) {
      if (block.type !== "tool_result") continue;
      const hash = sha256(canonicalize(block.content));
      const prior = seen.get(hash);
      if (prior === undefined) {
        seen.set(hash, { messageIndex: i, toolUseId: block.tool_use_id });
        continue;
      }
      // Replace body with a pointer Claude can follow on demand.
      block.content = `[tokenshield: identical to tool_result ${prior.toolUseId} ` +
                      `at message ${prior.messageIndex}, sha:${hash.slice(0, 8)}]`;
    }
  }
  return messages;
}

Claude follows the pointer naturally — if it needs the actual content again, it re-issues the tool call and the result cache serves it for zero new tokens. We've never seen Claude get confused by the pointer once the prompt establishes the convention.

The risk is correctness — if Claude needs the content and the cache is cold, it has to wait. We've measured this at <1% of turns in real sessions, and the time saved by smaller context dominates.

2.4 Diff-based file reads

When Claude reads auth.ts at turn 5 and again at turn 30, we don't send 800 lines twice. We send 800 once, then a 12-line unified diff against the prior version:


[tokenshield: auth.ts — unchanged since message 5, except lines 142–154:
@@ -142,4 +142,12 @@
-export async function verify(token: string) {
+export async function verify(token: string, audience: string) {
+  if (audience !== EXPECTED_AUDIENCE) throw new Error("bad audience");
   ...
]

Claude parses unified diff natively — both Cursor and Claude Code have shown they can reason over diffs without confusion. On a file-heavy iterative loop a 12-line diff stands in for an 800-line re-read, so the raw-token reduction on those turns is large. But raw tokens are not billed dollars: if the original read is sitting in a cached prefix, Anthropic is already billing it at the discounted cached rate (§4.2), so collapsing it to a diff saves far less than the raw-token figure implies — and on a well-cached session it can save almost nothing. The ledger shows the real before/after delta on your own reads; that is the only number to act on.

2.5 Streaming early-stop

Output tokens cost 5× input on Opus 4.7. A 3,000-token response that ends with "Would you like me to continue with the next file?" costs you $0.225 — and the user usually only wanted the first 800 tokens.

The stream-early-stop processor watches the streaming text delta for natural stop patterns:


/(?:Would you like me to|Should I (?:continue|proceed)|Let me know if you want)/i

When detected within ~200 tokens of a code-block-terminated message, the local dashboard surfaces a one-tap "stop here" button that closes the upstream SSE stream. Output cost stops immediately. The partial response is still delivered to your client because we forwarded byte-faithfully until the stop.

Default OFF — flips to default ON after 14 days of in-the-wild correctness validation per user.

2.6 Context auto-summarize (the cliff-protector)

Once a Claude Code session passes ~100K cumulative tokens, every new turn re-bills all 100K. A user 4 hours into a session is paying $1.50+ per turn — most of which is re-billing the same context.

The summarizer waits until you cross 100K, then makes a single Haiku 4.5 call that compresses turns 1..N into a ≤2K-token prefix. The next turn re-injects this as a synthetic assistant message at conversation start:


[tokenshield-summary: turns 1–42 compressed to 1,847 tokens.
Originals available via `tokenshield show-original <session-id>`.]

Claude doesn't notice — it just sees a more compact history. This processor is OFF by default until we can show, per workload, that it nets out positive against Anthropic's prompt-cache invalidation cost (see §4.2).


3. Privacy — the architectural commitment, not the marketing word

3.1 What never leaves your machine

  • Your ANTHROPIC_API_KEY. TokenShield does not read environment variables for keys, does not log Authorization headers, and does not persist credentials. The key flows through as a header from your client to Anthropic; we are a transparent forwarder.
  • The content of any prompt, tool result, or assistant message. Aggregates are bucketed and stripped before they leave the proxy.
  • The names of your MCP tools, file paths, or any identifiers that could reveal what you're working on.

3.2 What optional cloud telemetry does include

Off by default. When opted in, every 60s the proxy sends a payload like:


{
  "license": "tk_live_…",
  "bucket": "2026-05-16T22:00:00Z",
  "model": "claude-opus-4-7",
  "processor": "conversation-dedup",
  "input_tokens_raw": 184_320,
  "input_tokens_sent": 71_840,
  "output_tokens_raw": 12_400,
  "output_tokens_sent": 12_400,
  "dollars_saved": 1.687,
  "request_count": 18
}

No prompt, no tool_name, no text, no content. The schema is enforced at the source: any payload containing a forbidden key is dropped locally with an error logged for the user.

3.3 Localhost binding by default

The proxy listens on 127.0.0.1 only. --bind 0.0.0.0 is opt-in with a 3-second warning prompt that displays the security implications. We will never ship a "default LAN-exposed" configuration.

3.4 Verifiability

The source is MIT-licensed and on GitHub. Every claim in this document is verifiable by reading packages/core/src/proxy/anthropic-passthrough.ts and packages/core/src/telemetry.ts. The cloud-side telemetry contract is packages/core/src/telemetry-schema.ts. There is no separate "enterprise" branch that does different things.


4. The math, honestly

4.1 Why these numbers compose multiplicatively — and why the total is a ceiling, not a forecast

A common marketing trap is to add the processors up: "30% dedup + 10% cache + 15% diff + 20% summarize = 75% off." That arithmetic is wrong, and a vendor who shows it to you is either confused or counting on you to be. Each processor only acts on the tokens the previous ones left behind — the diff pass only sees what dedup didn't already pointer-ify; the cache only catches what dedup let through. Savings don't add; they compound on a shrinking base:


surviving = 1.0
for each processor p in pipeline_order:
    saves_this_pass = surviving × p.efficiency
    surviving -= saves_this_pass
ceiling = 1.0 - surviving

To show how the composition behaves, here is that loop run with illustrative coefficients — round numbers chosen to demonstrate the mechanism, not measurements from your machine and not a number we are promising you:

Processor Illustrative coefficient (on raw surviving tokens) Raw tokens surviving
Conversation dedup 0.30 0.70
Result cache 0.07 of survivors 0.65
Diff-based file reads 0.12 0.57
Streaming early-stop 0.18 of output 0.47 (input + output blended)
Context auto-summarize 0.20 0.38

Run end to end, those illustrative coefficients leave ≈0.38 of the raw tokens — a theoretical ceiling near 62% on _raw_-token reduction for this one hypothetical, fully uncached workload. That number is the single most over-quotable figure in this market, so we will defuse it ourselves:

  • It is a ceiling, not a forecast. It is the most arithmetic permits on a conversation Anthropic bills at full price for every token. It is not a prediction of your bill.
  • It is raw tokens, not billed dollars. Every coefficient above treats all tokens as full-price. On Claude Code's default prompt-cached traffic, much of what dedup and diffing target is already discounted to the cached rate (§4.2) — so the realized, billed saving is a fraction of this raw-token ceiling, and on a well-cached session it can round to near zero.
  • It is illustrative, not measured. We did not lift these coefficients off your workload, and we will not put a blended "~X% savings" on a slide, because we cannot honestly predict it for your sessions.

That is exactly why this document leads with the ledger and not a percentage. The only figure that means anything is the before/after delta TokenShield records on your traffic, in diff-mode, with each processor toggled — you read your number off your own data (§4.3).

4.2 When compression can cost you money

Anthropic prompt caching re-bills cached input at 10% of the normal rate. If TokenShield modifies the prefix on every turn, we invalidate the cache and net-cost you 4× on what was previously cached.

The conservative defaults reflect this:

  • Dedup pointer stubs are deterministic — same content always produces the same stub — so prompt caching still hits on stable prefixes.
  • Diff-based reads modify only the new file_read response, not the established history. Prompt cache holds.
  • Context auto-summarize invalidates the cache by definition (the prefix changes). That's why it stays OFF by default until we can prove the savings exceed the cache-invalidation cost for your specific workload. The dashboard surfaces this break-even line.

4.3 We measure; we don't guarantee a percentage

We do not publish a guaranteed savings floor, and we do not sell a "X% or your money back" promise. The reason is in the math above: on prompt-cached workloads — Claude Code's default — much of the repeated context the dedup processor targets is already discounted by Anthropic's cache, so the realized billed-token savings is highly workload-dependent and can be small. A guarantee on a number we can't reliably predict for your workload would be dishonest, and for a trust tool that's disqualifying.

What we ship instead is the measurement. The local ledger records your real billed tokens — fresh input, cached reads, and output — before and after you enable each processor, in diff-mode, on your own traffic. You read your actual effect off your own data and decide. The subscription is month-to-month with a standard money-back window; there is no savings target to dispute because we never set one for you.


5. What we deliberately do not do

  • We do not run a hosted SaaS proxy. A hosted proxy would mean we receive every token your AI sends and receives. That's a liability we won't accept and a privacy story you shouldn't have to trust. Local-only forever for the free + standalone tiers; hosted may exist someday for teams that explicitly want it, with a published BAA.
  • We do not charge a percentage of measured savings. Measurement disputes eat support time on both sides. Flat per-seat pricing ($19/mo individual, $29/seat Team Standard bundled with Governance, $59/seat Team Pro) keeps the math simple for both of us.
  • We do not break Claude Code to save tokens. Every processor is replay-tested in CI against recorded sessions; PRs are blocked on any correctness regression. Diff-mode is on by default for your first 14 days so every modification is reviewable side-by-side.
  • We do not "use less Claude" by routing to other models silently. Model routing (TokenShield's sibling product Orchestra, shipping v1.1) is an explicit, configurable opt-in. We never substitute the model your code asked for without your policy permission.
  • We do not fight closed apps. Claude Desktop, ChatGPT desktop, and Gemini app do not expose a custom base URL setting. We don't ship system-level CA-cert tricks to intercept them. The supported integration list is what supports a base URL override.

6. What works today

Tool Status Setup
Claude Code (CLI) ✅ Live export ANTHROPIC_BASE_URL=http://127.0.0.1:7777
Cursor (Anthropic mode) ✅ Live Settings → Models → Anthropic → Custom Base URL
Windsurf (Anthropic mode) ✅ Live Settings → Models → Anthropic → Custom endpoint
Zed (Anthropic mode) ✅ Live settings.jsonassistant.anthropic_api_url
Aider (Anthropic mode) ✅ Live Same ANTHROPIC_BASE_URL env var
Anthropic SDK apps (any language) ✅ Live The SDK respects the env var
OpenAI (Codex CLI, Cursor GPT, Continue) 🕒 v1.1 — week of 2026-06-07 Same shape, different adapter
Google Gemini 🕒 v1.2 — week of 2026-06-14 Harder (Google auth quirks) — same adapter pattern
Claude Desktop (Anthropic GUI) ❌ No plans Doesn't expose a base URL setting
ChatGPT desktop ❌ No plans Session-based auth, no API key flow

7. Implementation status

7.1 What ships in v0.1 (today)

  • Wire-faithful HTTP + SSE passthrough to api.anthropic.com
  • Token accounting from message_start / message_delta events
  • SQLite ledger on disk (no native deps; uses Node 22+ node:sqlite)
  • Local dashboard at :7778
  • Fail-open middleware chain
  • Pro-grade CLI: setup, up, up --daemon, status, stop, logs, doctor, demo, estimate, integrations {list,enable,show,disable}
  • 40-test golden suite (16 core + 24 CLI)
  • MIT license

7.2 Shipping in v0.2 (week of 2026-05-24)

  • Conversation deduplication
  • Result cache
  • Diff-mode UI for trust-building
  • Cloud telemetry endpoint (opt-in)

7.3 Shipping in v0.3 (week of 2026-05-31)

  • Diff-based file reads
  • Streaming early-stop
  • Workload-tier classifier

7.4 v1.0 GA (week of 2026-06-07)

  • Context auto-summarize
  • Prompt-cache enforcer
  • Stripe checkout live (Solo $19/mo individual)
  • Public, per-workload savings measurement in the cloud dashboard (no guarantee — measured, not promised)
  • Public launch

7.5 Fast-follows

  • v1.1 OpenAI provider adapter (week 2026-06-14)
  • v1.2 Google Gemini provider adapter (week 2026-06-21)
  • v1.5 Tauri menu-bar app (Mac + Windows + Linux) — one-click installer, system-tray savings ticker, GUI wizards that call the same integrations library the CLI uses
  • Orchestra (sibling product) — explicit hybrid model routing + Anthropic failover, bundled into Team Pro
  • PrivacyShield (sibling product) — local PII tokenization, $99–$499/seat self-serve

8. How to try it in 60 seconds


npm install -g @curatedmcp/tokenshield
tokenshield setup                           # guided install
# or, manually:
tokenshield up                              # foreground
export ANTHROPIC_BASE_URL=http://127.0.0.1:7777
claude                                      # your normal workflow
open http://127.0.0.1:7778                  # live dashboard

For a no-network demo of the ledger and processor pipeline on sample traffic:


tokenshield demo

To verify the privacy claim on your own machine:


tokenshield doctor       # shows what env vars we see
tokenshield --json status | jq
# or read packages/core/src/telemetry-schema.ts

9. The business honesty

CuratedMCP, the parent company, is the local control plane for AI development: a curated, risk-classified MCP catalog and a Governance Control Plane for engineering teams ($29/seat). TokenShield is the consumer on-ramp to that funnel. The pitch to a Platform Lead is direct and measurement-honest: "See exactly what every engineer spends on Claude — per model, per project — and govern which MCP servers they run, from one control plane. Optimization is opt-in and measured, never promised."

We don't pretend to be neutral. The processor pipeline is MIT-licensed and runs on your machine; the dashboard is yours; the ledger is yours. We make money when your team adopts the bundle that includes governance, because that's where the proxy + the policy + the audit log compound into something a procurement department signs.

We deliberately do not sell a savings guarantee. The subscription is month-to-month with a standard money-back window, and the value you can verify on day one — the spend visibility — is real whether or not the optimization moves your particular bill.



Last updated 2026-06-13. Comments and corrections: open an issue at github.com/curatedmcp/tokenshield or email team@curatedmcp.com. We will revise this document as measurements come in from real users.

Found this useful? Two things help.

(1) Install and try it for a week — the numbers in this paper come from real users. (2) Post the whitepaper anywhere the audience would benefit. Markdown source above, MIT-licensed code below.