Token optimization in the Postman plugin for Claude Code

Token optimization in the Postman plugin for Claude Code

Quinton Wall

Every AI coding agent has the same hidden tax: the context window. Anthropic’s guide to effective context engineering calls context “a critical yet limited resource,” and the research behind it is blunt: as the window fills, model accuracy degrades. The team calls this context rot. Token optimization isn’t only about cost. Every token your tooling injects is a token the model can’t spend reasoning about the user’s actual work.

That tax lands on plugin authors too. The Postman plugin for Claude Code is pure instructional Markdown: commands, skills, and agents that teach Claude the full Postman API lifecycle. There’s no runtime to profile and no binary to shrink. Its entire footprint is context-window tokens, paid inside every user’s session.

I recently ran a token-usage review on the plugin and shipped an optimization pass. The headline numbers: the plugin’s largest skill is now 60% lighter per trigger, the always-on overhead every session pays dropped by 20%, and a typical “explore an API and generate a client” session starts roughly 3,600 tokens lighter — about 65% less plugin overhead before any work happens. Here’s where the savings came from.

Where a Claude Code plugin spends tokens

A plugin built from Markdown spends tokens in three distinct ways, and they’re not equally expensive:

  1. Always-on cost. Every skill, command, and agent description in the YAML front matter is injected into every session’s system prompt, whether or not the user touches Postman that day. This is the most expensive token in the plugin: every user pays it, every session.
  2. Per-trigger cost. When Claude decides a skill is relevant, the entire SKILL.md body loads into context. A 19 KB skill costs roughly 4,800 tokens every time it fires, even if the user only needed a third of it. This layered loading model is documented in the Claude Code skills docs.
  3. Runtime cost. Tool output, async polling loops, and verbose narration while a command runs. Tool schemas add up fast here too — a discussion on the MCP specification repo measured roughly 1,000 tokens per complex tool definition.

60% lighter skills with progressive disclosure

The biggest saving came from the per-trigger cost. Anthropic’s engineering team describes progressive disclosure as the foundational pattern of the Agent Skills standard: metadata loads at startup, the SKILL.md body loads when relevant, and bundled reference files load only when a specific step needs them.

A skill doesn’t need to front-load every rule it might ever apply. It needs the workflow, plus pointers to detailed rules that Claude reads at the step that needs them:

## Step 4: Generate the client code

Before writing any code, read `references/code-generation.md` in this
skill's directory. It contains the full rule catalog for idiomatic
client generation.

We applied this split to the plugin’s two largest skills:

Skill Before After Saving per trigger
postman-context ~4,760 tokens ~1,930 tokens ~2,800 tokens (60%)
generate-spec ~2,640 tokens ~1,800 tokens ~840 tokens (32%)

No content was deleted. The detailed rules and templates are intact, deferred rather than removed. A user who asks “find me an email API” no longer pays ~2,800 tokens for code-generation rules they aren’t using. A user who does generate code pays the same total as before.

100% of the routing skill, gone

The plugin shipped a postman-routing skill (roughly 835 tokens) whose trigger was “use when user mentions APIs.” That’s broad enough to fire in nearly any backend coding session, Postman-related or not. Its body was a routing table that restated what every command’s description already tells Claude.

Modern Claude Code routes natively by matching user intent against component descriptions, so the skill was pure duplicate state. We deleted it.

Saving: ~835 tokens in every session where it fired — and given that trigger, it fired in most sessions in API codebases.

20% off the always-on overhead

Several command descriptions enumerated long quoted trigger-phrase lists:

description: Run Postman collection tests using Postman CLI - use when
  user says "run tests", "run collection", "run my postman tests",
  "verify changes", "check if tests pass", or wants to execute API
  test suites after code changes

Claude’s router doesn’t need a phrasebook. It needs the capability and when to use it:

description: Run Postman Collection tests with the Postman CLI and
  report failures. Use after code changes or when the user asks to
  run API tests.

Rewriting these, combined with the routing-skill removal, shrank the always-on description block from 3,182 to 2,562 bytes — a 20% reduction worth ~155 tokens in every session of every plugin user. That makes them the highest-leverage bytes in the repo. Anthropic’s Agent Skills announcement makes the same point: discovery metadata works best as one or two tight sentences.

90% fewer tool schemas for subagents

Every MCP-backed command and the plugin’s readiness-analyzer agent previously declared a wildcard, granting access to all 100+ tools on the Postman MCP Server. Each now lists exactly what it uses. The readiness analyzer went from 111 tools to 11 — 90% fewer — and the setup command declares six.

For subagents and clients that resolve tool schemas from the allowlist, that’s an order of magnitude fewer schemas loaded into context. It also keeps the model from wandering into unrelated tools mid-command, which matches Anthropic’s guidance that overlapping tool sets create ambiguity. As a bonus, the audit surfaced three latent permission bugs where commands were instructed to write or edit files without the permissions to do so. Wildcards hide that class of bug; explicit lists make it visible in review.

Less chatter from async workflows

Some Postman MCP Server tools return HTTP 202 and require polling for completion. Left to its own devices, the model will happily narrate every poll, and all of it accumulates in context. The affected commands now carry two lines of instruction: poll with increasing waits (2s, then 4s, then 8s), and report only the final outcome. Fewer round-trips, less narration, same result.

What it adds up to

Change Who pays today Saving
Routing skill removed Nearly every session in an API codebase ~835 tokens/session (100% of the skill)
Description trims Every session, every user ~155 tokens/session (20% of always-on overhead)
postman-context split Every session that triggers the skill up to ~2,800 tokens/trigger (60%)
generate-spec split Every session that triggers the skill up to ~840 tokens/trigger (32%)
Scoped allowed-tools Subagent spawns; eager-loading clients 90% fewer schemas loaded
Polling guidance Long-running async commands variable; fewer round-trips and less narration

A typical “explore an API and generate a client” session that previously loaded the routing skill plus the full postman-context skill now starts roughly 3,600 tokens (about 65%) lighter. A session that never touches Postman saves ~990 tokens it used to spend on routing overhead. We also updated the plugin’s contributor docs so the conventions stick: descriptions stay short, bulky skill content goes in references/, and allowed-tools lists explicit tool names.

Real-world test: API spec drift detection end-to-end

Numbers in a table are one thing. Watching them play out on a real task is more useful.

To validate the optimization work, I ran the same prompt against a live GitHub repository using both the pre-optimization and post-optimization versions of the plugin. The task was a non-trivial agentic workflow: scan a workspace for API spec drift, validate the contract, and open a pull request with any code fixes.

look at my workspace. Identify all API spec drift. Ensure the API contract is valid.
Open a PR in https://github.com/buildwithtalia/enterprise-resource-planning
with any code fixes.

Pre-optimization (v1.1.x)

Total cost:               $3.60
Total duration (API):     15m 52s
Total duration (wall):    17m 10s
Total code changes:       1159 lines added, 13 lines removed
Usage by model:
  claude-haiku-4-5:       104.3k input, 22.8k output, 0 cache read, 0 cache write ($0.2181)
  claude-sonnet-4-6:      1.8k input, 60.5k output, 4.5m cache read, 298.7k cache write ($3.38)

Actual tokens of new work (input + cache write + output, excluding re-reads): ~361k tokens for Sonnet alone, plus ~127k for Haiku background tasks.

Post-optimization (v1.2.0)

Total cost:               $2.64
Total duration (API):     7m 47s
Total duration (wall):    9m 12s
Total code changes:       119 lines added, 12 lines removed
Usage by model:
  claude-haiku-4-5:       482 input, 18 output, 0 cache read, 0 cache write ($0.0006)
  claude-sonnet-4-6:      395 input, 31.4k output, 5.5m cache read, 137.3k cache write ($2.64)

Actual tokens of new work: ~169k tokens (395 input + 137.3k cache write + 31.4k output). Haiku dropped to ~500 tokens — effectively background housekeeping.

What the numbers say

Metric Pre-optimization Post-optimization Change
Session cost $3.60 $2.64 27% cheaper
Wall time 17m 10s 9m 12s ~46% faster
Actual new tokens (Sonnet) ~361k ~169k ~53% fewer
Cache efficiency (read:write ratio) ~15:1 ~40:1 2.7× better

The 5.5 million cache reads in the post-optimization session might look alarming until you remember that cache reads cost about 10% of input tokens. The 40:1 read-to-write ratio means the context window was stable across turns — almost nothing had to be re-cached. Of the $2.64, cache reads account for roughly $1.65, cache writes ~$0.51, and output tokens ~$0.47. That’s exactly what a well-cached session should look like.

There’s also a more interesting data point buried in the code-changes column: the pre-optimization session made 1,159 line changes; the post-optimization session made 119. The optimized plugin didn’t just use fewer tokens — it used them more precisely, producing a tighter, more surgical diff for the same task. Less context noise, more signal.

Takeaways for plugin authors

If you maintain or are about to publish a Claude Code plugin, these are the rules I’d start with:

  1. Treat front-matter descriptions as the most expensive real estate you own. They’re injected into every session. One or two sentences: what it does, when to use it.
  2. Progressive disclosure beats monolithic skills. Keep SKILL.md to the workflow, around 6 KB or less, and move templates, rule catalogs, and edge-case handling to references/*.md files the skill reads on demand.
  3. Don’t build what the harness already does. A routing skill that duplicates Claude’s native description-based routing costs tokens twice and creates a maintenance hazard.
  4. Scope allowed-tools to what each component calls. It’s least-privilege hygiene, it loads fewer schemas where that matters, and the audit itself tends to find permission bugs.
  5. Make polling cheap. Any async workflow needs explicit backoff and final-result-only reporting, or the model narrates every poll.

What I like about this kind of token optimization work is that none of it required clever engineering. It required measuring where the tokens go and being honest about which ones earn their place. The same context-engineering discipline Anthropic recommends for agents applies one level down, to the tooling we hand them.

All of these savings are live today: install the latest version of the Postman plugin from the Claude Plugin Marketplace and your next session picks them up automatically. If you’re building your own plugin, clone the plugin repo to see the patterns in place, then check what your own skills cost per trigger. If your largest SKILL.md is over 10 KB, try the references/ split and compare your /context output before and after.

Resources

What do you think about this topic? Tell us in a comment below.

Comment

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.