Releases claude available open-source report review autonomous

How We Cut Claude Code Session Overhead with Lazy-Loaded Personas

DEV Communityby Drakko TarkinApril 2, 20264 min read0 views

If you use Claude Code with a heavily customized CLAUDE.md , every message you send carries that full file as context. Not just once at session start — on every turn. That matters more than most people realize. The Problem: Eager-Loading Everything The naive approach to building a multi-persona system in Claude Code is to define all your personas directly in CLAUDE.md . It feels clean — everything in one place, always available. The cost: if you have 23 specialist personas, each defined in 150-200 lines, you're looking at 3,000-5,000 tokens of persona definitions loaded on every single message — regardless of whether the current task has anything to do with a UX designer or a financial analyst. Claude Code's CLAUDE.md is not a one-time setup file. It is re-injected into context on every tu

If you use Claude Code with a heavily customized CLAUDE.md, every message you send carries that full file as context. Not just once at session start — on every turn.

That matters more than most people realize.

The Problem: Eager-Loading Everything

The naive approach to building a multi-persona system in Claude Code is to define all your personas directly in CLAUDE.md. It feels clean — everything in one place, always available.

The cost: if you have 23 specialist personas, each defined in 150-200 lines, you're looking at 3,000-5,000 tokens of persona definitions loaded on every single message — regardless of whether the current task has anything to do with a UX designer or a financial analyst.

Claude Code's CLAUDE.md is not a one-time setup file. It is re-injected into context on every turn. The larger it is, the more tokens you burn before you type a word.

The Pattern: Route First, Load on Demand

The fix is the same pattern software engineers have used for decades: don't load what you don't need until you need it.

Instead of embedding persona definitions in CLAUDE.md, you define a lightweight routing engine that reads signal words from the user's message and loads the relevant persona file on demand.

Eager approach (expensive):

# Personas

Mary (Business Analyst)

Mary is a meticulous analyst who investigates existing state... [150 more lines]

Amelia (Developer Agent)

Amelia is an execution-focused developer who builds and edits files... [150 more lines]

Winston (Architect)

Winston designs systems, data flows, and infrastructure... [150 more lines]

... 20 more persona blocks`

Enter fullscreen mode

Exit fullscreen mode

Lazy approach (efficient):

## Persona Routing Read routing-engine.md on every session. Load personas on demand from ~/.claude/prism/ when triggered by signal words. Only the active persona's file is in context.

## Persona Routing Read routing-engine.md on every session. Load personas on demand from ~/.claude/prism/ when triggered by signal words. Only the active persona's file is in context.

Enter fullscreen mode

Exit fullscreen mode

With this structure, CLAUDE.md stays lean. The routing engine (routing-engine.md) is a single file that maps signal words to persona file paths. When a message contains "architecture" or "schema," Claude reads persona-architect-winston.md. When it contains "brainstorm" or "ideate," it reads persona-brainstorm-coach-carson.md. Everything else stays off-context.

Why This Matters Right Now

In April 2026, Claude Code users started reporting session costs 10-20x higher than expected. The root cause: a caching bug where context that should be served from cache is being re-tokenized and re-charged on every turn.

Eager-loading large CLAUDE.md files makes this worse. The bigger your baseline context, the higher your exposure when the cache misses. A 5,000-token persona block that should cost fractions of a cent per session can become a material cost per message when caching breaks.

Lazy-loading is not a fix for the cache bug. It is a structural hedge. Smaller baseline context means less blast radius when something goes wrong with token accounting — and it means lower costs even when everything works correctly.

How to Apply This Pattern

You don't need a 23-persona routing system to benefit from this. Three steps work for any Claude Code setup:

Audit your CLAUDE.md token weight.

Paste it into a tokenizer (Anthropic's tokenizer playground, or tiktoken for a rough proxy) or run wc -w CLAUDE.md as a fast estimate. If you're over 1,000 words, you have room to trim.

Move reference content to separate files.

Anything that isn't needed on every turn belongs in its own file. Coding style guides, persona definitions, workflow references, architecture docs — pull them out of CLAUDE.md and into named files in your .claude/ directory.

Add a routing section that tells Claude what to load and when.

## Reference Files (load on demand)

Coding standards: Read ~/.claude/reference/coding-style.md when writing or reviewing code
Architecture patterns: Read ~/.claude/reference/architecture.md when designing systems
Deployment guide: Read ~/.claude/reference/deployment.md when working on CI/CD`

Enter fullscreen mode

Exit fullscreen mode

Claude Code follows these instructions literally. The file only enters context when the task requires it.

PRISM Forge

This pattern is the foundation of PRISM Forge, an open-source Claude Code persona routing system with 23 specialist personas that load on-demand via signal-word routing. The full implementation is at github.com/prism-forge/prism.

The token savings are real. The architecture is simple. And the pattern applies to any Claude Code setup — no persona system required.

If you're building autonomous Claude Code workflows and want this architecture set up for your team, reach out on LinkedIn.

Original source

DEV Community

https://dev.to/drakkotarkin/how-we-cut-claude-code-session-overhead-with-lazy-loaded-personas-3ann

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudeavailableopen-source

ProductsLive

How Crypto Lending Actually Works Under the Hood: A Developer's Perspective

If you've ever wondered how platforms offer 5-15% APY on crypto lending while traditional banks give you 0.5%, this article breaks down the actual mechanics from a technical standpoint. The Lending Marketplace Model The most straightforward lending model works like a peer-to-peer marketplace. Bitfinex is the canonical example — it runs an order-book style lending market where lenders place offers and borrowers take them. How P2P Lending Orders Work Think of it like a limit order book for loans: Lender A: Offering 10,000 USDT at 0.03% daily for 30 days Lender B: Offering 5,000 USDT at 0.025% daily for 7 days Lender C: Offering 20,000 USDT at 0.04% daily for 120 days Borrowers (usually margin traders) take the cheapest available offers first. When demand is high (volatile market days), rates

DEV Community

5m29 minutes ago

ProductsLive

Layered Context Routing for Campus Operations: A Facilities Intake PoC

How I Stacked Policy, Place, and Urgency Signals to Route Maintenance Requests TL;DR This write-up describes a personal experiment where I treat campus facilities intake as a context engineering problem rather than a single prompt. I combine TF-IDF retrieval over a small policy corpus with building metadata and lightweight urgency hints parsed from free text, then route tickets with explicit rules so every decision stays inspectable. The code lives in a public repository I published for learning purposes, and nothing here should be read as production guidance for a real university or as anything connected to an employer. From my perspective, the lesson worth sharing is that when operational language is messy, stacking context in named layers makes debugging and iteration far easier than st

DEV Community

28m25 minutes ago

ProductsLive

Five Agent Memory Types in LangGraph: A Deep Code Walkthrough (Part 2)

In Part-1 [ https://dev.to/sreeni5018/the-5-types-of-ai-agent-memory-every-developer-needs-to-know-part-1-52fn ] we covered the five memory types , why the LLM is stateless by design, and why memory is always an infrastructure concern. This post is the how. Same five types, but now we wire each one up with LangGraph , dissect every line of code, flag the gotchas, and leave you with a single working script you can run today. Before We Write a Single Line: Two Things You Must Understand The Context Window Is the Only Reality Repeat this like a mantra and the model only knows what is in the context window at inference time. Every token your message, retrieved facts, conversation history, tool results, system instructions has to be physically present in that window at the moment of the call. I

DEV Community

31m21 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 131 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Releases

ReleasesLive

More, and More Extensive, Supply Chain Attacks

Open source components are getting compromised a lot more often. I did some counting, with a combination of searching, memory, and AI assistance, and we had two in 2026-Q1 ( trivy , axios ), after four in 2025 ( shai-hulud , glassworm , nx , tj-actions ), and very few historically [1]: Earlier attacks were generally compromises of single projects, but some time around Shai-Hulud in 2025-11 there started to be a lot more ecosystem propagation. Things like the Trivy compromise leading to the LiteLLM compromise and (likely, since it was three days later and by the same attackers) Telnyx . I only counted the first compromise in chain in the chart, but if we counted each one the increase would be much more dramatic. Similarly, I only counted glassworm for 2025, when it came out, but it's still

lesswrong.com

4mabout 1 hour ago

ReleasesFresh

ElevenLabs launches ElevenMusic AI music generation app for iOS - 디지털투데이

ElevenLabs launches ElevenMusic AI music generation app for iOS 디지털투데이

GNews AI music

1mabout 7 hours ago

ReleasesFresh

Joint Interagency Task Force 401 Enhances Counter-UAS Capability to Protect the Southern Border

Joint Interagency Task Force 401 is rapidly delivering counter-unmanned aircraft systems to protect the southern border with advanced technology, an integrated system architecture and expanded authorities.

defense.gov

1mabout 6 hours ago

Releases

EchoData and R. Michael Anderson Announce Strategic Partnership to Drive AI Leadership Transformation in Saudi Arabia - Burlington County Times

EchoData and R. Michael Anderson Announce Strategic Partnership to Drive AI Leadership Transformation in Saudi Arabia Burlington County Times

GNews AI Saudi Arabia

1m8 days ago