Claude Code's Usage Limit Workaround: Switch to Previous Model with /compact
A concrete workflow to avoid Claude Code's usage limits: use the previous model version with the /compact flag set to 200k tokens for long, technical sessions. The Problem: Usage Burns Too Fast for Technical Work If you're using Claude Code for serious technical work—repo cleanup, long document rewrites, or multi-step code refactors—you've likely hit the usage limit wall. As discussed in a recent Reddit thread, developers report burning through their allocated usage "absurdly fast" even with disciplined, minimal setups. The core issue isn't casual chat; it's the need for continuity in complex tasks where resetting a session breaks the workflow. The Solution: Model Version + Context Compression The specific advice circulating among power users is a two-part configuration change: Switch to t
A concrete workflow to avoid Claude Code's usage limits: use the previous model version with the /compact flag set to 200k tokens for long, technical sessions.
The Problem: Usage Burns Too Fast for Technical Work
If you're using Claude Code for serious technical work—repo cleanup, long document rewrites, or multi-step code refactors—you've likely hit the usage limit wall. As discussed in a recent Reddit thread, developers report burning through their allocated usage "absurdly fast" even with disciplined, minimal setups. The core issue isn't casual chat; it's the need for continuity in complex tasks where resetting a session breaks the workflow.
The Solution: Model Version + Context Compression
The specific advice circulating among power users is a two-part configuration change:
-
Switch to the previous Claude model. Don't use the latest Opus 4.6 for extended, iterative sessions if you're hitting limits.
-
Use the /compact flag with a 200k token target. This tells Claude Code to aggressively compress the conversation history, prioritizing recent context.
You can apply this when starting a Claude Code session from your terminal:
claude code --model claude-3-5-sonnet-20241022 --compact 200000
Enter fullscreen mode
Exit fullscreen mode
Or, set it in your CLAUDE.md configuration for persistence:
Model: claude-3-5-sonnet-20241022 Compact: 200000 Model: claude-3-5-sonnet-20241022 Compact: 200000Enter fullscreen mode
Exit fullscreen mode
Why This Works: Token Economics
The latest models, like Claude Opus 4.6, are incredibly capable but also more computationally expensive per token. For long sessions where the model re-processes the entire conversation history on each turn, this cost compounds rapidly. The previous generation models (like claude-3-5-sonnet) offer a vastly better performance-to-cost ratio for extended coding and analysis tasks.
The /compact flag is the other critical lever. By default, Claude Code may retain a vast amount of context. Setting --compact 200000 instructs the system to aim for a 200k token context window, actively summarizing or dropping older parts of the conversation to stay near that target. This prevents the silent usage drain from endlessly growing context.
Implementing the Workflow
Don't just change the model—adapt your prompting style to work with compression.
-
Segment Large Tasks: Break a massive repo refactor into logical, folder-by-folder sessions. Use a final summary prompt at the end of each segment to hand off context.
-
Be Explicit About Files: When context is compressed, file contents can be dropped. Use commands like /read explicitly when you need to revisit a file, rather than assuming it's in memory.
-
Guide the Compression: After a significant milestone, you can prompt: "Please summarize the changes we've made to utils/ so far for context compression." This gives the model a high-quality summary to retain.
This approach isn't about using shorter prompts; it's about smarter session management that aligns with how Claude Code's usage is calculated. For many developers, this single configuration shift has turned daily limit hits into a weekly occurrence.
Originally published on gentic.news
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
claudemodelversion
AI Is Insatiable
While browsing our website a few weeks ago, I stumbled upon “ How and When the Memory Chip Shortage Will End ” by Senior Editor Samuel K. Moore. His analysis focuses on the current DRAM shortage caused by AI hyperscalers’ ravenous appetite for memory, a major constraint on the speed at which large language models run. Moore provides a clear explanation of the shortage, particularly for high bandwidth memory (HBM). As we and the rest of the tech media have documented, AI is a resource hog. AI electricity consumption could account for up to 12 percent of all U.S. power by 2028. Generative AI queries consumed 15 terawatt-hours in 2025 and are projected to consume 347 TWh by 2030. Water consumption for cooling AI data centers is predicted to double or even quadruple by 2028 compared to 2023. B

Anyone got Gemma 4 26B-A4B running on VLLM?
If yes, which quantized model are you using abe what’s your vllm serve command? I’ve been struggling getting that model up and running on my dgx spark gb10. I tried the intel int4 quant for the 31B and it seems to be working well but way too slow. Anyone have any luck with the 26B? submitted by /u/toughcentaur9018 [link] [comments]
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

Anyone got Gemma 4 26B-A4B running on VLLM?
If yes, which quantized model are you using abe what’s your vllm serve command? I’ve been struggling getting that model up and running on my dgx spark gb10. I tried the intel int4 quant for the 31B and it seems to be working well but way too slow. Anyone have any luck with the 26B? submitted by /u/toughcentaur9018 [link] [comments]

be careful on what could run on your gpus fellow cuda llmers
according to this report it seems that by "hammering" bits into dram chips through malicious cuda kernels, it could be possible to compromise systems equipped w/ several nvidia gpus up to excalating unsupervised privileged access to administrative role (root): https://arstechnica.com/security/2026/04/new-rowhammer-attacks-give-complete-control-of-machines-running-nvidia-gpus/ submitted by /u/DevelopmentBorn3978 [link] [comments]




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!