Models claude model language model available assistant valuation

Preliminary Explorations on Latent Side Task Uplift

LessWrong AIby Bruce W. LeeApril 2, 20266 min read0 views

TL;DR . This document presents a series of experiments exploring latent side task capability in large language models. We adapt Ryan’s filler token experiment into a more AI Control-like setup with main task and side task and find that Claude Opus 4.5 can solve harder arithmetic problems latently when it has a longer trajectory. This shifts its 50% accuracy threshold from ~5-step to ~6-step problems after 240 lines of irrelevant output. However, we don’t observe strong evidence to believe that current generation of models generally benefit much from wider parallel compute enabled by longer trajectories with the exception of Opus 4.5. Code is made available here GitHub . Longer Agent Outputs Can Increase Side Task Capability. Claude Opus 4.5's latent arithmetic accuracy as a function of pro

TL;DR. This document presents a series of experiments exploring latent side task capability in large language models. We adapt Ryan’s filler token experiment into a more AI Control-like setup with main task and side task and find that Claude Opus 4.5 can solve harder arithmetic problems latently when it has a longer trajectory. This shifts its 50% accuracy threshold from ~5-step to ~6-step problems after 240 lines of irrelevant output. However, we don’t observe strong evidence to believe that current generation of models generally benefit much from wider parallel compute enabled by longer trajectories with the exception of Opus 4.5.

Code is made available here GitHub.

Longer Agent Outputs Can Increase Side Task Capability. Claude Opus 4.5's latent arithmetic accuracy as a function of problem difficulty (number of operations), comparing 1 line vs. 240 lines of preceding irrelevant output. With 240 lines of filler, the 50% accuracy threshold shifts from ~5-step to ~6-step problems

Dataset. We use a balanced dataset of 1,000 arithmetic problems, with 100 problems per difficulty level (1–10 operations). Each problem consists of multi-step arithmetic involving addition, subtraction, multiplication, and division.

Latent Arithmetic Accuracy Decreases with Problem Complexity. Baseline latent arithmetic accuracy (no main task, no reasoning) across four models. Opus 4.5 reaches 50% accuracy at ~5.2 operations, followed by Sonnet 4 and GPT-5.2 at ~3.8, and GPT-4.1 at ~3.1.

Claude Models Show Modest Gains from Phrase Repetition

Here, we use a system prompt that contains an arithmetic problem and instructions to answer it in an tag. The user prompt asks the model to repeat an irrelevant phrase ("the quick brown fox") N times, numbered sequentially, and then answer the math problem. The key variable is N, swept across [1, 2, 3, 4, 5, 10, 20, 40, 80, 160, 240]. At N=1, the model writes one line of filler before answering. At N=240, it writes 240 lines first. When the model produces the answer, its forward pass can attend back to all those positions, providing more parallel computation for the math problem.

Reasoning is disabled via API configuration. Additionally, we have a format validation script that filters out responses where the model attempts to reason at all. We also tested with the phrase "i really like math" to check whether math-related filler might prime latent reasoning about the side task.

Claude Models Show Modest Gains from Phrase Repetition. Side task arithmetic accuracy vs. number of phrase repetitions (N) across four models, using two phrase variants ("the quick brown fox" and "i really like math"). Claude Sonnet 4 and Opus 4.5 show modest upward trends with increasing N, while GPT-5.2 and GPT-4.1 show no clear effect. Dashed lines indicate zero-shot baselines. Phrase content does not meaningfully affect the trend.

Does Generating vs. Prefilling Prose Lead to Different Side Task Performance?

We also experiment with prose writing. When the model generates prose, it has freedom in word selection. Therefore, the model could, in principle, encode side task-relevant information into its lexical choices. This could range from explicit steganography to softer biases. To check if the model would be able to benefit from this additional freedom, we compare prefilled prose vs inference-time-generated prose.

Generated and prefilled prose show no meaningful difference in side task accuracy, and temperature has no significant effect.

Generated vs. Prefilled Prose Show No Difference. Claude Opus 4.5's side task accuracy when preceding output is prose, comparing (a) generated vs. (b) prefilled at temperature=0 and temperature=1.

Appendix: How Is Reasoning Suppressed?

Reasoning is suppressed through three mechanisms:

First, the OpenRouter API parameter was set to {"reasoning": {"effort": "none"}} which disables extended thinking, so the model gets no scratchpad.
Second, the prompt explicitly instructs "No reasoning, no explanation, no steps."
Third, format validation filters out any response containing text beyond the required phrase repetitions and the tag. Sometimes, the model sneaks in reasoning steps like "let me compute 46 + -52 = -6...". Those responses are discarded.

As a sanity check, we ran the same zero-shot arithmetic evaluation on Claude Opus 4.5 with reasoning effort set to "high" instead of "none" to see if this parameter is properly used by OpenRouter:

With effort=high, the model achieves 99%+ accuracy across all difficulty levels. Claude Opus 4.5's zero-shot arithmetic accuracy with (a) reasoning effort set to "none" and (b) reasoning effort set to "high."

Appendix: Does Latent Side Task Capability Uplift Saturate?

Our results show that Opus 4.5's latent side task accuracy hadn't clearly saturated at 240 phrase repetitions, the highest default condition tested. To find the ceiling, we extended the prefilled phrase repetition experiment to 10,240 repetitions (which is around 40% of the 200K context window). However, filling an additional 35% of context beyond the 5% point seems to provide no further benefit. The total maximum uplift seems to be about +0.8 operations on the 50% correct threshold.

Latent Side Task Capability Saturates by ~5% of Context Window. Claude Opus 4.5's 50% accuracy threshold (number of arithmetic operations) as a function of prefilled phrase repetitions, extending to 10,240 repetitions (40% of the 200K context window).

Appendix: Does Vague Reasoning Text Help?

We also ask whether models could produce vague, reasoning-adjacent filler text that aids latent computation without actually reasoning about the problem. We tested this by comparing three conditions on Claude Opus 4.5, plotted on a shared x-axis of filler token count so the comparison is at matched token budget:

Phrase repetition (prefill): "the quick brown fox" repeated N times, prefilled in the assistant turn. This is the baseline from the existing main experiments.
Vague reasoning (prefill): Cycling through 20 vague reasoning-like sentences (e.g., "ok, let me consider the first part and think about it", "now I need to combine these pieces together carefully"), prefilled in the assistant's turn. The model did not write this text.
Vague reasoning (generated): The model is prompted to "think out loud in vague terms about problem-solving" for N numbered lines without referencing specific numbers or operations. The model generates this text itself at inference-time, with the math problem visible in the system prompt.

Generated vague reasoning performs worst, and the model does not seem to exploit the vague reasoning channel to encode problem-specific information.

Does Vague Reasoning Help Latent Side Task Capability? Claude Opus 4.5's 50% accuracy threshold as a function of filler token count, comparing phrase repetition (prefill), vague reasoning (prefill), and vague reasoning (generated). At matched token budgets, predictable phrase repetitions outperform vague reasoning text.

Acknowledgement

I'm grateful to Tomek Korbak, John Yueh-Han Chen, Beth Barnes, and Thomas Broadley for discussions on this topic.

Original source

LessWrong AI

https://www.lesswrong.com/posts/ftKQi7jtWpCqKhjmQ/preliminary-explorations-on-latent-side-task-uplift

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodellanguage model

ModelsLive

Google launches Gemma 4 with a broad licensing model - Techzine Global

Google launches Gemma 4 with a broad licensing model Techzine Global

Google News: DeepMind

1mabout 1 hour ago

ModelsRecent

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

Anthropic Races to Contain Leak of Code Behind Claude AI Agent WSJ

GNews AI open source

1m1 day ago

ModelsLive

Google DeepMind unveils Gemma 4: Next-Gen AI models for advanced reasoning - financialexpress.com

Google DeepMind unveils Gemma 4: Next-Gen AI models for advanced reasoning financialexpress.com

Google News: DeepMind

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 186 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

Google launches Gemma 4 with a broad licensing model - Techzine Global

Google launches Gemma 4 with a broad licensing model Techzine Global

Google News: DeepMind

1mabout 1 hour ago

ModelsRecent

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

Anthropic Races to Contain Leak of Code Behind Claude AI Agent WSJ

GNews AI open source

1m1 day ago

ModelsLive

Google DeepMind unveils Gemma 4: Next-Gen AI models for advanced reasoning - financialexpress.com

Google DeepMind unveils Gemma 4: Next-Gen AI models for advanced reasoning financialexpress.com

Google News: DeepMind

1mabout 1 hour ago

ModelsFresh

The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration

arXiv:2603.22862v2 Announce Type: replace Abstract: Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi-tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguis

arXiv cs.SE

1mabout 4 hours ago