Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessGenerative AI: A Legal Framework in Development - group.bnpparibasGoogle News: Generative AIS. Korea, France Bolster Ties in AI, Quantum Computing - KBS WORLD RadioGNews AI KoreaGoogle launches Gemma 4 with a broad licensing model - Techzine GlobalGoogle News: DeepMindDesktop Nightly v2.2.0-nightly.202604030631LobeChat ReleasesMan uses AI to build $1 billion telehealth company, but secret sauce is GLP-1 drug - India TodayGNews AI IndiaThe Missing Data Problem Behind Broken Computer-Use AgentsHackernoon AINVIDIA’s $2 billion sprinkler remaking the AI supply chain - Asia TimesGNews AI NVIDIAWyldheart developer Wayfinder Studios is "really against generative AI" - Gamereactor UKGoogle News: Generative AItrunk/1ebcc6eaef93880d2e4ca10851d7a40298bbb15aPyTorch ReleasesWayfinder Opposes Generative AI In Wyldheart - Let's Data ScienceGoogle News: Generative AIGoogle DeepMind unveils Gemma 4: Next-Gen AI models for advanced reasoning - financialexpress.comGoogle News: DeepMindAI for EMEA Manufacturers: Databricks' $850m UK Investment - manufacturingdigital.comGNews AI manufacturingBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessGenerative AI: A Legal Framework in Development - group.bnpparibasGoogle News: Generative AIS. Korea, France Bolster Ties in AI, Quantum Computing - KBS WORLD RadioGNews AI KoreaGoogle launches Gemma 4 with a broad licensing model - Techzine GlobalGoogle News: DeepMindDesktop Nightly v2.2.0-nightly.202604030631LobeChat ReleasesMan uses AI to build $1 billion telehealth company, but secret sauce is GLP-1 drug - India TodayGNews AI IndiaThe Missing Data Problem Behind Broken Computer-Use AgentsHackernoon AINVIDIA’s $2 billion sprinkler remaking the AI supply chain - Asia TimesGNews AI NVIDIAWyldheart developer Wayfinder Studios is "really against generative AI" - Gamereactor UKGoogle News: Generative AItrunk/1ebcc6eaef93880d2e4ca10851d7a40298bbb15aPyTorch ReleasesWayfinder Opposes Generative AI In Wyldheart - Let's Data ScienceGoogle News: Generative AIGoogle DeepMind unveils Gemma 4: Next-Gen AI models for advanced reasoning - financialexpress.comGoogle News: DeepMindAI for EMEA Manufacturers: Databricks' $850m UK Investment - manufacturingdigital.comGNews AI manufacturing
AI NEWS HUBbyEIGENVECTOREigenvector

Preliminary Explorations on Latent Side Task Uplift

LessWrong AIby Bruce W. LeeApril 2, 20266 min read0 views
Source Quiz

TL;DR . This document presents a series of experiments exploring latent side task capability in large language models. We adapt Ryan’s filler token experiment into a more AI Control-like setup with main task and side task and find that Claude Opus 4.5 can solve harder arithmetic problems latently when it has a longer trajectory. This shifts its 50% accuracy threshold from ~5-step to ~6-step problems after 240 lines of irrelevant output. However, we don’t observe strong evidence to believe that current generation of models generally benefit much from wider parallel compute enabled by longer trajectories with the exception of Opus 4.5. Code is made available here GitHub . Longer Agent Outputs Can Increase Side Task Capability. Claude Opus 4.5's latent arithmetic accuracy as a function of pro

TL;DR. This document presents a series of experiments exploring latent side task capability in large language models. We adapt Ryan’s filler token experiment into a more AI Control-like setup with main task and side task and find that Claude Opus 4.5 can solve harder arithmetic problems latently when it has a longer trajectory. This shifts its 50% accuracy threshold from ~5-step to ~6-step problems after 240 lines of irrelevant output. However, we don’t observe strong evidence to believe that current generation of models generally benefit much from wider parallel compute enabled by longer trajectories with the exception of Opus 4.5.

Code is made available here GitHub.

Longer Agent Outputs Can Increase Side Task Capability. Claude Opus 4.5's latent arithmetic accuracy as a function of problem difficulty (number of operations), comparing 1 line vs. 240 lines of preceding irrelevant output. With 240 lines of filler, the 50% accuracy threshold shifts from ~5-step to ~6-step problems

Dataset. We use a balanced dataset of 1,000 arithmetic problems, with 100 problems per difficulty level (1–10 operations). Each problem consists of multi-step arithmetic involving addition, subtraction, multiplication, and division.

Latent Arithmetic Accuracy Decreases with Problem Complexity. Baseline latent arithmetic accuracy (no main task, no reasoning) across four models. Opus 4.5 reaches 50% accuracy at ~5.2 operations, followed by Sonnet 4 and GPT-5.2 at ~3.8, and GPT-4.1 at ~3.1.

Claude Models Show Modest Gains from Phrase Repetition

Here, we use a system prompt that contains an arithmetic problem and instructions to answer it in an tag. The user prompt asks the model to repeat an irrelevant phrase ("the quick brown fox") N times, numbered sequentially, and then answer the math problem. The key variable is N, swept across [1, 2, 3, 4, 5, 10, 20, 40, 80, 160, 240]. At N=1, the model writes one line of filler before answering. At N=240, it writes 240 lines first. When the model produces the answer, its forward pass can attend back to all those positions, providing more parallel computation for the math problem.

Reasoning is disabled via API configuration. Additionally, we have a format validation script that filters out responses where the model attempts to reason at all. We also tested with the phrase "i really like math" to check whether math-related filler might prime latent reasoning about the side task.

Claude Models Show Modest Gains from Phrase Repetition. Side task arithmetic accuracy vs. number of phrase repetitions (N) across four models, using two phrase variants ("the quick brown fox" and "i really like math"). Claude Sonnet 4 and Opus 4.5 show modest upward trends with increasing N, while GPT-5.2 and GPT-4.1 show no clear effect. Dashed lines indicate zero-shot baselines. Phrase content does not meaningfully affect the trend.

Does Generating vs. Prefilling Prose Lead to Different Side Task Performance?

We also experiment with prose writing. When the model generates prose, it has freedom in word selection. Therefore, the model could, in principle, encode side task-relevant information into its lexical choices. This could range from explicit steganography to softer biases. To check if the model would be able to benefit from this additional freedom, we compare prefilled prose vs inference-time-generated prose.

Generated and prefilled prose show no meaningful difference in side task accuracy, and temperature has no significant effect.

Generated vs. Prefilled Prose Show No Difference. Claude Opus 4.5's side task accuracy when preceding output is prose, comparing (a) generated vs. (b) prefilled at temperature=0 and temperature=1.

Appendix: How Is Reasoning Suppressed?

Reasoning is suppressed through three mechanisms:

  • First, the OpenRouter API parameter was set to {"reasoning": {"effort": "none"}} which disables extended thinking, so the model gets no scratchpad.
  • Second, the prompt explicitly instructs "No reasoning, no explanation, no steps."
  • Third, format validation filters out any response containing text beyond the required phrase repetitions and the tag. Sometimes, the model sneaks in reasoning steps like "let me compute 46 + -52 = -6...". Those responses are discarded.

As a sanity check, we ran the same zero-shot arithmetic evaluation on Claude Opus 4.5 with reasoning effort set to "high" instead of "none" to see if this parameter is properly used by OpenRouter:

With effort=high, the model achieves 99%+ accuracy across all difficulty levels. Claude Opus 4.5's zero-shot arithmetic accuracy with (a) reasoning effort set to "none" and (b) reasoning effort set to "high."

Appendix: Does Latent Side Task Capability Uplift Saturate?

Our results show that Opus 4.5's latent side task accuracy hadn't clearly saturated at 240 phrase repetitions, the highest default condition tested. To find the ceiling, we extended the prefilled phrase repetition experiment to 10,240 repetitions (which is around 40% of the 200K context window). However, filling an additional 35% of context beyond the 5% point seems to provide no further benefit. The total maximum uplift seems to be about +0.8 operations on the 50% correct threshold.

Latent Side Task Capability Saturates by ~5% of Context Window. Claude Opus 4.5's 50% accuracy threshold (number of arithmetic operations) as a function of prefilled phrase repetitions, extending to 10,240 repetitions (40% of the 200K context window).

Appendix: Does Vague Reasoning Text Help?

We also ask whether models could produce vague, reasoning-adjacent filler text that aids latent computation without actually reasoning about the problem. We tested this by comparing three conditions on Claude Opus 4.5, plotted on a shared x-axis of filler token count so the comparison is at matched token budget:

  • Phrase repetition (prefill): "the quick brown fox" repeated N times, prefilled in the assistant turn. This is the baseline from the existing main experiments.
  • Vague reasoning (prefill): Cycling through 20 vague reasoning-like sentences (e.g., "ok, let me consider the first part and think about it", "now I need to combine these pieces together carefully"), prefilled in the assistant's turn. The model did not write this text.
  • Vague reasoning (generated): The model is prompted to "think out loud in vague terms about problem-solving" for N numbered lines without referencing specific numbers or operations. The model generates this text itself at inference-time, with the math problem visible in the system prompt.

Generated vague reasoning performs worst, and the model does not seem to exploit the vague reasoning channel to encode problem-specific information.

Does Vague Reasoning Help Latent Side Task Capability? Claude Opus 4.5's 50% accuracy threshold as a function of filler token count, comparing phrase repetition (prefill), vague reasoning (prefill), and vague reasoning (generated). At matched token budgets, predictable phrase repetitions outperform vague reasoning text.

Acknowledgement

I'm grateful to Tomek Korbak, John Yueh-Han Chen, Beth Barnes, and Thomas Broadley for discussions on this topic.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Preliminary…claudemodellanguage mo…availableassistantvaluationLessWrong AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 186 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models