Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechExclusive | OpenAI Buys Tech-Industry Talk Show TBPN - WSJGoogle News: OpenAIPrediction: The $700 Billion Artificial Intelligence (AI) Capex Boom Will Create the Best Buying Opportunity of 2026 for These 3 Stocks - The Motley FoolGoogle News: AIp-e-w/gemma-4-E2B-it-heretic-ara: Gemma 4's defenses shredded by Heretic's new ARA method 90 minutes after the official releaseReddit r/LocalLLaMAFrom Assistant to Actor: What the Rise of Agentic AI Means for Your Business - Morgan LewisGoogle News: Generative AIIndia AI Startup Sarvam Raises Funds at $1.5 Billion ValuationBloomberg TechnologyApple's AI Strategy Is Pivoting. Here's Why That Could Be Great News for the Stock. - The Motley FoolGNews AI AppleThere’s a Blinking Warning Sign for the Data Centers in Space IndustryFuturism AIThe Practical Guide to Superbabieslesswrong.comWill Gemma 4 124B MoE open as well?Reddit r/LocalLLaMA🔮 Autoresearch and the experimental societyExponential ViewBlack Hat USADark ReadingBlack Hat AsiaAI BusinessThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechExclusive | OpenAI Buys Tech-Industry Talk Show TBPN - WSJGoogle News: OpenAIPrediction: The $700 Billion Artificial Intelligence (AI) Capex Boom Will Create the Best Buying Opportunity of 2026 for These 3 Stocks - The Motley FoolGoogle News: AIp-e-w/gemma-4-E2B-it-heretic-ara: Gemma 4's defenses shredded by Heretic's new ARA method 90 minutes after the official releaseReddit r/LocalLLaMAFrom Assistant to Actor: What the Rise of Agentic AI Means for Your Business - Morgan LewisGoogle News: Generative AIIndia AI Startup Sarvam Raises Funds at $1.5 Billion ValuationBloomberg TechnologyApple's AI Strategy Is Pivoting. Here's Why That Could Be Great News for the Stock. - The Motley FoolGNews AI AppleThere’s a Blinking Warning Sign for the Data Centers in Space IndustryFuturism AIThe Practical Guide to Superbabieslesswrong.comWill Gemma 4 124B MoE open as well?Reddit r/LocalLLaMA🔮 Autoresearch and the experimental societyExponential View
AI NEWS HUBbyEIGENVECTOREigenvector

UK AISI Alignment Evaluation Case-Study

ArXiv CS.AIby Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz, Xander DaviesApril 2, 20261 min read0 views
Source Quiz

arXiv:2604.00788v1 Announce Type: new Abstract: This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5,

View PDF HTML (experimental)

Abstract:This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

Subjects:

Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

Cite as: arXiv:2604.00788 [cs.AI]

(or arXiv:2604.00788v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.00788

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Alexandra Souly [view email] [v1] Wed, 1 Apr 2026 11:53:25 UTC (1,749 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
UK AISI Ali…claudemodeltrainingreleaseannounceopen-sourceArXiv CS.AI

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 167 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!