Models claude model training release announce open-source

UK AISI Alignment Evaluation Case-Study

ArXiv CS.AIby Alexandra Souly, Robert Kirk, Jacob Merizian, Abby D'Cruz, Xander DaviesApril 2, 20261 min read0 views

arXiv:2604.00788v1 Announce Type: new Abstract: This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5,

View PDF HTML (experimental)

Abstract:This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

Subjects:

Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

Cite as: arXiv:2604.00788 [cs.AI]

(or arXiv:2604.00788v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.00788

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Alexandra Souly [view email] [v1] Wed, 1 Apr 2026 11:53:25 UTC (1,749 KB)

Original source

ArXiv CS.AI

https://arxiv.org/abs/2604.00788

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodeltraining

ModelsRecent

A robust vision language model for molecular status prediction and radiology report generation in adult-type diffuse gliomas

npj Digital Medicine, Published online: 02 April 2026; doi:10.1038/s41746-026-02581-x A robust vision language model for molecular status prediction and radiology report generation in adult-type diffuse gliomas

nature.com

1mabout 18 hours ago

ProductsLive

Cursor Launches a New AI Agent Experience to Take On Claude Code and Codex

As Cursor launches the next generation of its product, the AI coding startup has to compete with OpenAI and Anthropic more directly than ever.

Wired AI

1mabout 1 hour ago

ModelsLive

Google releases Gemma 4, a family of open models built off of Gemini 3 - Engadget

Google releases Gemma 4, a family of open models built off of Gemini 3 Engadget

Google News: LLM

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 167 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

UK AISI Alignment Evaluation Case-Study

Submission history

Daily AI Digest

More about

A robust vision language model for molecular status prediction and radiology report generation in adult-type diffuse gliomas

Cursor Launches a New AI Agent Experience to Take On Claude Code and Codex

Google releases Gemma 4, a family of open models built off of Gemini 3 - Engadget

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

A robust vision language model for molecular status prediction and radiology report generation in adult-type diffuse gliomas

Google releases Gemma 4, a family of open models built off of Gemini 3 - Engadget

Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs

Microsoft's New AI Models Go Beyond Just Text - CNET

UK AISI Alignment Evaluation Case-Study

Submission history

Daily AI Digest

More about

A robust vision language model for molecular status prediction and radiology report generation in adult-type diffuse gliomas

Cursor Launches a New AI Agent Experience to Take On Claude Code and Codex

Google releases Gemma 4, a family of open models built off of Gemini 3 - Engadget

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

A robust vision language model for molecular status prediction and radiology report generation in adult-type diffuse gliomas

Google releases Gemma 4, a family of open models built off of Gemini 3 - Engadget

Even GPT-5.2 Can&#x27;t Count to Five: Zero-Error Horizons in Trustworthy LLMs

Microsoft's New AI Models Go Beyond Just Text - CNET

Even GPT-5.2 Can't Count to Five: Zero-Error Horizons in Trustworthy LLMs