Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessDySCo: Dynamic Semantic Compression for Effective Long-term Time Series ForecastingarXivUQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engressionarXivAn Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance AnalysisarXivMalliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement LearningarXivEfficient and Principled Scientific Discovery through Bayesian Optimization: A TutorialarXivMassively Parallel Exact Inference for Hawkes ProcessesarXivModel Merging via Data-Free Covariance EstimationarXivDetecting Complex Money Laundering Patterns with Incremental and Distributed Graph ModelingarXivForecasting Supply Chain Disruptions with Foresight LearningarXivSven: Singular Value Descent as a Computationally Efficient Natural Gradient MethodarXivSECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous DrivingarXivJetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physicsarXivBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessDySCo: Dynamic Semantic Compression for Effective Long-term Time Series ForecastingarXivUQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engressionarXivAn Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance AnalysisarXivMalliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement LearningarXivEfficient and Principled Scientific Discovery through Bayesian Optimization: A TutorialarXivMassively Parallel Exact Inference for Hawkes ProcessesarXivModel Merging via Data-Free Covariance EstimationarXivDetecting Complex Money Laundering Patterns with Incremental and Distributed Graph ModelingarXivForecasting Supply Chain Disruptions with Foresight LearningarXivSven: Singular Value Descent as a Computationally Efficient Natural Gradient MethodarXivSECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous DrivingarXivJetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physicsarXiv
AI NEWS HUBbyEIGENVECTOREigenvector

How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models

arXiv cs.CLby Hiroki FukuiApril 2, 20262 min read0 views
Source Quiz

arXiv:2604.00021v1 Announce Type: new Abstract: Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and O

View PDF HTML (experimental)

Abstract:Alignment safety research assumes that ethical instructions improve model behavior, but how language models internally process such instructions remains unknown. We conducted over 600 multi-agent simulations across four models (Llama 3.3 70B, GPT-4o mini, Qwen3-Next-80B-A3B, Sonnet 4.5), four ethical instruction formats (none, minimal norm, reasoned norm, virtue framing), and two languages (Japanese, English). Confirmatory analysis fully replicated the Llama Japanese dissociation pattern from a prior study ($\mathrm{BF}_{10} > 10$ for all three hypotheses), but none of the other three models reproduced this pattern, establishing it as model-specific. Three new metrics -- Deliberation Depth (DD), Value Consistency Across Dilemmas (VCAD), and Other-Recognition Index (ORI) -- revealed four distinct ethical processing types: Output Filter (GPT; safe outputs, no processing), Defensive Repetition (Llama; high consistency through formulaic repetition), Critical Internalization (Qwen; deep deliberation, incomplete integration), and Principled Consistency (Sonnet; deliberation, consistency, and other-recognition co-occurring). The central finding is an interaction between processing capacity and instruction format: in low-DD models, instruction format has no effect on internal processing; in high-DD models, reasoned norms and virtue framing produce opposite effects. Lexical compliance with ethical instructions did not correlate with any processing metric at the cell level ($r = -0.161$ to $+0.256$, all $p > .22$; $N = 24$; power limited), suggesting that safety, compliance, and ethical processing are largely dissociable. These processing types show structural correspondence to patterns observed in clinical offender treatment, where formal compliance without internal processing is a recognized risk signal.

Comments: 34 pages, 7 figures, 4 tables. Preprint. OSF pre-registration: this http URL. Companion paper: arXiv:2603.04904

Subjects:

Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)

Cite as: arXiv:2604.00021 [cs.CL]

(or arXiv:2604.00021v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.00021

arXiv-issued DOI via DataCite

Submission history

From: Hiroki Fukui M.D. Ph.D. [view email] [v1] Wed, 11 Mar 2026 03:20:16 UTC (138 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

llamamodellanguage model

Knowledge Map

Knowledge Map
TopicsEntitiesSource
How Do Lang…llamamodellanguage mo…announceintegrationanalysisarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 237 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models