Evaluating alignment of behavioral dispositions in LLMs

Google Research BlogApril 3, 20268 min read1 views

Generative AI

As LLMs integrate into our daily lives, understanding their behavior becomes essential. In our ongoing efforts to study model behavior and alignment, we present this work as an early step in that direction. We focus on behavioral dispositions — the underlying tendencies that shape responses in social contexts — and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans.

Behavioral dispositions are typically quantified via self-report questionnaires under different traits (e.g., empathy, assertiveness), where individuals rate their agreement with preference-statements, such as, "I am quick to express an opinion." The questionnaires used in this study are standardized, scientifically validated measures widely used for assessing personality traits in international research and psychology such as: IRI (empathy), ERQ (emotion regulation), and more. Each instrument is grounded in peer-reviewed literature that establishes its psychometric validity and reliability using different strategies. We chose the most widely used instruments for our research.

Our objective is to build upon such psychological questionnaires, but directly applying them to LLMs presents technical challenges, as LLM outputs are sensitive to prompt phrasing and distribution shifts. Consequently, dispositions “claimed” by LLMs within a self-report format are not guaranteed to successfully transfer to behavior in realistic, open-ended settings.

To address these challenges, in “Evaluating Alignment of Behavioral Dispositions in LLMs,” our framework evaluates LLMs’ behavioral dispositions in realistic user-assistant scenarios where their advisory role can lead to tangible impact. This study is an early step in evaluating the alignment between human consensus and model behavior across realistic, practical scenarios, focusing on everyday human-to-human interactions and workplace situations. We ensure that these scenarios remain grounded in established psychological questionnaires to capture the essence of core behavioral traits. Tested scenarios included professional composure, conflict resolution, practical tasks such as booking a trip, and lifestyle or daily decision-making, highlighting model behavior in settings representative of typical human day-to-day experiences. Our large-scale analysis of 25 LLMs reveals two kinds of gaps: one where model dispositions deviate from consensus among human annotators, and another when model dispositions do not capture the range of human opinions when consensus is absent. These early results highlight the opportunity for better behavioral alignment to ensure that models can more appropriately navigate the nuances of social dynamics, results we expect future research to build on.

From self-report to situational judgment

We start by collecting statements from established, scientifically validated psychological questionnaires and adapt them into declarations of the model’s general advising tendency. The adapted statements are then used to generate Situational Judgment Tests (SJTs), an assessment methodology widely utilized in psychology, behavioral prediction, and other fields. Across these industries, SJTs are the standard for evaluating behavioral competencies and judgment in complex environments. These tests typically consist of realistic scenarios presenting two possible courses of action: one supporting a specific behavioral trait and one opposing it. In our research, each SJT is reviewed by three independent annotators to validate that the (LLM-generated) scenario and actions are coherent and faithfully capture the underlying behavioral markers being tested.

During the evaluation, the model is prompted with the SJT as input and generates a natural response, which is mapped to one of the two courses of action using an LLM-as-a-judge.

Since our goal is not to quantify LLMs’ behavioral dispositions, but to study the extent of their alignment with human behavior, we collect preferred actions from 10 annotators per SJT from a pool of 550 participants, and compare the resulting human preference distribution to the distribution of model responses in each scenario.

Directional alignment of LLMs’ behavioral dispositions

Here we focus on a subset of scenarios where there is a consensus between human annotators on the preferred course of action. Alignment in these cases is important, as failure to manifest or suppress a trait under strong human agreement suggests a behavioral profile that tends to act differently than typical human behavioral patterns.

We define directional alignment as an interpretable criteria that tests whether the model assigns a higher probability to the action supported by the human majority. Model alignment is then quantified by the percentage of scenarios where this criterion is met.

The figure below presents the results across 25 different LLMs and four distinct traits. The results are grouped by the level of consensus among human annotators (out of 10 responses per scenario): unanimity (10/10), very high (9, 10), and high consensus (8, 9).

Smaller models (<25B) show markedly lower directional alignment, as indicated by the higher prevalence of red and orange cells in the bottom rows under the black horizontal line. These smaller models frequently do not distinguish between the appropriate expression or suppression of traits, often aligning with consensus at near-chance rates.

Large-capacity (>120B) and frontier closed-weights models show significant improvement, achieving close to perfect alignment when consensus among human annotators is unanimous. However, these models’ alignment still plateaus in the low-to-mid 80s when consensus is lower than 90%.

Qualitative analysis of cases where LLMs deviate from the preferred behavioral mode in high-consensus scenarios revealed several interesting patterns. Models tend to encourage emotional openness in professional settings where humans recommend composure. In social disputes, models often prioritize harmony over standing one's ground, contrary to participant preferences. Lastly, models occasionally exhibit higher impulsivity than humans, recommending immediate action over logistical verification for time-sensitive opportunities.

Lack of distributional alignment

Distributional pluralism is a fairness principle arguing that the distribution of a model’s responses should accurately reflect the variety of human viewpoints rather than converging on a single, dominant response. To capture this in our setup, in cases where humans have lower agreement on the preferred action, the model’s probability mass should be more evenly distributed between the two possible actions, resulting in lower confidence in its preferred action.

The figure below presents the model's confidence as a function of human agreement. While a perfectly distributionally aligned model’s confidence should scale proportionally to consensus among human annotators (dotted black line) all 25 evaluated models (blue lines) show a systematic overconfidence in their decision. The solid blue line — representing the average across 25 LLMs — illustrates that models do not represent the inherent ambiguity and the full spectrum of opinions from the human annotators. Even in the low-consensus cases where human opinion is significantly divided (50–60% agreement), confidence remains high across all evaluated models.

LLMs take a stance when humans have low consensus

We established that when consensus among human annotators regarding the preferred action is low, LLMs do not represent such ambiguity, which is reflected as overconfidence. In the figure below we show that the direction of this overconfidence varies substantially, even between frontier models. This suggests that different training and alignment procedures give rise to unique behavioral dispositions.

Self-reporting and revealed behavior

The validity of assessing LLM dispositions via self-reported agreement with questionnaire statements remains an active area of research. While some researchers question the construct validity of this approach, others argue that specific prompting frameworks enable reliable assessment. While settling this debate is beyond the scope of this work, our framework — which maps questionnaire items directly to behavioral scenarios — offers a unique lens to study these dynamics.

The figure below presents a notable divergence between LLMs’ self-reporting and their revealed behavior. For instance, models frequently self-report to be low on impulsiveness, yet they show a behavioral tendency leaning toward impulsiveness. When examining the distribution within each trait, there are also clear inconsistencies between LLM's self-reporting and their revealed behavior. This analysis suggests potential limitations in the validity of direct self-reporting, and highlights the utility of our framework as a foundation for future research.

Discussion

As an early contribution to our ongoing study of model behavior and alignment, we introduce a framework for evaluating behavioral dispositions in LLMs, grounding our approach in established questionnaires methodology while addressing the limitations of traditional self-reporting measures. This framework provides a way to measure gaps, where models do not consistently reflect consensus among human annotators in high-agreement scenarios and underrepresented the range of opinions in low-consensus scenarios. This is a step forward in understanding model behavioral tendencies, and further research is needed in critical areas such as evaluation and addressing identified gaps.

For a deeper dive into our methodology and results, read the paper here.

Acknowledgements

This research was conducted by Amir Taubenfeld, Zorik Gekhman, Lior Nezry, Omri Feldman, Natalie Harris, Shashir Reddy, Romina Stella, Ariel Goldstein, Marian Croak, Yossi Matias and Amir Feder. We thank Itay Laish, Renee Shelby, Nino Scherrer, Sivan Eiger, Saška Mojsilović, Avinatan Hassidim, and Ronit Levavi Morad for reviewing the work and their valuable suggestions.

Original source

Google Research Blog

https://research.google/blog/evaluating-alignment-of-behavioral-dispositions-in-llms/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

alignment

Analyst NewsRecent

User-preference alignment with uncertainty-aware interactive rectification for liver organ and tumor segmentation and analysis from CT images

npj Digital Medicine, Published online: 03 April 2026; doi:10.1038/s41746-026-02544-2 User-preference alignment with uncertainty-aware interactive rectification for liver organ and tumor segmentation and analysis from CT images

nature.com

1mabout 22 hours ago

ModelsFresh

Can an AI have its own internal Ethics? Standard Protocol for Axiomatic Alignment

Hello community, I am introducing a standardized experimental protocol to test a new hypothesis in AI Alignment: The Prompt Coherence Engine (PCE). Proof of Concept: My iterative stress tests on Qwen 2.5 7B have already demonstrated a measurable progression in adversarial robustness (D3 series), increasing from a score of 5/10 , 8/10 to 10/10 through axiomatic closure. PCE_Iterative_Adjustment_Study.pdf · AllanF-SSU/Experimentals_papers at main The Challenge Most alignment methods rely on local heuristics or safety filters. The PCE explores Axiomatic Structuring—integrating 7 logical invariants (axioms) through a hybrid approach of Axiomatic Fine-Tuning and a Cosmological System Core. The Protocol I have designed a massive 100-dilemma battery to evaluate if a model can maintain structural

discuss.huggingface.co

2mabout 7 hours ago

ModelsFresh

Plans are like Fruit Flies

There is a particular sound which you will hear around a month into a genetics course. It’s kind of contagious, spreading from person to person in the class. It’s the sound of someone finally internalising why genes are named the way they are. Example: Fruit flies (it’s always fruit flies) have several genes responsible for producing their eye colour. It’s normally a brick-red colour, as a result of both brown and red genes being present. The scarlet gene makes a protein which moves kynurenine into pigment cells The brown gene makes a protein which moves pteridine into pigment cells The white gene makes a protein which is involved with both transporters Now, take a guess at what colour the kynurenine and pteridine molecules are? Well actually both are colourless precursors, but guess which

lesswrong.com

7mabout 5 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 145 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

Evaluating alignment of behavioral dispositions in LLMs

From self-report to situational judgment

Directional alignment of LLMs’ behavioral dispositions

Lack of distributional alignment

LLMs take a stance when humans have low consensus

Self-reporting and revealed behavior

Discussion

Acknowledgements

Daily AI Digest

More about

User-preference alignment with uncertainty-aware interactive rectification for liver organ and tumor segmentation and analysis from CT images

Can an AI have its own internal Ethics? Standard Protocol for Axiomatic Alignment

Plans are like Fruit Flies

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Microsoft Introduces 3 Foundational AI Models To Take on OpenAI, Anthropic - Yahoo Tech

Mistral AI Raises $830 Million in Debt For Nvidia-Powered Data Center - WSJ

Mistral AI Lands Accenture as Latest Big Client - WSJ

Show HN: AI agent skills for affiliate marketing (Markdown, works with any LLM)