Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts

Niklas Weller, Emilio Barkett

May 24, 2026

arXiv:2605.25256v1 PDF

cs.AI(primary)

#1441of 2682·Artificial Intelligence

#1441 of 2682 · Artificial Intelligence

Tournament Score

1400±41

10501800

40%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance6.5

Rigor4.5

Novelty6.5

Clarity7

Tournament Score

1400±41

10501800

40%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Aligning AI systems with organizational decision-making is typically framed as a single-target problem: make the model behave like the organization. We argue this framing obscures a deeper pluralistic challenge. We rely on a decision-policy capturing method to measure process alignment: whether an LLM weights information as the organization does, not merely whether it reaches the same conclusions. Applying this method to ECHR Article 6 decisions, process alignment strongly predicts output accuracy (r = 0.85, p < .001) and externalization substantially improves alignment for poorly-aligned models. Applying it to German consumer credit decisions, this relationship collapses (r = 0.15, p = .60): interventions produce inconsistent effects and the benchmark encodes potentially discriminatory historical patterns. This contrast is itself a pluralistic alignment finding: in contested domains, high process alignment is neither achievable via externalization nor unconditionally desirable. Output agreement alone cannot distinguish a model that has internalized an organizational policy from one that merely approximates its outcomes; process-level measurement is a necessary component of any pluralistic alignment evaluation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper introduces the Contextualized Alignment Lens Model (CALM), a framework for measuring *process alignment* between LLMs and organizational decision-making policies. Rather than evaluating whether an LLM reaches the same output as an organization, CALM measures whether the model weights information (cues) in the same way the organization does, operationalized via cosine similarity between ridge-regularized logistic regression coefficient vectors fitted to organizational and LLM decisions respectively.

The key conceptual contribution is reframing organizational AI alignment as a pluralistic, between-organization problem rather than a single-target optimization. The paper argues that different organizations encode different value systems, and that output-level agreement is insufficient to determine whether an LLM has genuinely internalized an organizational policy versus merely approximating its outcomes through different reasoning pathways.

The paper's central empirical finding is a cross-domain contrast: in ECHR legal decisions, process alignment strongly predicts output accuracy (r = 0.85), and externalization interventions reliably improve alignment for misaligned models. In German consumer credit decisions, this relationship collapses (r = 0.15), interventions produce inconsistent effects, and the benchmark itself encodes potentially discriminatory patterns. The authors argue this contrast is itself a pluralistic alignment finding — demonstrating that the desirability of process alignment is domain-dependent.

2. Methodological Rigor

The methodological foundation — the Brunswik Lens Model — is well-established in judgment and decision-making research, and its application to LLM-organization alignment is a reasonable extension. The operationalization via ridge logistic regression and cosine similarity is straightforward and interpretable.

However, several methodological concerns arise:

Sample sizes for correlation claims: The headline correlation of r = 0.85 is computed across n = 30 data points (10 models × 3 conditions), and r = 0.15 across approximately n = 13-15 points (5 models × 2-3 conditions). These are small samples, and the ECHR correlation, while statistically significant, may be inflated by the small n and potential non-independence of observations (the same model appears three times under different conditions).

Domain selection: The authors acknowledge the potential objection that domains were chosen to produce the contrast, but their defense — that domains were selected before results were examined — is an assertion without verifiable evidence. More critically, only two domains are examined, making it difficult to generalize the contestedness hypothesis.

Cue coding via GPT-5.4-mini: The 45 binary features for ECHR cases were coded by an LLM, introducing potential circularity — LLMs are coding the features that are then used to measure LLM alignment. The paper does not report inter-rater reliability or validate the coding against human annotations.

German Credit dataset limitations: The Statlog German Credit dataset (1994) is extremely well-worn, small (n=1000), and has known issues. Using it as the sole representative of "contested" organizational benchmarks is limiting. The balanced subset of 600 cases further reduces power.

Causal interpretation: The paper sometimes implies that externalization *causes* alignment improvement, but the design is observational across prompting conditions, not a controlled experiment with proper randomization.

Study 2 model selection: Only 5 of the 10 models from Study 1 are used in Study 2, with the promise that "remaining models will be included in a full replication." This is a notable gap for a paper making cross-domain claims.

3. Potential Impact

The paper addresses a genuinely important problem. As organizations increasingly deploy LLMs for consequential decisions, understanding *how* models reason — not just their accuracy — is critical for governance, auditing, and trust. The CALM framework offers:

A practical audit tool for organizations deploying LLMs in regulated settings (credit, legal, healthcare)

A framework connecting AI alignment to organizational theory (tacit knowledge, institutional norms)

Relevant input to EU AI Act compliance, particularly for high-risk applications requiring process transparency

The distinction between CALM as a calibration tool (legitimate benchmarks) vs. audit tool (contested benchmarks) is conceptually valuable and could influence how regulators think about AI process requirements.

4. Timeliness & Relevance

The paper is well-timed given increasing regulatory attention to AI decision-making processes (EU AI Act, proposed US frameworks) and the growing deployment of LLMs in organizational decision support. The pluralistic alignment framing connects to active research by Sorensen et al. (2024) and others. The between-organization plurality angle is genuinely under-explored and timely.

5. Strengths & Limitations

Strengths:

Novel and well-motivated conceptual framing: extending pluralistic alignment to between-organization diversity

Clear operationalization of process alignment via an established psychological framework

The cross-domain contrast is genuinely informative — the failure case (German Credit) is arguably more interesting than the success case

The Grok over-correction finding (99.5% approval rate under introspective feedback) is a striking and practically important failure mode

The paper honestly engages with the normative complexity of alignment targets

The faithfulness discussion (behavioral vs. stated reasoning) identifies an important open problem

Limitations:

Only two domains examined — the contestedness hypothesis needs broader testing

Small effective sample sizes for key statistical claims

LLM-coded features for ECHR create potential circularity

The German Credit dataset is dated and small; more modern credit datasets exist

No formal statistical testing of the cross-domain difference in correlations (e.g., Fisher's z-test)

The paper is a preprint and reads as somewhat preliminary — Study 2 uses only half the models

Ridge logistic regression as a proxy for organizational "decision policy" is a strong assumption — it captures linear, additive cue utilization but not interactions, nonlinearities, or case-specific reasoning

The paper does not compare CALM to any existing alignment measurement approach

Overall Assessment

This paper presents a conceptually interesting framework that bridges organizational theory, judgment and decision-making psychology, and AI alignment. The core idea — measuring process alignment at the cue-weighting level and recognizing that alignment desirability is domain-dependent — is valuable. However, the empirical evidence is preliminary: two domains, small samples, and incomplete model coverage limit the strength of the conclusions. The paper would benefit substantially from additional domains, larger-scale evaluation, and more rigorous statistical treatment of the cross-domain comparison.

Rating:5.5/ 10

Significance 6.5Rigor 4.5Novelty 6.5Clarity 7

Generated May 26, 2026

Comparison History (20)

vs. Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

claude-opus-4.65/28/2026

Paper 1 addresses a fundamental methodological issue in LLM confidence calibration—showing that measurement protocol choices (conditioning context, token readout, answer string selection) can change the sign of calibration comparisons. This has broad impact across the entire LLM evaluation community, as confidence calibration is central to trustworthy AI deployment. The actionable reporting checklist and systematic experimental design across multiple models/benchmarks provide immediate practical value. Paper 2 offers interesting insights on process vs. output alignment in specific domains, but its narrower scope (two specific decision contexts) and more niche audience limit its breadth of impact.

vs. Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

gpt-5.25/28/2026

Paper 1 has higher impact potential: it introduces a deployable edge–cloud split-inference framework for many-to-many speech translation that directly addresses major real-world constraints (privacy, bandwidth, on-device limits) with quantified gains (up to 10× bandwidth reduction) and strong multilingual results across 45 languages and 1,980 directions, plus released code/models enabling adoption and follow-on work. Paper 2 offers an important conceptual/measurement contribution to pluralistic alignment, but its empirical scope is narrower (two decision contexts) and nearer-term applications are more domain- and governance-dependent, likely yielding less immediate cross-field uptake.

vs. CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models

gpt-5.25/28/2026

Paper 2 has higher potential impact due to a clearer methodological and systems innovation (causal Mamba/SSM for streaming EEG with linear-time scaling plus a tailored multi-stage self-supervised objective), strong real-world applicability (real-time continuous EEG monitoring), and broad relevance across ML, neuroscience, and medical devices. The claimed SOTA across multiple datasets and >10x throughput suggests solid empirical rigor and immediate deployability. Paper 1 introduces an important evaluation lens for LLM organizational “process alignment,” but its impact is narrower, more context-dependent, and primarily conceptual/measurement-focused with less direct generalization to high-stakes deployment performance gains.

vs. Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

claude-opus-4.65/27/2026

Paper 2 addresses a broadly applicable problem (tool failure in medical AI agents) with a novel RL-based framework that demonstrates consistent improvements across seven benchmarks. Its practical relevance to clinical safety, methodological contribution (GRPO-based instance-level tool selection with disagreement-aware learning), and breadth of experimental validation give it higher potential impact. Paper 1 offers valuable conceptual insights on pluralistic alignment with process-level measurement, but its findings are more domain-specific (legal/credit decisions) and primarily diagnostic rather than providing a scalable solution, limiting its broader impact.

vs. Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

gemini-3.15/27/2026

Paper 2 challenges a fundamental paradigm in AI alignment by distinguishing process alignment from outcome alignment. Its findings on domain dependency and the risk of encoding discriminatory patterns offer profound implications for AI ethics, governance, and organizational implementation. While Paper 1 presents a solid architectural advancement for embodied agents, Paper 2's broader interdisciplinary relevance across AI safety, policy, and fairness gives it a higher potential for widespread scientific and societal impact.

vs. Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact. It targets an emerging, widely relevant problem—reliability of long-lived deployed agents—introducing a concrete benchmark (AgingBench), a taxonomy of aging mechanisms, and diagnostics tied to actionable repairs across the memory pipeline, evaluated across many models/scenarios and long horizons. This combines novelty, methodological rigor, timeliness, and broad applicability to agentic systems, MLOps, and safety/reliability engineering. Paper 1 offers an important alignment measurement lens but is more domain-specific and its broader methodological generalization is less clearly demonstrated.

vs. LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

gemini-3.15/26/2026

Paper 1 provides a highly rigorous methodology to uncover a fundamental architectural constraint in LLMs (the 4x4 scale threshold). Its detailed forensic pipeline and discovery of structural failure modes offer concrete, actionable insights for AI researchers to improve LLM reasoning, likely driving significant technical advancements.

vs. CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

gpt-5.25/26/2026

Paper 1 offers a more novel and general-purpose contribution: a scalable, inspectable environment that disentangles predictive performance from true causal-mechanism recovery for interactive LLM “AI scientist” agents, with ground-truth SCMs and intervention loops. This is methodologically strong, broadly reusable across causal discovery, agent evaluation, and automated science, and timely given interest in tool-using LLM researchers. Paper 2 is important for applied AI governance and highlights pluralistic alignment pitfalls, but its impact is more domain- and dataset-dependent and may be constrained by institutional specifics and normative disagreement.

vs. AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

claude-opus-4.65/26/2026

AgentFugue addresses a fundamental scaling question in AI agents—whether multiple peer agents can collectively improve performance on long-horizon tasks through shared reasoning without centralized orchestration. This has broad applicability across agentic AI systems, introduces a novel architectural paradigm (shared reasoning hub), and combines SFT with RL training. Paper 2 makes a valuable conceptual contribution about process vs. output alignment in pluralistic contexts, but its scope is narrower (two specific legal/credit domains) and its impact is more limited to the alignment evaluation community. AgentFugue's framework is more likely to inspire follow-up work across multiple AI subfields.

vs. GRAIL: AI translation for scientists application workflow on satellite data

gemini-3.15/26/2026

Paper 1 tackles a fundamental and highly timely issue in AI alignment, distinguishing between process and outcome alignment. Its findings on the limitations of current alignment evaluations in contested domains like law and finance have broad implications for AI ethics, governance, and organizational implementation. Paper 2 presents a valuable applied tool for geospatial scientists, but its methodological contribution is less foundational and its breadth of impact is narrower compared to the theoretical and practical implications of Paper 1.

vs. Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

gpt-5.25/26/2026

Paper 2 likely has higher impact due to a clearer, broadly applicable systems contribution: LLM-guided, traceable “model patches” for continuous re-optimization in deployed OR models, validated on two large real-world case studies. It targets a widespread industrial pain point (maintaining optimization models under changing constraints), offering immediate practical utility and cross-field relevance (OR, AI agents, decision-support, human-in-the-loop systems). Paper 1 is conceptually novel for pluralistic/process alignment measurement, but is narrower in domain scope and its mixed empirical results may limit near-term uptake.

vs. AION: Next-Generation Tasks and Practical Harness for Time Series

gemini-3.15/26/2026

Paper 1 addresses a fundamental challenge in AI safety and ethics by distinguishing between process and output alignment. Its findings regarding the reproduction of discriminatory historical patterns and the limits of alignment in contested domains offer profound implications for AI governance, fairness, and organizational deployment. Paper 2 provides a valuable but more narrowly focused engineering harness for time series agents, giving Paper 1 a broader and more urgent scientific and societal impact.

vs. Retrying vs Resampling in AI Control

gemini-3.15/26/2026

Paper 1 addresses a fundamental conceptual challenge in AI alignment by distinguishing between process and outcome alignment across diverse real-world contexts (legal and credit decisions). Its interdisciplinary approach bridges AI, organizational behavior, and ethics, offering broader societal and theoretical impact. Paper 2, while methodologically rigorous and practically useful for AI control, focuses on narrower technical mechanisms (retrying vs resampling) and has a more constrained scope of impact.

vs. SkillOS: Learning Skill Curation for Self-Evolving Agents

gpt-5.25/26/2026

Paper 2 likely has higher impact: it introduces a broadly applicable RL-based framework (SkillOS) for long-horizon skill curation in self-evolving LLM agents, a timely and fast-growing area with clear downstream applications (autonomous agents, continual learning, tool/skill libraries). The method appears more generalizable across tasks/backbones and could influence both agent architectures and learning algorithms. Paper 1 is conceptually novel for alignment evaluation (process alignment, pluralism) and important for governance, but its empirical scope is narrower and domain-dependent, potentially limiting cross-field uptake.

vs. ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it introduces a novel, technically concrete architecture (concept-guided multimodal MoE with residual concept bottlenecks) and validates it on real clinical datasets with expert (neuropathologist) assessment, supporting methodological rigor and translational relevance. Its applications in computational pathology are immediate and high-stakes, and the approach generalizes to broader multimodal medical AI and interpretable ML. Paper 1 offers important conceptual framing and an evaluation method for process alignment, but its impact may be narrower and more context-dependent, with mixed empirical results across domains.

vs. AMEL: Accumulated Message Effects on LLM Judgments

gemini-3.15/26/2026

Paper 1 offers a profound conceptual shift in AI alignment by distinguishing process alignment from outcome agreement, challenging current paradigms in AI governance and ethics. While Paper 2 rigorously identifies a practical bias in LLM-as-a-judge applications, its findings describe a procedural artifact that is easily mitigated by clearing the context window. Paper 1's insights into contested organizational domains have broader, longer-lasting implications for how we measure and deploy safe AI systems in complex societal frameworks.

vs. Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models

gemini-3.15/26/2026

Paper 2 addresses a fundamental challenge in AI safety and governance by distinguishing between process and outcome alignment. Its insights into pluralistic alignment and the risks of encoding historical biases have broad implications across multiple domains (legal, corporate, AI ethics), offering wider scientific impact than Paper 1's domain-specific mitigation for financial backtesting.

vs. Towards end-to-end LLM-based censoring-aware survival analysis

claude-opus-4.65/26/2026

Paper 2 introduces a novel technical framework (LLMSurvival) that solves a concrete, well-defined problem—enabling LLMs to perform censoring-aware survival analysis without architectural modifications. This has broad applicability across clinical medicine and potentially other fields using time-to-event data. The pairwise ranking reformulation is a creative methodological contribution with demonstrated improvements over established baselines on real clinical datasets. Paper 1 offers valuable conceptual insights about pluralistic alignment measurement but is more narrowly scoped to alignment evaluation methodology, with findings that are primarily diagnostic rather than providing a broadly adoptable new capability.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

gemini-3.15/26/2026

Paper 2 addresses a fundamental and highly timely challenge: LLM alignment. By shifting focus from outcome alignment to process alignment, it offers a novel framework applicable across law, finance, and AI governance. While Paper 1 presents a highly rigorous and valuable solution for industrial digital twins, Paper 2's conceptual breakthrough has broader implications for the rapidly expanding deployment of LLMs in societal decision-making, offering insights into fairness and pluralism that will likely influence a much wider range of disciplines.

vs. Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3

gemini-3.15/26/2026

Paper 1 offers higher potential scientific impact due to its broad real-world applicability in AI governance and ethics. By distinguishing between process and output alignment, it addresses a fundamental challenge in deploying LLMs in high-stakes domains like law and finance. Its insights into pluralistic alignment and the risks of encoding historical biases have wide-reaching implications across sociology, law, and AI safety. In contrast, Paper 2, while methodologically rigorous, focuses heavily on a specific benchmark (ARC-AGI-3), making its immediate impact narrower and more constrained to the niche of AGI benchmark evaluation.