Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts
Niklas Weller, Emilio Barkett
Abstract
Aligning AI systems with organizational decision-making is typically framed as a single-target problem: make the model behave like the organization. We argue this framing obscures a deeper pluralistic challenge. We rely on a decision-policy capturing method to measure process alignment: whether an LLM weights information as the organization does, not merely whether it reaches the same conclusions. Applying this method to ECHR Article 6 decisions, process alignment strongly predicts output accuracy (r = 0.85, p < .001) and externalization substantially improves alignment for poorly-aligned models. Applying it to German consumer credit decisions, this relationship collapses (r = 0.15, p = .60): interventions produce inconsistent effects and the benchmark encodes potentially discriminatory historical patterns. This contrast is itself a pluralistic alignment finding: in contested domains, high process alignment is neither achievable via externalization nor unconditionally desirable. Output agreement alone cannot distinguish a model that has internalized an organizational policy from one that merely approximates its outcomes; process-level measurement is a necessary component of any pluralistic alignment evaluation.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper introduces the Contextualized Alignment Lens Model (CALM), a framework for measuring *process alignment* between LLMs and organizational decision-making policies. Rather than evaluating whether an LLM reaches the same output as an organization, CALM measures whether the model weights information (cues) in the same way the organization does, operationalized via cosine similarity between ridge-regularized logistic regression coefficient vectors fitted to organizational and LLM decisions respectively.
The key conceptual contribution is reframing organizational AI alignment as a pluralistic, between-organization problem rather than a single-target optimization. The paper argues that different organizations encode different value systems, and that output-level agreement is insufficient to determine whether an LLM has genuinely internalized an organizational policy versus merely approximating its outcomes through different reasoning pathways.
The paper's central empirical finding is a cross-domain contrast: in ECHR legal decisions, process alignment strongly predicts output accuracy (r = 0.85), and externalization interventions reliably improve alignment for misaligned models. In German consumer credit decisions, this relationship collapses (r = 0.15), interventions produce inconsistent effects, and the benchmark itself encodes potentially discriminatory patterns. The authors argue this contrast is itself a pluralistic alignment finding — demonstrating that the desirability of process alignment is domain-dependent.
2. Methodological Rigor
The methodological foundation — the Brunswik Lens Model — is well-established in judgment and decision-making research, and its application to LLM-organization alignment is a reasonable extension. The operationalization via ridge logistic regression and cosine similarity is straightforward and interpretable.
However, several methodological concerns arise:
3. Potential Impact
The paper addresses a genuinely important problem. As organizations increasingly deploy LLMs for consequential decisions, understanding *how* models reason — not just their accuracy — is critical for governance, auditing, and trust. The CALM framework offers:
The distinction between CALM as a calibration tool (legitimate benchmarks) vs. audit tool (contested benchmarks) is conceptually valuable and could influence how regulators think about AI process requirements.
4. Timeliness & Relevance
The paper is well-timed given increasing regulatory attention to AI decision-making processes (EU AI Act, proposed US frameworks) and the growing deployment of LLMs in organizational decision support. The pluralistic alignment framing connects to active research by Sorensen et al. (2024) and others. The between-organization plurality angle is genuinely under-explored and timely.
5. Strengths & Limitations
Strengths:
Limitations:
Overall Assessment
This paper presents a conceptually interesting framework that bridges organizational theory, judgment and decision-making psychology, and AI alignment. The core idea — measuring process alignment at the cue-weighting level and recognizing that alignment desirability is domain-dependent — is valuable. However, the empirical evidence is preliminary: two domains, small samples, and incomplete model coverage limit the strength of the conclusions. The paper would benefit substantially from additional domains, larger-scale evaluation, and more rigorous statistical treatment of the cross-domain comparison.
Generated May 26, 2026
Comparison History (20)
Paper 1 addresses a fundamental methodological issue in LLM confidence calibration—showing that measurement protocol choices (conditioning context, token readout, answer string selection) can change the sign of calibration comparisons. This has broad impact across the entire LLM evaluation community, as confidence calibration is central to trustworthy AI deployment. The actionable reporting checklist and systematic experimental design across multiple models/benchmarks provide immediate practical value. Paper 2 offers interesting insights on process vs. output alignment in specific domains, but its narrower scope (two specific decision contexts) and more niche audience limit its breadth of impact.
Paper 1 has higher impact potential: it introduces a deployable edge–cloud split-inference framework for many-to-many speech translation that directly addresses major real-world constraints (privacy, bandwidth, on-device limits) with quantified gains (up to 10× bandwidth reduction) and strong multilingual results across 45 languages and 1,980 directions, plus released code/models enabling adoption and follow-on work. Paper 2 offers an important conceptual/measurement contribution to pluralistic alignment, but its empirical scope is narrower (two decision contexts) and nearer-term applications are more domain- and governance-dependent, likely yielding less immediate cross-field uptake.
Paper 2 has higher potential impact due to a clearer methodological and systems innovation (causal Mamba/SSM for streaming EEG with linear-time scaling plus a tailored multi-stage self-supervised objective), strong real-world applicability (real-time continuous EEG monitoring), and broad relevance across ML, neuroscience, and medical devices. The claimed SOTA across multiple datasets and >10x throughput suggests solid empirical rigor and immediate deployability. Paper 1 introduces an important evaluation lens for LLM organizational “process alignment,” but its impact is narrower, more context-dependent, and primarily conceptual/measurement-focused with less direct generalization to high-stakes deployment performance gains.
Paper 2 addresses a broadly applicable problem (tool failure in medical AI agents) with a novel RL-based framework that demonstrates consistent improvements across seven benchmarks. Its practical relevance to clinical safety, methodological contribution (GRPO-based instance-level tool selection with disagreement-aware learning), and breadth of experimental validation give it higher potential impact. Paper 1 offers valuable conceptual insights on pluralistic alignment with process-level measurement, but its findings are more domain-specific (legal/credit decisions) and primarily diagnostic rather than providing a scalable solution, limiting its broader impact.
Paper 2 challenges a fundamental paradigm in AI alignment by distinguishing process alignment from outcome alignment. Its findings on domain dependency and the risk of encoding discriminatory patterns offer profound implications for AI ethics, governance, and organizational implementation. While Paper 1 presents a solid architectural advancement for embodied agents, Paper 2's broader interdisciplinary relevance across AI safety, policy, and fairness gives it a higher potential for widespread scientific and societal impact.
Paper 2 likely has higher scientific impact. It targets an emerging, widely relevant problem—reliability of long-lived deployed agents—introducing a concrete benchmark (AgingBench), a taxonomy of aging mechanisms, and diagnostics tied to actionable repairs across the memory pipeline, evaluated across many models/scenarios and long horizons. This combines novelty, methodological rigor, timeliness, and broad applicability to agentic systems, MLOps, and safety/reliability engineering. Paper 1 offers an important alignment measurement lens but is more domain-specific and its broader methodological generalization is less clearly demonstrated.
Paper 1 provides a highly rigorous methodology to uncover a fundamental architectural constraint in LLMs (the 4x4 scale threshold). Its detailed forensic pipeline and discovery of structural failure modes offer concrete, actionable insights for AI researchers to improve LLM reasoning, likely driving significant technical advancements.
Paper 1 offers a more novel and general-purpose contribution: a scalable, inspectable environment that disentangles predictive performance from true causal-mechanism recovery for interactive LLM “AI scientist” agents, with ground-truth SCMs and intervention loops. This is methodologically strong, broadly reusable across causal discovery, agent evaluation, and automated science, and timely given interest in tool-using LLM researchers. Paper 2 is important for applied AI governance and highlights pluralistic alignment pitfalls, but its impact is more domain- and dataset-dependent and may be constrained by institutional specifics and normative disagreement.
AgentFugue addresses a fundamental scaling question in AI agents—whether multiple peer agents can collectively improve performance on long-horizon tasks through shared reasoning without centralized orchestration. This has broad applicability across agentic AI systems, introduces a novel architectural paradigm (shared reasoning hub), and combines SFT with RL training. Paper 2 makes a valuable conceptual contribution about process vs. output alignment in pluralistic contexts, but its scope is narrower (two specific legal/credit domains) and its impact is more limited to the alignment evaluation community. AgentFugue's framework is more likely to inspire follow-up work across multiple AI subfields.
Paper 1 tackles a fundamental and highly timely issue in AI alignment, distinguishing between process and outcome alignment. Its findings on the limitations of current alignment evaluations in contested domains like law and finance have broad implications for AI ethics, governance, and organizational implementation. Paper 2 presents a valuable applied tool for geospatial scientists, but its methodological contribution is less foundational and its breadth of impact is narrower compared to the theoretical and practical implications of Paper 1.
Paper 2 likely has higher impact due to a clearer, broadly applicable systems contribution: LLM-guided, traceable “model patches” for continuous re-optimization in deployed OR models, validated on two large real-world case studies. It targets a widespread industrial pain point (maintaining optimization models under changing constraints), offering immediate practical utility and cross-field relevance (OR, AI agents, decision-support, human-in-the-loop systems). Paper 1 is conceptually novel for pluralistic/process alignment measurement, but is narrower in domain scope and its mixed empirical results may limit near-term uptake.
Paper 1 addresses a fundamental challenge in AI safety and ethics by distinguishing between process and output alignment. Its findings regarding the reproduction of discriminatory historical patterns and the limits of alignment in contested domains offer profound implications for AI governance, fairness, and organizational deployment. Paper 2 provides a valuable but more narrowly focused engineering harness for time series agents, giving Paper 1 a broader and more urgent scientific and societal impact.
Paper 1 addresses a fundamental conceptual challenge in AI alignment by distinguishing between process and outcome alignment across diverse real-world contexts (legal and credit decisions). Its interdisciplinary approach bridges AI, organizational behavior, and ethics, offering broader societal and theoretical impact. Paper 2, while methodologically rigorous and practically useful for AI control, focuses on narrower technical mechanisms (retrying vs resampling) and has a more constrained scope of impact.
Paper 2 likely has higher impact: it introduces a broadly applicable RL-based framework (SkillOS) for long-horizon skill curation in self-evolving LLM agents, a timely and fast-growing area with clear downstream applications (autonomous agents, continual learning, tool/skill libraries). The method appears more generalizable across tasks/backbones and could influence both agent architectures and learning algorithms. Paper 1 is conceptually novel for alignment evaluation (process alignment, pluralism) and important for governance, but its empirical scope is narrower and domain-dependent, potentially limiting cross-field uptake.
Paper 2 likely has higher scientific impact: it introduces a novel, technically concrete architecture (concept-guided multimodal MoE with residual concept bottlenecks) and validates it on real clinical datasets with expert (neuropathologist) assessment, supporting methodological rigor and translational relevance. Its applications in computational pathology are immediate and high-stakes, and the approach generalizes to broader multimodal medical AI and interpretable ML. Paper 1 offers important conceptual framing and an evaluation method for process alignment, but its impact may be narrower and more context-dependent, with mixed empirical results across domains.
Paper 1 offers a profound conceptual shift in AI alignment by distinguishing process alignment from outcome agreement, challenging current paradigms in AI governance and ethics. While Paper 2 rigorously identifies a practical bias in LLM-as-a-judge applications, its findings describe a procedural artifact that is easily mitigated by clearing the context window. Paper 1's insights into contested organizational domains have broader, longer-lasting implications for how we measure and deploy safe AI systems in complex societal frameworks.
Paper 2 addresses a fundamental challenge in AI safety and governance by distinguishing between process and outcome alignment. Its insights into pluralistic alignment and the risks of encoding historical biases have broad implications across multiple domains (legal, corporate, AI ethics), offering wider scientific impact than Paper 1's domain-specific mitigation for financial backtesting.
Paper 2 introduces a novel technical framework (LLMSurvival) that solves a concrete, well-defined problem—enabling LLMs to perform censoring-aware survival analysis without architectural modifications. This has broad applicability across clinical medicine and potentially other fields using time-to-event data. The pairwise ranking reformulation is a creative methodological contribution with demonstrated improvements over established baselines on real clinical datasets. Paper 1 offers valuable conceptual insights about pluralistic alignment measurement but is more narrowly scoped to alignment evaluation methodology, with findings that are primarily diagnostic rather than providing a broadly adoptable new capability.
Paper 2 addresses a fundamental and highly timely challenge: LLM alignment. By shifting focus from outcome alignment to process alignment, it offers a novel framework applicable across law, finance, and AI governance. While Paper 1 presents a highly rigorous and valuable solution for industrial digital twins, Paper 2's conceptual breakthrough has broader implications for the rapidly expanding deployment of LLMs in societal decision-making, offering insights into fairness and pluralism that will likely influence a much wider range of disciplines.
Paper 1 offers higher potential scientific impact due to its broad real-world applicability in AI governance and ethics. By distinguishing between process and output alignment, it addresses a fundamental challenge in deploying LLMs in high-stakes domains like law and finance. Its insights into pluralistic alignment and the risks of encoding historical biases have wide-reaching implications across sociology, law, and AI safety. In contrast, Paper 2, while methodologically rigorous, focuses heavily on a specific benchmark (ARC-AGI-3), making its immediate impact narrower and more constrained to the niche of AGI benchmark evaluation.