Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration
Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li
Abstract
Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.
AI Impact Assessments
(1 models)Scientific Impact Assessment: ALMANAC
1. Core Contribution
ALMANAC introduces a novel dataset that pairs authentic human collaborative behaviors with theory-grounded, action-level mental model annotations. The key novelty lies in capturing not just *what* collaborators do (actions, dialogue), but *why* they do it — their self-reasoning, perceived partner intent, and perceived team goal at each action step. Built on the Map Task, a well-established dyadic routing paradigm from psycholinguistics, the dataset contains 2,987 annotated actions from 50 participants across 25 sessions.
The paper addresses a genuine gap: existing collaboration datasets (DealNoDeal, CaSiNo, MutualFriends) capture observable interactions but omit the cognitive layer that drives collaborative behavior. Existing agent benchmarks (ToolBench, WebArena, τ-Bench) evaluate task completion, not collaborative competence. ALMANAC bridges these by providing supervision signals for process-level collaboration modeling rather than outcome-level task completion.
2. Methodological Rigor
The annotation framework is thoughtfully designed with a two-step approach: in-session checkpoints (at 25%, 50%, 75% progress) capture real-time mental states, which then serve as memory anchors for post-session retrospective action-level annotation. This design mitigates recall bias, though it does not eliminate it entirely — a limitation the authors acknowledge.
The annotation schema is grounded in established collaboration theories (Cannon-Bowers et al., 1993; Marks et al., 2001; Gutwin & Greenberg, 2002; Traum, 1995), lending theoretical validity. The grounding act coding achieves Fleiss' κ = 0.81 among human annotators, and GPT-5.5's automated annotation achieves κ = 0.76 against humans, which is respectable.
The experimental design includes a meaningful between-subjects manipulation (canvas visibility), which creates natural variation in grounding behaviors and mental model alignment patterns. The benchmark evaluates six LLMs across two tasks (next action prediction and mental model prediction), with both prompt-based and fine-tuning conditions.
However, several methodological concerns arise. The dataset is modest in scale (25 sessions, 50 participants), which limits statistical power and generalizability. The train/test split is only 19/6 sessions, making test-set results potentially unstable. The retrospective annotation approach, despite mitigation efforts, remains susceptible to post-hoc rationalization. The paper also does not report inter-annotator agreement on the mental model annotations themselves (only on grounding acts), which is a notable gap since mental models are the central contribution.
3. Potential Impact
For the LLM agent community, ALMANAC offers a rare resource for training and evaluating agents on collaborative process understanding rather than task completion. The finding that mental model annotations improve behavioral prediction (inconsistently) while self-reasoning remains hard to predict highlights concrete research directions.
For human-AI collaboration, the dataset provides empirical grounding for developing agents that maintain partner models and shared situational awareness — capabilities essential for genuine collaboration beyond instruction-following.
For computational social science, the annotation framework could be adapted to other collaborative settings, providing a reusable methodology for capturing cognitive processes during interaction.
The practical impact is somewhat constrained by the task domain (Map Task), which, while theoretically motivated, is relatively simple compared to real-world collaboration scenarios like collaborative programming or decision-making. Extension to richer domains would be needed to validate generalizability.
4. Timeliness & Relevance
The paper is well-timed. The rapid deployment of LLM agents as collaborative partners (coding assistants, writing tools, meeting support) has outpaced our understanding of how to make them genuinely collaborative rather than merely responsive. The community is actively seeking benchmarks that go beyond task completion metrics. ALMANAC directly addresses this need by providing human-grounded collaboration data with cognitive annotations.
The framing of human-agent collaboration through the lens of human-human collaboration theory (common ground, shared mental models, workspace awareness) is increasingly relevant as agents take on more autonomous, peer-like roles.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
6. Additional Observations
The analysis connecting grounding acts to mental model alignment (Figure 4) is insightful and provides validation that the annotations capture meaningful cognitive variation. The finding that self-reasoning is consistently the hardest mental model dimension to predict is theoretically interesting and suggests that private cognition remains a fundamental challenge for LLMs, beyond simple scaling.
The dataset's release on HuggingFace supports reproducibility. However, the paper could benefit from more detailed error analysis examining *when* and *why* models fail at mental model prediction, rather than primarily reporting aggregate metrics.
Overall, ALMANAC makes a meaningful conceptual contribution by explicitly connecting collaboration theory to LLM agent evaluation, but its empirical impact is tempered by modest scale and limited domain coverage.
Generated Jun 5, 2026
Comparison History (17)
Paper 2 likely has higher scientific impact due to a concrete, reusable dataset with theory-informed, action-level mental model annotations—an enabling resource for many labs to train/evaluate collaborative agents. It is timely given rapid growth of LLM agents and the field’s clear bottleneck in authentic interaction data. The work is methodologically grounded (Map Task, explicit annotation schema, multi-LLM benchmarks) and broadly relevant across NLP, HCI, cognitive science, and social science. Paper 1 is a perspective/overview; while important, it is less immediately enabling and has narrower near-term adoption potential.
Paper 1 offers higher scientific impact due to its broader applicability across cognitive science, human-computer interaction, and general AI development. By providing a novel dataset with action-level mental model annotations, it addresses a fundamental bottleneck in agent-human collaboration (theory of mind and shared goals). In contrast, Paper 2, while highly valuable and methodologically rigorous, is primarily focused on an applied, domain-specific framework (legal tech), making its foundational scientific impact narrower.
Paper 2 offers higher potential impact by addressing a critical, timely paradox in generative AI: the homogenization of creative outputs. While Paper 1 provides a valuable dataset for human-agent collaboration, Paper 2 introduces a broad theoretical framework spanning HCI, cognitive science, and AI ethics. By identifying the underlying mechanisms of selective metacognitive adaptation, Paper 2 explains current empirical observations and provides actionable design principles to mitigate systemic socio-technical risks, ensuring broader multidisciplinary relevance.
Paper 1 introduces a novel, generalizable test-time adaptation framework (hybrid long-term trajectory memory + on-the-fly short-term strategy memory) with demonstrated performance gains across multiple agent benchmarks and an added training method (STEP-MFT). It is timely for deployable LLM agents and broadly applicable to many interactive tasks, suggesting wide downstream impact. Paper 2 provides a valuable dataset and evaluation suite for human-agent collaboration, but its scope is narrower (Map Task-derived dyadic routing) and impact depends on adoption and generalization beyond the dataset. Overall, Paper 1 is more likely to shift methods and practice.
Paper 2 likely has higher impact: it proposes a generally applicable, method-level advance (efficient test-time inference via a modified Open-Closed List search integrating a generative rollout model and learned heuristic) that can improve performance and compute efficiency across many planning domains. This directly targets a timely bottleneck (test-time compute and distribution shift) and can transfer to robotics, operations research, program synthesis, and LLM-based planning. Paper 1 is valuable and novel as a dataset for human-agent collaboration, but its impact may be narrower and more data/task-dependent.
Paper 1 is likely to have higher scientific impact because it delivers a concrete, reusable dataset with theory-informed, action-level mental model annotations—an immediate community resource for benchmarking and training collaborative agents. Its methodology is clearer and more easily verifiable (grounded in a classic social-science task, quantified scale, model benchmarks). The dataset can directly drive progress across NLP, HCI/CSCW, and cognitive science. Paper 2 is conceptually ambitious and timely, but is primarily a proposed architecture with less evident empirical validation, making near-term uptake and measurable impact less certain.
Paper 1 introduces a novel, training-free causal tool-filtering method that directly improves reliability, safety (fewer premature/wrong tool calls), and efficiency (large token savings) for practical LLM agents—a timely, widely applicable problem as tool-using agents proliferate. It appears methodologically rigorous via multi-model, multi-task benchmarking with strong baselines and multiple metrics. Paper 2 provides a valuable annotated dataset for human-agent collaboration research, but its immediate real-world impact may be narrower (Map Task domain) and downstream gains depend on subsequent model-training work. Overall breadth and near-term applicability favor Paper 1.
ALMANAC introduces a novel, reusable dataset resource addressing a clear gap in human-AI collaboration research—action-level mental model annotations grounded in social science theory. It has broader applicability across HCI, multi-agent systems, and cognitive science, and provides benchmarks for LLM evaluation. Paper 2 offers valuable negative/cautionary findings about intervention timing reliability, but its scope is narrower (runtime safety for autonomous agents), its findings are largely diagnostic rather than constructive, and the low inter-rater reliability result, while important, limits the field's ability to build on it. Paper 1's dataset contribution has more lasting utility.
Goedel-Architect achieves extraordinary results in formal theorem proving: 100% on MiniF2F-test, 88.8% on PutnamBench, and strong performance on IMO 2025 and Putnam 2025—representing massive leaps over prior state-of-the-art. The blueprint generation and refinement paradigm is a novel architectural contribution with immediate practical applications in mathematics and formal verification. The 500x cost reduction and use of open-weight models further amplify impact. Paper 2 contributes a useful but niche dataset for human-AI collaboration research with more incremental impact on the field.
Paper 2 demonstrates higher potential scientific impact due to several factors: (1) it addresses critical enterprise-level challenges (hallucination, compliance, domain drift) with a formal neurosymbolic architecture validated across multiple LLMs and industries; (2) the 'inverse parametric knowledge effect' is a novel, generalizable finding; (3) it has immediate real-world deployment (650+ agents, 22 verticals); (4) the cross-model replication strengthens methodological rigor; (5) it bridges the neurosymbolic AI and enterprise systems communities. Paper 1, while valuable for human-AI collaboration research, is more niche in scope with a smaller dataset and narrower evaluation framework.
Paper 1 addresses a highly timely and critical bottleneck in human-AI collaboration: equipping LLM agents with theory-of-mind capabilities. By providing a novel dataset with action-level mental model annotations, it opens up new avenues for evaluating and training agents beyond mere task completion. While Paper 2 offers a solid optimization approach for a classic problem (class imbalance), Paper 1 has broader cross-disciplinary implications for AI, social science, and human-computer interaction, representing a more significant leap in capability for next-generation AI assistants.
Paper 1 addresses a highly critical and timely challenge in the rapidly expanding field of LLM agents: endowing them with theory-of-mind capabilities for human-AI collaboration. By providing a novel dataset and benchmark for action-level mental models, it offers broad applicability and high potential for rapid adoption. Paper 2 is methodologically rigorous but focuses on a more specialized intersection of deep learning and constrained optimization, which likely has a narrower breadth of impact.
Paper 1 introduces a novel conceptual framework for knowledge infusion in generative models that addresses a critical and timely problem (reliability, safety, domain compliance). Its layered intervention taxonomy provides broadly applicable design principles across multimodal generative AI, supported by empirical validation showing 70.97% reduction in knowledge-violating outputs. Paper 2 contributes a valuable but narrower dataset for human-AI collaboration mental models. While useful, its scope (2,987 annotations from a specific routing task) and benchmarking focus limit its breadth of impact compared to Paper 1's framework-level contribution addressing safety-critical AI generation.
Paper 2 likely has higher impact due to its broadly useful, timely dataset contribution: authentic human collaboration traces with theory-informed, action-level mental model annotations. Such resources can become community benchmarks, enabling reproducible evaluation and training across many agent-collaboration methods and fields (NLP, HCI, social science). The methodological framing (Map Task) and multi-model benchmarks strengthen rigor and adoption potential. Paper 1 is a practical architecture for local preference learning in agents, but appears narrower in scope and may be harder to generalize beyond specific skill-selection setups.
Paper 1 addresses a critical vulnerability in LLM-as-judge evaluations, a ubiquitous methodology in current AI benchmarking. By exposing post-decision manipulability and introducing a robustness metric, it has immediate, widespread implications for AI safety, alignment, and evaluation protocols. While Paper 2 offers a valuable dataset for human-agent collaboration, Paper 1's findings challenge foundational practices across the broader LLM research community, granting it a higher potential for immediate and broad scientific impact.
Paper 2 likely has higher impact: it contributes a reusable, theory-informed human collaboration dataset with action-level mental model annotations—an enabling resource for broad research on human–AI/agent collaboration, evaluation, and training. Its methodology is grounded in a classic social-science task and provides benchmarks across multiple LLMs, supporting rigor and comparability. The dataset can generalize across domains (dialogue, HCI, cognitive science, agent alignment), making its cross-field reach high and timely as agents become collaborators. Paper 1 is promising for UI/UX, but is narrower and more solution-specific.
Paper 2 has higher potential scientific impact due to its direct application in drug discovery and chemistry. By addressing multi-objective molecular optimization through a novel tree-structured multi-agent framework, it solves complex, high-stakes real-world problems with significant downstream societal value. While Paper 1 provides a valuable dataset for human-agent interaction, Paper 2's methodological innovation in navigating conflicting objectives in vast chemical spaces offers broader and more immediate scientific breakthroughs in the critical field of AI-driven scientific discovery.