Shang Ma, Jisheng Dang, Wencan Zhang, Yifan Zhang, Bimei Wang, Hong Peng, Bin Hu, Qi Tian
We propose a multi-agent collaborative framework built upon a lightweight Multimodal Large Language Model (MLLM), specifically designed for social intelligence reasoning. A key feature of our approach is that both the training and inference phases are augmented via knowledge distillation. Within this architecture, multi-modal data pertinent to social intelligence is precisely localized. Furthermore, relevant long-tail events are identified, extracted, and rendered as formatted, explicit text. This formatting strategy prevents critical long-tail information from being overshadowed by head events and environmental noise during the tokenization process. Specifically, we integrate Test-Time Adaptation (TTA) across the entire reasoning pipeline, encompassing the extraction and representation of long-tail events, Chain-of-Thought (CoT) prompting, and self-reflection. This TTA mechanism is also distillation-enhanced, utilizing Low-Rank Adaptation (LoRA) to fine-tune the foundation model exclusively for instance-level reasoning. Extensive evaluations against various open-source and proprietary AI models across multiple benchmarks demonstrate the effectiveness of the proposed framework. With around 30% of training data from IntentTrain, we achieve state-of-the-art results. Codes are available at https://github.com/eeee-sys/MODF-SIR, demo is available at https://huggingface.co/spaces/Harry-1234/MODF-SIR, LoRA is available at https://huggingface.co/Harry-1234/MODF-SIR and the dataset for training router is available at https://huggingface.co/datasets/Harry-1234/IntentRouterTrain.
MODF-SIR proposes a multi-agent collaborative framework for social intelligence reasoning — the task of inferring human intentions, emotions, and interpersonal dynamics from multimodal (video, audio, text) signals. The framework is built on a lightweight 7B-parameter MLLM (Qwen2.5-Omni) and orchestrates five specialized agents:
The central novelty lies in combining knowledge distillation (both at training and inference), test-time adaptation with instance-specific LoRA that is discarded post-inference, and a dual-process inspired routing mechanism. The framework explicitly targets "long-tail events" — subtle behavioral cues like micro-expressions and vocal tremors — that are typically overwhelmed by dominant signals during tokenization.
The framework addresses a meaningful gap: current MLLMs struggle with implicit social signals, particularly in long-form video content where subtle cues are drowned by dominant events. The multi-agent decomposition provides interpretability — each agent's output is visible, making the reasoning process auditable.
Real-world applications include:
Broader influence: The TTA with ephemeral LoRA is a potentially generalizable technique beyond social intelligence — any task requiring instance-level adaptation could benefit. The dual-process routing concept could influence other multi-agent MLLM architectures.
However, the inference-time computational cost (iterative LoRA + teacher evaluation) significantly limits practical deployment. The reliance on a 30B teacher model during inference is a substantial constraint.
The paper is highly timely. Multi-agent LLM frameworks, test-time computation scaling, and social AI are all active research frontiers. The integration of GRPO (inspired by DeepSeek-R1) and LoRA-based TTA reflects cutting-edge techniques. The benchmarks used (Daily-Omni, IntentBench, WorldSense) appear recent (2025), suggesting the paper targets the current frontier.
The focus on omni-modal (video + audio + text) reasoning is increasingly relevant as models move beyond vision-language to truly multimodal understanding. Social intelligence reasoning is an emerging and underexplored area compared to standard VQA.
MODF-SIR presents a comprehensive and technically sound multi-agent framework for social intelligence reasoning with several innovative components (ephemeral LoRA TTA, GRPO-trained grounding, asymmetric knowledge distillation for routing). It achieves consistent improvements across benchmarks. However, the practical viability is questionable due to unquantified inference costs, teacher model dependencies, and modest gains on the most relevant benchmark. The contribution is primarily in systems-level integration rather than fundamental algorithmic innovation.
Generated Jun 11, 2026
MODF-SIR addresses the broader and more impactful problem of social intelligence reasoning with multimodal models, combining knowledge distillation, test-time adaptation, and multi-agent collaboration in a novel framework. It demonstrates state-of-the-art results across multiple benchmarks with only 30% training data, suggesting strong practical efficiency. Paper 1 (SkillJuror) makes a narrower contribution studying how skill organization affects agent behavior, with modest outcome improvements (+4.1%). Paper 2's combination of techniques and its applicability to social AI reasoning gives it wider relevance and stronger potential impact across multiple research communities.
Paper 2 presents a broader, more fundamental contribution to AI through a multi-agent omni-modal framework addressing long-tail event extraction and test-time adaptation. Its open-source release of models, data, and code maximizes reproducibility and future research potential. In contrast, Paper 1 is a competition-specific technical report for an autonomous driving challenge, which, while practically valuable, has a narrower scope and more incremental methodological impact.
Paper 2 addresses the highly timely and rapidly expanding field of Multimodal Large Language Models and multi-agent systems. Its focus on social intelligence reasoning and handling long-tail events offers broader potential real-world applications in human-AI interaction compared to Paper 1's focus on classical search algorithms. Furthermore, Paper 2 provides an open-source framework, datasets, and demonstrable state-of-the-art results, ensuring immediate accessibility and high potential for widespread adoption and follow-up research across the broader AI community.
Paper 1 presents a highly innovative, technically rigorous framework for social intelligence reasoning, integrating multi-agent systems, multimodal LLMs, knowledge distillation, and Test-Time Adaptation. Its achievement of state-of-the-art results and provision of open-source models, code, and datasets give it strong potential for high scientific impact and immediate utility in the AI research community. While Paper 2 is timely for AI policy, Paper 1 offers broader, foundational advancements in core AI methodology.
SciConBench addresses a critical gap in evaluating AI agents' ability to synthesize scientific conclusions in high-stakes domains like health. Its large-scale benchmark (9.11K questions), clean-room evaluation methodology to counter data leakage, and audit of consumer-facing tools (Google AI Overview, OpenEvidence) have broad implications for AI safety, scientific integrity, and policy. The finding that even the best agents achieve only 0.337 F1 highlights a fundamental limitation with wide-reaching consequences. Paper 2 presents an incremental multi-agent framework for social intelligence reasoning with narrower scope and less transformative potential.
Paper 2 addresses a highly active and broader field (multimodal large language models and multi-agent systems) with state-of-the-art results. Its combination of test-time adaptation, knowledge distillation, and long-tail event extraction offers significant methodological innovation. The provision of open-source code, models, and datasets further accelerates its potential adoption and citations. In contrast, Paper 1 proposes a valuable but highly domain-specific ontology pattern for compliance knowledge graphs, which has a narrower scope of impact.
Paper 1 addresses a critical, high-stakes domain (medical diagnosis) by introducing a novel, large-scale pulmonary knowledge graph (LungKG) and a KG-guided LLM framework. Its focus on grounding diagnostic reasoning in EMR data directly tackles major limitations of LLMs in healthcare. While Paper 2 presents a strong open-source multimodal framework, Paper 1's creation of a foundational medical resource and its highly translational clinical applications give it a higher potential for significant scientific and societal impact.
Paper 1 has higher likely scientific impact due to clearer methodological novelty (a normative intermediate representation with formal operational semantics plus an executable ASP compilation) and stronger rigor/verification potential in a high-stakes domain (regulatory compliance). Its contributions are more reusable across legal/requirements engineering, knowledge representation, and formal methods, with practical applicability to safety standards. Paper 2 is timely and open-sourced, but it primarily combines existing techniques (multi-agent, distillation, TTA, LoRA, CoT) with benchmark-driven gains, which may be less durable scientifically and harder to generalize beyond the presented tasks.
Paper 2 likely has higher impact: it tackles a key scalability bottleneck in a well-known neurosymbolic framework (NeurASP) and delivers orders-of-magnitude speedups via broadly applicable systems techniques (vectorization, batching, caching). This directly enables larger, more complex neurosymbolic tasks and can be adopted by other neuro-symbolic/logic-learning systems, giving cross-field relevance (ML systems + symbolic reasoning). Paper 1 is timely and application-oriented but combines several existing ideas (multi-agent, distillation, TTA, LoRA, CoT) with narrower domain focus, making generalizable methodological contribution less clear.
Paper 2 addresses a fundamental challenge in neural combinatorial optimization (TSP) by introducing structure-aware projections to diffusion models. Its ability to outperform state-of-the-art neural solvers and classical heuristics while significantly reducing inference time and memory usage provides high methodological rigor and broad real-world applicability in logistics and operations research. Paper 1, while comprehensive, primarily focuses on assembling existing MLLM techniques (LoRA, CoT, TTA) and exhibits narrower fundamental algorithmic innovation compared to Paper 2.