Songliang Cao, Jiele Zhao, Yuru Wang, Hao Li, Daqi Liu, Zehan Zhang, Fangzhen Li, Yu Wang
With the development of autonomous driving systems, mining high-value, safety-critical, and planning-relevant scenarios from large-scale driving logs has become essential for data-driven evaluation. In this paper, we propose AutoMine, a robust self-refining scenario mining method based on LLMs and VLMs. AutoMine uses semantics-preserving prompt augmentation to reduce LLM prompt sensitivity, combines robust trajectory atomic functions with VLM-based functions to handle perception noise and open-world visual cues, and refines generated code through execution feedback from real logs. In the Argoverse 2 Scenario Mining Competition at CVPR 2026, AutoMine achieves a HOTA-Temporal score of 36.38 and a Timestamp BA score of 77.21.
AutoMine presents a multimodal scenario mining framework that retrieves safety-critical driving scenarios from large-scale logs given natural language descriptions. The system combines four main components: (1) trajectory refinement to handle noisy perception outputs, (2) semantics-preserving prompt augmentation to reduce LLM sensitivity to query phrasing, (3) a hybrid library of robust trajectory-based and VLM-enhanced atomic functions, and (4) an execution-driven self-refinement loop that uses real-log feedback to iteratively fix code errors.
The core novelty lies in the integration of these components into a coherent pipeline rather than in any single algorithmic breakthrough. The self-refinement loop, where execution diagnostics (e.g., empty-log ratio, category distribution, track statistics) are fed back to the LLM for code repair, is the most distinctive element and addresses a genuine practical problem: LLM-generated code frequently contains systematic errors that are invisible from code inspection alone but manifest clearly during execution.
As a competition technical report, the methodological presentation is adequate but limited in depth. The ablation study (Table 2) is well-structured, incrementally adding each component and showing that gains are complementary rather than redundant. The authors provide useful qualitative observations about *why* each component helps—for instance, trajectory refinement primarily improves HOTA-Track (lifetime of referred objects) rather than HOTA-Temporal, while atomic function optimization brings the largest single-component gain.
However, several aspects lack rigor:
The practical impact is moderate. Scenario mining is increasingly important for autonomous driving validation, and the problem of retrieving specific driving events from massive datasets using natural language is genuinely useful. The approach of combining symbolic program execution with LLM code generation and VLM visual reasoning is a sensible paradigm that others in the field are likely to adopt.
The execution-driven self-refinement concept has broader applicability beyond scenario mining—it could generalize to any setting where LLM-generated code operates on structured data and can be validated against execution statistics. This is perhaps the most transferable idea in the paper.
However, the impact is constrained by:
The paper is timely. Scenario mining from large driving datasets is an active area, and the use of LLMs/VLMs for this task is emergent. The competition setting (CVPR 2026) ensures the work addresses a current benchmark. The observation that LLMs are sensitive to prompt wording in code generation tasks is well-motivated by recent literature, and the proposed mitigations are practical.
The work builds directly on RefProg [1], extending it with multimodal capabilities, robustness mechanisms, and self-refinement. This represents an incremental but practical advancement over the baseline approach.
AutoMine is a competent competition solution that achieves strong results through careful engineering of multiple components around LLM/VLM-based scenario mining. The execution-driven self-refinement loop and the dual symbolic/visual reasoning approach are its most notable contributions. However, as a short technical report, it lacks the depth, analysis, and novelty typically expected for significant scientific impact. The contributions are primarily systems-level integration rather than fundamental methodological advances.
Generated Jun 11, 2026
Paper 2 presents a broader, more fundamental contribution to AI through a multi-agent omni-modal framework addressing long-tail event extraction and test-time adaptation. Its open-source release of models, data, and code maximizes reproducibility and future research potential. In contrast, Paper 1 is a competition-specific technical report for an autonomous driving challenge, which, while practically valuable, has a narrower scope and more incremental methodological impact.
Paper 1 introduces a novel framework (HELM) addressing a significant gap in automating finite element modeling for safety-critical infrastructure, combining AI agents with human-in-the-loop verification. It presents comprehensive experimental evaluation across 20 cases, provides open-source tools, and addresses fundamental challenges in AI-assisted engineering simulation. Paper 2, while technically sound, is primarily a competition solution report for a specific challenge with narrower scope and less generalizable contributions. HELM's cross-disciplinary impact (AI + structural engineering) and its systematic analysis of agent failure modes offer broader scientific value.
Paper 2 presents a concrete, methodologically successful solution to a critical problem in autonomous driving (safety-critical scenario mining), demonstrating strong empirical results in a recognized competition. In contrast, Paper 1, while targeting the important domain of medical research, reports inconclusive, exploratory findings with limited statistical significance and poor expert agreement, reducing its immediate scientific and practical impact.
Paper 1 investigates foundational cognitive architectures for AI agents with high methodological rigor, including pre-registered experiments and robust statistical analyses. In contrast, Paper 2 is a solution tailored to a specific dataset challenge (AV2 2026 Scenario Mining Challenge), which typically has a narrower, more applied impact compared to foundational AI memory research.
Paper 2 addresses a fundamental and broadly applicable challenge in BIM compliance checking with a novel graph-based semantic reasoning framework (SGR-BIM) that bridges regulatory semantics and geometric data. It offers a reusable paradigm for the entire AEC industry with rigorous validation on 679 expert-verified queries. Paper 1, while technically competent, is a competition solution report for a specific challenge (AV2 2026), which typically has narrower impact and lower novelty beyond the competition context. Paper 2's cross-modal knowledge graph approach has broader methodological contributions and real-world applicability.
Paper 1 has higher likely scientific impact due to stronger novelty (LLM/VLM-driven self-refining scenario mining with execution-feedback code refinement and prompt-sensitivity mitigation), clear methodological integration across language, vision, and trajectory analysis, and broad relevance to safety-critical autonomous driving evaluation. It targets a timely, high-stakes problem (mining safety-critical scenarios from large logs) with direct real-world application and potential transfer to other domains needing robust data mining from multimodal streams. Paper 2 is useful but is mainly an engineering combination (curriculum + multi-model selection) with narrower breadth and evaluation limited to BERTScore on one dataset.
Paper 2 likely has higher scientific impact: it introduces a novel, timely LLM/VLM-driven self-refining scenario mining pipeline with execution-feedback code refinement and demonstrates competitive performance on a prominent CVPR 2026 benchmark/competition, with clear real-world relevance to autonomous driving safety evaluation and potential reuse across robotics/ML. Paper 1 is valuable for AI regulation clarity, but its contribution is more domain-specific (legal/definition of inference under the EU AI Act) and less likely to drive broad technical adoption or cross-field methodological advances at scale.
Lung-R1 presents a more impactful contribution: a novel knowledge graph (LungKG) with 59K nodes and 164K edges, a new training methodology combining KG-constrained reasoning with reinforcement learning, and direct clinical application to pulmonary diagnosis. It addresses a clearly defined gap (Knowledge-to-Diagnosis) with reusable resources and rigorous evaluation across 20 systems. Paper 1, while technically sound, is a competition solution for a narrow scenario mining task with incremental engineering contributions (prompt augmentation, code refinement). Paper 2 has broader cross-disciplinary impact spanning NLP, medical AI, and clinical practice.
Paper 2 presents a generalizable framework for open-ended scientific discovery, addressing a critical bottleneck in AI-driven research: evidence calibration. Its potential to accelerate discoveries across diverse scientific disciplines gives it a much broader and more profound impact. In contrast, Paper 1 is an engineering solution tailored to a specific autonomous driving competition. While highly valuable for AV safety, its scope, methodology, and impact are narrow and domain-specific compared to the foundational AI-for-science advancements proposed in Paper 2.
Paper 2 is likely higher impact: it addresses a broadly relevant, timely bottleneck (long-context efficiency) with a general recipe (SA→SWA conversion + RL adaptation) that can transfer across LLM reasoning tasks and model families, potentially influencing both research and deployment. Its key insight (data-architecture mismatch and RL as an adaptation mechanism) is conceptually novel and could affect how the community evaluates efficient attention. Paper 1 is strong for autonomous driving scenario mining and competition results, but is more domain-specific and appears more system/engineering-oriented, limiting cross-field breadth.