Yoojin Nam, Jinhoon Jeong, Namkug Kim
Objective. Large language models (LLMs) increasingly draft clinical research manuscripts, but their fluency can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items. Existing tools generate text without verifying it, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture that pairs generation with verification. Methods. The design rests on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism -- a deterministic, re-executable check where one suffices, and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills coordinated by one orchestrator, whose deterministic tier comprises 21 standard-library detectors. We evaluate it on three reproducible public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Results. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects. On 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a generic single-prompt LLM reviewer detected 11, its misses concentrated in generated-code, bibliography-internal, and style defects the prose does not expose. Conclusion. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript -- feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).
The paper introduces an architecture called "MedSci Skills" that pairs LLM-based manuscript generation with a structured verification layer. The central idea is an "integrity-gate taxonomy" that classifies each verification question by whether it can be resolved deterministically (lookup, arithmetic, pattern matching) or requires interpretive judgment (prose-level probes). The deterministic tier comprises 21 standard-library detectors organized into five families: citation/reference integrity, numerical/cohort arithmetic, confounding/scope contracts, reporting compliance, and style/review-process integrity. The pipeline enforces halt-on-failure at every stage transition, preventing defective artifacts from propagating.
The intellectual contribution is essentially architectural: the principle of "determinism-where-possible" applied to the medical manuscript verification problem. Individual components (citation checking, numerical reconciliation, checklist auditing) are not novel in isolation, but their systematic organization under a unified taxonomy with halt-on-failure orchestration represents a useful engineering contribution.
The evaluation has notable strengths and weaknesses. Strengths: The three end-to-end demonstrations use public datasets (Wisconsin Breast Cancer, BCG vaccine trials, NHANES 2017-2018), are fully reproducible via content-hash manifests, and span three reporting guidelines (STARD, PRISMA, STROBE). The seeded-defect evaluation with 27 injected defects is well-designed as a regression-style test, and the comparison against a single-prompt LLM reviewer provides useful contrast.
Weaknesses: The evaluation is fundamentally a feasibility demonstration, not a rigorous empirical study. The 27 seeded defects come from 17 injectors designed by the same developer who built the detectors — essentially testing whether hand-crafted detectors catch the specific defect classes they were designed for. Perfect recall (27/27) in this setting is expected rather than impressive. The comparison against a "generic single-prompt LLM reviewer" is a weak baseline — a multi-turn, structured LLM review with tool access would be far more competitive. The paper acknowledges this limitation but the framing still risks overstating the contrast.
The false-positive analysis is particularly thin: testing clean inputs paired with each injection tells us about specificity on pristine artifacts, not about real-world false-positive rates on messy, imperfect manuscripts. The single operational false positive (confounding-completeness on Demonstration 3) is reported transparently, which is commendable, but one instance is insufficient to characterize precision.
The primary author is both the toolkit developer and the evaluator, and human judgments about true/false positive status were made by this same person. The paper discloses this forthrightly, but it remains a significant limitation.
The practical need is genuine and growing. LLM-assisted manuscript drafting is becoming widespread, and the failure modes identified (fabricated citations, numerical drift, reporting-guideline gaps) are real and documented. The architecture addresses a legitimate gap: most LLM writing tools lack systematic verification.
Immediate applications: Pre-submission screening by authors, editorial office triage of LLM-assisted submissions, and integration into institutional research integrity workflows. The open-source release (MIT license, Zenodo archive) lowers adoption barriers.
Broader influence: The "determinism-where-possible" principle could transfer to other domains — legal document verification, financial reporting, regulatory submissions. The paper explicitly notes this generalizability. However, the current implementation is heavily tailored to medical research manuscripts, so actual cross-domain transfer would require substantial new development.
The impact is likely moderate. The toolkit serves a real need but occupies a niche: researchers who use LLMs for manuscript drafting AND want systematic verification AND work in clinical research. The competitive landscape includes rapidly evolving general-purpose guardrail frameworks that could subsume much of this functionality.
The paper is timely. LLM-assisted academic writing is accelerating (the paper cites Siler 2026 on widespread LLM language in published articles and Topaz et al. 2026 on fabricated citations across 2.5 million papers). Journals are increasingly concerned about AI-generated content, and tools for verification are needed now. The reporting-guideline integration is particularly relevant given ongoing efforts by EQUATOR Network and journals to improve adherence.
1. Transparency and intellectual honesty: The paper is unusually forthright about its limitations — reporting false positives, acknowledging the developer-as-evaluator conflict, explicitly bounding claims to feasibility rather than quality, and noting two instructive false positives when dogfooding on the paper itself.
2. Reproducibility infrastructure: Content-hash manifests, version-pinned archives, two-snapshot versioning, and standard-library-only detectors represent genuine commitment to reproducibility that goes beyond typical software papers.
3. Well-articulated design principles: The decomposition-gating-determinism framework is clearly stated and consistently applied, making the architecture transferable even if the specific implementation is not reused.
4. Practical relevance: The focus on manuscript-level integrity (not just text quality) addresses a genuine gap in the AI-assisted writing ecosystem.
1. Circular evaluation: Detectors designed for specific defect classes predictably catch those defects. The absence of naturally-occurring errors from real manuscripts severely limits external validity.
2. Weak baseline: A single-prompt LLM review is not a credible upper bound on what LLM-based verification can achieve; structured multi-step LLM review with tool access would narrow the gap substantially.
3. Scale of evaluation: Three demonstrations and 27 seeded defects is a small evidence base. The 54% exact-match audit-trail figure, while honestly reported, suggests significant gaps in traceability.
4. Single-developer artifact: The toolkit reflects one researcher's domain expertise and workflow assumptions, limiting generalizability without substantial community contribution.
5. No user study: The paper makes no claim about whether the audit trail actually helps human reviewers catch errors — the central practical question — deferring this entirely to a companion study.
This is a well-conceived systems paper that addresses a timely problem with a principled architectural approach. The contribution is primarily in the design philosophy (determinism-where-possible verification) rather than in technical novelty. The evaluation demonstrates feasibility but not effectiveness, and the evidence base is too small and too circular to support strong claims. The transparency and reproducibility practices are exemplary for a software-methods paper. Impact will depend heavily on community adoption and the forthcoming blinded evaluation study.
Generated Jun 9, 2026
Paper 2 addresses a highly critical and timely issue: LLM hallucinations and integrity in scientific publishing. By introducing deterministic integrity gates for clinical manuscript preparation, it safeguards medical literature and ensures reproducibility. This provides a broader and more vital real-world impact across all scientific domains compared to the web navigation advancements presented in Paper 1.
Paper 1 addresses the broader and more foundational topic of Self-Explainability in complex adaptive systems, providing a systematic literature review, unified taxonomy, and research roadmap that can influence multiple fields (AI, robotics, distributed systems, etc.). Its breadth of impact and timeliness given the AI trust crisis give it higher potential. Paper 2, while practically useful, addresses a narrower engineering problem (LLM manuscript verification) with an incremental architectural contribution and limited generalizability beyond clinical manuscript preparation.
Paper 2 addresses a critical, highly timely issue: LLM hallucinations and data integrity in clinical research manuscripts. Its architecture for verifiable, deterministic checks has broad applications across biomedical informatics and scientific writing, potentially preventing widespread misinformation. While Paper 1 offers valuable algorithmic improvements for the Traveling Salesman Problem, Paper 2's impact extends across the broader scientific community by ensuring the reliability and safety of AI-assisted research outputs.
Paper 1 addresses fundamental theoretical challenges in AI alignment, exploring how large language models generalize ethical behavior and project specific personas. Its insights into 'emergent alignment' offer broad, foundational implications for training paradigms across the entire AI field. While Paper 2 presents a highly practical and rigorous framework for clinical manuscript verification, Paper 1's findings have a significantly wider potential impact on the core development, understanding, and safety mechanisms of frontier AI systems, making it more impactful broadly.
Paper 1 addresses a significant engineering optimization problem (IPMSM design) with a novel multi-agent framework combining RAG, uncertainty-aware FEA-AI hybrid optimization, and automated workflow—potentially impacting the broader fields of electrical machine design, multi-objective optimization, and AI-assisted engineering. Paper 2 addresses a narrower problem (LLM manuscript verification) with a practical but more incremental contribution focused on deterministic checking of LLM outputs. Paper 1's methodological innovations (uncertainty-driven surrogate/FEA switching, ANOVA+LLM failure analysis) have broader transferability across engineering domains, giving it higher potential impact.
Paper 1 offers a concrete, technically novel and timely architecture for auditable LLM-assisted clinical writing, with open-source implementation and empirical evaluation across multiple reporting-guideline pipelines plus defect-injection tests. Its deterministic “halt-on-failure” integrity-gate taxonomy is directly actionable and likely to see adoption in biomedical informatics, research ops, and regulated documentation, giving broad near-term real-world impact. Paper 2 is conceptually interesting and potentially important for AI governance, but is more normative/theoretical with limited methodological validation, making its scientific impact more uncertain and longer-term.
SIGA addresses a broader and more transformative problem—enabling general-purpose coding agents to operate complex scientific simulators with minimal adaptation. It demonstrates practical speedups (36x over human experts), generalizes across multiple simulators (GEOS, OpenFOAM, LAMMPS), and introduces a self-evolution mechanism. The concept of simulator-interface grounding adapters has wide applicability across computational sciences. Paper 1, while solving a real problem in LLM-assisted manuscript preparation, addresses a narrower application domain with incremental engineering contributions (deterministic verification gates) rather than a fundamentally new paradigm.
Paper 1 has higher impact potential due to a more novel, concrete, and rigorously evaluated architecture addressing an urgent real-world problem (auditability and error prevention in LLM-assisted clinical writing). It offers an open-source, deterministic “halt-on-failure” integrity-gate taxonomy with reproducible pipelines and strong ablation evidence (perfect detection on seeded defects vs an LLM reviewer). Its applications are immediate in biomedical publishing, compliance, and research integrity, with likely spillover to other regulated domains. Paper 2 is timely and broadly relevant, but appears more conceptual with less demonstrated methodological validation.
Paper 2 addresses a critical and highly timely issue—LLM hallucinations in clinical research—with a practical, open-source verification architecture. While Paper 1 provides valuable theoretical insights into active inference for a specialized audience, Paper 2 offers significantly broader interdisciplinary impact, immediate real-world utility in biomedical informatics, and directly tackles a major bottleneck in the safe adoption of generative AI in scientific writing.
Paper 1 targets a timely, high-stakes bottleneck—reliability of LLM-assisted clinical scientific writing—where failures have direct patient-care and scientific-integrity implications. Its “determinism-where-possible” integrity-gate taxonomy plus an open-source, auditable toolkit (43 skills) suggests strong real-world uptake, reproducibility, and cross-domain applicability to other LLM-mediated workflows (science, compliance, regulated documentation). The evaluation includes seeded-defect ablations and comparisons showing clear advantages over LLM self-review. Paper 2 is a solid algorithmic improvement in bidirectional search, but likely impacts a narrower community and has less immediate societal/industry pull than clinical AI governance tooling.