Giorgio Leonardi, Stefania Montani, Manuel Striani, Alessandro Canessa, Delfina Ferrandi
Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.
This paper presents a modular framework that orchestrates multiple Large Language Models (LLMs) to perform medical conformance checking—assessing whether patient care pathways comply with clinical guidelines—without requiring formal Computer-Interpretable Guidelines (CIGs). The pipeline involves six stages: (1) trace extraction from discharge letters using Gemini 2.5 Flash, (2) rule extraction from textual guidelines using NotebookLM, (3) rule filtering against the available event log, (4) rule coding into executable Python scripts, (5) rule refinement/debugging, and (6) computation of a Trace Conformance Indicator (TCI). The central novelty is eliminating the knowledge engineering bottleneck of manually creating CIGs, which has long been a barrier to real-world conformance checking in healthcare.
The idea of chaining LLMs for different subtasks in a pipeline is sensible and reflects the broader trend toward compound AI systems. The problem addressed—bridging unstructured clinical text and unstructured guidelines for automated compliance analysis—is genuinely important and underserved.
The methodological rigor presents several concerns:
Validation of trace extraction is limited: only a 20% random subsample of the 463 extracted traces was validated by medical experts. While all checked traces were deemed correct, this leaves 80% unverified. No quantitative metrics (precision, recall, F1) are reported for trace extraction accuracy, nor are inter-annotator agreement measures provided.
Rule extraction evaluation is largely qualitative. The authors note that 161 rules were extracted, physicians checked them (especially the first 4 of 9 categories), and found them "semantically accurate" and "clinically relevant." Three rules were incomplete, and cerebral hemorrhage was undertreated. However, no systematic evaluation framework (e.g., coverage analysis against a gold-standard rule set, precision/recall of rule extraction) is provided. The human-in-the-loop correction required only one iteration, but the criteria for "satisfactory" are not formalized.
Rule coding and refinement also lack formal evaluation. There is no reporting of how many bugs were found/fixed in the refinement step, or what percentage of Python scripts were initially correct. The decision to use Gemini 3 Pro-Preview for refinement because it "exceeds the capacity of the other LLM models" is not supported with evidence.
Conformance checking results report that >86% of traces were conformant across 50 rules, but the paper does not provide a ground truth or gold-standard conformance assessment by clinicians to validate the automated TCI scores. The two detailed case studies (TCI=62% and TCI=6%) are insightful and clinically interesting, but they serve as anecdotal illustrations rather than systematic validation.
LLM selection was driven by institutional access ("availability of Google tools within an educational agreement") rather than systematic benchmarking. The comparison with GPT-5 Thinking yielding only 28 rules is mentioned but not elaborated with controlled experiments.
The practical value proposition is clear: if this approach works reliably, it could democratize conformance checking across hospitals that lack the resources for CIG development. The stroke care domain is clinically significant, and the approach is in principle domain-agnostic. The framework could extend to other medical specialties and other guideline-driven domains.
However, the impact is currently limited by the lack of rigorous validation. Healthcare applications demand high reliability, and the absence of systematic accuracy metrics may hinder adoption. The finding that some "non-conformances" actually represented clinically appropriate deviations (e.g., treating infection-caused hyperthermia with antibiotics rather than paracetamol) highlights an important limitation: rule-based conformance checking from guidelines cannot easily capture justified clinical deviations, which are pervasive in medicine.
The code availability (OSF repository) supports reproducibility, though the patient data cannot be shared for privacy reasons, limiting full replication.
The paper is timely on multiple fronts: LLM orchestration is an active research paradigm, conformance checking in healthcare remains practically challenging, and there is growing interest in applying LLMs to clinical workflows. The connection to the compound AI systems trend is well-articulated. The work addresses a real bottleneck—the CIG development effort—that has constrained practical deployment of conformance checking for decades.
1. Addresses a real gap: The CIG bottleneck is well-known, and automating the pipeline from raw text to conformance metrics is valuable.
2. End-to-end pipeline: Covering both event log extraction and rule extraction from unstructured text in a single framework is a meaningful integration.
3. Clinical grounding: The collaboration with Alessandria Hospital and use of real discharge letters (463 patients over 2022-2024) with ethics approval lends practical credibility.
4. Insightful case analysis: The investigation of specific non-conformance cases reveals nuanced clinical situations and demonstrates practical utility.
5. Modular design: The architecture allows component-level replacement and improvement.
1. Insufficient quantitative evaluation: No gold-standard comparisons, no precision/recall metrics for any pipeline stage, and no systematic error analysis.
2. No ablation study: The contribution of each pipeline stage is not isolated; it's unclear whether simpler approaches could achieve comparable results.
3. Single-domain evaluation: Only stroke care at one hospital is tested; generalizability is unknown.
4. No comparison with baselines: Neither traditional conformance checking methods (on manually created CIGs) nor single-LLM approaches are compared against.
5. Scalability and cost considerations are not discussed.
6. The paper does not address LLM non-determinism and its potential impact on reproducibility of results.
7. The TCI metric, while intuitive, is simplistic—it does not weight rules by clinical importance.
This paper presents a promising proof-of-concept for LLM-orchestrated conformance checking that addresses a genuine practical need. However, it reads more as a feasibility demonstration than a rigorous scientific evaluation. The lack of quantitative validation metrics, baseline comparisons, and systematic error analysis significantly weakens the scientific contribution. The clinical insights from the case studies are valuable but cannot substitute for formal evaluation. The work would benefit substantially from a gold-standard annotation effort and comparative evaluation against traditional approaches.
Generated Jun 9, 2026
Paper 1 presents a practical, validated framework addressing a real and widespread problem in healthcare—conformance checking without requiring formal computer-interpretable guidelines. It was tested with real hospital data (hundreds of patient traces, 50 rules) and demonstrates clear clinical utility. Paper 2, while creative in concept, raises significant credibility concerns: it references nonexistent models (GPT-5.5-pro, DeepSeek-V4-pro), the mathematical contributions appear narrow (proving only N=n+1), and the 'autonomous conjecture generation' claim lacks rigorous validation. Paper 1's methodological rigor and immediate real-world healthcare applicability give it higher impact potential.
Paper 2 has higher estimated impact due to a more novel, generalizable integration of LLM reasoning with physically grounded world models and principled uncertainty handling (double-loop learning). Its methodological framing (graph-latent conservation, KL-bounded adaptation, quantified effect sizes) is stronger and targets a broad, timely problem—resilient, policy-constrained decision-making in complex cyber-physical systems—applicable beyond supply chains. Paper 1 is valuable and practical in healthcare, but relies on LLM extraction/translation pipelines whose novelty and rigor may be more incremental and domain-specific, with heavier dependence on prompt/model behavior.
Paper 2 introduces a foundational architectural improvement for LLM agents (long-term persistent memory), addressing a critical bottleneck in core AI research. This methodological advancement has broad applicability across any domain utilizing AI agents. In contrast, Paper 1, while highly valuable and practical, is a domain-specific applied study focused on medical informatics and stroke care. Thus, Paper 2 has a higher potential for widespread scientific impact and adoption.
Paper 1 has a significantly higher potential scientific and societal impact due to its application in healthcare. Automating conformance checking directly from unstructured clinical texts addresses a major bottleneck in medical informatics, with direct implications for patient care quality and hospital efficiency. While Paper 2 presents an innovative methodological adaptation for sports analytics, the life-saving potential, broader applicability to various medical domains, and timely integration of LLMs in critical real-world clinical settings give Paper 1 a substantially higher overall scientific impact.
Paper 2 has higher estimated scientific impact due to clear, near-term real-world applicability in healthcare operations: enabling conformance checking without Computer-Interpretable Guidelines addresses a major deployment bottleneck and can generalize to many clinical domains. It demonstrates an end-to-end, modular system evaluated on substantial hospital data (hundreds of traces, 50 rules), indicating methodological maturity and translational potential. Paper 1 is novel and valuable for AI-for-math evaluation, but its immediate applications are narrower and its impact is more specialized, whereas Paper 2 can influence clinical informatics, process mining, guideline engineering, and applied LLM systems.
Paper 2 presents a novel, practical framework applying LLMs to a concrete healthcare problem—conformance checking without requiring formal computer-interpretable guidelines. This addresses a real bottleneck in clinical quality assessment with immediate real-world applicability, validated on actual hospital data. While Paper 1 contributes a useful benchmark for GUI agents, benchmarks have incremental impact unless widely adopted. Paper 2's cross-disciplinary contribution (NLP + process mining + healthcare) and demonstrated feasibility in a real clinical setting gives it broader impact potential across multiple fields.
Paper 2 is likely higher impact: it identifies a broadly relevant failure mode in pretrained biomedical encoders (spurious high similarity for unrelated cross-domain pairs) that can directly corrupt causal reasoning in user-level models, then proposes generalizable contrastive and KG-mined hard-negative fixes with sizable quantitative gains. It also contributes practical systems insights (CPU/AMX/OpenVINO latency) plus released benchmarks and tooling, enabling adoption and follow-on work across biomed NLP, retrieval, personalization, and causal discovery. Paper 1 is applied and useful but more domain-specific, with heavier dependence on LLM orchestration and less general methodological novelty.
Paper 1 introduces a novel, generalizable framework (SkeMex) for medical agent reasoning with a self-evolving skill memory system that addresses fundamental limitations in how AI agents accumulate and reuse clinical experience. Its contributions—structured skill distillation, value-aware retrieval, and a closed-loop governance lifecycle—are broadly applicable across clinical tasks and model backbones. Paper 2 presents a useful but more narrowly scoped LLM orchestration framework for conformance checking in stroke care, validated at a single hospital. Paper 1 has greater novelty, broader applicability, and stronger methodological contributions with higher potential to influence multiple research directions.
Paper 2 has higher estimated impact due to immediate real-world applicability in healthcare: it operationalizes conformance checking without needing scarce computer-interpretable guidelines, and demonstrates deployment on hundreds of hospital traces and 50 guideline-derived rules. This combination of timeliness (LLM-based clinical workflow analysis), demonstrated utility, and domain significance (stroke care) supports broader adoption across hospitals and other guideline-driven domains. Paper 1 is novel conceptually, but appears more proof-of-concept and may face harder validation and integration hurdles, reducing near-term impact.
Paper 2 introduces a novel, generalizable framework for healthcare conformance checking that eliminates the need for Computer-Interpretable Guidelines—a significant practical barrier. It has broader applicability across medical domains, stronger methodological contribution (modular LLM orchestration architecture), and addresses a well-recognized gap in healthcare process mining. Paper 1, while rigorous in its ablation study, is narrower in scope (drug-asset valuation), heavily tied to a proprietary commercial product (Noah AI), and its core finding—that proprietary data matters—is somewhat intuitive. Paper 2's potential to impact clinical quality assessment across many healthcare settings gives it greater breadth of impact.