LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

Giorgio Leonardi, Stefania Montani, Manuel Striani, Alessandro Canessa, Delfina Ferrandi

Jun 8, 2026arXiv:2606.09489v1

cs.AI

#2364of 3489·Artificial Intelligence

#2364 of 3489 · Artificial Intelligence

Tournament Score

1353±42

10501800

43%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance6

Rigor3.5

Novelty5.5

Clarity6

Abstract

Objective: Conformance checking in healthcare seeks to assess whether patient care pathways adhere to clinical guidelines. However, its practical application often depends on the availability of formal, machine-interpretable representations of guidelines, such as Computer-Interpretable Guidelines (CIGs), which are seldom available in real-world clinical settings. Methods: This work introduces a modular framework based on the orchestration of Large Language Models (LLMs) to support medical conformance checking directly from unstructured clinical and guideline texts, without requiring predefined CIGs. The proposed architecture integrates multiple LLMs and supporting components to extract patient traces from clinical discharge letters, identify normative rules from textual clinical guidelines, translate these rules into executable scripts, and compute a Trace Conformance Indicator to quantify compliance within the event log. Results: The framework was implemented and evaluated in the stroke care domain at the neurological ward of Alessandria Hospital. Hundreds of patient traces were automatically extracted from hospital data and assessed against 50 rules derived from the reference guideline. The analysis showed that more than 86\% of the available traces were conformant. Conclusion: The results demonstrate the feasibility of using orchestrated LLMs for practical healthcare conformance analysis. At the same time, the study provides evidence of a high level of adherence to stroke care guidelines at Alessandria Hospital.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper presents a modular framework that orchestrates multiple Large Language Models (LLMs) to perform medical conformance checking—assessing whether patient care pathways comply with clinical guidelines—without requiring formal Computer-Interpretable Guidelines (CIGs). The pipeline involves six stages: (1) trace extraction from discharge letters using Gemini 2.5 Flash, (2) rule extraction from textual guidelines using NotebookLM, (3) rule filtering against the available event log, (4) rule coding into executable Python scripts, (5) rule refinement/debugging, and (6) computation of a Trace Conformance Indicator (TCI). The central novelty is eliminating the knowledge engineering bottleneck of manually creating CIGs, which has long been a barrier to real-world conformance checking in healthcare.

The idea of chaining LLMs for different subtasks in a pipeline is sensible and reflects the broader trend toward compound AI systems. The problem addressed—bridging unstructured clinical text and unstructured guidelines for automated compliance analysis—is genuinely important and underserved.

Methodological Rigor

The methodological rigor presents several concerns:

Validation of trace extraction is limited: only a 20% random subsample of the 463 extracted traces was validated by medical experts. While all checked traces were deemed correct, this leaves 80% unverified. No quantitative metrics (precision, recall, F1) are reported for trace extraction accuracy, nor are inter-annotator agreement measures provided.

Rule extraction evaluation is largely qualitative. The authors note that 161 rules were extracted, physicians checked them (especially the first 4 of 9 categories), and found them "semantically accurate" and "clinically relevant." Three rules were incomplete, and cerebral hemorrhage was undertreated. However, no systematic evaluation framework (e.g., coverage analysis against a gold-standard rule set, precision/recall of rule extraction) is provided. The human-in-the-loop correction required only one iteration, but the criteria for "satisfactory" are not formalized.

Rule coding and refinement also lack formal evaluation. There is no reporting of how many bugs were found/fixed in the refinement step, or what percentage of Python scripts were initially correct. The decision to use Gemini 3 Pro-Preview for refinement because it "exceeds the capacity of the other LLM models" is not supported with evidence.

Conformance checking results report that >86% of traces were conformant across 50 rules, but the paper does not provide a ground truth or gold-standard conformance assessment by clinicians to validate the automated TCI scores. The two detailed case studies (TCI=62% and TCI=6%) are insightful and clinically interesting, but they serve as anecdotal illustrations rather than systematic validation.

LLM selection was driven by institutional access ("availability of Google tools within an educational agreement") rather than systematic benchmarking. The comparison with GPT-5 Thinking yielding only 28 rules is mentioned but not elaborated with controlled experiments.

Potential Impact

The practical value proposition is clear: if this approach works reliably, it could democratize conformance checking across hospitals that lack the resources for CIG development. The stroke care domain is clinically significant, and the approach is in principle domain-agnostic. The framework could extend to other medical specialties and other guideline-driven domains.

However, the impact is currently limited by the lack of rigorous validation. Healthcare applications demand high reliability, and the absence of systematic accuracy metrics may hinder adoption. The finding that some "non-conformances" actually represented clinically appropriate deviations (e.g., treating infection-caused hyperthermia with antibiotics rather than paracetamol) highlights an important limitation: rule-based conformance checking from guidelines cannot easily capture justified clinical deviations, which are pervasive in medicine.

The code availability (OSF repository) supports reproducibility, though the patient data cannot be shared for privacy reasons, limiting full replication.

Timeliness & Relevance

The paper is timely on multiple fronts: LLM orchestration is an active research paradigm, conformance checking in healthcare remains practically challenging, and there is growing interest in applying LLMs to clinical workflows. The connection to the compound AI systems trend is well-articulated. The work addresses a real bottleneck—the CIG development effort—that has constrained practical deployment of conformance checking for decades.

Strengths

1. Addresses a real gap: The CIG bottleneck is well-known, and automating the pipeline from raw text to conformance metrics is valuable.

2. End-to-end pipeline: Covering both event log extraction and rule extraction from unstructured text in a single framework is a meaningful integration.

3. Clinical grounding: The collaboration with Alessandria Hospital and use of real discharge letters (463 patients over 2022-2024) with ethics approval lends practical credibility.

4. Insightful case analysis: The investigation of specific non-conformance cases reveals nuanced clinical situations and demonstrates practical utility.

5. Modular design: The architecture allows component-level replacement and improvement.

Limitations

1. Insufficient quantitative evaluation: No gold-standard comparisons, no precision/recall metrics for any pipeline stage, and no systematic error analysis.

2. No ablation study: The contribution of each pipeline stage is not isolated; it's unclear whether simpler approaches could achieve comparable results.

3. Single-domain evaluation: Only stroke care at one hospital is tested; generalizability is unknown.

4. No comparison with baselines: Neither traditional conformance checking methods (on manually created CIGs) nor single-LLM approaches are compared against.

5. Scalability and cost considerations are not discussed.

6. The paper does not address LLM non-determinism and its potential impact on reproducibility of results.

7. The TCI metric, while intuitive, is simplistic—it does not weight rules by clinical importance.

Overall Assessment

This paper presents a promising proof-of-concept for LLM-orchestrated conformance checking that addresses a genuine practical need. However, it reads more as a feasibility demonstration than a rigorous scientific evaluation. The lack of quantitative validation metrics, baseline comparisons, and systematic error analysis significantly weakens the scientific contribution. The clinical insights from the case studies are valuable but cannot substitute for formal evaluation. The work would benefit substantially from a gold-standard annotation effort and comparative evaluation against traditional approaches.

Rating:4.5/ 10

Significance 6Rigor 3.5Novelty 5.5Clarity 6

Generated Jun 9, 2026

Comparison History (23)

Wonvs. Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

Paper 1 presents a practical, validated framework addressing a real and widespread problem in healthcare—conformance checking without requiring formal computer-interpretable guidelines. It was tested with real hospital data (hundreds of patient traces, 50 rules) and demonstrates clear clinical utility. Paper 2, while creative in concept, raises significant credibility concerns: it references nonexistent models (GPT-5.5-pro, DeepSeek-V4-pro), the mathematical contributions appear narrow (proving only N=n+1), and the 'autonomous conjecture generation' claim lacks rigorous validation. Paper 1's methodological rigor and immediate real-world healthcare applicability give it higher impact potential.

claude-opus-4-6·Jun 10, 2026

Lostvs. ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

Paper 2 has higher estimated impact due to a more novel, generalizable integration of LLM reasoning with physically grounded world models and principled uncertainty handling (double-loop learning). Its methodological framing (graph-latent conservation, KL-bounded adaptation, quantified effect sizes) is stronger and targets a broad, timely problem—resilient, policy-constrained decision-making in complex cyber-physical systems—applicable beyond supply chains. Paper 1 is valuable and practical in healthcare, but relies on LLM extraction/translation pipelines whose novelty and rigor may be more incremental and domain-specific, with heavier dependence on prompt/model behavior.

gpt-5.2·Jun 10, 2026

Lostvs. Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Paper 2 introduces a foundational architectural improvement for LLM agents (long-term persistent memory), addressing a critical bottleneck in core AI research. This methodological advancement has broad applicability across any domain utilizing AI agents. In contrast, Paper 1, while highly valuable and practical, is a domain-specific applied study focused on medical informatics and stroke care. Thus, Paper 2 has a higher potential for widespread scientific impact and adoption.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Paper 1 has a significantly higher potential scientific and societal impact due to its application in healthcare. Automating conformance checking directly from unstructured clinical texts addresses a major bottleneck in medical informatics, with direct implications for patient care quality and hospital efficiency. While Paper 2 presents an innovative methodological adaptation for sports analytics, the life-saving potential, broader applicability to various medical domains, and timely integration of LLMs in critical real-world clinical settings give Paper 1 a substantially higher overall scientific impact.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Evaluating Research-Level Math Proofs via Strict Step-Level Verification

Paper 2 has higher estimated scientific impact due to clear, near-term real-world applicability in healthcare operations: enabling conformance checking without Computer-Interpretable Guidelines addresses a major deployment bottleneck and can generalize to many clinical domains. It demonstrates an end-to-end, modular system evaluated on substantial hospital data (hundreds of traces, 50 rules), indicating methodological maturity and translational potential. Paper 1 is novel and valuable for AI-for-math evaluation, but its immediate applications are narrower and its impact is more specialized, whereas Paper 2 can influence clinical informatics, process mining, guideline engineering, and applied LLM systems.

gpt-5.2·Jun 10, 2026

Wonvs. Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Paper 2 presents a novel, practical framework applying LLMs to a concrete healthcare problem—conformance checking without requiring formal computer-interpretable guidelines. This addresses a real bottleneck in clinical quality assessment with immediate real-world applicability, validated on actual hospital data. While Paper 1 contributes a useful benchmark for GUI agents, benchmarks have incremental impact unless widely adopted. Paper 2's cross-disciplinary contribution (NLP + process mining + healthcare) and demonstrated feasibility in a real clinical setting gives it broader impact potential across multiple fields.

claude-opus-4-6·Jun 10, 2026

Lostvs. Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Paper 2 is likely higher impact: it identifies a broadly relevant failure mode in pretrained biomedical encoders (spurious high similarity for unrelated cross-domain pairs) that can directly corrupt causal reasoning in user-level models, then proposes generalizable contrastive and KG-mined hard-negative fixes with sizable quantitative gains. It also contributes practical systems insights (CPU/AMX/OpenVINO latency) plus released benchmarks and tooling, enabling adoption and follow-on work across biomed NLP, retrieval, personalization, and causal discovery. Paper 1 is applied and useful but more domain-specific, with heavier dependence on LLM orchestration and less general methodological novelty.

gpt-5.2·Jun 9, 2026

Lostvs. Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory

Paper 1 introduces a novel, generalizable framework (SkeMex) for medical agent reasoning with a self-evolving skill memory system that addresses fundamental limitations in how AI agents accumulate and reuse clinical experience. Its contributions—structured skill distillation, value-aware retrieval, and a closed-loop governance lifecycle—are broadly applicable across clinical tasks and model backbones. Paper 2 presents a useful but more narrowly scoped LLM orchestration framework for conformance checking in stroke care, validated at a single hospital. Paper 1 has greater novelty, broader applicability, and stronger methodological contributions with higher potential to influence multiple research directions.

claude-opus-4-6·Jun 9, 2026

Wonvs. Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

Paper 2 has higher estimated impact due to immediate real-world applicability in healthcare: it operationalizes conformance checking without needing scarce computer-interpretable guidelines, and demonstrates deployment on hundreds of hospital traces and 50 guideline-derived rules. This combination of timeliness (LLM-based clinical workflow analysis), demonstrated utility, and domain significance (stroke care) supports broader adoption across hospitals and other guideline-driven domains. Paper 1 is novel conceptually, but appears more proof-of-concept and may face harder validation and integration hurdles, reducing near-term impact.

gpt-5.2·Jun 9, 2026

Wonvs. AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

Paper 2 introduces a novel, generalizable framework for healthcare conformance checking that eliminates the need for Computer-Interpretable Guidelines—a significant practical barrier. It has broader applicability across medical domains, stronger methodological contribution (modular LLM orchestration architecture), and addresses a well-recognized gap in healthcare process mining. Paper 1, while rigorous in its ablation study, is narrower in scope (drug-asset valuation), heavily tied to a proprietary commercial product (Noah AI), and its core finding—that proprietary data matters—is somewhat intuitive. Paper 2's potential to impact clinical quality assessment across many healthcare settings gives it greater breadth of impact.

claude-opus-4-6·Jun 9, 2026

#2364of 3489·Artificial Intelligence

#2364 of 3489 · Artificial Intelligence

Tournament Score

1353±42

10501800

43%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance6

Rigor3.5

Novelty5.5

Clarity6