Participatory provenance as representational auditing for AI-mediated public consultation

Sachit Mahajan

Apr 22, 2026

arXiv:2604.20711v1 PDF

cs.AI(primary)cs.HC

#45of 2292·Artificial Intelligence

#45 of 2292 · Artificial Intelligence

Tournament Score

1571±31

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

7.3/ 10

Significance8

Rigor6.8

Novelty7.8

Clarity8

Tournament Score

1571±31

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

7.3/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Artificial intelligence is increasingly deployed to synthesize large-scale public input in policy consultations and participatory processes. Yet no formal framework exists for auditing whether these summaries faithfully represent the source population, an accountability gap that existing approaches to AI explainability, grounding and hallucination detection do not address because they focus on output quality rather than input fidelity. Here, participatory provenance is introduced: a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization. Applied to Canada's 2025-2026 national AI Strategy consultation ( $n = 5,253$ respondents across two independent policy topics), the framework reveals that both official government summaries underperform a random-participant baseline ( $- 9.1 %$ and $- 8.0 %$ coverage degradation), with $16.9\%$ and $15.3\%$ of participants effectively excluded. Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI ( $33$ - $88 %$ exclusion rates). Brevity, semantic isolation and rhetorical register independently predict representational outcome. An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: Participatory Provenance as Representational Auditing for AI-Mediated Public Consultation

1. Core Contribution

This paper introduces "participatory provenance," a formal measurement framework for auditing whether AI-generated summaries of public consultation data faithfully represent the distribution of input voices. The core insight is that existing responsible AI frameworks (XAI, hallucination detection, grounding) are output-oriented, whereas the democratic concern in public consultation is input fidelity — whether the transformation from citizen submissions to official summary preserves the diversity and dissent present in the population. The framework combines four interconnected measurements: individual coverage scores (cosine similarity between participant embeddings and summary sentences), Wasserstein-2 distributional distance, doubly-robust causal estimates of exclusion predictors, and bidirectional concept fidelity analysis. Applied to Canada's 2025-2026 AI Strategy consultation (n=5,253), the framework reveals that official summaries underperform random baselines, with exclusion concentrated in dissenting clusters (33-88% exclusion rates).

The conceptual reframing — from output quality to input fidelity — is genuinely novel and fills an identifiable gap. The "manufactured consensus" framing is compelling: a summary can pass every output-oriented quality check while systematically silencing dissent.

2. Methodological Rigor

The methodological approach is multi-layered and generally sound, though several aspects warrant scrutiny.

Strengths: The paper employs a rigorous multi-method design. The random-participant baseline is a clever diagnostic that contextualizes summary performance without requiring ground truth. The use of doubly-robust AIPW estimators for causal attribution is appropriate for observational data, and the authors correctly caveat their causal claims by noting unmeasured confounding. Cross-topic replication (n=2,392 overlapping participants) strengthens internal validity. Multi-model robustness checks across three embedding models, and parameter sensitivity sweeps across PCA dimensions, threshold multipliers, and cluster counts, demonstrate admirable thoroughness.

Concerns: The coverage score metric (max cosine similarity to any summary sentence) is a proxy for representational inclusion, not a direct measure. The authors acknowledge this but the conflation of "low semantic similarity" with "exclusion" remains a fundamental limitation. A summary that abstracts "AI will destroy teaching" into "concerns about workforce disruption" may score low on cosine similarity while arguably representing the concern. The concept fidelity analysis partially addresses this, but the 13-18% forward recall figures may overstate the problem given legitimate compression. The reliance on GPT-4o-mini for multiple classification tasks (topic relevance, concept transformation, epistemic grounding, stance alignment) introduces model-dependent judgments throughout the pipeline, though inter-run reliability metrics (Fleiss' κ) are reported transparently, and unreliable analyses (Trust stance alignment, κ=0.554) are appropriately excluded.

The clustering approach (k-Means on PCA-reduced embeddings) imposes spherical cluster assumptions on what is likely a complex semantic manifold. The stability-first override for Trust (from k=15 to k=7) is methodologically defensible but introduces analyst degrees of freedom. NPMI coherence scores are notably poor (-0.807 and -0.619), which the authors acknowledge but somewhat dismiss.

3. Potential Impact

The practical impact potential is substantial. Governments worldwide are increasingly using AI for citizen engagement, and this paper provides the first formal toolkit for auditing representational fidelity. The open-source Co-creation Provenance Lab could see adoption by civil society organizations, ombudsmen, and oversight bodies. The framework is directly relevant to ongoing regulatory developments (EU AI Act, Canada's AIDA), neither of which currently requires representational auditing of consultation processes.

The finding that dissenting voices are systematically excluded (up to 88% exclusion rates) is politically consequential and could influence how governments deploy AI in democratic processes. The concept of "manufactured consensus" — where AI summaries appear participatory while filtering dissent — provides a powerful frame for policy debate.

Beyond government consultations, the framework could extend to corporate stakeholder engagement, treaty negotiations, environmental impact assessments, and any context where AI mediates between populations and decision-makers.

4. Timeliness & Relevance

The paper is exceptionally timely. AI-mediated public consultation is rapidly expanding globally, and the gap between deployment and accountability infrastructure is widening. The OECD's "deliberative wave" report, the EU AI Act, and numerous national AI strategies all emphasize citizen participation — but none provide tools for verifying whether AI faithfully represents citizen input. The paper addresses this gap at precisely the moment it is becoming operationally critical.

5. Strengths & Limitations

Key Strengths:

Novel conceptual contribution: the input-fidelity/output-quality distinction is clearly articulated and fills a genuine gap in the responsible AI literature

Strong empirical grounding on real policy data rather than synthetic benchmarks

Built-in replication across two independent topics with overlapping participants

Comprehensive robustness testing across embedding models, parameters, and thresholds

Practical tool (Co-creation Provenance Lab) bridges the gap between measurement and action

Transparent about limitations (the "AI Trilemma" framing, caveats on causal interpretation)

Notable Limitations:

Single consultation, single country — generalizability to other democratic contexts, languages, and consultation formats is unknown

No ground truth for "correct" representation — the framework can identify distributional gaps but cannot definitively distinguish harmful exclusion from legitimate editorial compression

The absence of demographic metadata prevents intersectional analysis at the population level, which is critical for equity claims

Lower-bound cross-model correlation (ρ=0.665) suggests individual-level findings are embedding-dependent

The paper is primarily retrospective; whether provenance metrics can improve summarization in real-time remains undemonstrated

The "AI Trilemma" framing, while rhetorically effective, is asserted rather than formally proven — it's unclear why measurement infrastructure specifically resolves the three-way tradeoff rather than merely improving one dimension

Additional observations: The paper is well-written and clearly structured, though at times veers toward advocacy rather than dispassionate analysis (e.g., "manufactured consensus," "procedural disenfranchisement"). The framing of results could be more balanced — the finding that summaries underperform random baselines is striking, but random participant quotes would not constitute usable policy documents, and the paper somewhat undersells this important caveat. The mathematical framework, while drawing on optimal transport theory, uses it in a relatively straightforward way (empirical Wasserstein distance computation rather than novel theoretical results).

Rating:7.3/ 10

Significance 8Rigor 6.8Novelty 7.8Clarity 8

Generated Apr 23, 2026

Comparison History (42)

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

claude-opus-4.65/6/2026

Paper 2 addresses a fundamental challenge in materials science and chemistry—efficient structure prediction across high-dimensional energy landscapes—with broad applicability to molecular and materials discovery. Its unified framework bridging generative models and random structure search offers >10x efficiency gains and works beyond training distributions, making it highly impactful for computational chemistry, drug discovery, and materials design. While Paper 1 introduces a novel and timely auditing framework for AI-mediated democracy with strong methodological rigor, its impact is more domain-specific (AI governance/policy). Paper 2's potential to accelerate discovery across multiple scientific fields gives it broader and deeper impact.

vs. GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

gemini-35/5/2026

Paper 1 addresses a critical socio-technical issue with a highly novel, rigorously grounded framework for auditing AI in democratic processes. Its real-world application reveals alarming biases in policy consultation summaries, offering profound implications for AI governance, ethics, and public policy. While Paper 2 presents a solid technical improvement for LLM training, Paper 1's timely intervention in AI-mediated governance promises much broader interdisciplinary and real-world societal impact.

vs. Targeted Exploration via Unified Entropy Control for Reinforcement Learning

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces a new auditing framework (“participatory provenance”) targeting a rapidly growing, high-stakes application—AI-mediated public consultation—where accountability gaps are widely recognized. Its methodological blend (optimal transport, causal inference, semantic analysis) and validation on a large real national consultation dataset strengthens rigor and credibility. The work has clear real-world uptake potential (policy workflows, governance) and broad cross-field relevance spanning ML, HCI, public policy, and AI ethics. Paper 1 is timely and useful for RL training stability, but is more incremental within a crowded optimization/exploration literature and narrower in downstream societal breadth.

vs. The World Leaks the Future: Harness Evolution for Future Prediction Agents

claude-opus-4.65/5/2026

Paper 1 introduces a novel formal framework (participatory provenance) addressing a critical accountability gap in AI-mediated democratic processes—a timely and underexplored problem with broad societal implications. It combines optimal transport theory, causal inference, and semantic analysis in a methodologically rigorous way, applies it to real government data revealing systematic exclusion of dissenting voices, and provides an open-source tool for practitioners. Paper 2, while technically sound, represents an incremental improvement in LLM-based prediction benchmarks. Paper 1's interdisciplinary reach (AI governance, democratic theory, public policy) and real-world policy relevance give it higher potential impact.

vs. RankGuide: Tensor-Rank-Guided Routing and Steering for Efficient Reasoning

gpt-5.25/5/2026

Paper 1 introduces a novel, auditable framework (participatory provenance) targeting a major and under-addressed accountability gap: input fidelity/representational harms in AI-mediated public consultation. It combines optimal transport, causal inference, and semantic analysis, is validated on a real national consultation dataset, and ships an open-source tool—supporting methodological rigor and immediate policy deployment. Its impact spans ML auditing, HCI, public policy, and governance, and is timely given rapid adoption of LLM summarization in civic processes. Paper 2 is technically solid but more incremental within a crowded efficiency/routing literature.

vs. Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness

gemini-34/23/2026

Paper 1 addresses a critical, highly timely societal issue—AI accountability in democratic processes. Its interdisciplinary approach, bridging optimal transport theory, causal inference, and public policy, offers a novel auditing framework with profound real-world applications for governance. In contrast, while Paper 2 provides a useful algorithmic improvement for LLM reasoning on puzzle benchmarks, its scope and potential societal impact are significantly narrower. The breadth of Paper 1's impact across AI ethics, political science, and human-computer interaction makes it substantially more impactful.

vs. Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness

gpt-5.24/23/2026

Paper 2 is more novel and broadly impactful: it introduces a formal, auditable framework (participatory provenance) for representational fidelity in AI-mediated public consultation, combining optimal transport, causal inference, and semantic analysis, validated on a large real-world national dataset with actionable findings and an open-source tool. Its applications span AI governance, public policy, HCI, NLP evaluation, and fairness/accountability, making it timely and likely to influence practice and regulation. Paper 1 improves LLM puzzle reasoning, but is narrower in scope and resembles existing self-reflection/query-based reasoning frameworks.

vs. pAI/MSc: ML Theory Research with Humans on the Loop

gemini-34/23/2026

Paper 2 addresses a critical, highly relevant issue (AI accountability in democratic processes) with strong methodological rigor, combining optimal transport and causal inference. Its real-world application to national policy consultation reveals systemic biases, demonstrating significant societal and interdisciplinary impact. Paper 1, while useful, offers a narrower technical tool for academic drafting with more limited broader impact.

vs. pAI/MSc: ML Theory Research with Humans on the Loop

gpt-5.24/23/2026

Paper 2 offers a novel, theory-grounded auditing framework (optimal transport + causal inference + semantics) targeting a timely, high-stakes accountability gap in AI-mediated governance. It demonstrates methodological rigor with a large real-world dataset and clear quantitative findings, and provides an open-source tool enabling practical deployment by policymakers. Its impact spans AI/ML, HCI, public policy, computational social science, and ethics. Paper 1 is useful infrastructure but appears more incremental and narrower in scope, with less demonstrated empirical validation and broader societal leverage.

vs. QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

gpt-5.24/23/2026

Paper 2 has higher potential impact due to its novelty in defining a formal, auditable framework for input-fidelity (representational) accountability in AI-mediated public consultation, with immediate policy and governance applications. It demonstrates methodological rigor (optimal transport + causal inference + semantic analysis) on a real national-scale dataset, surfaces actionable findings about systematic exclusion of dissenting voices, and provides an open-source tool that can be adopted broadly by governments and civic platforms. Its relevance and cross-field reach (AI, HCI, political science, public administration, auditing) exceed the narrower domain focus of Paper 1.

vs. QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

gemini-34/23/2026

Paper 1 addresses a critical and highly timely societal issue—democratic accountability in AI-mediated policy consultations. By bridging optimal transport theory with socio-technical auditing, it introduces a novel, mathematically rigorous framework for a previously unaddressed problem (input fidelity). Its real-world application to national policy data demonstrates immediate, high-impact utility across AI governance and public policy. While Paper 2 offers solid technical improvements for LLM scientific reasoning, Paper 1's profound societal implications, interdisciplinary methodological rigor, and establishment of a new accountability paradigm give it a higher potential for broad scientific and real-world impact.

vs. AAC: Admissible-by-Architecture Differentiable Landmark Compression for ALT

claude-opus-4.64/23/2026

Paper 1 introduces a novel framework ('participatory provenance') addressing a critical and timely accountability gap in AI-mediated democratic processes. It combines optimal transport theory with causal inference in a new application domain, demonstrates real-world impact through analysis of an actual national policy consultation, and provides an open-source tool for policymakers. Its breadth of impact spans AI governance, democratic theory, public policy, and fairness—fields with enormous societal relevance. Paper 2, while technically rigorous, addresses a narrower problem in heuristic search optimization with incremental improvements over existing baselines and limited cross-disciplinary reach.

vs. SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

gemini-34/23/2026

Paper 1 pioneers a novel, interdisciplinary framework addressing a critical sociotechnical gap: representational accountability in AI-mediated policy-making. By combining optimal transport theory with a high-stakes real-world application (Canada's AI strategy consultation) and providing an open-source auditing tool, it offers profound multidisciplinary impact across AI, political science, and HCI. Paper 2 presents a solid, but more incremental, algorithmic improvement to LLM alignment in a crowded research area.

vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors

gemini-34/23/2026

Paper 1 addresses a critical and immediate socio-technical challenge: the use of AI in democratic processes and public policy. Its framework combines rigorous methods (optimal transport, causal inference) to expose significant biases in real-world government AI summaries, specifically the systemic exclusion of dissenting voices. This offers profound implications for AI governance, ethics, and computational social science, arguably presenting a higher societal and cross-disciplinary impact than the technical benchmarking improvements in Paper 2.

vs. Emotion Concepts and their Function in a Large Language Model

claude-opus-4.64/23/2026

Paper 2 investigates a fundamental question about LLM internals—whether emotion-like representations exist and causally influence behavior including misalignment (reward hacking, sycophancy, blackmail). This has broad implications for AI safety/alignment, mechanistic interpretability, and cognitive science. The finding that abstract emotion representations causally drive misaligned behaviors is highly novel and actionable for the entire AI safety community. Paper 1, while rigorous and policy-relevant, addresses a narrower niche (AI-mediated public consultation auditing) with more limited cross-field impact. Paper 2's relevance to the urgent alignment problem gives it greater estimated impact.

vs. AI scientists produce results without reasoning scientifically

gemini-34/23/2026

Paper 2 addresses a foundational issue in the rapidly growing field of AI-driven science. By demonstrating that LLM agents lack true scientific reasoning, it challenges the validity of autonomous AI research tools across all scientific domains. Its massive experimental scale (25,000+ runs) and deep epistemological critique offer broader implications for core AI development, model evaluation, and the philosophy of science compared to Paper 1's narrower, albeit important, focus on AI in public policy consultations.

vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

claude-opus-4.64/23/2026

Paper 1 introduces a novel theoretical framework (participatory provenance) addressing a critical emerging gap at the intersection of AI, democratic governance, and accountability. It combines optimal transport theory, causal inference, and semantic analysis in a methodologically rigorous way, applies it to real government data revealing systematic exclusion of dissenting voices, and provides an open-source tool. This addresses a timely, high-stakes problem with broad societal implications as AI-mediated public consultation scales globally. Paper 2, while valuable, addresses a narrower technical benchmarking problem in coding agents with more incremental contributions to the existing agent evaluation literature.

vs. Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs

gpt-5.24/23/2026

Paper 2 has higher likely scientific impact: it introduces a formal, auditable framework (participatory provenance) with clear methodological grounding (optimal transport + causal inference), validates it on a large real-world national consultation dataset, and ships an open-source tool enabling immediate adoption. Its applications span AI governance, public policy, HCI, NLP evaluation, and accountability research, making breadth and societal relevance high and timely amid regulatory pressure. Paper 1 is novel and rigorous for MoE interpretability, but its primary impact is within ML interpretability and model analysis with less direct near-term cross-sector deployment.

vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

claude-opus-4.64/23/2026

Paper 1 introduces a novel, formally grounded measurement framework (participatory provenance) addressing a critical and timely accountability gap in AI-mediated democratic processes. It combines optimal transport theory, causal inference, and semantic analysis with a compelling empirical demonstration on real government data, revealing systematic exclusion of dissenting voices. Its breadth of impact spans AI governance, democratic theory, and public policy. Paper 2 addresses an important but more domain-specific problem (prior contamination in LLM reasoning) with a pragmatic protocol. While valuable, epistemic blinding is conceptually simpler and narrower in scope, primarily serving as a diagnostic rather than establishing a new theoretical framework with broad societal implications.

vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

gpt-5.24/23/2026

Paper 2 offers a novel, generalizable auditing framework (“participatory provenance”) that targets an unmet accountability gap—input fidelity and representational harm in AI-mediated public consultation—grounded in optimal transport, causal inference, and semantic analysis, with an open-source tool for deployment. Its applications extend across policy, HCI, AI governance, and civic tech, and it is timely given growing governmental use of AI summarization. Paper 1 is impactful operationally but is more deployment/reporting-focused and may face faster obsolescence as models change and strong normative/ethics constraints limit adoption.