A governance horizon for ethical-use constraints in open-weight AI models

Weiwei Xu, Hengzhi Ye, Haoran Ye, Kai Gao, Vladimir Filkov, Minghui Zhou

May 23, 2026

arXiv:2605.24383v1 PDF

cs.AI(primary)cs.CY

#325of 2682·Artificial Intelligence

#325 of 2682 · Artificial Intelligence

Tournament Score

1502±43

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

8.2/ 10

Significance8.5

Rigor8

Novelty8

Clarity8.5

Tournament Score

1502±43

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

8.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ( $R^{2}$ =0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and formalizes a "governance horizon" — a structural depth boundary beyond which disclosure-based ethical-use constraints become practically unauditable in open-weight AI model ecosystems. Through an ecosystem-scale audit of 2.14 million model repositories on Hugging Face, the authors demonstrate that restriction evidence decays exponentially with a half-life of 1.31 derivation steps, and that beyond ~7 downstream generations, 80%+ of descendant models lack sufficient public evidence for governance determination. The key insight is that this is not merely a documentation compliance problem but a structural-topological one: governance signals depend on voluntary re-declaration at each derivation step, while model capabilities propagate automatically through weights.

The concept of formalizing this as a measurable "horizon" with a specific decay function is genuinely novel. Prior work documented individual documentation failures, licensing conflicts, or transparency gaps, but none quantified the depth-dependent degradation of governance signals across full derivation graphs at ecosystem scale.

Methodological Rigor

The methodology is thorough and multi-layered. The pipeline for extracting and validating 1.03M model-to-model derivation relationships is well-documented, with stratified evaluation against LLM judges (micro-F1=94.0% topology-level) and independent re-annotation stability checks (Cohen's κ=0.865). The ethical-use restriction classifier achieves strong performance (precision=0.96, recall=0.91) against a human-adjudicated gold standard.

The audit-state framework (Decidable, Inconsistent, Undecidable-Missing, Undecidable-Ambiguous) is a thoughtful operationalization that distinguishes between different failure modes. The governance horizon formalization with bootstrap stability testing (95% CI [7,7]) and sensitivity analysis across 24 parameter combinations is commendable.

The merge-conflict analysis employs progressively stronger identification strategies (raw comparison, PSM, IPW, AIPW), and the authors are appropriately transparent about residual imbalance on model age and the modest absolute effect sizes. The PyPI comparator ecosystem provides a valuable structural control, though the comparison is somewhat limited by the different nature of governance signals (continuous LRI vs. binary auditability).

Some methodological concerns: the restriction classifier relies on pattern matching over licence text, which may miss implicit restrictions or misclassify edge cases. The single-platform, single-timepoint snapshot limits temporal and cross-platform generalizability. The simulation of platform interventions, while informative, makes simplifying assumptions about enforcement that may not capture real-world dynamics.

Potential Impact

Policy impact: This paper has direct implications for the EU AI Act's supply-chain documentation requirements, NIST AI RMF provenance expectations, and OECD recommendations on open-weight governance. The finding that disclosure-based governance has a structurally bounded reach — independent of enforcement intensity for inheritance-only designs — is a concrete, actionable result for policymakers. It suggests that current regulatory frameworks may be building on fundamentally insufficient infrastructure.

Technical impact: The paper motivates development of cryptographic provenance attestation, weight-embedded licence chains, and platform-enforced derivation registries. These are technically nascent but the empirical evidence presented here provides strong justification for investment in such infrastructure.

Ecosystem governance: For platforms like Hugging Face, the distinction between inheritance-only and mandatory-declaration designs (with the latter shifting the governance horizon even at moderate enforcement) provides a concrete design recommendation that could be implemented relatively quickly.

Research impact: The formalization of governance horizons as measurable quantities opens a new empirical research program. The methodology could be applied longitudinally to track whether the horizon is expanding or contracting, and cross-platform to other model registries.

Timeliness & Relevance

This paper arrives at a critical moment. The EU AI Act is entering implementation, open-weight models now constitute ~55% of commercially available foundation models, and model merging/fine-tuning ecosystems are growing rapidly. The gap between regulatory expectations and technical reality identified here is immediately consequential. The paper also addresses the ongoing debate about open vs. closed AI governance by showing that the problem is not openness per se but the specific topology of weight-level derivation.

Strengths

1. Scale and completeness: Auditing 2.14M repositories provides ecosystem-level evidence rather than case studies

2. Conceptual clarity: The "governance horizon" concept is intuitive, formally defined, and empirically grounded

3. Actionable policy implications: The distinction between inheritance-only and mandatory-declaration designs offers concrete guidance

4. Strong comparator design: The PyPI comparison isolates the topology-specific nature of the collapse

5. Comprehensive robustness: 24-combination sensitivity analysis, multiple causal estimators, bootstrap stability

6. Reproducibility commitment: Deterministic analysis pipeline release

Limitations

1. Single-platform snapshot: No temporal dynamics or cross-platform validation beyond PyPI

2. Legal abstraction: The classifier captures public disclosure evidence, not legal enforceability — the distinction between "governance infrastructure reach" and "legal reach" could be more deeply explored

3. Simulation simplifications: The intervention models assume random enforcement targeting rather than strategic prioritization, which likely understates the effectiveness of targeted interventions

4. Dataset dependencies excluded: The paper acknowledges this but dataset provenance is an important component of real governance

5. Merge-conflict effect modest: The absolute relicensing probabilities are below 7%, limiting practical significance despite statistical significance

6. PyPI comparison asymmetry: Comparing continuous LRI with binary auditability limits the precision of the cross-ecosystem comparison

Overall Assessment

This is a high-quality empirical contribution that identifies a fundamental structural limitation in current AI governance infrastructure. The combination of massive-scale data, rigorous methodology, and clear policy relevance makes it a significant contribution. The governance horizon concept provides both a diagnostic tool and a benchmark for evaluating future governance interventions. The main limitation is its cross-sectional nature, but the methodological framework enables longitudinal and cross-platform extensions.

Rating:8.2/ 10

Significance 8.5Rigor 8Novelty 8Clarity 8.5

Generated May 26, 2026

Comparison History (24)

vs. SIA: Self Improving AI with Harness & Weight Updates

gemini-3.15/27/2026

Paper 2 tackles a foundational goal in artificial intelligence—autonomous self-improvement—by unifying scaffold optimization and weight updates. Its methodological breadth, demonstrated by massive performance gains across disparate domains (law, GPU optimization, RNA denoising), gives it immense potential for widespread, cross-disciplinary scientific application. While Paper 1 provides a highly rigorous and important empirical analysis for AI governance, Paper 2 directly advances core AI capabilities, which typically drives deeper technical impact and broader adoption in the machine learning community.

vs. What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

claude-opus-4.65/27/2026

Paper 2 addresses a fundamental question about why chain-of-thought prompting works in LLMs, revealing that local co-occurrence rather than logical reasoning drives much of the gain. This challenges widely-held assumptions about CoT and has broad implications for interpretability, prompt engineering, and understanding of LLM reasoning across the entire AI/NLP community. Paper 1 provides valuable empirical governance analysis of Hugging Face model licensing, but addresses a narrower policy/infrastructure audience. Paper 2's mechanistic insight into a ubiquitously used technique will likely influence more research directions and has higher citation potential.

vs. Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact: it identifies a broadly relevant failure mode (monitoring-control gap) in retrieval-augmented LLMs, a dominant deployment paradigm, with direct implications for safety in high-stakes applications. It combines large-scale multi-turn evaluation (50k+ turn-level), cross-model replication, human validation, and mechanism-oriented analyses, making the claim both timely and methodologically rigorous. The concept generalizes across domains (alignment, RAG evaluation, decision-making under uncertainty). Paper 1 is novel and valuable for governance of open-weight ecosystems, but its impact is more policy/platform-specific and narrower in technical breadth.

vs. LipoAgent: Coordinating Fine-Tuned LLM Agents for Safer Lipid Design

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental and broadly impactful problem in AI governance—the traceability and sustainability of ethical-use constraints across open-weight model ecosystems. Its large-scale empirical audit of over 2 million repositories, formalization of the 'governance horizon' concept, and comparison across platforms provide novel, rigorous insights with direct policy implications for the entire open-source AI ecosystem. Paper 2 makes a solid applied contribution to lipid nanoparticle design using LLM agents, but its impact is more domain-specific. Paper 1's breadth across AI policy, supply-chain accountability, and open-source governance gives it wider and more lasting influence.

vs. Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

gemini-3.15/26/2026

Paper 2 conducts a massive empirical audit of over 2 million models, establishing a fundamental 'governance horizon' that highlights critical failures in current open-weight AI policy. Its insights profoundly impact the broader fields of AI governance, policy-making, and open-source ecosystems. While Paper 1 offers a valuable technical solution for LLM safety alignment, Paper 2 addresses a structural bottleneck in AI accountability with sweeping implications for how open models are regulated globally.

vs. FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

gpt-5.25/26/2026

Paper 2 has higher likely scientific impact due to its broad, timely relevance to AI governance and open-weight model supply-chain accountability, with immediate policy and platform applications. It introduces a clear, quantifiable concept (the “governance horizon”), audits a very large real-world dataset, and derives actionable design implications validated via comparisons/interventions. Its conclusions generalize across stakeholders (research, platforms, regulators) and fields (ML, security, policy, software ecosystems). Paper 1 is valuable and novel for OR/LLM evaluation but is narrower in domain reach and downstream policy leverage.

vs. Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

gemini-3.15/26/2026

Paper 2 addresses a critical, timely issue in AI governance by empirically auditing over 2 million models to reveal the systemic decay of ethical constraints across open-weight AI lineages. Its formalization of the 'governance horizon' and actionable policy insights give it profound, cross-disciplinary implications for AI regulation, policy, and safety, offering broader real-world impact compared to the technical improvements in LLM interpretability presented in Paper 1.

vs. NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

claude-opus-4.65/26/2026

Paper 2 addresses a critical and timely AI governance challenge with broad policy implications. Its large-scale empirical audit of 2.1M+ model repositories provides novel, quantitative evidence (governance horizon, half-life of restriction decay) that directly informs AI regulation and open-source policy design. The findings have immediate real-world applications for policymakers, platform designers, and the AI safety community. Paper 1, while technically strong in neuroimaging, addresses a narrower domain (fMRI decoding) with incremental improvements. Paper 2's cross-disciplinary impact spanning AI governance, supply-chain accountability, and policy design gives it broader and more timely significance.

vs. Controllable User Simulation

claude-opus-4.65/26/2026

Paper 1 presents a large-scale empirical audit of 2.1M+ model repositories, formalizing a novel concept (governance horizon) with immediate policy implications for AI governance—a critically timely topic. It bridges AI policy, software supply-chain analysis, and open-source ecosystems with rigorous methodology and actionable design recommendations. Paper 2 makes solid theoretical contributions to controllable user simulation via causal inference, but addresses a narrower problem (evaluation of conversational agents). Paper 1's breadth of impact across AI governance, policy design, and open-source communities, combined with its timeliness amid global AI regulation efforts, gives it higher potential impact.

vs. Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

gemini-3.15/26/2026

Paper 2 addresses an urgent and universally relevant issue: AI governance and supply-chain accountability. By empirically auditing over 2 million models, it provides rigorous, large-scale evidence of the failure of current ethical-use constraints in open-weight models. This highly novel approach bridges technical ecosystems and policy design, offering immediate real-world implications for global AI regulation, whereas Paper 1 focuses on a technical benchmark for future AI agents.

vs. Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

gpt-5.25/26/2026

Paper 2 likely has higher impact due to its scale (auditing 2.1M repos), strong quantitative finding (a measurable “governance horizon” with predictive fit), and immediate policy relevance for open-weight model governance and supply-chain accountability. Its conclusions generalize across platforms (Hugging Face vs PyPI) and inform actionable platform/policy designs, affecting research, industry compliance, and regulation. Paper 1 is timely and useful for agent safety evaluation, but its domain specificity (industrial multi-agent workflows) and narrower stakeholder reach suggest comparatively smaller cross-field and real-world governance impact.

vs. A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it introduces a scalable foundation-model paradigm (signal–language contrastive pretraining) trained on very large clinical data, with extensive multi-cohort external validation (~1.5M ECGs) across 89 tasks, suggesting strong methodological rigor and broad applicability to diagnosis, screening, and representation learning in healthcare. Its real-world translational potential is immediate (routine ECG workflows, rare disease detection, echocardiography proxy targets) and timely given foundation-model momentum. Paper 1 is novel and relevant for AI governance, but its impact is more policy/infrastructure-focused and less directly transformative across multiple scientific/clinical domains.

vs. Neuro-Inspired Inverse Learning for Planning and Control

gpt-5.25/26/2026

Paper 1 offers a novel learning/control framework (inverse learning with trajectory-level optimization) that improves performance and drastically reduces inference compute on standard benchmarks, plus a compelling cross-domain demo in quantum control. This combination of methodological innovation, measurable gains, and broad applicability to robotics/embodied AI and potentially other control problems suggests high scientific and practical impact. Paper 2 is timely and policy-relevant with strong empirical auditing, but its primary impact is narrower (AI governance/metadata infrastructure) and more contingent on institutional adoption than on a generalizable technical advance.

vs. AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

claude-opus-4.65/26/2026

Paper 1 presents a novel, large-scale empirical analysis of AI governance infrastructure with a rigorous methodology (auditing >2M model repositories), introduces formalized concepts (governance horizon, half-life of restriction evidence), and has broad implications for AI policy, supply-chain accountability, and open-source governance. Its findings are structurally fundamental and relevant across regulatory, legal, and technical domains. Paper 2, while timely and practical, is a more incremental robustness benchmark contribution within the narrower scope of computer-use agents, with less transformative potential for the broader field.

vs. Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

gpt-5.25/26/2026

Paper 1 has higher potential impact due to its novelty and breadth: it introduces a quantitative “governance horizon” for open-weight model lineages, empirically audits a massive ecosystem (2.1M repos), and yields policy-relevant, structurally grounded conclusions about traceability limits and intervention design. Its implications span AI governance, software supply-chain provenance, platform policy, and responsible open-source AI. Paper 2 is timely and practically useful for production benchmarking rigor, but is narrower in scope (client-side measurement bias and tooling) and likely more incremental/engineering-focused, with less cross-field policy impact.

vs. Credit Assignment with Resets in Language Model Reasoning

claude-opus-4.65/26/2026

Paper 1 addresses a core technical challenge in LLM training—credit assignment in reinforcement learning for reasoning—with novel methods (RRPO, SRPO) grounded in theoretical frameworks (CPI) and validated empirically. This directly advances the rapidly growing field of LLM reasoning improvement, with broad applicability across models and benchmarks. Paper 2 provides valuable empirical insights into AI governance limitations on Hugging Face, but its impact is narrower, primarily informing policy design for open-weight model ecosystems. Paper 1's methodological contributions are more likely to be widely adopted and cited in the highly active LLM research community.

vs. Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat

gemini-3.15/26/2026

Paper 2 addresses a highly timely and universally relevant challenge in AI governance, analyzing over 2 million models to expose critical flaws in open-weight AI policy enforcement. Its findings on the 'governance horizon' have broad implications for AI ethics, law, and supply-chain accountability. In contrast, Paper 1 presents incremental methodological improvements for a niche application (military air combat), resulting in a much narrower scope of scientific and societal impact.

vs. Inference Time Context Sparsity: Illusion or Opportunity?

gemini-3.15/26/2026

Paper 1 addresses a critical, universal bottleneck in modern AI: LLM long-context inference compute and memory costs. By empirically demonstrating that extreme context sparsity is robust across models without fine-tuning, and providing hardware-level optimizations yielding up to a 10x speedup, it offers immediate, transformative utility for AI development. While Paper 2 provides a valuable, rigorous audit of AI governance and open-weight licensing decay, Paper 1's findings will directly and fundamentally drive future LLM architecture, systems engineering, and deployment capabilities across a massive global industry.

vs. GRAIL: AI translation for scientists application workflow on satellite data

claude-opus-4.65/26/2026

Paper 1 presents a novel, rigorous empirical analysis of AI governance infrastructure at scale (2.1M+ repositories), introduces the formalized concept of a 'governance horizon,' and provides actionable insights for AI policy design with broad implications across AI governance, open-source ecosystems, and supply-chain accountability. Its methodological rigor (quantitative decay modeling, cross-platform comparison) and timeliness given ongoing AI regulation debates give it significantly higher potential impact. Paper 2 is a useful engineering contribution but addresses a narrower problem (translating geospatial Python to Spark) with limited generalizability beyond its specific domain.

vs. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

gemini-3.15/26/2026

Paper 2 addresses the critical and highly timely issue of AI governance and open-weight model proliferation. By conducting a massive empirical audit of over 2.1 million models, it introduces quantifiable metrics like the 'governance horizon' and provides actionable insights into policy design. Its interdisciplinary breadth, impacting AI regulation, platform engineering, and machine learning safety, gives it a higher potential for widespread scientific and real-world policy impact compared to the narrower, domain-specific agent evaluation study in Paper 1.