Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

Yavar Yeganeh, Mahsa Shekari, Nicla Frigerio, Daniele Pagano, Andrea Matta

Jun 9, 2026arXiv:2606.10705v1

cs.LGcs.AIeess.SY

#2234of 5669·cs.LG

#2234 of 5669 · cs.LG

Tournament Score

1426±43

10501750

56%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor6

Novelty6.5

Clarity5.5

Abstract

Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. These characteristics yield complex, high-dimensional decision problems with delayed feedback and long-horizon requirements, complicating production planning and control. We propose a deep reinforcement learning framework for multi-objective policy optimization at this scale. Specifically, we formulate control as a centralized-agent problem, where a core policy coordinates system-wide decisions, while system evolution is represented as an interconnected temporal process driven by discrete events. Accordingly, we develop a tailored event-driven temporal-difference formulation that remains general and can be integrated with various policy optimization methods under relevant training settings. We investigate several core model-free algorithms incorporated into this framework and evaluate their effectiveness using high-fidelity simulations of diverse, industry-real operating scenarios. Across extensive validation experiments, agents trained in both offline and online settings show significant and consistent gains in throughput and utilization. We further evaluate performance and generalization across training phases, clarifying the relative strengths of alternative reinforcement learning formulations and algorithms. Overall, the results support the scalability, generality, and transferability of the proposed framework for controlling event-driven complex adaptive systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper introduces an event-driven reinforcement learning framework for long-horizon control in semiconductor fabrication (fab) systems. The central novelty lies in the event-group temporal-difference (TD) learning formulation, which aggregates credit assignment across temporally overlapping, asynchronous events rather than treating each decision as an independent transition. This addresses a genuine structural mismatch: in fabs, actions initiate extended processes that overlap in time, and system-level KPIs (throughput, utilization) emerge only from the collective effect of thousands of interdependent decisions. The framework uses a centralized agent with parameter sharing across all equipment, a decomposed reward (event-level + system-level), and supports both offline and online training phases with multiple algorithm backbones (DQL, CQL, IQL, SAC, PPO).

2. Methodological Rigor

Strengths in experimental design:

Evaluation spans 31 real industrial scenarios (21 train, 10 test) with temporal separation to prevent leakage—a commendable practice rarely seen in applied RL work.

Each test scenario uses 3 random seeds, yielding 30 evaluation instances with Bonferroni-adjusted confidence intervals and paired statistical tests (permutation, Wilcoxon).

The ablation study on TD formulations (Table 2) is particularly informative: the truncated discounted-sequence baseline performs poorly (negative gains), validating the motivation for event-driven aggregation.

Limitations in rigor:

The theoretical grounding of the event-group TD objective is acknowledged as incomplete. The authors note it is a "surrogate aggregation objective" without formal guarantees of Bellman consistency or fixed-point preservation. This is a significant gap—the method works empirically but lacks convergence guarantees.

Hyperparameter tuning is explicitly not exhaustive, making cross-algorithm comparisons (e.g., DQL vs. SAC vs. PPO) somewhat informal. The authors acknowledge this but it weakens claims about algorithmic superiority.

The offline training uses data from a random policy only, which is an extremely weak behavior policy. While this tests robustness, it doesn't reflect realistic deployment where historical expert/heuristic data would be available.

The proprietary data and simulator prevent full reproducibility, though the RL framework code is released.

3. Potential Impact

Domain-specific: The framework addresses a real bottleneck in semiconductor manufacturing—dispatching decisions across hundreds of tools with re-entrant flows and delayed feedback. The reported gains (up to ~21% throughput improvement over FIFO, ~42% over random policy) are operationally meaningful. The collaboration with STMicroelectronics and use of industry-real scenarios lends credibility.

Broader applicability: The event-driven TD formulation is positioned as general for any discrete-event system with asynchronous, temporally overlapping actions—healthcare operations, logistics, telecommunications. This generality is plausible but unvalidated beyond the semiconductor domain.

Methodological influence: The event-group aggregation idea could influence how RL researchers handle credit assignment in systems with temporally extended, overlapping actions. However, the lack of theoretical analysis limits its adoption as a principled method versus an empirical heuristic.

4. Timeliness & Relevance

The paper is well-timed. Industrial RL deployment remains challenging, and semiconductor manufacturing is experiencing renewed strategic importance globally. The emphasis on offline-to-online training pipelines aligns with current trends in practical RL deployment. The connection to semi-MDP and temporal abstraction literature is appropriate, though the paper could engage more with recent return decomposition methods (RUDDER is cited but not compared against).

5. Strengths & Limitations

Key Strengths:

Problem formulation is well-motivated: The temporal structure diagram (Fig. 1) and the distinction between contained/started events provide clear intuition for why standard TD fails here.

Comprehensive algorithmic coverage: Testing DQL, CQL, IQL, SAC, and PPO under the same framework enables meaningful (if imperfect) comparisons across value-based, conservative, and policy-gradient families.

Offline model selection analysis (Table 3): The finding that TD loss poorly correlates with downstream performance is practically important and aligns with recent findings in offline RL.

Multi-phase training: The offline→online pipeline with conservative pretraining is practically relevant.

Scale: The state/action space (5500+ dimensional input) and 31 scenarios represent genuine industrial complexity.

Notable Weaknesses:

No comparison with existing RL-for-fab methods: Despite an extensive literature review, no prior RL approach is used as a baseline. Only FIFO, SPT, and random policies serve as comparisons. This makes it difficult to assess whether the gains come from the event-group TD formulation specifically or from applying any reasonable deep RL approach.

Feature engineering vs. learned representations: The paper uses hand-crafted features (500+ per action, 5000+ for state), which limits generalizability claims. The authors acknowledge this but it is a substantial limitation for transferability.

Reward design sensitivity: Table 7 shows that segment-only reward actually outperforms event+segment reward for PPO, which somewhat contradicts the motivation for the decomposed reward structure.

Limited theoretical analysis: The aggregation objective in Eq. 12 lacks formal justification—does minimizing the squared deviation between mean TD errors and system reward lead to correct value estimates? The connection to Bellman consistency is unclear.

Paper length and organization: At 63 pages with appendices, the paper is difficult to navigate. Core contributions could be more concisely presented.

Single domain evaluation: Despite claims of generality to "complex adaptive systems," validation is limited to one semiconductor fab setting.

Additional Observations

The sector-level analysis (Appendix D.2) showing how learned policies redistribute workload across fab sectors provides useful interpretability. The process mining diagnostics (Appendix D.1.2) are thorough but somewhat tangential to the RL contribution. The weighted aggregation variant (Eq. 13) showing instability issues (Appendix F.4.1) suggests the approach's sensitivity to design choices needs further investigation.

Rating:6.2/ 10

Significance 6.5Rigor 6Novelty 6.5Clarity 5.5

Generated Jun 10, 2026

Comparison History (18)

Lostvs. Latent World Recovery for Multimodal Learning with Missing Modalities

Paper 2 addresses a fundamental and pervasive challenge in machine learning—handling missing modalities in multimodal data—which has broad applicability across diverse fields like healthcare, robotics, and autonomous systems. While Paper 1 offers a strong, practical application of RL to semiconductor manufacturing, Paper 2's methodological innovation in latent space alignment without explicit imputation offers greater theoretical novelty and wider potential cross-disciplinary impact, particularly in high-stakes areas like cancer prediction.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Implicit Neural Representations of Individual Behavior

Paper 2 introduces a highly novel methodological bridge by adapting Implicit Neural Representations to behavioral data, creating a flexible, self-supervised framework for policy representation. This fundamental algorithmic innovation is applicable across diverse domains like robotics, autonomous driving, and gaming. While Paper 1 addresses a high-value specific industrial application, Paper 2's broader applicability across various subfields of artificial intelligence suggests a significantly wider potential scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. On Subquadratic Architectures: From Applications to Principles

Paper 2 is likely to have higher scientific impact due to its strong real-world applicability and timeliness: scalable RL for long-horizon, event-driven control in semiconductor fabrication targets a high-value, highly constrained industrial domain. The event-driven TD formulation appears broadly reusable for other discrete-event complex systems (e.g., logistics, networks), extending impact beyond semiconductors. The use of high-fidelity simulations across diverse operating scenarios and both offline/online training suggests solid methodological rigor and practical validation. Paper 1 is valuable but is primarily comparative/analytical within an active subquadratic-architecture space, with potentially narrower cross-domain impact.

gpt-5.2·Jun 11, 2026

Wonvs. Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Paper 1 addresses a fundamental challenge in semiconductor manufacturing—a critical industry—with a novel event-driven RL framework demonstrating real-world applicability at industrial scale. Its contributions span RL theory (event-driven temporal-difference formulation), methodology, and practical validation on industry-realistic scenarios. Paper 2 offers a technically sound but incremental improvement (replacing ratio clipping with divergence constraints) for flow matching models in generative AI. While timely, it is a narrower algorithmic refinement. Paper 1's broader cross-disciplinary impact (RL + manufacturing), novelty of formulation, and potential for real-world deployment give it higher estimated scientific impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

Paper 2 addresses a fundamental bottleneck in brain-computer interfaces (BCIs)—cross-day recalibration due to changing neural populations. By decoupling temporal dynamics from the neuron interface, it sets a new state-of-the-art on a standard benchmark while drastically reducing the data needed for recalibration. This has profound implications for long-term biomedical and neuroscientific applications. While Paper 1 offers a valuable industrial application of RL, Paper 2 demonstrates higher foundational scientific impact by advancing the rapidly growing field of neuro-AI and addressing a critical hurdle in viable, long-term neural prosthetics.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Weak Diffusion Priors Can Still Achieve Strong Inverse-Problem Performance

Paper 2 addresses a critical real-world industrial problem (semiconductor fabrication control) with a novel event-driven RL framework that demonstrates practical scalability and generalization. Its impact spans RL methodology, manufacturing optimization, and complex systems control. Paper 1 provides useful theoretical insights about diffusion priors for inverse problems, but is more incremental—explaining an observed phenomenon rather than enabling new capabilities. Paper 2's broader applicability to event-driven complex adaptive systems and its direct industrial relevance give it higher potential impact across both academia and industry.

claude-opus-4-6·Jun 10, 2026

Wonvs. Encoding the Euler Characteristic Transform

Paper 2 addresses a critical, large-scale real-world problem—semiconductor manufacturing control—which is a major global bottleneck. Its event-driven reinforcement learning framework has immediate and significant industrial applications, bridging AI and operations research. While Paper 1 presents an elegant methodological advance in topological data analysis, Paper 2's potential for broad economic and technological impact, combined with its high timeliness and relevance to complex adaptive systems, gives it a higher overall scientific impact.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Express Language Modeling

Paper 2 addresses fundamental bottlenecks in large language models (attention complexity and KV cache) with strong theoretical guarantees and practical speedups over FlashAttention 2. Given the current dominance of LLMs, this will have massive, immediate, and broad scientific impact across AI. While Paper 1 presents a highly valuable RL application for semiconductor manufacturing, its impact is inherently more domain-specific compared to the ubiquitous applicability of faster, memory-efficient language modeling.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Overcoming Rank Collapse in Feedback Alignment

Paper 1 addresses a high-impact industrial problem (semiconductor manufacturing control) with a novel event-driven RL framework validated on industry-real scenarios, demonstrating significant practical gains in throughput and utilization. Its contributions span RL methodology, manufacturing systems, and complex adaptive systems. Paper 2 makes a solid contribution to understanding feedback alignment's limitations (rank collapse) and proposes remedies, but it addresses a more niche problem in biologically plausible learning that has struggled to gain practical traction. Paper 1's real-world applicability to a critical industry and methodological generality give it broader potential impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

Paper 2 provides fundamental theoretical insights into the inner workings of large language models, specifically regarding fine-tuning mechanisms like LoRA and activation steering. Given the explosive growth and broad applicability of LLMs across disciplines, these foundational discoveries will likely influence a vast array of ongoing AI research. In contrast, while Paper 1 offers a strong and practical application of reinforcement learning, its impact is more narrowly focused on industrial control systems and semiconductor manufacturing.

gemini-3.1-pro-preview·Jun 10, 2026

#2234of 5669·cs.LG

#2234 of 5669 · cs.LG

Tournament Score

1426±43

10501750

56%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor6

Novelty6.5

Clarity5.5