Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko
Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.
This paper introduces an agency-transferring arbitration mechanism that embeds a pre-existing "functional but suboptimal" baseline policy into the RL training loop. The key idea is a mixing coefficient αt that governs whether the learning policy or baseline policy acts at each timestep. The mixing uses a two-pronged acceptance criterion: (1) a deterministic critic-improvement gate (learning action accepted if critic value exceeds an episode-local benchmark by margin ν), and (2) a stochastic relaxation term with decaying acceptance probability prelλ^(t−τ). Over training, both prel and λ increase toward 1, eventually yielding a standalone neural network policy requiring no baseline at deployment.
The contribution is methodologically distinct from residual RL (which adds baseline and learned actions algebraically and requires the baseline at deployment) and from demonstration-based methods (which use offline data). The paper provides both the algorithmic framework and formal theoretical analysis, including goal-reaching guarantees during training and a transfer theorem bounding degradation when the baseline is removed.
Theoretical analysis. The paper provides three main theoretical results: Theorem 1 (the composite policy inherits the ε-improbable goal-reaching property of the baseline during training when λ<1), Theorem 2 (a uniform version with explicit overshoot bounds and reaching-time distributions), and Theorem 3 (a transfer theorem bounding goal-reaching probability degradation after baseline removal via trajectory distance). The proofs are constructive and clearly structured. Theorem 1's proof elegantly uses Borel-Cantelli to show finite learning-policy activations when λ<1. Theorem 3 uses a clean Markov inequality argument.
However, there are important caveats. Theorem 1 assumes unbounded episode length (no truncation), making it an "interpretive" result rather than a practical guarantee—the authors are transparent about this. Theorem 2 requires additional assumptions (critic envelope bounds, uniform KL-certificate for baseline) that, while implementable, add non-trivial design burden. The critic-monotonicity gate (equation 8) used in Theorem 2 is introduced solely for the analysis and isn't the default algorithm—a gap between theory and implementation. Theorem 3's bound depends on trajectory distance ΔT, which must be estimated empirically, limiting its predictive utility.
Experimental design. The experiments use two custom continuous-control environments with explicit goal sets and hand-designed baseline policies. Ten random seeds are used per configuration, with appropriate statistical reporting (medians, IQRs, standard deviations). The ablation studies (Sections 6.1–6.3) systematically evaluate ν, Ttran, and schedule parameters. Code and data are publicly available.
A significant limitation is the evaluation scope: only two custom environments are tested, both relatively simple (6D AUV dynamics and a 2D robot). No standard RL benchmarks (MuJoCo locomotion, manipulation suites) are used. The authors justify this by arguing that standard benchmarks lack explicit goal sets and easily specifiable baselines, but this significantly limits generalizability claims. The baseline policies are simple PD/proportional controllers, leaving open whether the method works with more complex baselines.
The practical premise is strong: many real-world control problems have functional but suboptimal controllers (PID, heuristic planners, etc.), and methods to bootstrap RL from these are valuable. The key practical advantage—high goal-reaching rates from the start of training—is important for safety-sensitive domains where catastrophic exploration failures are costly (robotics, autonomous vehicles).
However, the impact is somewhat limited by: (1) the requirement for a callable baseline policy, which is more restrictive than demonstration-based approaches; (2) the assumption that the baseline satisfies the ε-improbable goal-reaching property, which may be hard to verify for complex baselines; (3) the need to tune four additional hyperparameters (prel₀, λ₀, ν, Ttran) beyond the backbone RL algorithm's own hyperparameters.
The method occupies a niche between residual RL (which requires the baseline at deployment) and from-scratch RL (which ignores available policies). The deployment independence from the baseline is a genuine advantage over residual RL.
The work addresses a genuine practical need. As RL methods are increasingly applied to real-world control problems, methods for incorporating prior knowledge efficiently become more important. The paper is well-positioned in the current landscape where sim-to-real transfer, safe exploration, and sample efficiency remain bottlenecks.
1. Clean algorithmic design with clear separation between arbitration module and backbone RL algorithm, making it backbone-agnostic
2. Theoretical-empirical alignment: the formal results, while interpretive, provide genuine insight into why the method works
3. Deployment independence: unlike residual RL, the final policy is a standalone neural network
4. Strong empirical goal-reaching performance throughout training, including after baseline removal
5. Thorough ablation studies and open-source implementation
6. Excellent presentation: the paper is well-organized with clear diagrams, comprehensive related work comparison (Table 1), and honest discussion of limitations
1. Narrow experimental evaluation: only two custom environments with simple dynamics and baselines
2. Gap between theory and practice: Theorem 2's gated mechanism differs from the implemented algorithm; unbounded episode assumptions
3. Scalability unknown: no evidence the method works in high-dimensional observation/action spaces or with complex baseline policies
4. Hyperparameter sensitivity: the schedule parameters require careful tuning (Section 6.3 shows degradation with poor choices)
5. Comparison scope: only vanilla and residual RL baselines are compared; no comparison with DAPG, DQfD adapted to the callable-policy setting, or other curriculum/teacher-student methods
6. The ε-improbable goal-reaching property may be difficult to verify or ensure for real-world baseline policies
This is a technically competent paper that introduces a well-motivated algorithm with supporting theory and experiments. The arbitration mechanism is intuitive and the theoretical framework, while primarily interpretive, adds genuine value. However, the limited experimental scope and narrow comparison set significantly weaken the empirical contribution. The work would benefit substantially from evaluation on standard benchmarks with more diverse and complex baseline policies.
Generated Jun 9, 2026
Paper 1 addresses a major bottleneck in reinforcement learning—high training costs and instability—by effectively leveraging existing suboptimal baseline policies. This approach has widespread, immediate practical applications in robotics and industrial control systems. Paper 2, while methodologically rigorous, focuses on a more specialized area of Bayesian particle transport, giving Paper 1 a broader and more significant potential impact across multiple fields.
Paper 1 presents a foundational contribution to reinforcement learning with broad applicability across various control problems, supported by rigorous theoretical bounds and empirical validation. Its methodological innovation in safely integrating suboptimal baselines has wide-reaching implications for AI and robotics. In contrast, while Paper 2 offers a valuable real-world clinical tool, its impact is narrower and potentially limited by its single-center retrospective design.
Paper 1 offers a foundational contribution to reinforcement learning by addressing the critical bottleneck of training efficiency. Its proposed agency-transferring method includes theoretical guarantees and is applicable across a wide variety of control problems, granting it a broad impact across disciplines like robotics and automation. In contrast, Paper 2 presents a specialized architecture for EEG emotion recognition, which, while valuable for brain-computer interfaces, is narrower in scope and relies on combining existing techniques without providing foundational theoretical advancements.
While Paper 2 offers a rigorous theoretical and algorithmic advancement in reinforcement learning, Paper 1 addresses a critical, widespread bottleneck in modern AI research: managing and editing large deep learning models. By providing a reproducible, robust tool for 'tensor surgery' and model upcycling, Paper 1 has the potential for massive adoption across various deep learning domains, similar to other foundational ML infrastructure tools, thereby enabling a broader range of downstream scientific breakthroughs.
Paper 1 likely has higher scientific impact due to a clear, rate-optimal theoretical advance: improving queue-length regret from ~T^{-1/4} to ~T^{-1/2} and matching it with a minimax lower bound, tightly characterizing the problem’s statistical limits. The three-phase algorithm and queue-specific coupling lower-bound technique are novel and methodologically rigorous, and results are broadly relevant to bandits, online learning, and stochastic networks/scheduling. Paper 2 targets an important practical RL setting and shows promising empirical performance, but the idea resembles existing safe/baseline-guided or residual RL paradigms and its theory depends on stronger assumptions, making its incremental novelty and rigor less clear.
Paper 1 presents a more rigorous and comprehensive contribution with both theoretical guarantees (formal lower bounds on goal-reaching probability) and empirical validation on continuous-control benchmarks. The problem of leveraging existing suboptimal policies to bootstrap RL training is broadly applicable and well-formalized. Paper 2 addresses an interesting practical problem (continual learning for deployed LLM agents) but offers a more incremental systems-level contribution (CLaaS) with relatively standard techniques (experience replay, asynchronous training) and limited evaluation scope (a single adversarial task). Paper 1's theoretical depth and methodological rigor give it higher potential for lasting scientific impact.
Paper 1 addresses a highly timely and critical intersection of AI safety, adversarial robustness, and machine unlearning. By proposing safe reinforcement unlearning to defend against data poisoning without retraining from scratch, it solves a pressing bottleneck in deploying offline RL for safety-critical systems. While Paper 2 presents a solid efficiency improvement for RL training using baselines, Paper 1's focus on unlearning and security offers greater novelty and broader implications for trustworthy AI deployment.
Paper 2 is a foundational review that synthesizes a rapidly growing, highly impactful field at the intersection of AI and physics. By proposing a unifying framework (REO) and a phase diagram for equation discoverability, it offers a broad conceptual roadmap that will likely influence multiple disciplines across the physical sciences. In contrast, Paper 1 offers a valuable but narrower methodological improvement within reinforcement learning. Paper 2's potential to guide scientific discovery across diverse physical systems gives it a broader and higher potential scientific impact.
Paper 2 addresses a fundamental challenge in reinforcement learning—sample efficiency and safe exploration—by seamlessly integrating existing suboptimal policies. Its theoretical guarantees and broad applicability across various continuous-control domains give it a wider potential impact compared to Paper 1, which focuses on an empirical benchmark for a domain-specific application (OLED molecular generation).
Paper 1 likely has higher impact due to strong timeliness and clear practical contribution: it ports proven large-scale parallelism (TP, FSDP) into formal neural network verification, addressing a core bottleneck (GPU memory) that directly limits scalability and benchmarks. It demonstrates integration with state-of-the-art complete verification (α/β-CROWN + BaB), convolutional layers, and achieves a notable complete UNSAT result on a challenging VNN-COMP 2024 model, suggesting real-world applicability and community relevance. Paper 2 is useful but resembles existing safe/assisted RL ideas and may have narrower cross-field impact.