Paul Kahlmeyer, Henrik Voigt, Michael Habeck, Joachim Giesen
Equation discovery aims to automate the discovery of scientific models in the form of mathematical equations from data. Technically, equation discovery is implemented by symbolic regression algorithms. Performance of symbolic regression for equation discovery is measured along two dimensions: Prediction accuracy on test data, and recovery of known groundtruth formulas. For standard regression, accuracy is typically measured on in-domain test data, for instance, by splitting a data set randomly into training and test data. While this makes sense for in-domain interpolation, which is the common goal in ordinary regression, it can be a misleading proxy for true model discovery and generalization. The obvious alternative is to measure out-of-domain accuracy. However, obtaining challenging out-of-domain test data is a non-trivial problem. Therefore, we focus on equation recovery for evaluating symbolic regression algorithms for equation discovery. The rationale is that symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery. Existing benchmarks for symbolic regression include equation recovery tasks, however, with only a small number of groundtruth formulas that are publicly known. Moreover, these benchmarks place less emphasis on evaluating the robustness of algorithms in terms of their behavior under changing dimensionality, sampling size, sampling distribution and sampling domain. This, however, is of central importance to practitioners wanting to discover equations for modeling natural phenomena, since data is almost certainly noisy and comes from diverse domains, distributions, and sample sizes. To fill this gap, we introduce the Equation Recovery Benchmark (ERBench), a new evaluation framework designed to rigorously assess algorithms explicitly targeting the task of equation discovery.
ERBench introduces a structured benchmark and competition framework specifically designed to evaluate symbolic regression algorithms on the task of equation recovery — recovering the exact ground-truth symbolic expression from data, rather than merely achieving low predictive error on in-domain test points. The key distinction from prior benchmarks (notably SRBench) is the emphasis on: (a) a large public development set of 10,000 formulas spanning multiple scientific domains and synthetic expressions, (b) a secret evaluation set of 1,000 formulas accessible only through a permutation-based competition protocol, (c) systematic evaluation across multiple robustness axes (dimensionality, complexity, sample size, noise, domain, distribution), and (d) metrics centered on symbolic equivalence rather than predictive accuracy.
The benchmark follows a "common task framework" design with public training data and secret test data, analogous to successful paradigms in NLP and computer vision. The competition protocol is well-designed: permutation of problem order, sample order, and variable order prevents information leakage across repeated queries.
The paper is methodologically sound in several respects. The evaluation metrics are well-defined: symbolic recovery rate (strict), Jaccard Index over subexpressions (relaxed/structural), and normalized Tree Edit Distance (structural). The authors acknowledge the undecidability of symbolic equivalence verification (Richardson, 1968) and provide both symbolic (lower bound) and numerical (upper bound) equivalence checks, finding them equivalent in practice on their test cases.
The benchmark design is motivated by a systematic review of algorithmic paradigms (enumeration-based, sampling-based, pre-trained, hybrid), with each design principle (marked with ·) derived from identified weaknesses of specific algorithm classes. This is a strength — the benchmark is not arbitrary but engineered to probe known failure modes: pre-trained methods' dependence on pre-training distributions (Figure 4), search methods' sensitivity to sampling domains (Figure 3, right), and the universal degradation with expression complexity (Figure 3, left).
However, the experimental evaluation is somewhat limited. Only six methods are benchmarked (PySR, DSR, E2E, Operon, gplearn, Linear), and several important recent methods (LLM-SR, RSRM, KANs, AI-Feynman) are absent. The diagnostic analysis (Figure 6) is conducted only for PySR on a single subset (Feynman), limiting the generalizability of insights. The paper could have been strengthened by deeper cross-method diagnostic comparisons.
The benchmark fills a genuine gap. Prior benchmarks like SRBench primarily evaluate predictive accuracy with publicly known test sets, creating risks of overfitting and inadequate evaluation of generalization. ERBench's secret test set with a continuously running competition protocol is a meaningful infrastructure contribution.
Practical utility: The robustness evaluation across noise levels, sample sizes, and distributions directly addresses practitioner needs. Scientists working with real experimental data face exactly these variations, and knowing which algorithms degrade gracefully is valuable.
Community infrastructure: The Hugging Face hosting, competition website, and reproducible evaluation scripts lower barriers to adoption. The continuously running competition (vs. fixed events like GECCO competitions) enables ongoing benchmarking.
Revealing findings: The result that most state-of-the-art methods achieve near-zero recovery rates on the secret test set (Table 3), and that a simple linear baseline matches or exceeds complex methods on Jaccard Index, is a sobering but important finding. It quantifies how far the field is from reliable equation discovery and identifies complexity as the primary bottleneck.
The paper is highly timely. The explosion of LLM-based symbolic regression methods (LLM-SR, in-context approaches) creates urgent need for benchmarks that control for data leakage — ERBench's secret, procedurally-generated test set addresses this directly. The growing interest in AI for Science (AI-Descartes, AI-Newton, Science-Gym) makes rigorous equation discovery evaluation increasingly important. The paper correctly identifies that conflating symbolic regression (curve fitting) with equation discovery (law recovery) has been a persistent confusion in the field.
The 30% recovery rate achieved by PySR — the clear winner — sets a useful but concerning baseline. The rapid performance degradation with complexity (Figure 6a) suggests fundamental algorithmic limitations rather than mere engineering challenges. The benchmark thus serves both as an evaluation tool and as a diagnostic instrument for guiding algorithmic development.
The paper's philosophical framing through Popper's falsificationism, while appropriate for motivating equation recovery over predictive accuracy, somewhat overstates the binary nature of scientific modeling — in practice, approximate models with known domains of validity are scientifically useful.
Overall, ERBench represents a solid infrastructure contribution that addresses a real need in the equation discovery community, with thoughtful design choices and useful initial findings, though the experimental evaluation could be more comprehensive.
Generated Jun 9, 2026
Paper 2 introduces a novel theoretical framework for capacity-constrained online convex optimization with delayed feedback, addressing a practical gap in online learning theory. It provides new algorithmic contributions (Delayed-Weighted FTRL), a refined semi-clairvoyant delay model, and rigorous regret bounds that gracefully degrade with capacity constraints. This advances fundamental optimization theory with broad applicability. Paper 1, while useful as a benchmarking tool for symbolic regression, is more incremental—extending existing benchmarks with robustness evaluations. Benchmarks can be impactful but typically have narrower theoretical contribution compared to foundational algorithmic advances with provable guarantees.
Paper 2 introduces a novel benchmark for equation discovery (symbolic regression), a crucial tool for AI-driven scientific discovery across multiple disciplines. Benchmarks typically have high scientific impact by standardizing evaluation, exposing algorithmic weaknesses, and driving future research directions. While Paper 1 offers a strong methodological improvement for decentralized federated learning, Paper 2's focus on enabling robust automated scientific discovery gives it broader applicability and higher potential for widespread cross-disciplinary impact.
Paper 1 focuses on integrating audio understanding into LLMs via a novel distillation approach, addressing critical latency and efficiency bottlenecks in multimodal AI. Given the explosive growth and broad applicability of LLMs across numerous fields, this method promises significant and immediate real-world impact. While Paper 2 provides a valuable benchmark for symbolic regression, its scope is much narrower and targets a more specialized subfield compared to the widespread relevance of multimodal LLMs.
Paper 2 addresses a critical flaw in interpretability and pruning methods for Mixture-of-Experts (MoE) models, a highly active and impactful area in AI. By demonstrating that observational metrics fail to predict causal importance, it challenges widespread assumptions in deep learning. While Paper 1 introduces a valuable benchmark for symbolic regression, its impact is confined to the narrower niche of equation discovery. Paper 2's findings have broader, more immediate implications for LLM architecture design and interpretability.
Paper 2 addresses the highly active and broadly impactful field of LLM training efficiency, unifying three critical bottlenecks (data, memory, compute) under a novel constraint-centric framework. Its breadth of coverage and practical relevance to the rapidly growing LLM community gives it wider potential impact. Paper 1, while valuable for the symbolic regression/equation discovery niche, serves a narrower community. Paper 2's timeliness—given the explosive growth in LLM research and deployment—and its potential to guide resource-efficient training practices across industry and academia give it higher estimated scientific impact.
Paper 1 likely has higher scientific impact due to broader, cross-domain relevance: a rigorous benchmark/test suite for equation discovery can shape evaluation standards across ML, scientific computing, physics, chemistry, and engineering. Its focus on robustness to dimensionality, sampling, and domain shift addresses a key methodological gap and can influence many future algorithm papers. Paper 2 has strong clinical application potential, but is single-center/retrospective and domain-specific; impact may be constrained by generalizability, external validation, and deployment hurdles despite interpretability and solid performance gains.
Paper 1 likely has higher impact because it introduces a general-purpose benchmark/test suite (ERBench) addressing a widely recognized evaluation gap in symbolic regression/equation discovery (robustness to noise, sampling regimes, dimensionality). Benchmarks often catalyze community-wide progress, standardize comparisons, and influence many downstream methods across ML and scientific computing. Paper 2 proposes a solid, timely method for Fokker–Planck operator learning, but it is narrower in scope (specific PDE class) and closer to incremental advances in PINNs/normalizing flows. Overall, Paper 1’s breadth and community-enabling nature suggest greater potential impact.
Paper 2 introduces a novel architecture (CNPC) that combines causal inference with probabilistic circuits and concept bottleneck models, addressing a fundamental limitation in interpretable AI—respecting causal dependencies during interventions. It provides both theoretical guarantees and empirical validation across multiple benchmarks. While Paper 1 contributes a useful benchmark for symbolic regression evaluation, benchmarks typically have narrower impact than novel methodological contributions. Paper 2's integration of causality into interpretable deep learning has broader implications for trustworthy AI, a highly active and impactful research area.
Paper 1 proposes a novel methodology addressing the critical global challenge of biodiversity monitoring, specifically tackling spatio-temporal dynamics and rare species. Its integration of graph-temporal encoding and contrastive learning provides significant advancements for conservation planning. Paper 2, while offering a useful benchmark for symbolic regression, is more narrowly focused on algorithm evaluation and lacks the immediate, broad real-world impact on urgent ecological issues demonstrated by Paper 1.
Paper 2 addresses a timely intersection of foundation models, Bayesian optimization, and molecular design—areas of high current interest. Its novel approach of adapting tabular foundation model surrogates for latent-space BO through continued pretraining with domain-specific synthetic tasks is innovative and has direct applications in drug discovery and materials science. Paper 1, while useful as a benchmark contribution for symbolic regression, is more incremental—it improves evaluation methodology rather than introducing new algorithmic capabilities. Benchmarks have impact but typically less than methodological advances with clear real-world applications in high-impact domains like molecular optimization.