ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

Paul Kahlmeyer, Henrik Voigt, Michael Habeck, Joachim Giesen

Jun 8, 2026arXiv:2606.09276v1

cs.LG

#3512of 5669·cs.LG

#3512 of 5669 · cs.LG

Tournament Score

1373±43

10501750

48%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6.5

Novelty5.5

Clarity7.5

Abstract

Equation discovery aims to automate the discovery of scientific models in the form of mathematical equations from data. Technically, equation discovery is implemented by symbolic regression algorithms. Performance of symbolic regression for equation discovery is measured along two dimensions: Prediction accuracy on test data, and recovery of known groundtruth formulas. For standard regression, accuracy is typically measured on in-domain test data, for instance, by splitting a data set randomly into training and test data. While this makes sense for in-domain interpolation, which is the common goal in ordinary regression, it can be a misleading proxy for true model discovery and generalization. The obvious alternative is to measure out-of-domain accuracy. However, obtaining challenging out-of-domain test data is a non-trivial problem. Therefore, we focus on equation recovery for evaluating symbolic regression algorithms for equation discovery. The rationale is that symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery. Existing benchmarks for symbolic regression include equation recovery tasks, however, with only a small number of groundtruth formulas that are publicly known. Moreover, these benchmarks place less emphasis on evaluating the robustness of algorithms in terms of their behavior under changing dimensionality, sampling size, sampling distribution and sampling domain. This, however, is of central importance to practitioners wanting to discover equations for modeling natural phenomena, since data is almost certainly noisy and comes from diverse domains, distributions, and sample sizes. To fill this gap, we introduce the Equation Recovery Benchmark (ERBench), a new evaluation framework designed to rigorously assess algorithms explicitly targeting the task of equation discovery.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ERBench

1. Core Contribution

ERBench introduces a structured benchmark and competition framework specifically designed to evaluate symbolic regression algorithms on the task of equation recovery — recovering the exact ground-truth symbolic expression from data, rather than merely achieving low predictive error on in-domain test points. The key distinction from prior benchmarks (notably SRBench) is the emphasis on: (a) a large public development set of 10,000 formulas spanning multiple scientific domains and synthetic expressions, (b) a secret evaluation set of 1,000 formulas accessible only through a permutation-based competition protocol, (c) systematic evaluation across multiple robustness axes (dimensionality, complexity, sample size, noise, domain, distribution), and (d) metrics centered on symbolic equivalence rather than predictive accuracy.

The benchmark follows a "common task framework" design with public training data and secret test data, analogous to successful paradigms in NLP and computer vision. The competition protocol is well-designed: permutation of problem order, sample order, and variable order prevents information leakage across repeated queries.

2. Methodological Rigor

The paper is methodologically sound in several respects. The evaluation metrics are well-defined: symbolic recovery rate (strict), Jaccard Index over subexpressions (relaxed/structural), and normalized Tree Edit Distance (structural). The authors acknowledge the undecidability of symbolic equivalence verification (Richardson, 1968) and provide both symbolic (lower bound) and numerical (upper bound) equivalence checks, finding them equivalent in practice on their test cases.

The benchmark design is motivated by a systematic review of algorithmic paradigms (enumeration-based, sampling-based, pre-trained, hybrid), with each design principle (marked with ·) derived from identified weaknesses of specific algorithm classes. This is a strength — the benchmark is not arbitrary but engineered to probe known failure modes: pre-trained methods' dependence on pre-training distributions (Figure 4), search methods' sensitivity to sampling domains (Figure 3, right), and the universal degradation with expression complexity (Figure 3, left).

However, the experimental evaluation is somewhat limited. Only six methods are benchmarked (PySR, DSR, E2E, Operon, gplearn, Linear), and several important recent methods (LLM-SR, RSRM, KANs, AI-Feynman) are absent. The diagnostic analysis (Figure 6) is conducted only for PySR on a single subset (Feynman), limiting the generalizability of insights. The paper could have been strengthened by deeper cross-method diagnostic comparisons.

3. Potential Impact

The benchmark fills a genuine gap. Prior benchmarks like SRBench primarily evaluate predictive accuracy with publicly known test sets, creating risks of overfitting and inadequate evaluation of generalization. ERBench's secret test set with a continuously running competition protocol is a meaningful infrastructure contribution.

Practical utility: The robustness evaluation across noise levels, sample sizes, and distributions directly addresses practitioner needs. Scientists working with real experimental data face exactly these variations, and knowing which algorithms degrade gracefully is valuable.

Community infrastructure: The Hugging Face hosting, competition website, and reproducible evaluation scripts lower barriers to adoption. The continuously running competition (vs. fixed events like GECCO competitions) enables ongoing benchmarking.

Revealing findings: The result that most state-of-the-art methods achieve near-zero recovery rates on the secret test set (Table 3), and that a simple linear baseline matches or exceeds complex methods on Jaccard Index, is a sobering but important finding. It quantifies how far the field is from reliable equation discovery and identifies complexity as the primary bottleneck.

4. Timeliness & Relevance

The paper is highly timely. The explosion of LLM-based symbolic regression methods (LLM-SR, in-context approaches) creates urgent need for benchmarks that control for data leakage — ERBench's secret, procedurally-generated test set addresses this directly. The growing interest in AI for Science (AI-Descartes, AI-Newton, Science-Gym) makes rigorous equation discovery evaluation increasingly important. The paper correctly identifies that conflating symbolic regression (curve fitting) with equation discovery (law recovery) has been a persistent confusion in the field.

5. Strengths & Limitations

Key Strengths:

Well-motivated benchmark design with each principle traced to identified algorithmic weaknesses

Large, diverse formula collection (10,000 public + 1,000 secret) substantially exceeding prior benchmarks (~119 Feynman equations)

Anti-leakage measures for LLM evaluation (secret test set, permutation protocol)

Multi-axis robustness evaluation framework

Lightweight, continuously running competition design

Clear articulation of the distinction between interpolation accuracy and equation recovery

Notable Limitations:

The secret test set generation process is undisclosed, making it impossible to assess potential biases in formula selection or difficulty distribution

Limited baseline evaluation: only 6 methods tested, missing several important recent approaches

The SynEq dataset (5,303 formulas, 53% of public set) is generated from random DAGs, which may not represent the structural properties of real scientific equations

The OEIS integer sequences (3,757 formulas, 38% of public set) are discrete/combinatorial, which may not align well with continuous equation discovery

Computation time comparison is acknowledged as only a rough proxy due to heterogeneous hardware

No analysis of how the secret test set difficulty compares to the public development set

The paper does not discuss how the benchmark handles dimensional analysis or physical units, which are important constraints in scientific equation discovery

Reliance on SymPy for equivalence checking introduces potential systematic biases in what counts as "recovered"

Additional Observations:

The 30% recovery rate achieved by PySR — the clear winner — sets a useful but concerning baseline. The rapid performance degradation with complexity (Figure 6a) suggests fundamental algorithmic limitations rather than mere engineering challenges. The benchmark thus serves both as an evaluation tool and as a diagnostic instrument for guiding algorithmic development.

The paper's philosophical framing through Popper's falsificationism, while appropriate for motivating equation recovery over predictive accuracy, somewhat overstates the binary nature of scientific modeling — in practice, approximate models with known domains of validity are scientifically useful.

Overall, ERBench represents a solid infrastructure contribution that addresses a real need in the equation discovery community, with thoughtful design choices and useful initial findings, though the experimental evaluation could be more comprehensive.

Rating:6.5/ 10

Significance 7Rigor 6.5Novelty 5.5Clarity 7.5

Generated Jun 9, 2026

Comparison History (21)

Lostvs. Capacity-Constrained Online Convex Optimization with Delayed Feedback

Paper 2 introduces a novel theoretical framework for capacity-constrained online convex optimization with delayed feedback, addressing a practical gap in online learning theory. It provides new algorithmic contributions (Delayed-Weighted FTRL), a refined semi-clairvoyant delay model, and rigorous regret bounds that gracefully degrade with capacity constraints. This advances fundamental optimization theory with broad applicability. Paper 1, while useful as a benchmarking tool for symbolic regression, is more incremental—extending existing benchmarks with robustness evaluations. Benchmarks can be impactful but typically have narrower theoretical contribution compared to foundational algorithmic advances with provable guarantees.

claude-opus-4-6·Jun 11, 2026

Wonvs. Inverse Probability Weighting and Age-of-Information Aggregation for Decentralized Federated Learning under Partial Reception

Paper 2 introduces a novel benchmark for equation discovery (symbolic regression), a crucial tool for AI-driven scientific discovery across multiple disciplines. Benchmarks typically have high scientific impact by standardizing evaluation, exposing algorithmic weaknesses, and driving future research directions. While Paper 1 offers a strong methodological improvement for decentralized federated learning, Paper 2's focus on enabling robust automated scientific discovery gives it broader applicability and higher potential for widespread cross-disciplinary impact.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. AuRA: Internalizing Audio Understanding into LLMs as LoRA

Paper 1 focuses on integrating audio understanding into LLMs via a novel distillation approach, addressing critical latency and efficiency bottlenecks in multimodal AI. Given the explosive growth and broad applicability of LLMs across numerous fields, this method promises significant and immediate real-world impact. While Paper 2 provides a valuable benchmark for symbolic regression, its scope is much narrower and targets a more specialized subfield compared to the widespread relevance of multimodal LLMs.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

Paper 2 addresses a critical flaw in interpretability and pruning methods for Mixture-of-Experts (MoE) models, a highly active and impactful area in AI. By demonstrating that observational metrics fail to predict causal importance, it challenges widespread assumptions in deep learning. While Paper 1 introduces a valuable benchmark for symbolic regression, its impact is confined to the narrower niche of equation discovery. Paper 2's findings have broader, more immediate implications for LLM architecture design and interpretability.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

Paper 2 addresses the highly active and broadly impactful field of LLM training efficiency, unifying three critical bottlenecks (data, memory, compute) under a novel constraint-centric framework. Its breadth of coverage and practical relevance to the rapidly growing LLM community gives it wider potential impact. Paper 1, while valuable for the symbolic regression/equation discovery niche, serves a narrower community. Paper 2's timeliness—given the explosive growth in LLM research and deployment—and its potential to guide resource-efficient training practices across industry and academia give it higher estimated scientific impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Pre-AF 13: An Interpretable Atrial Fibrillation Risk Score Mined from Discharge Reports

Paper 1 likely has higher scientific impact due to broader, cross-domain relevance: a rigorous benchmark/test suite for equation discovery can shape evaluation standards across ML, scientific computing, physics, chemistry, and engineering. Its focus on robustness to dimensionality, sampling, and domain shift addresses a key methodological gap and can influence many future algorithm papers. Paper 2 has strong clinical application potential, but is single-center/retrospective and domain-specific; impact may be constrained by generalizability, external validation, and deployment hurdles despite interpretability and solid performance gains.

gpt-5.2·Jun 10, 2026

Wonvs. Operator learning for solving Fokker-Planck equations with various initial conditions

Paper 1 likely has higher impact because it introduces a general-purpose benchmark/test suite (ERBench) addressing a widely recognized evaluation gap in symbolic regression/equation discovery (robustness to noise, sampling regimes, dimensionality). Benchmarks often catalyze community-wide progress, standardize comparisons, and influence many downstream methods across ML and scientific computing. Paper 2 proposes a solid, timely method for Fokker–Planck operator learning, but it is narrower in scope (specific PDE class) and closer to incremental advances in PINNs/normalizing flows. Overall, Paper 1’s breadth and community-enabling nature suggest greater potential impact.

gpt-5.2·Jun 9, 2026

Lostvs. Causal Neural Probabilistic Circuits

Paper 2 introduces a novel architecture (CNPC) that combines causal inference with probabilistic circuits and concept bottleneck models, addressing a fundamental limitation in interpretable AI—respecting causal dependencies during interventions. It provides both theoretical guarantees and empirical validation across multiple benchmarks. While Paper 1 contributes a useful benchmark for symbolic regression evaluation, benchmarks typically have narrower impact than novel methodological contributions. Paper 2's integration of causality into interpretable deep learning has broader implications for trustworthy AI, a highly active and impactful research area.

claude-opus-4-6·Jun 9, 2026

Lostvs. STELLAR: Spatio-Temporal Environmental Learning with Latent Alignment and Refinement for Long-Tailed Species Distribution Modeling

Paper 1 proposes a novel methodology addressing the critical global challenge of biodiversity monitoring, specifically tackling spatio-temporal dynamics and rare species. Its integration of graph-temporal encoding and contrastive learning provides significant advancements for conservation planning. Paper 2, while offering a useful benchmark for symbolic regression, is more narrowly focused on algorithm evaluation and lacks the immediate, broad real-world impact on urgent ecological issues demonstrated by Paper 1.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. In-Context Learning for Latent Space Bayesian Optimization

Paper 2 addresses a timely intersection of foundation models, Bayesian optimization, and molecular design—areas of high current interest. Its novel approach of adapting tabular foundation model surrogates for latent-space BO through continued pretraining with domain-specific synthetic tasks is innovative and has direct applications in drug discovery and materials science. Paper 1, while useful as a benchmark contribution for symbolic regression, is more incremental—it improves evaluation methodology rather than introducing new algorithmic capabilities. Benchmarks have impact but typically less than methodological advances with clear real-world applications in high-impact domains like molecular optimization.

claude-opus-4-6·Jun 9, 2026

#3512of 5669·cs.LG

#3512 of 5669 · cs.LG

Tournament Score

1373±43

10501750

48%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6.5

Novelty5.5

Clarity7.5