TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval
Yuhang Zhang, Keyan Ding, Peilin Chen, Han Liu, Can Lin, Ruixi Chen, Shiqi Wang, Qi Song
Abstract
Enzyme-reaction retrieval is a fundamental problem in computational biology, underpinning enzyme characterization, reaction mechanism elucidation, and the rational design of metabolic pathways and biocatalysts. As a bidirectional task, it entails both enzyme-to-reaction and reaction-to-enzyme mapping. However, existing approaches suffer from poor generalization across tasks and distributions, with performance highly sensitive to dataset splits and substantial asymmetry between retrieval directions. To address these challenges, we present TIGER, a Text-Informed Generalized Enzyme-Reaction Retrieval framework that leverages protein-to-text generation models to distill textual semantic knowledge from enzyme sequences, providing a generalized representation that bridges enzymes and biochemical reactions. To ensure the quality and reliability of textual semantics, we design a Dynamic Gating Network that adaptively fuses text-derived knowledge with sequence features, enabling more consistent and informative enzyme representations, while a Structure-Shared Feature Projector aligns enzyme and reaction representations within a unified latent space. Extensive experiments demonstrate that, under bidirectional retrieval supervision, TIGER significantly outperforms state-of-the-art baselines across diverse distributions and exhibits strong robustness and transferability across tasks.
AI Impact Assessments
(1 models)Scientific Impact Assessment: TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval
1. Core Contribution
TIGER addresses a well-defined problem in computational biology: bidirectional enzyme-reaction retrieval, where existing methods suffer from directional asymmetry (enzyme→reaction vs. reaction→enzyme performance gaps) and sensitivity to dataset splitting strategies. The key insight is that pretrained protein language models (e.g., ESM2) capture evolutionary and structural signals but lack explicit understanding of chemical transformations, creating a semantic gap when aligning enzymes with reactions.
The main novelty lies in using protein-to-text generation models (ESM2Text, ProtT3) to produce textual descriptions of enzymes that encode functional and catalytic semantics, then fusing these with sequence embeddings via a Dynamic Gating Network (DGN). The DGN addresses a practical concern—AI-generated text may be noisy or hallucinated—by learning reliability-aware gating weights. A Structure-Shared Feature Projector (SSFP) then maps enzyme and reaction representations into a unified latent space for contrastive training.
2. Methodological Rigor
Strengths in experimental design: The paper evaluates on ReactZyme, the largest and most standardized benchmark for this task, across three complementary splits (time-based, enzyme similarity-based, reaction similarity-based) that test progressively harder generalization scenarios. The evaluation is thorough, with Hit@k at multiple cutoffs, Precision@k, MRR, and Mean Rank reported in extensive appendix tables. Ablation studies isolate contributions of textual knowledge, DGN, and SSFP.
Concerns:
3. Potential Impact
The practical implications are meaningful for several domains:
The text-informed paradigm itself could influence adjacent fields. The idea of using protein-to-text models to bridge modality gaps is transferable to other protein function prediction tasks. The DGN mechanism for handling noisy AI-generated supervision is a generally applicable technique.
However, the impact is somewhat bounded by the task's current scope—enzyme-reaction retrieval, while important, is still a relatively niche subfield compared to broader protein function prediction or drug discovery tasks.
4. Timeliness & Relevance
The paper is well-timed. ReactZyme (NeurIPS 2024) recently established this benchmark, and CLIPZyme provided the first contrastive learning approach. The identified limitations (directional asymmetry, split sensitivity) are genuine and well-documented issues. The use of protein-to-text generation models is a natural evolution given recent advances in multimodal protein representation learning (ProtST, ProTrek, ProtCLIP). The paper sits at the intersection of two active research areas: protein language models and cross-modal retrieval.
5. Strengths & Limitations
Key Strengths:
1. Substantial empirical improvements: Hit@1 improvements of 14% to over 200% relative to baselines are significant. On the most challenging reaction similarity-based split, TIGER achieves 0.416 E→R Hit@1 vs. 0.131 for the best baseline—a transformative improvement.
2. Bidirectional consistency: TIGER substantially reduces the E→R vs. R→E performance gap, addressing a fundamental limitation of prior work. This is well-demonstrated across all splits.
3. Comprehensive ablations: The paper systematically evaluates the contribution of each component (text source, DGN, SSFP, loss balancing γ), providing clear evidence for design decisions.
4. Robustness across distribution shifts: Performance degradation across splits is much more graceful than baselines, suggesting genuine generalization rather than memorization.
Notable Limitations:
1. Dependence on text generation quality: While DGN mitigates noise, the framework's ceiling is partially determined by the quality of protein-to-text models. The gap between AI-generated and human-reviewed text (Figure 4) shows 5-12% Hit@1 differences, suggesting room for improvement.
2. Lack of structural information on the enzyme side: The framework uses sequence-based representations (ESM2) rather than structure-aware models. Given the importance of active site geometry for catalysis, incorporating structural features could further improve performance.
3. Limited analysis of failure modes: The paper doesn't analyze where TIGER fails—are there systematic patterns (e.g., multi-functional enzymes, promiscuous reactions)?
4. Scalability considerations: The paper doesn't discuss computational overhead from text generation and multi-modal fusion during inference, which matters for large-scale screening applications.
5. Single benchmark evaluation: All experiments are on ReactZyme. Cross-dataset validation would strengthen generalizability claims.
6. The SSFP ablation (Table 3) shows mixed results: In some directions/splits, the simpler 2-layer MLP outperforms SSFP, weakening the case for this component's necessity.
Additional Observations
The paper's framework is modular and could benefit from future improvements in any component (better protein LMs, better text generation, better molecular encoders). The choice of γ=0.5 for balanced bidirectional supervision is well-motivated but the sensitivity analysis (Figure 5) shows γ=0.7 actually achieves the best E→R performance, suggesting task-specific tuning could be valuable.
The paper is clearly written with good figure quality, though the main text focuses primarily on Hit@1 and MRR, deferring extensive additional metrics to the appendix.
Generated May 26, 2026
Comparison History (22)
Paper 1 addresses a foundational challenge in modern AI: mechanistic interpretability of Large Language Models. By uncovering the specific attention heads and deductive circuits responsible for logical reasoning and algorithmic execution, this work provides critical insights into the 'black box' of LLMs. Its methodological rigor using causal mediation analysis advances the field of AI safety and model optimization. While Paper 2 offers strong applied value in computational biology, Paper 1 has a broader potential impact, as understanding and improving LLM reasoning fundamentally affects all downstream applications of AI, including scientific discovery itself.
Paper 2 addresses a fundamental question about how LLMs develop human-like perceptual representations despite lacking sensory input, with broad implications across cognitive science, AI interpretability, and neuroscience. Its findings about transient geometric structure emerging in intermediate layers provide novel mechanistic insights into transformer representations. While Paper 1 makes solid contributions to enzyme-reaction retrieval with practical bioinformatics applications, Paper 2's cross-disciplinary relevance (linguistics, cognitive science, AI alignment, mechanistic interpretability) and its surprising findings about grounded cognition emerging from text-only training give it higher potential for broad scientific impact.
AIBuildAI-2 addresses the broad challenge of automating AI model development, with demonstrated state-of-the-art results on established benchmarks (MLE-Bench) and real competitions. Its knowledge-enhanced agent framework with self-evolving knowledge systems has wide applicability across all fields using AI, potentially democratizing AI for non-experts. While TIGER makes a solid contribution to enzyme-reaction retrieval in computational biology, its scope is narrower. AIBuildAI-2's breadth of impact across scientific disciplines, practical utility, and novel self-improving knowledge architecture give it higher potential impact.
Paper 1 addresses a fundamental safety concern in retrieval-augmented LLMs—a technology with massive and growing deployment. The discovery of the 'monitoring-control gap' (models detect but don't resolve contradictions) has broad implications for AI safety, trust, and deployment policy across many high-stakes domains. Its rigorous multi-turn evaluation protocol across 50,000+ evaluations with mechanistic analysis (probing, attention) provides deep methodological contributions. Paper 2, while solid work in computational biology, addresses a more specialized problem (enzyme-reaction retrieval) with narrower impact scope. The timeliness and breadth of Paper 1's contribution to LLM safety gives it higher potential impact.
Paper 1 challenges the prevailing assumption that Chain-of-Thought prompting relies on logical derivation, revealing it primarily stems from lexical activation and local token co-occurrence. This provides a fundamental paradigm shift in understanding LLM 'reasoning', impacting the broad and highly active AI research community. While Paper 2 offers a valuable methodological advancement with practical applications in computational biology, Paper 1's theoretical insights into the inner workings of foundation models have a wider breadth of impact and higher potential to alter the trajectory of AI research.
Paper 2 likely has higher impact: it targets a central, timely bottleneck in post-training LMs for reasoning (credit assignment under outcome-only rewards) with a generally applicable mechanism (resets) plus analysis in a CPI framework and empirical gains across multiple models/benchmarks. Its methods are broadly reusable across RLHF/RLAIF-style training and could influence many downstream reasoning and agentic applications. Paper 1 is innovative for enzyme–reaction retrieval and valuable for bioinformatics, but its impact is more domain-specific and depends on adoption within a narrower community.
Paper 1 develops a comprehensive algebraic framework grounding deep learning architectures in lattice theory and mathematical morphology, providing fundamental theoretical insights into why depth creates representational power in CNNs. This addresses a core question in deep learning theory with broad implications across the field. Its rigorous mathematical unification of disparate architectural components (convolutions, ReLU, pooling, skip connections, encoder-decoders) under a single algebraic framework has potential to influence both theoretical understanding and architectural design across all of deep learning. Paper 2, while valuable, addresses a more specialized retrieval problem in computational biology with incremental methodological contributions.
Paper 2 addresses a fundamental problem in computational biology with direct implications for metabolic pathway design and biocatalysis. By bridging NLP and biochemistry, it offers broader cross-disciplinary impact and significant real-world applications compared to Paper 1, which primarily focuses on theoretical improvements and benchmark tasks like Sudoku in machine learning.
Paper 2 targets a broad, timely bottleneck in agentic AI—system-level “harness” design—likely affecting many domains using foundation-model agents (software engineering, HCI, security, governance, evaluation). It proposes a unifying framing, research agenda, new evaluation dimensions, and a reference implementation, which can catalyze community benchmarks and standards. Paper 1 is methodologically concrete and impactful within computational enzymology, but its scope is narrower. Overall, Paper 2 has higher potential cross-field adoption and near-term relevance as agent systems proliferate.
TIGER addresses a fundamental problem in computational biology (enzyme-reaction retrieval) with a well-motivated, methodologically rigorous framework combining protein-to-text generation, dynamic gating, and structure-shared projectors. It demonstrates strong generalization across diverse distributions and tasks, directly enabling applications in enzyme engineering, metabolic pathway design, and biocatalysis. Paper 2 (SkillEvolver) presents an interesting meta-learning approach for LLM agent skills, but operates in a more niche, rapidly-evolving area with less established scientific foundations and narrower real-world biological/scientific applications. TIGER's cross-disciplinary impact (NLP + biology) and practical utility give it higher potential impact.
Paper 1 addresses a fundamental problem in computational biology with significant real-world applications in drug discovery, metabolic engineering, and biocatalyst design. Its novel integration of text-derived semantics with protein sequences for bidirectional retrieval aligns with highly impactful current trends in AI for Science. While Paper 2 offers a valuable algorithmic advancement for imperfect-information games, Paper 1's cross-disciplinary approach and immediate potential to accelerate biochemical research give it a broader and more substantial scientific impact.
Paper 1 introduces a novel cross-disciplinary framework bringing formal robotics control theory to foundation model safety—a highly timely problem given rapid LLM deployment. Its breadth of impact spans AI safety, HCI, education, mental health, and caregiving, addressing a critical gap (trajectory-level behavioral guarantees vs. single-output safety). The conceptual reframing from output-level to interaction-trajectory guardrails is innovative and broadly applicable. Paper 2, while technically solid and useful for computational biology, addresses a more specialized retrieval problem with incremental methodological advances (text-informed fusion for enzyme-reaction matching).
Paper 2 introduces a surprising and broadly applicable finding—that a single dominant adaptation module can outperform full LoRA with ~0.7% of parameters—with implications across the entire LLM fine-tuning community. The PAGE metric and DomLoRA method are simple, generalizable, and practically useful across diverse tasks and architectures. Paper 1, while solid, addresses a more niche problem (enzyme-reaction retrieval) with incremental improvements over existing methods. Paper 2's breadth of impact across NLP/AI, its counterintuitive finding challenging conventional wisdom, and its immediate practical applicability give it higher potential scientific impact.
Paper 2 (TIGER) targets a core, broadly useful bioinformatics problem (enzyme–reaction retrieval) with direct downstream applications in enzyme annotation, pathway design, and biocatalysis, and emphasizes robustness/generalization across distributions and bidirectional retrieval—key for real-world deployment. Its integration of protein-to-text semantic distillation plus gated fusion and shared projection is a novel, cross-modal approach with likely impact across computational biology and drug/metabolic engineering. Paper 1 is timely and useful for LLM efficiency, but may be more incremental within model-inference optimization and less directly tied to high-stakes domain applications.
TIGER addresses a fundamental problem in computational biology (enzyme-reaction retrieval) with a novel framework that bridges protein sequences and biochemical reactions through text-informed representations. Its impact spans enzyme engineering, metabolic pathway design, and biocatalysis—areas with significant real-world applications in drug discovery, synthetic biology, and industrial biotechnology. While Paper 1 offers a useful training-free method for reducing hallucinations in VLMs (an incremental improvement in a crowded field), Paper 2 introduces a more novel cross-modal framework with broader interdisciplinary impact and stronger potential for enabling downstream scientific discoveries.
Paper 1 identifies a critical inverse scaling phenomenon in LLMs for high-stakes forecasting tasks like epidemiology and finance. Its findings challenge current assumptions about model scaling and highlight crucial flaws in evaluation metrics. This has profound, timely implications for AI safety, evaluation, and deployment across multiple domains, offering broader scientific impact than Paper 2, which focuses on a specialized (albeit important) application in computational biology.
TIGER addresses a fundamental problem in computational biology (enzyme-reaction retrieval) with a novel framework combining protein-to-text generation, dynamic gating, and structure-shared projection. It has broader impact across enzyme engineering, metabolic pathway design, and biocatalysis—fields with significant real-world applications in drug discovery, synthetic biology, and industrial biotechnology. Its methodological innovations (bridging text semantics with sequence/structure features) are more transferable. Geo-Expert, while valuable, primarily applies existing techniques (LoRA fine-tuning) to a narrower geological domain with less transformative methodological contribution.
Paper 1 addresses a fundamental problem in computational biology with broad implications for metabolic engineering, drug discovery, and biocatalyst design. By bridging AI and biochemistry, its potential for transformative, cross-disciplinary scientific breakthroughs is higher than Paper 2, which focuses on a more specialized, albeit important, engineering application in hardware verification.
Paper 1 has higher likely impact: it proposes a concrete, implementable method for enzyme–reaction retrieval with demonstrated empirical gains, robustness, and transfer across distributions—immediately useful for enzyme annotation, pathway design, and biocatalysis. Its novelty (text-informed enzyme representations with dynamic gating and shared projection) is plausible and actionable, and the application domain is large in biotech. Paper 2 is ambitious and wide-ranging, but the sweeping “architecture-only accuracy ceiling” and many cross-domain impossibility claims risk being overly strong or hard to validate; impact depends on exceptionally rigorous proofs and community acceptance.
TIGER addresses a fundamental problem in computational biology (enzyme-reaction retrieval) with a novel framework combining protein-to-text generation, dynamic gating, and cross-modal alignment. It demonstrates strong empirical results with clear practical applications in enzyme engineering, metabolic pathway design, and biocatalysis. Paper 2 introduces a benchmark for in-context RL in ad-hoc teamwork but primarily reports negative results (baselines fail), which, while informative, has narrower impact. TIGER's methodological contributions and direct applicability to biotechnology give it broader and more immediate scientific impact.