Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang

May 22, 2026

arXiv:2605.23590v1 PDF

cs.AI(primary)

#1489of 2682·Artificial Intelligence

#1489 of 2682 · Artificial Intelligence

Tournament Score

1397±41

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1397±41

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Co-ReAct

1. Core Contribution

Co-ReAct reframes rubrics from evaluative artifacts (used as rewards during training or post-hoc quality checkers) into prescriptive, step-level action-guidance signals consumed by ReAct agents during inference. The key insight is that at each decision point in a multi-step search trajectory, a rubric conditioned on the current partial trajectory can specify what the next action should accomplish—what tool to use, what query to issue, what evidence gap to fill. This is operationalized through three innovations: (1) a five-tuple loop (Rubric, Reason, Act, Verify, Observe) extending ReAct's three-tuple; (2) a dedicated rubric generator trained via GRPO with a listwise Spearman rank-correlation reward against multi-judge expert consensus rankings; and (3) the rubric generator's dual use as both the core of Co-ReAct and a drop-in plug-in for other test-time compute methods.

The distinction between evaluative and prescriptive rubric use is conceptually clean and practically meaningful. Prior work (DR-Tulu, Rubric-ARM, OpenRubrics) treats rubrics as training signals or final-output evaluators; Co-ReAct is the first to train rubrics specifically for step-level, trajectory-conditioned inference guidance in agentic search.

2. Methodological Rigor

Training pipeline. The data collection procedure is well-designed: branching points from real trajectories, diverse candidate slates (three model scales × four temperatures, deduplicated via MMR-BM25), and multi-judge listwise rankings via Borda count over three frontier LLMs. The scale is reasonable (29,866 branching points from 11,406 queries). The choice of listwise over pairwise ranking as a training signal is well-motivated—when k=4 candidates exist, a full ranking provides richer gradient information than pairwise comparisons.

GRPO with Spearman reward. The reward design is principled: a rubric earns high reward only when the ranking it induces over candidate actions correlates with the expert consensus. The auxiliary atomicity and format rewards are sensibly weighted (0.75/0.15/0.10) to keep the discriminative signal dominant. The use of an independent evaluator LLM (Gemini 2.5 Pro) to score actions against rubrics during training adds a layer of indirection that could introduce noise, though results suggest this is manageable.

Experimental design. The evaluation is methodical: two benchmarks (DRB, SQA-CS-V2), two open-source scales (8B, 14B), one frontier model (Gemini 3.1 Pro), four baselines, and ablations. Using Qwen3-235B as a shared answer rewriter across all methods is a smart control that isolates search quality from writing ability. The plug-in study (Figure 3) is a compelling demonstration of composability.

Weaknesses in rigor. The improvements on DRB with Qwen3-8B are modest (2.5% overall), and some sub-metric comparisons are mixed (e.g., Co-ReAct doesn't always win Instruction Following or Answer Precision). The reliance on LLM-based judges throughout—both for training data generation and evaluation—creates a circularity concern the authors acknowledge but don't fully address. The paper lacks statistical significance testing; with benchmark-level averages, it's unclear whether the improvements are robust to sampling variance. The ablation is only on SQA-CS-V2 with one model scale.

3. Potential Impact

Practical applications. The framework is immediately applicable to any ReAct-based deep research agent. The plug-in portability is particularly valuable—organizations already using Best-of-N, CRITIC, or Step-Back can inject the rubric generator without architectural changes, with reported gains of 1.2–14.8% depending on the method.

Conceptual contribution. The evaluative → prescriptive shift for rubrics is a general idea that could influence adjacent areas: code generation agents, embodied agents, multi-agent systems. The listwise GRPO training with Spearman reward is a reusable technique for any setting where a generated artifact must induce correct rankings.

Scalability. The method adds inference-time cost (rubric generation + verification per step), but the paper shows this is bounded (one retry max, rubric capped at 1024 tokens). The 25% increase in tool calls yielding 52% more retrieved documents suggests favorable efficiency.

4. Timeliness & Relevance

This paper addresses a genuine current bottleneck. Deep research agents (Gemini Deep Research, OpenScholar, etc.) are a rapidly growing application area, and the observation that ReAct agents make shallow, redundant searches is well-documented. The agentic AI community is actively seeking better step-level guidance mechanisms, and rubrics offer a human-interpretable, verifiable specification language. The timing—building on the GRPO training paradigm from DeepSeek, leveraging frontier models as judges—places this squarely at the intersection of current capabilities.

5. Strengths & Limitations

Key strengths:

Clean conceptual framing (evaluative → prescriptive rubrics)

Strong ablation showing each component matters, especially that untrained rubrics *hurt* performance (w/o RL Rubric: 72.44 vs. ReAct: 72.76)

Plug-in portability demonstrating the rubric's general utility

Thorough behavioral analysis (Table 3) showing how rubrics change search patterns

The case study (Figure 4) is illustrative and concrete

Notable limitations:

No comparison with end-to-end RL-trained agents (Search-R1, R1-Searcher), which the authors acknowledge but frame as "orthogonal"—this is fair but limits understanding of the absolute quality ceiling

The expert ranking depends on frontier LLMs as judges, creating potential systematic biases in what "good" actions look like

Improvements with 8B models are relatively small; the method seems to benefit stronger models more, raising questions about accessibility

Single retry limit and fixed verification threshold (τ=0.5) are not justified or tuned

Reproducibility depends on access to frontier APIs for judge councils during training

Overall Assessment

Co-ReAct makes a well-motivated conceptual contribution—step-level prescriptive rubrics for agentic search—backed by a sound training methodology and consistent empirical gains. The plug-in composability result is particularly notable for practical adoption. However, the magnitude of improvements is moderate, the evaluation relies heavily on LLM judges, and the exclusion of RL-trained agent baselines leaves open questions about the method's positioning in the broader landscape. The work is a solid contribution to the growing literature on improving agentic reasoning at inference time.

Rating:6.5/ 10

Significance 6.5Rigor 6.5Novelty 7Clarity 7.5

Generated May 25, 2026

Comparison History (24)

vs. Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

claude-opus-4.65/26/2026

Co-ReAct introduces a novel, principled framework for step-level rubric-guided reasoning in ReAct agents, with a dedicated rubric generator trained via a novel list-wise Spearman rank-correlation GRPO objective. It demonstrates consistent improvements across multiple models and benchmarks, and the rubric generator serves as a modular drop-in component. While Paper 2 introduces a valuable benchmark for always-on assistants, benchmarks typically have narrower methodological impact compared to new training/inference frameworks. Co-ReAct's approach is more broadly applicable across reasoning agent architectures and introduces reusable methodological innovations.

vs. Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork

gemini-3.15/26/2026

Paper 2 addresses a highly active and broadly applicable area—enhancing LLM agent reasoning and search capabilities. By introducing step-level rubrics and a novel GRPO-based training objective for test-time guidance, it offers a constructive method that improves agent performance across models. In contrast, Paper 1 primarily introduces a benchmark and reports negative results for a more specialized subfield (In-Context RL for Ad-Hoc Teamwork), making Paper 2's potential real-world utility and methodological impact significantly broader.

vs. PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

claude-opus-4.65/26/2026

Co-ReAct introduces a novel rubric-guided action-selection framework with a theoretically grounded training objective (list-wise Spearman rank-correlation reward via GRPO), demonstrating broad applicability across multiple model scales and both open/closed-source models. Its contribution—using rubrics as step-level inference-time guidance rather than just evaluation signals—represents a more fundamental conceptual advance in agentic reasoning. PANDO, while practically valuable for efficiency gains on web tasks, is more narrowly scoped to multimodal web agents and focuses on engineering optimizations. Co-ReAct's modular rubric generator as a drop-in component gives it wider potential adoption across diverse agent architectures.

vs. ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

gpt-5.25/26/2026

Paper 1 likely has higher impact due to stronger real-world clinical applicability and breadth: it advances interpretable multimodal learning in computational pathology with concept bottlenecks plus residual pathways, and includes validation by an independent neuropathologist. If robust, this can influence medical AI deployment, regulation, and trust across healthcare domains. Paper 2 is timely and useful for LLM agent performance, but is more incremental (test-time rubric guidance + learned rubric generator) and its impact may be narrower or faster-moving given rapid agent-method turnover.

vs. AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

gpt-5.25/26/2026

Paper 2 (Co-ReAct) is likely higher impact: it introduces a broadly applicable, inference-time rubric-guidance mechanism for ReAct agents plus a novel training objective (list-wise Spearman correlation to multi-judge consensus) that improves step-level decision making across multiple model scales and benchmarks, with public code enabling adoption. Its method is modular and can transfer across many agentic systems, boosting real-world reliability in research/search workflows. Paper 1 is innovative in collective reasoning, but its shared-hub multi-agent setup may be more complex to deploy and its impact may be narrower to multi-agent scaling scenarios.

vs. CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

claude-opus-4.65/26/2026

CUA-Gym addresses a fundamental bottleneck in training computer-use agents—the scarcity of scalable, verifiable training environments and rewards. It introduces a comprehensive pipeline for co-generating tasks, environments, and reward functions at scale (32K verified tuples across 110 environments), demonstrates strong empirical results on established benchmarks (OSWorld, WebArena), and promises to open-source all components. This infrastructure contribution has broader impact potential, enabling future RLVR research for CUAs. Co-ReAct, while a solid contribution on rubric-guided reasoning, addresses a narrower problem with more incremental improvements to existing ReAct-style agents.

vs. AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

claude-opus-4.65/26/2026

Co-ReAct introduces a novel rubric-guided action-selection framework that addresses a fundamental limitation of ReAct-style agents—reliance on internal judgment—with a principled approach using step-level rubric guidance and a list-wise GRPO training objective with Spearman rank-correlation reward. This has broader applicability across reasoning and search tasks, offers a transferable component (drop-in rubric generator), and advances both methodology (list-wise preference optimization) and practical agent design. AgentHijack, while valuable as a robustness benchmark, is more incremental and narrower in scope, focusing specifically on computer-use agent corruption scenarios.

vs. Learning to Search and Searching to Learn for Generalization in Planning

claude-opus-4.65/26/2026

Paper 1 presents a fundamentally novel self-improving framework combining classical AI search (WA*) with deep learning (relational GNNs) that achieves remarkable zero-shot combinatorial generalization (e.g., training on 30 blocks, solving 488). This bridges classical planning and DRL in a principled way, addressing a core challenge in AI. Paper 2 introduces a useful engineering contribution (rubric-guided ReAct agents) but is more incremental, building on existing prompting/agent paradigms. Paper 1's methodological innovation, theoretical depth, and demonstrated generalization capabilities suggest broader and more lasting impact across planning, RL, and combinatorial optimization.

vs. EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation

gpt-5.25/25/2026

Paper 2 has higher potential impact due to a more generally applicable training paradigm: reliably internalizing privileged context (persona/private facts/solutions) without unintended behavioral drift is central to practical LLM post-training and deployment. EDGE-OPD’s guided rollouts plus evidence-masked updates address a fundamental OPSD failure mode and offer a transferable mechanism for capability injection while preserving general performance, with clear relevance to safety, personalization, and knowledge transfer. Paper 1 is valuable but more specialized to ReAct-style search agents and inference-time control via rubrics, with narrower cross-field reach.

vs. EVE-Agent: Evidence-Verifiable Self-Evolving Agents

gpt-5.25/25/2026

Paper 1 has higher estimated scientific impact due to a more novel and broadly applicable principle: making self-evolution data generation evidence-verifiable via a measurable marginal-utility signal for retrieved spans, enabling auditable, label-free improvement. This directly addresses a central reliability failure mode (self-reinforced hallucinated curricula) and can generalize across search/RAG, agent training, and safety/verification. Paper 2 is timely and useful for improving ReAct trajectories, but rubric-guided inference and learned evaluators are a more incremental extension with narrower cross-field reach and potentially higher sensitivity to rubric quality/domain shift.

vs. SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

gemini-3.15/25/2026

Paper 1 introduces a foundational benchmark for a critical bottleneck in LLM agents: skill generation and distillation. By isolating skill creation from execution and providing a standardized evaluation protocol, it is likely to shape future research directions and gather broad citations. Paper 2 presents a strong methodological improvement for ReAct agents, but as a benchmark establishing a new reproducible testbed, Paper 1 has higher potential for foundational, field-wide impact.

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

gpt-5.25/25/2026

Paper 2 likely has higher scientific impact due to addressing a core scalability bottleneck (KV-cache memory/time in long-context LLM inference) with broadly applicable techniques (dynamic meta-token synthesis + integration to mitigate information loss). This is timely for deployment and can impact many domains relying on long-context generation, benefiting systems, model-serving, and efficient LLM research. Paper 1 is novel and useful for agentic search/reasoning quality, but its impact is narrower (primarily ReAct-style agents and rubric generation) and more dependent on specific benchmarks and task setups.

vs. Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions

gpt-5.25/25/2026

Paper 2 is likely higher impact due to timeliness and broad applicability in current LLM agent research, with clear empirical validation on benchmarks, practical deployment pathways (drop-in rubric generator), and open-source code enabling rapid adoption. The step-level rubric guidance and list-wise Spearman objective against expert consensus are novel and methodologically concrete. Paper 1 is conceptually ambitious (type-2/3/quantum extensions of mediative fuzzy logic) but appears more theoretical, with narrower immediate adoption prospects and less evidence of broad real-world uptake beyond illustrative examples.

vs. ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

gemini-3.15/25/2026

Paper 1 addresses a fundamental system-level bottleneck (KV cache memory limits) in the rapidly growing area of tree-based LLM reasoning. By enabling up to 4x memory reduction, it unblocks hardware constraints and allows broader exploration of test-time search scaling. Paper 2 presents a valuable but narrower algorithmic improvement for ReAct agents; foundational infrastructure solutions like ArborKV typically have broader, more lasting impact across the field.

vs. CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

gpt-5.25/25/2026

Paper 1 has higher likely impact: it introduces a novel, generally applicable inference-time rubric-guided framework for LLM agents, plus a dedicated rubric generator trained with a distinctive list-wise rank-correlation objective, validated across multiple benchmarks and model families with released code—supporting reproducibility and adoption. Its applications (search/research agents, multi-step reasoning, tool use) are timely and broadly relevant across AI/NLP and agentic systems. Paper 2 is a solid hybrid DP+CP case study, but it is problem-specific (PSSP) and explicitly not competitive with state-of-the-art solvers, limiting real-world uptake and cross-field breadth.

vs. GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

gpt-5.25/25/2026

Paper 1 likely has higher scientific impact: it introduces an evergreen, procedurally generated benchmark distribution plus new diagnostic constructs (capability profiles across multiple strategic axes and a “jaggedness” stability measure) for evaluating LLM strategic behavior in deployment-relevant, variable environments. This is a broadly useful methodological contribution for AI evaluation, safety, and agentic economics, with potential to become a standard stress-test framework resistant to saturation/contamination. Paper 2 is a solid, practical inference-time control improvement for ReAct agents, but is more incremental and narrower in scope.

vs. The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

gemini-3.15/25/2026

Paper 1 presents a novel, actionable technical advancement (Co-ReAct) that directly improves the reasoning and search capabilities of LLM agents, a highly active and critical area of AI research. Its introduction of a trained rubric generator using a list-wise reward offers a reusable, drop-in component for future research. While Paper 2 addresses an important societal question regarding human-AI interaction, Paper 1 is likely to have a more immediate and measurable scientific impact through direct adoption and citations in the rapidly evolving field of autonomous AI agents.

vs. Agentic Proving for Program Verification

gpt-5.25/25/2026

Paper 2 has higher estimated impact due to broader applicability and clearer methodological novelty: a step-level rubric-guided inference framework plus a dedicated rubric generator trained with a list-wise Spearman rank-correlation objective against multi-judge rankings. It targets a widely used agent paradigm (ReAct), shows consistent gains across multiple benchmarks and model families, and releases code, improving reproducibility and adoption. Paper 1 is timely and valuable for program verification benchmarking, but its impact is narrower (Lean/CLEVER-specific) and depends heavily on evaluating a particular proprietary system, limiting generalization and reuse.

vs. Foundation Protocol: A Coordination Layer for Agentic Society

gemini-3.15/25/2026

Paper 1 offers a concrete, methodologically rigorous approach to improving LLM agents with empirical validation on established benchmarks and open-source code. In contrast, Paper 2 presents a broad, ambitious conceptual framework for multi-agent coordination but lacks the empirical evidence and experimental validation typical of high-impact scientific research. Paper 1's timely solution to agent reasoning bottlenecks, backed by measurable results, gives it higher immediate scientific utility and impact.

vs. SkillOpt: Executive Strategy for Self-Evolving Agent Skills

claude-opus-4.65/25/2026

SkillOpt demonstrates broader impact with its systematic text-space optimizer for agent skills, showing improvements across 52 evaluation cells spanning six benchmarks, seven models, and three execution harnesses. Its conceptual contribution—treating skills as optimizable external state with learning-rate budgets and validation-based acceptance—introduces a novel paradigm analogous to weight-space optimization. The transfer results across models, environments, and tasks further strengthen its generalizability. Co-ReAct contributes meaningful step-level rubric guidance for ReAct agents but addresses a narrower problem scope with fewer benchmarks and less demonstrated generalization.