CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

Yuxin Zhang, Yiyao Li, Ping Shu Ho, Simon See, Zhenqin Wu, Kevin Tsia

Jun 2, 2026

arXiv:2606.03435v1 PDF

cs.AI(primary)

#940of 3404·Artificial Intelligence

#940 of 3404 · Artificial Intelligence

Tournament Score

1450±45

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity7

Tournament Score

1450±45

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug-disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP-Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent's potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening -- streamlining iterative cycles of hypothesis generation in drug discovery.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CP-Agent

1. Core Contribution

CP-Agent addresses a genuine gap at the intersection of high-content imaging, drug discovery, and multimodal AI. The paper makes two interlinked contributions: (1) CP-CLIP, a contrastive alignment module that jointly embeds Cell Painting microscopy images with structured experimental metadata (cell line, compound identity, dose, time) through a custom token injection strategy; and (2) CP-Agent, a modular agentic system that chains perception, segmentation, feature extraction, statistical synthesis, and LLM-based report generation to produce interpretable mechanistic rationales for observed morphological changes.

The key novelty lies in the context-aware token projection mechanism, where continuous molecular descriptors, normalized concentration, and time are injected as learned embeddings into placeholder positions within the text encoder's token sequence. This is a clean engineering solution to the problem of encoding heterogeneous metadata types (categorical, continuous, structured chemical) within a contrastive learning framework. The paired perturbation-control image input design is also sensible for capturing treatment-specific morphological shifts.

2. Methodological Rigor

Strengths in evaluation design: The paper evaluates CP-CLIP across multiple axes—seen-drug classification, unseen-drug matching, embedding structure analysis, and dose-response trajectory visualization—providing a reasonably comprehensive picture. The comparison against four frontier MLLMs (GPT-5, Grok-4, Claude-4-Sonnet, Gemini-2.5-Pro) on Cell Painting tasks convincingly demonstrates that general-purpose models fail at compound identification (near-zero F1), establishing the need for domain-specific perception.

Concerns:

The classification benchmark (Table 2) uses only 10 randomly sampled compounds, which is a narrow evaluation. The F1 of 0.896 is impressive but should be contextualized against the scale of real drug screening libraries (hundreds to thousands of compounds).

The comparison with CLOOME is somewhat indirect—CLOOME uses molecular structure as input while CP-CLIP uses molecular descriptors plus full experimental context, making it hard to isolate the contribution of context injection versus simply having richer input.

The unseen-drug matching evaluation (Table 3) reports cosine similarities rather than task-specific metrics. Average similarity of 0.432-0.444 is modest in absolute terms, though the improvement over baselines is consistent.

The expert evaluation (N=11) with Kendall's W of 0.33-0.37 indicates only fair inter-rater agreement, and no significant differences between LLM backends were found, somewhat weakening claims about reasoning quality.

The counterfactual prompt experiments (Appendix S) are a valuable addition, demonstrating CP-CLIP is not merely exploiting metadata shortcuts, though these appear only in the appendix.

3. Potential Impact

Drug discovery workflows: CP-Agent's most compelling value proposition is translating opaque high-dimensional morphological profiles into human-readable mechanistic hypotheses. The case studies (Taxol, Sorbinil, BGT226) demonstrate the system's ability to handle canonical, subtle, and complex phenotypes respectively, with appropriate uncertainty flagging. This could genuinely accelerate iterative hypothesis refinement in phenotypic screening.

Scalability questions: The system depends on CellProfiler feature extraction, fine-tuned segmentation models, and multiple LLM calls per image pair. The practical throughput for large-scale screens (millions of images) is unclear and likely limiting.

Broader applicability: The context-aware contrastive alignment paradigm could generalize to other experimental biology domains where metadata is rich but underutilized (e.g., spatial transcriptomics, flow cytometry). The modular agent architecture is extensible, though the current instantiation is tightly coupled to Cell Painting.

4. Timeliness & Relevance

This work is well-timed. Cell Painting has become increasingly central to pharmaceutical phenotypic screening, with growing public datasets (JUMP-CP, RxRx). Simultaneously, agentic AI systems are emerging rapidly but have barely been applied to high-content imaging. The paper fills this specific niche. The training on 1.9M image-context pairs across three public datasets provides a useful scale of pretraining for this domain.

However, the concurrent emergence of works like CLOOME, MolPhenix, and CellCLIP means the contrastive alignment component is evolutionary rather than revolutionary. The agentic report generation is the more distinctive contribution.

5. Strengths & Limitations

Key Strengths:

Well-motivated integration of experimental context that is typically discarded or poorly utilized

Comprehensive system design from perception to report generation with clear modularity

Strong demonstration that frontier MLLMs cannot handle Cell Painting data zero-shot, establishing the need for specialized perception

Dose-response embedding trajectories (Figure 3c) provide compelling evidence of biologically meaningful representations

Extensive appendix with reproducibility details, prompts, and ablations

Notable Limitations:

The "agentic" framing is somewhat generous—the workflow is largely a predetermined pipeline with LLM reasoning at specific nodes rather than truly autonomous planning

No comparison against traditional Cell Painting analysis pipelines (e.g., standard CellProfiler → morphological profiling → NSC/MoA enrichment) on the same MoA prediction task

The report quality assessment lacks ground-truth comparison—reports are evaluated for internal consistency and expert preference, not factual correctness against known biology

Training requires paired control-perturbation images with metadata, limiting applicability to well-annotated datasets

The paper does not address computational cost or latency, which matters for practical deployment

Descriptor vs. fingerprint comparison shows mixed results across tasks, without clear guidance on when to prefer each

Summary

CP-Agent represents a solid systems-level contribution that thoughtfully integrates contrastive multimodal learning with agentic LLM reasoning for an important application domain. The CP-CLIP component demonstrates clear improvements over baselines for context-aware morphological profiling. The agentic pipeline, while more of a structured workflow than truly autonomous reasoning, produces useful interpretable outputs. The work's impact will depend on community adoption and whether the approach scales to realistic drug screening campaigns. It establishes a useful paradigm but leaves open questions about scalability, biological validation, and comparison with established computational phenotyping methods.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 6.5Clarity 7

Generated Jun 3, 2026

Comparison History (16)

vs. When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

gemini-3.16/5/2026

Paper 2 introduces a highly interdisciplinary application of multimodal LLMs to accelerate drug discovery, offering significant real-world impact and cross-field utility (AI and biomedicine). While Paper 1 provides valuable meta-analysis on AI evaluation methodologies, Paper 2's direct contribution to scalable, interpretable biological research and hypothesis generation presents a higher ceiling for both scientific advancement and societal impact.

vs. SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification

gemini-3.16/5/2026

While Paper 2 presents a highly valuable domain-specific tool for drug discovery, Paper 1 has a broader potential scientific impact. By extending Process Reward Models (PRMs) beyond mathematics into general scientific reasoning (biology, chemistry, physics) and addressing critical LLM hallucination issues via tool-aware verification, Paper 1 contributes to foundational AI methodology. Its ability to enable test-time scaling and improve reinforcement learning for broad scientific problem-solving gives it wider applicability and greater potential to accelerate research across multiple disciplines.

vs. Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact due to broad relevance and timeliness: inference-time safety vulnerabilities affect nearly all deployed LLMs across domains. It extends “shallow safety” to a more general, actionable threat model (mid-sequence token injections) and proposes a methodology (trajectory-based alignment via simulated perturbations) that could influence alignment training practices widely. Paper 1 is innovative for phenotypic screening and drug discovery, but its impact is narrower (Cell Painting workflows) and depends on dataset access and biological validation; Paper 2’s insights generalize across models and applications.

vs. Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

claude-opus-4.66/3/2026

CP-Agent addresses a critical bottleneck in drug discovery by combining multimodal LLMs with cell painting analysis, offering interpretable MoA inference and context-aware phenotypic screening. Its interdisciplinary impact spans AI, biology, and pharmacology, with direct real-world applications in accelerating drug discovery pipelines. While Harness-1 is a solid engineering contribution to search agents with strong benchmark results, it represents an incremental improvement in retrieval methodology. CP-Agent's novelty in bridging high-content imaging with agentic reasoning and its potential to transform experimental workflows gives it broader and deeper scientific impact.

vs. AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

claude-opus-4.66/3/2026

CP-Agent introduces a novel multimodal agent framework for cell morphological profiling that directly addresses a critical bottleneck in drug discovery—interpretability and context-awareness in phenotypic screening. Its CP-CLIP alignment module is a concrete technical contribution with strong quantitative results (F1=0.896), and its real-world application to accelerating drug discovery gives it high translational impact. While AutoMedBench is a solid benchmarking contribution for evaluating medical AI agents, benchmarks tend to have narrower long-term impact compared to novel methodological frameworks that enable new capabilities in high-value domains like pharmaceutical development.

vs. Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

claude-opus-4.66/3/2026

CP-Agent addresses a critical bottleneck in drug discovery by combining multimodal LLMs with Cell Painting data for interpretable phenotypic screening. It has direct real-world applications in pharmaceutical research, offers strong methodological innovation (context-aware alignment via CP-CLIP, agentic reasoning), and demonstrates concrete quantitative results (F1=0.896). Paper 2 provides interesting theoretical analysis of MAS orchestration dynamics and identifies the 'Reasoning Trap,' but its impact is more narrowly focused on LLM system design. CP-Agent's intersection of AI and drug discovery gives it broader interdisciplinary impact and more immediate practical relevance.

vs. Decomposing how prompting steers behavior

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact due to its general, mechanistic contribution to understanding how prompting alters internal representations across multiple LLMs/VLMs, tasks, and modalities. The nested geometric decomposition plus causal layerwise state-mapping tests provide strong methodological rigor and a broadly applicable analysis toolkit for interpretability, controllability, and model design. Its relevance is high given widespread reliance on prompting. Paper 1 is impactful for drug discovery, but it is more domain-specific and its “agentic” component is less fundamentally novel than the cross-model causal geometry framework in Paper 2.

vs. Joint Agent Memory and Exploration Learning via Novelty Signals

gpt-5.26/3/2026

Paper 1 targets a high-value, domain-critical bottleneck in drug discovery: interpretable, context-aware phenotypic screening from Cell Painting with experimental metadata. The approach is novel in tightly aligning multimodal imaging with perturbation context and producing human-interpretable MoA rationales, with clear real-world translational applications and broad relevance across chemical biology, pharmacology, and biomedical AI. Paper 2 is timely and potentially impactful for agent exploration, but novelty-signal supervision (e.g., code coverage) is more domain-specific and likely to have narrower immediate cross-field and societal impact than accelerating scalable, interpretable drug screening.

vs. Inducing Reasoning Primitives from Agent Traces

gemini-3.16/3/2026

Paper 1 presents a foundational methodological advancement in LLM agent reasoning by automating the discovery of reusable reasoning primitives. Its generalizable approach offers broad impact across numerous AI applications, outperforming existing baselines across diverse tasks. While Paper 2 provides high value in a specific domain (drug discovery), Paper 1's domain-agnostic framework for self-improving agents is likely to drive wider adoption, stimulate more subsequent research, and have a more profound, cross-disciplinary scientific impact in the rapidly evolving field of AI.

vs. Evaluating Bivariate Causal Statements Based on Mutual Compatibility

gpt-5.26/3/2026

Paper 2 likely has higher impact due to strong real-world applicability in drug discovery and phenotypic screening, a timely multimodal/LLM-based approach, and broader downstream utility (MoA inference, toxicity, experimental design). Its reported quantitative performance (F1=0.896) and integration of images + metadata with interpretable reports suggest practical deployability and cross-field relevance (biology, ML, pharma). Paper 1 is conceptually novel for evaluating causal claim sets and LLM causal outputs, but its immediate applications and empirical validation breadth may be narrower and more dependent on assumptions/modeling choices.

vs. EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

gemini-3.16/3/2026

Paper 2 addresses a critical bottleneck in drug discovery by integrating multimodal LLMs with high-content biological imaging. The application of agentic reasoning to generate human-interpretable rationales for morphological changes has profound implications for pharmaceutical research, accelerating mechanism-of-action discovery and toxicity prediction. While Paper 1 presents an innovative approach to BCI scalability, the massive real-world healthcare and economic impact of accelerating drug discovery using foundation models arguably gives Paper 2 a broader and more transformative scientific footprint.

vs. Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

gpt-5.26/3/2026

Paper 1 is more likely to have higher broad scientific impact: it targets a foundational, cross-domain problem (authorization/delegation for agentic systems) with a compositional framework that can overlay onto existing IAM policies, suggesting wide applicability across AI deployments, cybersecurity, finance, and governance. Its emphasis on formal relational definitions, compositional operators, proofs, and empirical evaluation indicates higher methodological rigor and potential to become a standard reference as agentic AI adoption grows. Paper 2 is timely and impactful for drug discovery, but its scope is narrower to Cell Painting workflows and specific modeling pipelines.

vs. Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

gemini-3.16/3/2026

Paper 1 offers substantial potential for real-world application in drug discovery, a field where accelerating hypothesis generation and experimental design has massive societal and economic value. By combining high-content imaging, experimental metadata, and MLLMs, it demonstrates strong cross-disciplinary innovation and tackles a highly complex bottleneck in biomedical research. While Paper 2 presents a solid methodological improvement for LLMs, Paper 1's integration of multimodal AI into a specialized scientific workflow promises a broader and more transformative impact across both artificial intelligence and computational biology.

vs. ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

gemini-3.16/3/2026

Paper 2 addresses a critical and highly timely challenge in foundational AI: the inefficiency and 'over-thinking' of Large Reasoning Models (LRMs). By reducing token usage by 56% without sacrificing accuracy, ThoughtFold offers massive computational savings and scalability improvements for state-of-the-art models. While Paper 1 presents an innovative application of multimodal LLMs in drug discovery, Paper 2's methodological advancements in preference learning and reasoning efficiency will have a much broader impact across all fields utilizing advanced AI, making its overall scientific and practical footprint significantly larger.

vs. When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

gemini-3.16/3/2026

Paper 1 addresses a fundamental mechanism in LLM reasoning (multi-agent debate) with exceptional methodological rigor, evaluating over 6,000 pairs and deriving a broadly applicable mathematical condition. Its findings generalize across multiple domains, offering foundational insights for the rapidly growing field of AI agents. While Paper 2 presents a valuable application in drug discovery, Paper 1's theoretical contributions and cross-domain generalizability promise a wider breadth of impact across the entire AI community.

vs. Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to stronger real-world applicability and timeliness: context-aware multimodal modeling for Cell Painting can directly accelerate drug discovery workflows and improve interpretability in high-content screening, a major bottleneck in pharma/biomed. Its integration of images + experimental metadata and agentic reporting is a novel, translational contribution with cross-field reach (biology, cheminformatics, ML, automation). Paper 2 is methodologically rigorous and valuable for offline RL, but its impact is more specialized and benchmark-driven, with less immediate downstream adoption outside RL.