Kairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi
World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.
Kairos presents an integrated "world model stack" targeting three interconnected challenges in Physical AI: (1) learning from heterogeneous cross-embodiment data, (2) maintaining persistent world states over long horizons, and (3) deploying efficiently on real hardware. The paper's key intellectual contributions are:
Architecture: The hybrid attention design is well-motivated and clearly described. The combination of local (SWA), mid-range (DSWA), and global (GLA via GatedDeltaNet) attention mechanisms is a sensible engineering choice with clear computational benefits — linear scaling in sequence length versus quadratic for standard attention.
Theoretical Analysis: The paper provides formal theorems establishing (a) the necessity of persistent latent states beyond sliding-window attention (Theorem 1) and (b) the approximate sufficiency of the hybrid multi-scale memory under contraction assumptions (Theorem 2/4). While mathematically correct, these results are somewhat expected — the necessity result essentially restates that conditioning on less information yields worse predictions, and the sufficiency result depends on assumptions (Lipschitz decoder, contractive updates, Bayes predictor factorization) that may not hold in practice. The gap between the theoretical guarantees and empirical behavior is not bridged.
Experimental Evaluation: Benchmarks span WorldModelBench, DreamGen, PAI-Bench, RoboTwin 2.0, LIBERO-Plus, and VideoPhy. Kairos achieves state-of-the-art or near-SOTA results across most benchmarks, often with significantly fewer parameters (4B vs. 14-28B competitors). The ablation studies on human-centric data scaling, VLM encoder choice, and joint training are informative but limited in scope. Notably, many baseline results are "reproduced by our team" (marked with *), which introduces potential confounds. Real-world robot deployment results are conspicuously absent — all evaluations are on simulation benchmarks.
Efficiency Claims: The linear scaling claim is well-supported by the DiT step timing curves showing near-perfect linearity (R² = 0.9997). The 28-85× speedup over Cosmos-Predict2.5-14B is impressive and practically significant.
Immediate Applications: The efficiency gains are practically significant for robotics — real-time 480P video generation on A800 GPUs and consumer-grade deployment on RTX5090 could enable broader adoption of world models in robotics research.
Broader Influence: The CEDC paradigm of progressive cross-embodiment training could influence how the community approaches data organization for embodied AI. The philosophical shift from "fine-tune video generators for robots" to "natively pre-train for physical AI" is timely and could set a new standard, though the evidence that this native approach is fundamentally superior (rather than merely better-engineered) needs stronger ablation.
Limitations on Impact: Without open real-world deployment results, the "deployment-aware" claims remain partially validated. The self-evolution framework (Section 5.1) is described aspirationally but not empirically validated beyond prompt rewriting.
The paper arrives at a critical juncture where world models are transitioning from visual generation to operational infrastructure. The explicit comparison and positioning against Cosmos, V-JEPA, Genie 3, and Dreamer 4 demonstrates awareness of the competitive landscape. The emphasis on efficiency and deployment readiness addresses a genuine bottleneck — many world models remain impractical for closed-loop robotics.
The paper's framing as a "stack" is strategic but raises the question of whether the integrated system's benefits arise from genuine architectural synergy or from careful engineering and data curation. The 34× data engineering speedup (Table 3) suggests substantial infrastructure investment that may not be replicable by most research groups.
The distillation results (4-step inference) are practically valuable but the observed "motion diminution" and "visual homogenization" artifacts suggest fundamental limitations in the distillation approach that are only partially addressed.
Generated Jun 16, 2026
Paper 1 fundamentally reshapes LLM evaluation by invalidating a widespread but flawed methodological trend: treating LLMs as human proxies for psychological testing. By proving these profiles are measurement artifacts via rigorous psychometric frameworks, it prevents wasted effort and bad science across AI safety, psychology, and HCI. While Paper 2 offers an impressive, theoretically grounded systems architecture for physical AI, Paper 1 promises broader, longer-lasting scientific impact by correcting a foundational measurement error in a rapidly growing, cross-disciplinary research area.
Paper 2 (Kairos) has higher estimated scientific impact due to its broader, more foundational scope: a unified world-model stack spanning data curriculum, architecture, theory (formal bounds on error accumulation), and deployment-efficient systems. This combination targets a core bottleneck for “Physical AI” across robotics, embodied agents, and video/sequence modeling, increasing cross-field spillover. Paper 1 (ENPIRE) is highly practical and timely for autonomous real-world robotics iteration, but is more framework/engineering-centric and likely narrower in generality beyond manipulation pipelines and specific lab setups.
Paper 2 (Kairos) targets a broader, highly active area—scalable world models for Physical AI/robotics—with direct real-world deployment implications (long-horizon state, efficiency, system co-design). Its contributions span data curriculum, unified architecture, theoretical error-accumulation bounds, and deployment constraints, suggesting cross-field impact (ML, robotics, systems). Paper 1 is methodologically strong and novel for autoformalization faithfulness with solid benchmarks and theory, but its primary impact is narrower to formal methods and math proof assistants. Overall, Paper 2 is likely to have wider and more timely scientific and practical influence.
Kairos addresses the fundamental infrastructure challenge for Physical AI—world models that can learn, maintain state, and deploy efficiently across embodiments. Its breadth of impact spans robotics, embodied AI, and autonomous systems, with novel contributions in cross-embodiment pre-training, a theoretically grounded temporal attention architecture, and deployment-aware design. While User as Code is a clever and practical contribution to personalized agents with strong benchmark results, its scope is narrower (user memory for conversational agents). Kairos's potential to serve as foundational infrastructure for physical AI gives it broader and deeper scientific impact.
Kairos addresses a fundamental infrastructure challenge for Physical AI with broad implications across robotics, autonomous systems, and embodied intelligence. Its contributions span architecture design (hybrid temporal attention with theoretical guarantees), training paradigm (cross-embodiment curriculum), and deployment optimization—offering a comprehensive stack with wide applicability. Paper 2 makes an important contribution to AI safety by formalizing hallucination-to-action conversion and proposing ECA, but it addresses a narrower problem (multimodal agent safety in tool-use settings). Kairos's breadth of impact across robotics, simulation, and embodied AI, combined with its foundational nature, gives it higher potential scientific impact.
Kairos presents a comprehensive technical system (world model stack) addressing critical infrastructure needs for Physical AI with novel contributions in architecture design, training paradigms, and deployment optimization, backed by theoretical guarantees and empirical benchmarks. Its potential impact spans robotics, embodied AI, and autonomous systems—fields with massive industrial investment. While Paper 2 raises important conceptual points about AI pluralism and ontological flattening, it offers a preliminary qualitative framework (PLG) without validated empirical results, limiting its near-term scientific impact compared to Kairos's immediately actionable technical contributions.
MiniMax Sparse Attention (MSA) addresses a fundamental and immediate bottleneck in LLM deployment—quadratic attention cost at long contexts—with a practical, well-engineered solution achieving 28.4x compute reduction and significant wall-clock speedups on production hardware. It is deployed in a publicly released 109B-parameter model, demonstrating real-world impact. The open-sourced kernel and integration with GQA make it broadly adoptable. While Kairos proposes an ambitious world model stack for Physical AI, it spans many components with less focused depth, and its real-world deployment impact remains more speculative. MSA's immediate applicability to the massive LLM ecosystem gives it broader near-term impact.
Paper 1 introduces a foundational world model stack for Physical AI, a transformative and rapidly growing field. Its comprehensive approach—combining novel pre-training paradigms, an innovative architecture with theoretical bounds, and deployment-aware design—promises significant advancements in robotics and embodied AI. In contrast, Paper 2 offers valuable insights into affect forecasting but addresses a narrower niche in NLP and psychology, primarily showing that simple numeric baselines outperform textual models for future predictions.
Paper 1 introduces a foundational architecture and training paradigm for physical AI world models, supported by theoretical bounds on error accumulation and deployment-aware co-design. Its comprehensive approach to enabling embodied AI represents a core capability leap with massive downstream applications. While Paper 2 offers a highly practical and clever auditing protocol for LLMs, Paper 1's development of a native world model stack is likely to have a more transformative and foundational impact on the future development of robotics and self-evolving physical intelligence.
Paper 1 leverages an unprecedented dataset of 200 million enrollees to train a healthcare foundation model, bridging a critical gap in utilizing administrative claims for AI. Its immediate, demonstrable improvements in disease-onset prediction (especially rare diseases), healthcare expenditure forecasting, and bias reduction in trial emulation offer massive, near-term real-world applicability in a high-stakes domain. While Paper 2 presents strong architectural advancements for physical AI, Paper 1's scale, external validation, and direct impact on medical decision-making and epidemiology give it a higher potential for broad scientific and societal impact.