Back to Rankings

FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

Sergi Masip, Jonathan Swinnen, Yutong Hu, Renaud Detry, Tinne Tuytelaars

cs.AI
Share
#1839 of 3489 · Artificial Intelligence
Tournament Score
1393±44
10501800
50%
Win Rate
9
Wins
9
Losses
18
Matches
Rating
4.2/ 10
Significance4.5
Rigor3.5
Novelty4.5
Clarity7

Abstract

Joint Embedding Predictive Architectures (JEPAs) have shown promising world modeling capabilities, enabling planning in latent space by optimizing action trajectories using methods like the Cross-Entropy Method (CEM). These methods are, however, too computationally expensive and ineffective for long-horizon planning. Furthermore, these methods typically require an explicit image of the goal state, which is not always possible in real-world tasks. In this work, we tackle these limitations by proposing Forward-Forward-JEPA (FF-JEPA), a hierarchical approach leveraging two forward dynamics models. Alongside a standard action-conditioned forward model, we introduce an action-free latent planner that predicts the next subgoal given the current state. This approach removes the need for goal images and enables long-horizon planning by decomposing complex trajectories into a sequence of tractable, short-term optimization problems. Preliminary results on PushT demonstrate that FF-JEPA successfully overcomes flat world models' long-horizon collapse, highlighting this approach as a promising direction for goal-free planning.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

1. Core Contribution

FF-JEPA proposes a hierarchical planning framework that augments JEPA-style world models with a learned latent planner to enable long-horizon, goal-free planning. The key idea is to train an action-free forward dynamics model (the "latent planner") that predicts subgoal states in the world model's latent space. These predicted subgoals replace the need for explicit goal images and decompose long trajectories into sequences of short-horizon CEM optimization problems. Two planner variants are explored: a deterministic transformer-based planner and a diffusion-based planner (DiT backbone). The approach eliminates the requirement for goal images at inference time and addresses the compounding error problem inherent to flat, long-horizon CEM planning.

The conceptual contribution is relatively straightforward — hierarchical planning via subgoal prediction is a well-established idea — but the specific instantiation within the JEPA latent space and the reinterpretation of the world model as an implicit inverse dynamics module provides a clean and practical formulation.

2. Methodological Rigor

The experimental evaluation has several notable limitations:

Single task evaluation: All experiments are conducted on the PushT environment, a 2D pushing task. This is a commonly used benchmark but is relatively simple. No evaluation is provided on higher-dimensional environments, manipulation tasks with richer dynamics, or tasks requiring more complex reasoning.

Comparison fairness: The authors acknowledge that comparisons with DINO-WM are not directly comparable due to different evaluation protocols. The primary fair comparison is against flat LeWM, which is essentially an ablation rather than a comparison against the state of the art in long-horizon planning.

Limited statistical reporting: Success rates are reported over 256 episodes, but no confidence intervals or standard deviations across seeds are provided. The single-seed nature of the results limits reliability of the conclusions.

Ablation quality: The demonstration quality ablation (Table II) is interesting but shallow — only two data points are compared. The success rate vs. budget analysis (Figure 4) provides useful insight into convergence behavior.

Training details: The deterministic planner uses a context window of 3 while the diffusion planner uses 1, making it difficult to isolate whether performance differences stem from the architecture or the context window choice. The authors acknowledge this confound but don't resolve it.

3. Potential Impact

The paper addresses a genuine practical limitation of world model-based planning: the need for goal images and the inability to plan over long horizons. Removing the goal image requirement is particularly relevant for robotics applications where specifying precise goal states is impractical. The hierarchical decomposition strategy, while not novel in concept, is cleanly implemented within the JEPA framework.

However, the impact is limited by:

  • Single benchmark: Without demonstration on more complex tasks, it's unclear how well this transfers.
  • Reliance on pretrained world model quality: The approach inherits all limitations of the underlying world model.
  • Scalability unknowns: Whether the latent planner can learn meaningful subgoal sequences for tasks with multi-modal solutions, contact-rich dynamics, or longer horizons (hundreds of steps) remains untested.
  • The finding that high-quality demonstrations can substitute for large datasets (Table II) is practically useful but needs more thorough investigation.

    4. Timeliness & Relevance

    This work is timely. JEPA-based world models are an active research area, and the paper builds directly on very recent work (LeWorldModel, DINO-WM, hierarchical JEPA planning — all 2025-2026). The problem of goal-free, long-horizon planning in learned world models is a current bottleneck. The concurrent work by Zhang et al. [11] addresses a very similar problem with a similar hierarchical approach but still requires goal images, giving FF-JEPA a meaningful differentiator.

    The paper is positioned well within the current discourse around JEPA architectures and their use in robotics/control.

    5. Strengths & Limitations

    Strengths:

  • Clean formulation that naturally decomposes the problem into a latent planner and a world model acting as an inverse dynamics module
  • Removal of goal image requirement is a meaningful practical advance
  • Dramatic improvement over flat planning on long-horizon tasks (3.52% → 91.80% for t=75)
  • Lightweight deterministic planner variant adds minimal overhead (2.1ms vs 926.6ms for CEM)
  • The framework allows training the latent planner on unlabeled demonstrations, which is practically valuable
  • Good qualitative analysis of failure cases
  • Limitations:

  • Single, simple environment: PushT is 2D and relatively low-dimensional; claims about "real-world tasks" are unsupported
  • Preliminary nature: The paper is explicitly described as presenting "preliminary results" — this limits the strength of conclusions
  • No comparison with modern planning baselines: Missing comparisons with MPC variants, tree search methods, or other hierarchical planners
  • Fixed subgoal spacing: The horizon H=25 is fixed; adaptive subgoal spacing could be important for variable-difficulty segments
  • No analysis of latent planner quality: How accurate are the predicted subgoals? What is the distribution of errors? How does subgoal quality degrade over long sequences?
  • Limited theoretical justification: No analysis of when/why the hierarchical decomposition should work or fail
  • Reproducibility: While the authors use publicly available libraries, full training details and hyperparameters for reproduction are sparse
  • Additional Observations

    The paper is well-written and clearly presented for its scope. The naming "Forward-Forward JEPA" is somewhat confusing as it could be conflated with Hinton's Forward-Forward algorithm. The paper reads more as a workshop paper or extended abstract than a full contribution, which aligns with its self-described preliminary nature.

    The diffusion planner's substantial parameter overhead (50.1M vs 18M for the world model) and inference cost raise questions about whether simpler alternatives to CEM-based action extraction (e.g., a learned inverse dynamics model) might be more efficient end-to-end.

    Summary

    FF-JEPA presents a clean and practical idea — augmenting JEPA world models with latent planners for goal-free long-horizon planning. The idea is sound and the preliminary results are promising, but the evaluation is too narrow (single simple task, limited baselines, no statistical rigor) to support strong claims about the approach's general utility. This represents a reasonable early-stage contribution that needs substantially more validation to achieve significant impact.

    Rating:4.2/ 10
    Significance 4.5Rigor 3.5Novelty 4.5Clarity 7

    Generated Jun 9, 2026

    Comparison History (18)

    Wonvs. Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

    Paper 2 has higher likely impact: it proposes a concrete, generalizable planning method (hierarchical latent planner + forward dynamics) addressing recognized bottlenecks in world-model planning (long-horizon collapse, CEM cost, goal-image dependence), with clear robotics/control applications and relevance to current model-based RL trends. Paper 1 is novel conceptually (autonomous conjecture generation, Neural Jacobian Conjecture) and rigorous in a special case, but its broader impact depends on future mathematical uptake and resolution of higher-width cases, making near-term real-world and cross-field impact less certain.

    gpt-5.2·Jun 10, 2026
    Wonvs. Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

    Paper 2 has higher potential impact due to a more novel algorithmic contribution (hierarchical latent planning with an action-free subgoal predictor) that targets a central, timely bottleneck in world-model planning: long-horizon collapse and goal specification. If validated beyond preliminary PushT results, it could influence reinforcement learning, robotics, control, and representation learning broadly. Paper 1 is methodologically rigorous and useful, but as a benchmark/evaluation framework its primary impact is narrower (engineering VLM assessment) and more incremental relative to the fast-moving benchmarking landscape.

    gpt-5.2·Jun 10, 2026
    Wonvs. STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

    FF-JEPA introduces a novel hierarchical planning architecture addressing fundamental limitations of JEPA-based world models for long-horizon planning—a core challenge in AI/robotics. The action-free latent planner enabling goal-free planning is a meaningful architectural contribution with broad implications for model-based reinforcement learning and robotics. While Paper 2 (STAGE-Claw) provides a useful benchmarking framework, it is more incremental in nature—improving agent evaluation methodology rather than advancing fundamental capabilities. Paper 1's methodological innovation has greater potential for broad, lasting impact across multiple research areas.

    claude-opus-4-6·Jun 10, 2026
    Lostvs. Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

    Paper 2 demonstrates strong empirical validation, including discovering novel algorithmic structures and achieving top placements in an international competition (AAMAS 2026). It advances the highly relevant field of LLM-driven code evolution in multi-agent environments. In contrast, Paper 1 tackles an important problem in world models but relies on 'preliminary results' on a single baseline, indicating it is likely earlier-stage research with less proven real-world impact.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

    Paper 2 has higher estimated impact due to a more application-critical domain (supply-chain resilience), clearer methodological contributions (physically constrained graph-latent world model + explicit epistemic/aleatoric separation via double-loop learning), and stronger empirical claims with statistical testing on a defined benchmark. Its framing bridges LLM semantics and RL control, potentially influencing ML, operations research, and risk modeling. Paper 1 is novel for long-horizon latent planning without goal images, but evidence is preliminary and demonstrated on a narrow robotics task (PushT), suggesting earlier-stage impact and limited breadth so far.

    gpt-5.2·Jun 10, 2026
    Lostvs. (Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

    Paper 2 likely has higher impact: it proposes a broadly applicable, timely methodology for reliable LLM-driven autoformalization with deterministic process semantics, demonstrated by an end-to-end Lean formalization of a recent major Ramsey theory result—strong evidence of rigor and real-world utility. Its approach can influence multiple fields (formal methods, theorem proving, ML/LLMs, mathematics) and addresses a central bottleneck in scalable verification. Paper 1 is innovative for long-horizon latent planning, but is more specialized, has preliminary results on a limited benchmark, and impact depends on further validation.

    gpt-5.2·Jun 9, 2026
    Wonvs. Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

    Paper 1 introduces a fundamental methodological advancement in world models and long-horizon planning, solving key computational bottlenecks in JEPAs. This technical innovation has broad applicability across robotics, reinforcement learning, and autonomous agents, likely yielding wider scientific impact. Paper 2, while highly relevant to AI safety and real-world applications, provides a largely domain-specific theoretical critique of RAG in the legal field, which may have a narrower overall scientific footprint.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

    SpatialWorld introduces a comprehensive benchmark with 760 human-annotated tasks across 8 simulation backends, addressing a significant gap in evaluating interactive spatial reasoning for multimodal agents. It provides extensive evaluation of 15 advanced agents including GPT-5 and open-source models, revealing critical performance bottlenecks. Its breadth of impact is larger—serving the entire MLLM agent community. FF-JEPA presents an interesting hierarchical planning idea but offers only preliminary results on a single task (PushT), limiting its demonstrated impact. SpatialWorld's scale, rigor, and community utility give it higher potential impact.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. Structure Enables Effective Self-Localization of Errors in LLMs

    Paper 1 addresses the fundamental and highly active problem of LLM self-correction with a well-motivated, structured approach (Thought-ICS) that shows strong empirical results (20-40% self-correction lift). It tackles a core limitation of LLMs relevant across virtually all applications. Paper 2 presents an interesting hierarchical planning approach (FF-JEPA) but reports only preliminary results on a single task (PushT), limiting its demonstrated impact. The breadth of applicability, methodological rigor, and timeliness of Paper 1's contribution to the rapidly growing LLM reasoning/self-correction literature gives it higher potential impact.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. Vision Language Model Helps Private Information De-Identification in Vision Data

    Paper 2 is likely to have higher scientific impact: it proposes a conceptually novel hierarchical latent-planning mechanism (action-free subgoal predictor + action-conditioned dynamics) that addresses key bottlenecks in model-based RL (CEM compute cost, long-horizon collapse, and dependence on goal images). If validated beyond preliminary PushT results, it could influence broad areas (world models, planning, robotics, autonomy). Paper 1 is timely and valuable for privacy-preserving VLM deployment, but is more application/dataset-and-tuning oriented with narrower methodological novelty and field reach.

    gpt-5.2·Jun 9, 2026