GRASP: Practical Long-Horizon Planning with World Models

Introduction

Large learned world models are transforming how we approach sequential decision-making. These models can predict long sequences of future observations in high-dimensional visual spaces and generalize across tasks in ways that seemed impossible just a few years ago. As they scale, they begin to resemble general-purpose simulators rather than task-specific predictors.

GRASP: Practical Long-Horizon Planning with World Models — Source: bair.berkeley.edu

However, having a powerful predictive model is not the same as being able to use it effectively for control, learning, or planning. In practice, long-horizon planning with modern world models remains fragile: optimization becomes ill-conditioned, non-greedy structure introduces bad local minima, and high-dimensional latent spaces bring subtle failure modes. To address these challenges, a team of researchers—including Mike Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar (with equal advisorship)—has developed GRASP, a new gradient-based planner that makes long-horizon planning practical. This article explores the problems that motivated GRASP and the key innovations that make it robust.

What Is a World Model?

The term “world model” is overloaded in modern AI. Depending on context, it might refer to an explicit dynamics model or an implicit internal state that a generative model relies on (for instance, when an LLM generates chess moves, it may have some internal representation of the board). For our purposes, we adopt a loose working definition:

Suppose you take actions a_t from an action space A and observe states s_t from a state space S (images, latent vectors, proprioception). A world model is a learned model that, given the current state and a sequence of future actions, predicts what will happen next. Formally, it defines a predictive distribution:

P_θ(s_t+1 | s_t-h:t, a_t)

that approximates the environment’s dynamics. This model can then be used for planning by simulating future trajectories and optimizing action sequences.

The Long-Horizon Challenge

While world models have become increasingly accurate, using them for planning over long horizons remains a stress test. Three main issues arise:

Ill-conditioned optimization: Gradient signals can vanish or explode over long trajectories, making it hard to adjust early actions.
Non-greedy structure: Optimal sequences often require coordination across many time steps, creating many local minima that trap gradient-based optimizers.
High-dimensional latent spaces: Modern world models operate in rich latent spaces (e.g., from vision encoders). Gradients through these brittle models can be noisy or uninformative.

Standard gradient-based planners often fail when horizons extend beyond a few steps. This fragility motivated the design of GRASP.

GRASP: Three Key Innovations

GRASP introduces three core ideas that together make gradient-based planning robust even for long horizons:

1. Parallel Trajectory Lifting

Instead of optimizing actions sequentially over time, GRASP lifts the entire trajectory into a set of virtual states. This reframes the optimization as a parallel problem: all time steps can be updated simultaneously, which improves gradient flow and reduces the risk of vanishing signals. The trajectory becomes a batch of independent optimization variables, each corresponding to a time step, and the world model’s predictions are computed in parallel.

2. Stochasticity for Exploration

To escape bad local minima, GRASP injects stochasticity directly into the state iterates during optimization. By adding noise to the virtual states, the planner can explore alternative paths in the trajectory space. This randomness is controlled and annealed, allowing the optimizer to gradually settle on better solutions.

3. Gradient Reshaping

Gradients through high-dimensional vision models (like those used in world models) can be brittle, as they mix action signals with state inputs. GRASP reshapes gradients so that actions receive clean, informative gradients while avoiding the noisy “state-input” gradients that come from the vision model. This is achieved by reformulating the gradient computation to decouple the action updates from the state estimation path.

Benefits and Implications

Together, these innovations yield a planner that can handle much longer horizons than previous gradient-based methods. GRASP is not only more robust to ill-conditioning and local minima but also computationally efficient because the parallel lifting maps well to modern hardware (GPUs). The method opens the door to using large world models for planning tasks that require foresight—such as robotics, autonomous driving, or game playing—without needing to handcraft reward shaping or rely on sample-inefficient model-free RL.

The ability to plan over long horizons with learned dynamics also suggests that world models can serve as general-purpose simulators for many tasks, reducing the need for separate planning algorithms for different domains.

Conclusion

GRASP demonstrates that careful algorithmic design—parallel trajectory lifting, stochastic exploration, and gradient reshaping—can overcome the fragility of gradient-based planning with modern world models. By addressing the core issues of ill-conditioned optimization, local minima, and brittle gradients, GRASP makes long-horizon planning practical. As world models continue to scale, approaches like GRASP will be essential to unlock their full potential for decision-making and control.