Towards Policy-Aware World Models

Unstable Policy

Stable Policy

Goal Image

Abstract

World models have received significant attention from the robotics and computer vision community, both of whom have started scaling to networks comprising billions of parameters in the hope of unlocking new robot skills. In this paradigm, models are pre-trained on internet-scale data and then fine-tuned on robot data to learn policies. However, it is still unclear what makes a good world model for downstream policy learning, resulting in slow, costly iterations of model training and policy evaluation.

In this work, we demonstrate that the expected signal-to-noise ratio (ESNR) of policy gradients provides a reliable training-time metric for downstream policy performance. This provides a handle on the world model's policy awareness, which denotes how well a policy can learn from a model.

We show that ESNR can be used to understand (1) when world models are sufficiently pre-trained, (2) how architecture changes affect downstream performance and (3) what is the best policy learning method for a given world model. Crucially, ESNR can be computed on-the-fly with minimal overhead and without a trained policy. We validate our metric on traditional architectures and tasks as well as large pretrained world models, demonstrating the practical utility of ESNR for practitioners who wish to train or finetune such models for robot applications.

Expected Signal-to-Noise Ratio (ESNR) as a Policy Performance Metric

ESNR measures how clean and learnable the policy-gradient signal is by comparing signal to variance under action sampling. On a simple discontinuous task, higher-ESNR estimators (CEM > FoG > ZoG) climb faster—so we can gauge “policy awareness” during pretraining without training a policy or running rollouts.

Selecting the policy-extraction method

ESNR tracks returns across ZoG/FoG/CEM; CEM scores highest ESNR and best returns, letting us pick the extractor without costly training.

Ranking architectures

ESNR mirrors performance gaps (regularized TD-MPC2 > TD-MPC2 basic > DreamerV3), surfacing architectural bottlenecks before any policy training.

Training readiness

ESNR over checkpoints gives a stop/continue signal. Once ESNR stabilizes, policy extraction succeeds reliably.

Large models & wall-time

For ResNet/R3M/DINO-WM/V-JEPA2, ESNR predicts planning performance while taking seconds, versus hours for full policy evaluation—dramatically tightening the iteration loop.

Method	ESNR time (hrs)	Eval time (hrs)
ResNet	0.003	0.70
R3M	0.003	2.21
DinoWM	0.027	0.72
V-JEPA2	0.358	4.58