Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning
ICML 2025

  • This work was partly completed at Carnegie Mellon University.

Why does prior offline MBRL fail in long-horizon tasks?

  • Inaccurate Value Estimation in Sparse Reward Settings. Sparse rewards in long-horizon tasks lead to flat or noisy value estimates across distant states. This undermines the critic's ability to reflect meaningful progress, distorting temporal distance signals.
  • Unstable Policy Optimization. When value targets are unreliable, policy gradients become erratic or misleading. This results in suboptimal or unstable policy updates during training.
  • Model Overgeneralization and Geometric Bias in Rollouts. Learned models often overgeneralize in underexplored regions, generating unrealistic transitions. As a result, agents exploit geometrically short but temporally infeasible paths during planning.

Main Idea

  • We propose a novel method to learn temporal distance representations from fixed offline data for transition augmentation in MBRL, which we dub TempDATA.
  • Our key idea is to build a latent space where distances reflect temporal separation, enabling both trajectory-level and transition-level understanding of time.
  • At the trajectory level, we approximate the temporal distance between two states by finding their closest alignment across dataset trajectories, capturing long-term structure.
  • At the transition level, we control the latent space resolution by tuning a smoothing factor that defines the minimal unit of temporal or spatial progression captured between transitions.
  • Here are remarkable properties of our method:
    Modularity of representation space : Our autoencoder is fully compatible with any off-the-shelf model-free or model-based RL algorithm. Moreover, its latent dynamics support a wide array of downstream objectives, including RL, skill RL, and GCRL, without requiring any modifications to the core encoding or planning modules.

TempDATA Framework

  • Autoencoder \(f_\theta, h_\theta\) is trained with the following loss: $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal D}\Big[|| s - h \circ f(s;\theta)||\Big]}_{\texttt{Reconstruction}} + \eta_1 \cdot \underbrace{\mathbb{E}_{\substack{(s, s')\sim \mathcal{D}\\ s_{\mathrm{goal}}\sim p_{\mathrm{goal}}}}\bigg [L_2^\tau\Big(\mathcal{B}d - d\big(f(s;\theta), f(s_{\mathrm{goal}};\theta)\big)\Big)\bigg]}_{\texttt{Trajectory loss}} + \eta_2 \cdot \underbrace{\mathbb{E}_{(s, s')\sim \mathcal{D}}\bigg[L_2^{\tau=1}\Big(d\big(f(s;\theta), f(s';\theta)- d_0\big)\Big)\bigg]}_{\texttt{Transition loss}}. \end{aligned}$$ $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal D}\Big[|| s - h \circ f(s;\theta)||\Big]}_{\texttt{Reconstruction}} + \eta_1 \cdot \underbrace{\mathbb{E}_{\substack{(s, s')\sim \mathcal{D}\\ s_{\mathrm{goal}}\sim p_{\mathrm{goal}}}}\bigg [L_2^\tau\Big(\mathcal{B}d - d\big(f(s;\theta), f(s_{\mathrm{goal}};\theta)\big)\Big)\bigg]}_{\texttt{Trajectory loss}} + \eta_2 \cdot \underbrace{\mathbb{E}_{(s, s')\sim \mathcal{D}}\bigg[L_2^{\tau=1}\Big(d\big(f(s;\theta), f(s';\theta)- d_0\big)\Big)\bigg]}_{\texttt{Transition loss}}. \end{aligned}$$
  • Latent dynamics \(\zeta(z'|z,a)\) is trained with following loss: $$\mathcal{L} (\zeta) = \mathbb{E}_{(s,a,s^\prime) \sim \mathcal D, (z, z') \sim f(\cdot;\theta)} \big[-\log \zeta(z'|z,a) \big].$$ $$\mathcal{L} (\zeta) = \mathbb{E}_{(s,a,s^\prime) \sim \mathcal D, (z, z') \sim f(\cdot;\theta)} \big[-\log \zeta(z'|z,a) \big].$$
  • Skill policy and value function \(\phi_\pi, \phi_Q\) is trained by Implicit Q-learning and unsupervised policy training.

Experiments

antmaze-medium-play

antmaze-medium-diverse

antmaze-large-play

antmaze-large-diverse

antmaze-ultra-play

antmaze-ultra-diverse

CALVIN (Traj 1)

CALVIN (Traj 2)

kitchen-mixed

kitchen-partial

visual-kitchen-mixed

visual-kitchen-partial

Citation

 @article{lee2025temporal,
  title={Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning},
  author={Lee, Dongsu and Kwon, Minhae},
  journal={arXiv preprint arXiv:2505.13144},
  year={2025}
}  

The website template was borrowed from Seohong Park, Michaƫl Gharbi, and Jon Barron.