Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning
ICML 2025
-
Dongsu Lee
Soongsil Univ. -
Minhae Kwon
Soongsil Univ.
- This work was partly completed at Carnegie Mellon University.
Why does prior offline MBRL fail in long-horizon tasks?
- Inaccurate Value Estimation in Sparse Reward Settings. Sparse rewards in long-horizon tasks lead to flat or noisy value estimates across distant states. This undermines the critic's ability to reflect meaningful progress, distorting temporal distance signals.
- Unstable Policy Optimization. When value targets are unreliable, policy gradients become erratic or misleading. This results in suboptimal or unstable policy updates during training.
- Model Overgeneralization and Geometric Bias in Rollouts. Learned models often overgeneralize in underexplored regions, generating unrealistic transitions. As a result, agents exploit geometrically short but temporally infeasible paths during planning.
Main Idea
- We propose a novel method to learn temporal distance representations from fixed offline data for transition augmentation in MBRL, which we dub TempDATA.
- Our key idea is to build a latent space where distances reflect temporal separation, enabling both trajectory-level and transition-level understanding of time.
- At the trajectory level, we approximate the temporal distance between two states by finding their closest alignment across dataset trajectories, capturing long-term structure.
- At the transition level, we control the latent space resolution by tuning a smoothing factor that defines the minimal unit of temporal or spatial progression captured between transitions.
-
Here are remarkable properties of our method:
Modularity of representation space : Our autoencoder is fully compatible with any off-the-shelf model-free or model-based RL algorithm. Moreover, its latent dynamics support a wide array of downstream objectives, including RL, skill RL, and GCRL, without requiring any modifications to the core encoding or planning modules.
TempDATA Framework
- Autoencoder \(f_\theta, h_\theta\) is trained with the following loss: $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal D}\Big[|| s - h \circ f(s;\theta)||\Big]}_{\texttt{Reconstruction}} + \eta_1 \cdot \underbrace{\mathbb{E}_{\substack{(s, s')\sim \mathcal{D}\\ s_{\mathrm{goal}}\sim p_{\mathrm{goal}}}}\bigg [L_2^\tau\Big(\mathcal{B}d - d\big(f(s;\theta), f(s_{\mathrm{goal}};\theta)\big)\Big)\bigg]}_{\texttt{Trajectory loss}} + \eta_2 \cdot \underbrace{\mathbb{E}_{(s, s')\sim \mathcal{D}}\bigg[L_2^{\tau=1}\Big(d\big(f(s;\theta), f(s';\theta)- d_0\big)\Big)\bigg]}_{\texttt{Transition loss}}. \end{aligned}$$ $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal D}\Big[|| s - h \circ f(s;\theta)||\Big]}_{\texttt{Reconstruction}} + \eta_1 \cdot \underbrace{\mathbb{E}_{\substack{(s, s')\sim \mathcal{D}\\ s_{\mathrm{goal}}\sim p_{\mathrm{goal}}}}\bigg [L_2^\tau\Big(\mathcal{B}d - d\big(f(s;\theta), f(s_{\mathrm{goal}};\theta)\big)\Big)\bigg]}_{\texttt{Trajectory loss}} + \eta_2 \cdot \underbrace{\mathbb{E}_{(s, s')\sim \mathcal{D}}\bigg[L_2^{\tau=1}\Big(d\big(f(s;\theta), f(s';\theta)- d_0\big)\Big)\bigg]}_{\texttt{Transition loss}}. \end{aligned}$$
- Latent dynamics \(\zeta(z'|z,a)\) is trained with following loss: $$\mathcal{L} (\zeta) = \mathbb{E}_{(s,a,s^\prime) \sim \mathcal D, (z, z') \sim f(\cdot;\theta)} \big[-\log \zeta(z'|z,a) \big].$$ $$\mathcal{L} (\zeta) = \mathbb{E}_{(s,a,s^\prime) \sim \mathcal D, (z, z') \sim f(\cdot;\theta)} \big[-\log \zeta(z'|z,a) \big].$$
- Skill policy and value function \(\phi_\pi, \phi_Q\) is trained by Implicit Q-learning and unsupervised policy training.
Experiments
antmaze-medium-play
antmaze-medium-diverse
antmaze-large-play
antmaze-large-diverse
antmaze-ultra-play
antmaze-ultra-diverse
CALVIN (Traj 1)
CALVIN (Traj 2)
kitchen-mixed
kitchen-partial
visual-kitchen-mixed
visual-kitchen-partial
Citation
@article{lee2025temporal,
title={Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning},
author={Lee, Dongsu and Kwon, Minhae},
journal={arXiv preprint arXiv:2505.13144},
year={2025}
}