Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning
ICML 2025

Dongsu Lee
Soongsil Univ.
Minhae Kwon
Soongsil Univ.

This work was partly completed at Carnegie Mellon University.

Why does prior offline MBRL fail in long-horizon tasks?

Inaccurate Value Estimation in Sparse Reward Settings. Sparse rewards in long-horizon tasks lead to flat or noisy value estimates across distant states. This undermines the critic's ability to reflect meaningful progress, distorting temporal distance signals.
Unstable Policy Optimization. When value targets are unreliable, policy gradients become erratic or misleading. This results in suboptimal or unstable policy updates during training.
Model Overgeneralization and Geometric Bias in Rollouts. Learned models often overgeneralize in underexplored regions, generating unrealistic transitions. As a result, agents exploit geometrically short but temporally infeasible paths during planning.

Main Idea

We propose a novel method to learn temporal distance representations from fixed offline data for transition augmentation in MBRL, which we dub TempDATA.
Our key idea is to build a latent space where distances reflect temporal separation, enabling both trajectory-level and transition-level understanding of time.
At the trajectory level, we approximate the temporal distance between two states by finding their closest alignment across dataset trajectories, capturing long-term structure.
At the transition level, we control the latent space resolution by tuning a smoothing factor that defines the minimal unit of temporal or spatial progression captured between transitions.
Here are remarkable properties of our method:
Modularity of representation space : Our autoencoder is fully compatible with any off-the-shelf model-free or model-based RL algorithm. Moreover, its latent dynamics support a wide array of downstream objectives, including RL, skill RL, and GCRL, without requiring any modifications to the core encoding or planning modules.

TempDATA Framework

Show algorithm pseudocode

Autoencoder $f_\theta, h_\theta$ is trained with the following loss: $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal D}\Big[|| s - h \circ f(s;\theta)||\Big]}_{\texttt{Reconstruction}} + \eta_1 \cdot \underbrace{\mathbb{E}_{\substack{(s, s')\sim \mathcal{D}\\ s_{\mathrm{goal}}\sim p_{\mathrm{goal}}}}\bigg [L_2^\tau\Big(\mathcal{B}d - d\big(f(s;\theta), f(s_{\mathrm{goal}};\theta)\big)\Big)\bigg]}_{\texttt{Trajectory loss}} + \eta_2 \cdot \underbrace{\mathbb{E}_{(s, s')\sim \mathcal{D}}\bigg[L_2^{\tau=1}\Big(d\big(f(s;\theta), f(s';\theta)- d_0\big)\Big)\bigg]}_{\texttt{Transition loss}}. \end{aligned}$$ $$\begin{aligned} \underbrace{\mathbb{E}_{s \sim \mathcal D}\Big[|| s - h \circ f(s;\theta)||\Big]}_{\texttt{Reconstruction}} + \eta_1 \cdot \underbrace{\mathbb{E}_{\substack{(s, s')\sim \mathcal{D}\\ s_{\mathrm{goal}}\sim p_{\mathrm{goal}}}}\bigg [L_2^\tau\Big(\mathcal{B}d - d\big(f(s;\theta), f(s_{\mathrm{goal}};\theta)\big)\Big)\bigg]}_{\texttt{Trajectory loss}} + \eta_2 \cdot \underbrace{\mathbb{E}_{(s, s')\sim \mathcal{D}}\bigg[L_2^{\tau=1}\Big(d\big(f(s;\theta), f(s';\theta)- d_0\big)\Big)\bigg]}_{\texttt{Transition loss}}. \end{aligned}$$
Latent dynamics $\zeta(z'|z,a)$ is trained with following loss: $$\mathcal{L} (\zeta) = \mathbb{E}_{(s,a,s^\prime) \sim \mathcal D, (z, z') \sim f(\cdot;\theta)} \big[-\log \zeta(z'|z,a) \big].$$ $$\mathcal{L} (\zeta) = \mathbb{E}_{(s,a,s^\prime) \sim \mathcal D, (z, z') \sim f(\cdot;\theta)} \big[-\log \zeta(z'|z,a) \big].$$
Skill policy and value function $\phi_\pi, \phi_Q$ is trained by Implicit Q-learning and unsupervised policy training.

Experiments

Show main table

antmaze-medium-play

antmaze-medium-diverse

antmaze-large-play

antmaze-large-diverse

antmaze-ultra-play

antmaze-ultra-diverse

CALVIN (Traj 1)

CALVIN (Traj 2)

kitchen-mixed

kitchen-partial

visual-kitchen-mixed

visual-kitchen-partial

Citation

 @article{lee2025temporal,
  title={Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning},
  author={Lee, Dongsu and Kwon, Minhae},
  journal={arXiv preprint arXiv:2505.13144},
  year={2025}
}

Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning
ICML 2025

Paper

Code

Why does prior offline MBRL fail in long-horizon tasks?

Main Idea

TempDATA Framework

Show algorithm pseudocode

Experiments

Show main table

Citation

Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning ICML 2025

Paper

Code

Why does prior offline MBRL fail in long-horizon tasks?

Main Idea

TempDATA Framework

Show algorithm pseudocode

Experiments

Show main table

Citation

Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning
ICML 2025