(ICLR26) Training Memory-Augmented LLM Agent via Online Self-Distillation

Fri, 13 Feb 2026 00:00:00 +0800

EMPO²: Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu¹*, Jeonghye Kim¹˒²*, Xufang Luo¹†, Dongsheng Li¹, Yuqing Yang¹

Microsoft Research¹ · KAIST² · ICLR 2026
* Equal contribution; work done during an internship at Microsoft Research | † Corresponding author

📄 Paper · 💻 Code · 📝 OpenReview

Existing LLM-based agents rely heavily on prior knowledge and thus fail to learn effectively in environments that require discovering and exploring novel states. To address this limitation, we propose a reinforcement learning framework that promotes exploration through memory and combines on- and off-policy optimization to improve generalization without relying on memory at inference time.

Agent on Agent Lightning ⚡ Blog

(ICLR26) Training Memory-Augmented LLM Agent via Online Self-Distillation

EMPO²: Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization