(ICLR26) Training Memory-Augmented LLM Agent via Online Self-Distillation

EMPO²: Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization Zeyuan Liu¹*, Jeonghye Kim¹˒²*, Xufang Luo¹†, Dongsheng Li¹, Yuqing Yang¹ Microsoft Research¹ · KAIST² · ICLR 2026 * Equal contribution; work done during an internship at Microsoft Research | † Corresponding author 📄 Paper · 💻 Code · 📝 OpenReview Existing LLM-based agents rely heavily on prior knowledge and thus fail to learn effectively in environments that require discovering and exploring novel states. To address this limitation, we propose a reinforcement learning framework that promotes exploration through memory and combines on- and off-policy optimization to improve generalization without relying on memory at inference time. ...

February 13, 2026 · 4 min