Reinforcement Learning

(ICLR26) Training Memory-Augmented LLM Agent via Online Self-Distillation

📄 Paper · 💻 Code · 📝 OpenReview Zeyuan Liu¹*, Jeonghye Kim¹˒²*, Xufang Luo¹†, Dongsheng Li¹, Yuqing Yang¹ Microsoft Research¹ · KAIST² · ICLR 2026 * Equal contribution; work done during an internship at Microsoft Research | † Corresponding author Existing LLM-based agents rely heavily on prior knowledge and thus fail to learn effectively in environments that require discovering and exploring novel states. To address this limitation, we propose a reinforcement learning framework that promotes exploration through memory and combines on- and off-policy optimization to improve generalization without relying on memory at inference time. ...

Adopting the Trajectory Level Aggregation for Faster Training

Adopting the Trajectory Level Aggregation for Faster Training Agent Lightning (AGL) Team Date: Dec. 2025 1. Introduction In the context of Multi-turn Agent Reinforcement Learning (RL), data collection relies on rollouts where an agent interacts with an environment over multiple sequential turns. The strategy used to process these rollouts into training samples is a critical architectural decision that fundamentally impacts both training efficiency and model performance. Currently, Agent Lightning supports two primary strategies for aggregating these interaction traces: Transition Aggregation and Trajectory Aggregation. ...

Tinker X Agent Lightning

Tuning ANY AI agent with Tinker X Agent-lightning Yuge Zhang Nov. 2025 Tinker is the first product built by an all-star company called Thinking Machine Lab, whose team members come from leading organizations such as OpenAI. Notable members include former OpenAI CTO Mira Murati; John Schulman, the first author of PPO; Barret Zoph, a leading scientist in AutoML (the area I previously worked in); and well-known Asian researchers like Danqi Chen and Lilian Weng. ...